Peptide comparative genomics
Each Ensembl Genomes division, apart from Ensembl Bacteria, performs comparative analyses at the peptide level, and an additional pan-taxonomic comparative analysis is performed for a set of representative species from across the taxonomic space. In brief, the methodology uses peptide sequence alignments to cluster proteins, which are then aligned, and phylogenetic trees are inferred from those alignments. Finally, lists of orthologues and paralogues are derived from the gene trees.
For Ensembl Protists and Ensembl Fungi, one genome per species is included in the peptide comparative analysis. The selection of this genome is based on prior inclusion in Ensembl Genomes, or the date of its submission to INSDC. For Ensembl Bacteria, 108 important genomes are included in the pan-taxonomic comparative analysis; these are selected to provide one representative genome from any species which either features in the curated reference proteome set from UniProt, or which has a high citation rate in Europe PubMed Central.
Orthologue QC
A subset of orthologues are classified as a "high confidence" set. At a minimum, orthologous proteins must have percentage identity above a certain threshold, currently set at 25% for all species, and satisfy a "tree-compliance" metric that identifies orthologues inferred from dubious tree topologies.
For some species in Ensembl Metazoa, orthologue metrics have been calculated using two orthogonal sources of information: gene order conservation (GOC) and whole genome alignments (WGA). The methodology is described in more detail elsewhere, but briefly: The "GOC score" metric for a pair of orthologues measures whether the two genes up- and downstream of each gene in the orthologue pair are also orthologous, and allows for inversions and gene insertions. The "WGA coverage" metric determines the extent to which the orthologous regions have been aligned by pairwise genome alignments, primarily based on exonic coverage, with a small contribution from intronic coverage. Both metrics have a value between 0 and 100.
There is only an expectation for gene order conservation between species that are evolutionarily close; thus the GOC score is only calculated within Diptera, Hymenoptera, and Nematoda. Similarly, pairwise WGAs, and thus the related metric, are only available for a subset of fairly closely-related species. To classify orthologues as "high confidence", thresholds are applied to the orthologue metrics, according to the evolutionary distance between the species. Within Aculeata, Caenorhabditis, Drosophila, and Onchocercidae the GOC threshold is 50 and the WGA threshold is 50; no thresholds are applied beyond these clades.
Data Access
Interactive gene trees and homologue data can be accessed in the genome browser, and also with the Perl API and REST service. Data files are also made available via the Ensembl Genomes FTP site.