GFF annotation import

Gene models for some species are loaded from GFF files provided by the research community. (If the gene models are available in INSDC, however, we usually import directly from INSDC.) In most cases, the GFF files consist of protein-coding genes, pseudogenes, transcripts, exons, and CDS regions. In a small number of cases, the GFF files include non-coding RNA annotation, which is also imported; the GFF type of the ncRNA is used as the biotype in Ensembl Genomes.

When importing genes from a GFF file, we check whether they produce a valid translation, i.e. one without internal stop codons. If there are invalid translations, then we first query this with the data provider to determine if there are errors or unannotated pseudogenes, and amend the GFF file accordingly. If the data provider believes them to be protein-coding, we can accommodate changes in either the nucleotide or amino acid sequence, or insert frameshifts. We aim to minimise these changes, as they add complexity when interpreting the data, and also because the role of Ensembl Genomes is to present, rather than modify, the data of others. If the status of a gene with an invalid translation remains unclear, then it is assigned a "nontranslating CDS" biotype; no protein sequence is associated with such a gene, but it is otherwise displayed as a normal gene in Ensembl Genomes.