EMBL-EBI User Survey 2024

Do data resources managed by EMBL-EBI and our collaborators make a difference to your work?

Please take 10 minutes to fill in our annual user survey, and help us make the case for why sustaining open data resources is critical for life sciences research.

Survey link: https://www.surveymonkey.com/r/HJKYKTT?channel=[webpage]

GFF annotation import

Gene models for some species are loaded from GFF files provided by the research community. (If the gene models are available in INSDC, however, we usually import directly from INSDC.) In most cases, the GFF files consist of protein-coding genes, pseudogenes, transcripts, exons, and CDS regions. In a small number of cases, the GFF files include non-coding RNA annotation, which is also imported; the GFF type of the ncRNA is used as the biotype in Ensembl Genomes.

When importing genes from a GFF file, we check whether they produce a valid translation, i.e. one without internal stop codons. If there are invalid translations, then we first query this with the data provider to determine if there are errors or unannotated pseudogenes, and amend the GFF file accordingly. If the data provider believes them to be protein-coding, we can accommodate changes in either the nucleotide or amino acid sequence, or insert frameshifts. We aim to minimise these changes, as they add complexity when interpreting the data, and also because the role of Ensembl Genomes is to present, rather than modify, the data of others. If the status of a gene with an invalid translation remains unclear, then it is assigned a "nontranslating CDS" biotype; no protein sequence is associated with such a gene, but it is otherwise displayed as a normal gene in Ensembl Genomes.