Repeat feature annotation
If repeat data is present in INSDC when a genome is loaded, then those features are imported into Ensembl Genomes. For bacterial genomes, this is currently the only source of repeat data. For other divisions, a computational pipeline is additionally run, to annotate three types of repeat:
- Low-complexity regions (Dust [1])
- Tandem repeats (TRF [2])
- Complex repeats (RepeatMasker [3])
Annotating repeats with RepeatMasker requires a repeat library. In most cases, a species-specific library is not available, so the RepBase [4] database of eukaryotic repetitive elements is used. Repeat libraries from the following sources are used and combined where possible:
- TREP for Triticeae genomes.
- The MIPS Repeat Database and RepetDB for plant genomes.
- A rice repeat library from the Arizona Genomics Institute for rice genomes.
- Vectorbase species use custom libraries, some of which are publicly available. Contact VectorBase for further details.
- WormBase species use custom libraries which are usually not made publicly available. Contact WormBase for further details.
Viewing and accessing repeat features
By default, repeat features are not displayed in the genome browser; display them by using the Configure this page option. You can view all repeats, or a subset of repeats based on type.
The repeat annotations can be programatically accessed using the Ensembl API. See the RepeatFeature and RepeatFeatureAdaptor documentation for further details.
For Ensembl Plants species only, tandem repeats annotated by the TRF program are not used to soft- and hardmask the genome sequences.
References
- Morgulis A et al. (2006) A fast and symmetric DUST implementation to mask low-complexity DNA sequences. J Comput Biol. 13:1028-40
- Benson G (1999) Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res. 27: 573-580
- Smit AFA, Hubley R, Green P (1996-2010) RepeatMasker Open-3.0 http://www.repeatmasker.org
- Jurka J et al. (2005) Repbase Update, a database of eukaryotic repetitive elements. Cytogenet Genome Res. 110:462-467