For rapid bulk download of files, the Ensembl FTP site is available as an end point in the Globus Online system. In order to access the data you need to sign up for an account with Globus, install the Globus Connect Personal software and setup a personal endpoint to download the data. The Ensembl data is hosted at the EMBL-EBI end point called “Shared EMBL-EBI public endpoint”. Data from the Ensembl FTP site can then be found under the "/gridftp/ensemblorg/pub" directory within the EMBL-EBI public end point.
If you do not have access to git, you can obtain our latest API code as a gzipped tarball:
Note: the API version needs to be the same as the databases you are accessing, so please use git to obtain a previous version if querying older databases.
Entire databases can be downloaded from our FTP site in a variety of formats. Please be aware that some of these files can run to many gigabytes of data.
Single species data
Popular species are listed first. You can customise this list via our home page.
Data files containing metadata for Ensembl Genomes from release 15 onwards can be found in the root directory or appropriate division directory of each release e.g. /current/protists/.
The following files are provided:
- species.txt (or e.g. species_EnsemblProtists.txt) - simple tab-separated file containing basic information about each genome
- species_metadata.json (or e.g. species_metadata_EnsemblProtists.json) - full metadata about each genome in JSON format, including comparative analyses, sequence region names etc.
- species_metadata.xml (or e.g. species_metadata_EnsemblProtists.xml) - full metadata about each genome in XML format, including comparative analyses, sequence region names etc.
- uniprot_report.txt (or e.g. uniprot_report_EnsemblProtists.txt) - specialised tab-separated file containing information about mapping of genome to UniProtKB
To facilitate storage and download all databases are GNU Zip (gzip, *.gz) compressed.
About the data
The following types of data dumps are available on the FTP site.
- FASTA sequence databases of Ensembl gene, transcript and protein
model predictions. Since the
FASTA format does not permit sequence annotation,
these database files are mainly intended for use with local sequence
similarity search algorithms. Each directory has a README file with a
detailed description of the header line format and the file naming
- Masked and unmasked genome sequences associated with the assembly (contigs, chromosomes etc.).
- The header line in an FASTA dump files containing DNA sequence consists of the following attributes : coord_system:version:name:start:end:strand This coordinate-system string is used in the Ensembl API to retrieve slices with the SliceAdaptor.
- Coding sequences for Ensembl or ab initio predicted genes.
- cDNA sequences for Ensembl or ab initio predicted genes.
- Protein sequences for Ensembl or ab initio predicted genes.
- Non-coding RNA gene predictions.
- Annotated sequence
- Flat files allow more extensive sequence annotation by means of feature tables and contain thus the genome sequence as annotated by the automated Ensembl genome annotation pipeline. Each nucleotide sequence record in a flat file represents a 1Mb slice of the genome sequence. Flat files are broken into chunks of 1000 sequence records for easier downloading.
- All Ensembl MySQL databases are available in text format as are the SQL table definition files. These can be imported into any SQL database for a local installation of a mirror site. Generally, the FTP directory tree contains one directory per database. For more information about these databases and their Application Programming Interfaces (or APIs) see the API section.
- Gene sets for each species. These files include annotations of both coding and non-coding genes. This file format is described here.
- GFF3 provides access to all annotated transcripts which make up an Ensembl gene set. This file format is described here.
- EMF flatfile dumps (comparative data)
Alignments of resequencing data are available for several species as Ensembl Multi Format (EMF) flatfile dumps. The accompanying README file describes the file format.
Also, the same format is used to dump whole-genome multiple alignments as well as gene-based multiple alignments and phylogentic trees used to infer Ensembl orthologues and paralogues. These files are available in the ensembl_compara database which will be found in the mysql directory.
- MAF (comparative data)
MAF files are provided for all pairwise alignments containing human (GRCh38), and all multiple alignments. The MAF file format is described here.
- GVF (variation data)
- GVF (Genome Variation Format) is a simple tab-delimited format derived from GFF3 for variation positions across the genome. There are GVF files for different types of variation data (e.g. somatic variants, structural variants etc). For more information see the "README" files in the GVF directory.
- VCF (variation data)
- VCF (Variant Call Format) is a text file format containing meta-information lines, a header line, and then data lines each containing information about a position in the genome. This file format can also contain genotype information on samples for each position. More details about the format and its specifications are available here.
- VEP (variation data)
- Compressed text files (called "cache files") used by the Variant Effect Predictor tool. More information about these files is available here.
- BED format files (comparative data)
Constrained elements calculated using GERP are available in BED format. For more information see the accompanying README file.
BED format is a simple line-based format. The first 3 mandatory columns are:
- chromosome name (may start with 'chr' for compliance with UCSC)
- start position. This is a 0-based position
- end position.
The entire Ensembl API is gzipped and concatenated into a single TAR file. This is updated daily.