Short-read sequence alignment

The Sequence Read Archive (SRA) is a international public archival of raw short read sequencing data from the next generation of sequencing platforms, established under the guidance of the International Nucleotide Sequence Database Collaboration (INSDC). More information about the metadata data storage model from SRA is available on the EBI website.

Use of SRA in Ensembl Genomes

In Ensembl Genome, expression data under various studies are mapped to the genome and the BAM format files thus obtained are configured to displayed in the genome browser. Re-sequencing data from the SRA are used for SNP calling and displayed as variation data in Ensembl Genomes.

Generation of alignments

The reads from the SRA are downloaded from the European Nucleotide Archive (ENA) in fastq format and mapped to the genome using GSNAP or Burrow Wheeler Aligner(BWA). The mapped information is stored in Sequence Alignment Map (SAM) format. SAMtools are used to convert it into Binary format (BAM).

SNP Calling

BAM files obtained from mapping the resequenced data using the above method are also used for SNP calling. SAMtools uses mapped data to call the sequence variants, which are stored in a Variant Call Format (VCF). VCF is then imported into an Ensembl Variation database.

Re-submission to ENA

The BAM and VCF files obtained using the above method are then resubmitted to the European Nucleotide Archive as as 'Analysis' object with reference to the original Study and the Assembly version of the genome it is mapped to.

Data Visualisation

The variation data obtained from mapping SRA data can be visualized using the Ensembl Variation Browser and API . The expression data mapped in BAM format can be visualised in the Ensembl Genome browser as customised tracks.