Microarray probeset mapping
Inclusion of microarray data
Microarray data are added to Ensembl Genomes after consultation with relevant scientific user communities to identify experiments of general interest within that community (e.g. WormBase for worms, VectorBase for arthropod disease vectors). Suggestions for new data sets are welcome, and wherever possible these will be derived from publically accessible data sources such as Array Express.
Microarray probe mapping
Microarray probe mapping is conducted for non-VectorBase species as described by the Ensembl microarray probe mapping pipeline. The following modifications have been made for VectorBase-derived species.
Step one: Genome/transcript alignment
Probes are mapped using two different programs according to probe length - short probes (Bowtie2 alignment, while longer probes (>200bp) are mapped using Exonerate. The reason for processing the probes differently according to length is based on the need to rapidly map large numers of probes across many different species. The majority of VectorBase probes are short (Bowtie2 alignments), which are significantly faster to perform than Exonerate. Longer probes (>200bp) are often derived from cDNA/EST sequences and span multiple exon/intron junctions - these probes are mapped with Exonerate as it detects these junctions in an accurate and sensitive manner. Probes are mapped against both the genomic sequence and the transcripts. Where high quality hits are found on the transcript but not on the genomic sequence, the probe sequence is projected from the transcript sequence back into genomic coordinates.
Alignments are stored as described by the Ensembl microarray probe mapping pipeline, but only probes that match 10 or fewer unique locations in the genome are stored.
Step two: Ensembl transcript annotation
The association of probes and probe sets with transcripts is performed as described in the Ensembl microarray probe mapping pipeline, except that for VectorBase species the UTR annotation of the transcript is followed strictly and no attempt is made to extend the UTR region based on the mean/modal UTR statistics of the transcripts.
Data access
The probe mappings and transcript annotations are stored in Ensembl functional genomics (funcgen) databases and can be programatically accessed using the funcgen API. POD documentation is available:
- Bio::EnsEMBL::Funcgen::Array
- Bio::EnsEMBL::Funcgen::Probe
- Bio::EnsEMBL::Funcgen::ProbeSet
- Bio::EnsEMBL::Funcgen::ProbeFeature
Probe and ProbeSet level transcript annotations are stored in the funcgen databases and can be accessed using the API, or the relevant Ensembl Genomes BioMart.