General Information:

Home
Pipeline Overview
Pipeline Details
User's Guide
Downloads

Related Things:

Some Links:

Ensembl Databases
Gramene Databases

Solexa Google Group
Genographia
SeqAnswers

Ensembl Pipeline - Operations Overview

Glossary

Important terms which will be used throughout the document:

"The Database"	The Ensembl database used by the pipeline.
Tag	The 16- or 17-bp sequence read by the Illumina GA, and originally cut adjacent to a CATG or GATC. Depending on the context, "tag" may refer to the sequence with or without the restriction site (which would increase tag length to 20 or 21bp, respectively)
Site	The actual 20- or 21-bp tag footprint, including the CATG or GATC.
Class	A number from 1-27, assigned to each tag based on several factors. Underpins the assignment of tags to tag tables and expedites annotation, but is not especially useful for postprocessing. More detail below.
Canonical	Tags which would normally arise from the known transcriptome (exons, splice junctions).
Noncanonical	Tags which would not normally arise from the known transcriptome (introns, exon-intron or exon-intergenic overlaps).
Genic	Tags where, for various reasons, the canonical/noncanonical status cannot be distinguished (but still, it comes from a gene).
Intergenic	Tags which do not overlap any annotated gene, including Ensembl's "blessed" predictions.
Ambiguous	Tags which come from areas of SAME-STRAND gene overlap (which gene generated the tag?).
Silent Repeat	Tags which come from multiple sites, but all within the same gene (can be assigned to one gene).
Repeat	Tags which come from multiple sites, NOT all within the same gene (cannot be assigned to one gene).

3 Steps

Once the database has been installed, the process occurs in 3 steps:

Tag Table Generation (Pipeline "Step 1")
Alignment
Alignment Interpretation (Pipeline "Step 2")

Optionally, some postprocessing may be desired, depending on what the reads were generated for.

Step 1: Tag Table Generation

The Nutshell

The Ensembl database is downloaded and installed. A minimal installation can be used if desired (see Details).
The generator script is edited to reflect user's database connection settings, and a few other things.
The generator script is run, producing tag tables, reports, and new mysql tables.
Reports are reviewed to ensure all is in order.
New mysql tables are uploaded to the database. If no further table sets will be produced, the original tables can now be dropped (but the database cannot be renamed).
Tag tables are used for alignment.

For editing instructions and runtime parameters, see the User's Guide.

PLEASE NOTE: Processing may require a significant amount of RAM. Memory usage is roughly 10GB per gigabase, plus overhead, so expect a 3 gigabase genome to absorb ~33GB of RAM during processing. A 1 gigabase genome will take ~16GB.

The Tag Tables

Step 1 will generate up to 10 tag tables. Six are "general" tag tables, based on tag class. Definitions are given in the glossary above.

Canonical
Noncanonical
Genic
Intergenic
Ambiguous
Repeat

Another four are "specific", reserved for specific transcriptional loci, and contain unique tags of any class -- i.e. canonicals, noncanonicals, and genics get pooled together.

Mitochondrial
Chloroplast
Ribosomal
ncRNA

For the specific tables, precedence is in decreasing order (mitochondrial takes precedence over chloroplast, chloroplast takes precedence over ribosomal, and ribosomal takes precedence over ncRNA).

Depending on the organism, available annotations, and exclusion level (see Details), some tables may not be created.

A readme is generated for every set of tag tables, giving the run parameters and some descriptive stats.

The Mysql Tables

Step 1 produces a gene data table (solexa_gene_data) and two to three tables for the particular tag table set you are creating. The gene data table is always the same, regardless of run parameters, so only one copy needs to exist in the database. Each mysql table set comes with its own loader .sql file.

Tables in an annotation set have systematic names of the format:

solexa_{Enzyme}_{TagLength}_{C,S,CS}_X{0,1,2}_{TableType}

Parameters being defined as:

Enzyme	NlaIII, DpnII (user can add more).
TagLength	17, 16, or other user-specified length.
C,S,CS	Tags taken from Chromosomes only, Scaffolds only, or both.
X{0,1,2}	Exclusion level (X0, X1, or X2).

The three values of TableType are:

Index	Minimal data about a tag index (indexes allow rapid annotation of mapped reads).
Locations	Minimal data for each genomic site where the tag was found.
Rescues	May or may not be created -- tag-to-family mappings for rescued repeat tags.

At least two of these mysql tables will be generated for every run: Index and Locations. Creation of a Rescues table depends on whether the repeat rescue feature was engaged and had any success.

An example set, using chromosomes only, NlaIII, exclusion level 2:

solexa_gene_data
solexa_NlaIII_17_C_X2_index
solexa_NlaIII_17_C_X2_locations
solexa_NlaIII_17_C_X2_48_rescues

The rescue table has an extended systematic name, where {TableType} becomes {ComparaVersion}_{TableType}. This allows for multiple rescue versions per table set. In the above example, Compara 48 was used.

Creation of a rescues table requires four things:

An Ensembl-supported organism
An Ensembl Compara database (see Details for minimal installation instructions)
Exclusion Level 1 or 2
At least one succesfully rescued repeat tag

Repeat rescue is automatically attempted if the first three conditions are met.

The Reports

A large number of reports are generated to allow easy validation of most any aspect of tag generation, classification, and assignment. There are two groups of reports. One is the "pre-fasta" reports, which give details for each tag found in each tag table (these have the same names as the tag table fastas). The other group is everything else, which has several categories:

1. General reports:

query_dump.txt	Dump of the Exon-Transcript-Gene relationship query
chromosome_data.txt	Lengths, genes, and restriction site counts for each chromosome (and/or scaffold)
complete_locations_report.txt	Full report on every location for every tag found in the genome and transcriptome
gene_tag_report.txt	Subset of the above; gene-associated tags only
tagless_3'_exons.txt	A list of 3' exons which cannot be tagged, FYI

2. Special gene reports (subsets of the solexa_gene_data table), for checking of special gene assignments:

mito_genes.txt	All mitochondrial-genome, mitochondrion-associated and tRNA genes
chloro_genes.txt	All chloroplast-genome and chloroplast-associated genes
ribo_genes.txt	All rRNA and ribosomal protein genes
ncRNA_genes.txt	All remaining noncoding RNA genes (not associated with above groups)

3. Repeat rescue reports: (only generated if repeat rescues were attempted)

compara_query_dump.txt	Dump of the Compara gene-family relationships query
rescue_report.txt	Results for every tag for which rescue was attempted
rescue_successful.txt	Condensed results for tags that were rescued
rescue_unsuccessful.txt	Condensed results for tags that were not rescued

Alignment of Reads

Tag tables replace the reference genome in the alignment process. Tables should work with any alignment software, but currently only Eland "results" output can be interpreted by the second stage of the pipeline.

Step 2: Alignment Interpretation

The Nutshell

Eland output is acquired.
The interpreter script is edited to reflect user's database connection settings, and a few other things.
The interpreter script is run, producing lanewise tag count tables and various reports.
Reports are reviewed to ensure all is in order.
Tag count tables are sent to postprocessing / statistical analysis.

For editing instructions and runtime parameters, see the User's Guide.

PLEASE NOTE: Processing may require a significant amount of RAM. Highest observed memory usage is ~13.5GB for a 51-million read run. A 26-million read run still takes ~11GB.

The Tag Count Tables

Tag count tables are the main product of Step 2, giving lanewise read counts and gene associations for each tag. Every tag that made it through alignment will be reported in some table, no matter what the condition.

There are five main tables:

genic_table.txt	Tags associated with one gene
intergenic_table.txt	Single-site tags associated with no genes
ambig_table.txt	Single-site tags associated with >1 genes
repeat_table.txt	Multiple-site tags associated with >1 genes (or no genes)
annotations_table.txt	Annotations for every gene encountered in the tables (including those below).

The above tables are limited to tags with 2 or more reads (for the entire experiment). So-called "singleton" tags are reported in a separate set of tables:

genic_singletons.txt
intergenic_singletons.txt
ambig_singletons.txt
repeat_singletons.txt

Lastly are the miscellaneous tables, which contain 'FYI' data or material for troubleshooting / further investigation:

adaptor_tags.txt	NM tags which were found to contain portions of the 3' sequencing adaptor.
processed_NMs.txt	"Clean" NM tags (no adaptor pieces), ranked by read count.
mapping_errors.txt	Mapping error tags, where off-center alignment occurred within the tag tables. Most of these are defective reads anyway, but all data from the alignment run makes it into a report one way or the other.

The Reports

The additional reports:

code_summary.txt	A comprehensive breakdown of Eland read-code counts and tag table hits. Numbers are reported both readwise and tagwise, using filtered and unfiltered reads, allowing inspection of how the alignments fared, what areas were most impacted by filtering, what types and percentages of reads were mapped to what table, and more.
tables_README.txt	A short file capturing runtime parameters, some descriptive stats, and a rescue summary (where applicable).

The Track Files

Track files give tag positions and read counts in Log10 TPM (tags-per-million). Since tags with < 10 TPM will appear below the line in a .wig file, both strands cannot be accurately represented in one track, so W and C appear in independent files. The raw tracks can be very large, so .bz2 versions are also created automatically. Here ExpName is the experiment name, a user-defined runtime parameter.

{ExpName}_W.wig
{ExpName}_C.wig
{ExpName}_W.wig.bz2
{ExpName}_C.wig.bz2

Also note that repeat tags are excluded, and read counts from silent repeat tags are divided equally among their respective sites.

As the tracks are in .wig format, they are intended for viewing in UCSC and are not Ensembl-compatible. Ensembl-ready, LDAS-type track files (similiar to GFF) are in development.

Postprocessing

Some Ideas

Check if normalization is required: Convert values to TPM and look at housekeeper behavior across lanes.
Assess differential expression in genic reads (if your experiment involved 2 or more conditions).
BLAST the high-copy-count NM tags.
Look for read clusters in intergenic space.
Inspect gene prediction sets for possible supporting evidence.