General Information:

Home
Pipeline Overview
Pipeline Details
User's Guide
Downloads

Related Things:

NlaIII_vs_DpnII_Release_2.xls

Some Links:

Ensembl Databases
Gramene Databases

Solexa Google Group
Genographia
SeqAnswers

Ensembl Pipeline - Operations Overview


Glossary

Important terms which will be used throughout the document:

"The Database"  The Ensembl database used by the pipeline.
Tag The 16- or 17-bp sequence read by the Illumina GA, and originally cut adjacent to a CATG or GATC. Depending on the context, "tag" may refer to the sequence with or without the restriction site (which would increase tag length to 20 or 21bp, respectively)
Site The actual 20- or 21-bp tag footprint, including the CATG or GATC.
Class A number from 1-27, assigned to each tag based on several factors. Underpins the assignment of tags to tag tables and expedites annotation, but is not especially useful for postprocessing. More detail below.
Canonical Tags which would normally arise from the known transcriptome (exons, splice junctions).
Noncanonical Tags which would not normally arise from the known transcriptome (introns, exon-intron or exon-intergenic overlaps).
Genic Tags where, for various reasons, the canonical/noncanonical status cannot be distinguished (but still, it comes from a gene).
Intergenic Tags which do not overlap any annotated gene, including Ensembl's "blessed" predictions.
Ambiguous Tags which come from areas of SAME-STRAND gene overlap (which gene generated the tag?).
Silent Repeat Tags which come from multiple sites, but all within the same gene (can be assigned to one gene).
Repeat Tags which come from multiple sites, NOT all within the same gene (cannot be assigned to one gene).


3 Steps

Once the database has been installed, the process occurs in 3 steps:
  1. Tag Table Generation (Pipeline "Step 1")
  2. Alignment
  3. Alignment Interpretation (Pipeline "Step 2")
Optionally, some postprocessing may be desired, depending on what the reads were generated for.



Step 1: Tag Table Generation

The Nutshell

  1. The Ensembl database is downloaded and installed. A minimal installation can be used if desired (see Details).
  2. The generator script is edited to reflect user's database connection settings, and a few other things.
  3. The generator script is run, producing tag tables, reports, and new mysql tables.
  4. Reports are reviewed to ensure all is in order.
  5. New mysql tables are uploaded to the database. If no further table sets will be produced, the original tables can now be dropped (but the database cannot be renamed).
  6. Tag tables are used for alignment.
For editing instructions and runtime parameters, see the User's Guide.

PLEASE NOTE: Processing may require a significant amount of RAM.  Memory usage is roughly 10GB per gigabase, plus overhead, so expect a 3 gigabase genome to absorb ~33GB of RAM during processing. A 1 gigabase genome will take ~16GB.


The Tag Tables

Step 1 will generate up to 10 tag tables. Six are "general" tag tables, based on tag class. Definitions are given in the glossary above.
  • Canonical
  • Noncanonical
  • Genic
  • Intergenic
  • Ambiguous
  • Repeat
Another four are "specific", reserved for specific transcriptional loci, and contain unique tags of any class -- i.e. canonicals, noncanonicals, and genics get pooled together.
  • Mitochondrial
  • Chloroplast
  • Ribosomal
  • ncRNA
For the specific tables, precedence is in decreasing order (mitochondrial takes precedence over chloroplast, chloroplast takes precedence over ribosomal, and ribosomal takes precedence over ncRNA).

Depending on the organism, available annotations, and exclusion level (see Details), some tables may not be created.

A readme is generated for every set of tag tables, giving the run parameters and some descriptive stats.


The Mysql Tables

Step 1 produces a gene data table (solexa_gene_data) and two to three tables for the particular tag table set you are creating. The gene data table is always the same, regardless of run parameters, so only one copy needs to exist in the database. Each mysql table set comes with its own loader .sql file.

Tables in an annotation set have systematic names of the format:

solexa_{Enzyme}_{TagLength}_{C,S,CS}_X{0,1,2}_{TableType}

Parameters being defined as:

EnzymeNlaIII, DpnII (user can add more).
TagLength17, 16, or other user-specified length.
C,S,CSTags taken from Chromosomes only, Scaffolds only, or both.
X{0,1,2}Exclusion level (X0, X1, or X2).

The three values of TableType are:

IndexMinimal data about a tag index (indexes allow rapid annotation of mapped reads).
LocationsMinimal data for each genomic site where the tag was found.
RescuesMay or may not be created -- tag-to-family mappings for rescued repeat tags.

At least two of these mysql tables will be generated for every run: Index and Locations. Creation of a Rescues table depends on whether the repeat rescue feature was engaged and had any success.

An example set, using chromosomes only, NlaIII, exclusion level 2:
  • solexa_gene_data
  • solexa_NlaIII_17_C_X2_index
  • solexa_NlaIII_17_C_X2_locations
  • solexa_NlaIII_17_C_X2_48_rescues
The rescue table has an extended systematic name, where {TableType} becomes {ComparaVersion}_{TableType}. This allows for multiple rescue versions per table set. In the above example, Compara 48 was used.

Creation of a rescues table requires four things:
  1. An Ensembl-supported organism
  2. An Ensembl Compara database (see Details for minimal installation instructions)
  3. Exclusion Level 1 or 2
  4. At least one succesfully rescued repeat tag
Repeat rescue is automatically attempted if the first three conditions are met.


The Reports

A large number of reports are generated to allow easy validation of most any aspect of tag generation, classification, and assignment. There are two groups of reports. One is the "pre-fasta" reports, which give details for each tag found in each tag table (these have the same names as the tag table fastas). The other group is everything else, which has several categories:

1. General reports:

query_dump.txt Dump of the Exon-Transcript-Gene relationship query
chromosome_data.txt Lengths, genes, and restriction site counts for each chromosome (and/or scaffold)
complete_locations_report.txt Full report on every location for every tag found in the genome and transcriptome
gene_tag_report.txt Subset of the above; gene-associated tags only
tagless_3'_exons.txt A list of 3' exons which cannot be tagged, FYI

2. Special gene reports (subsets of the solexa_gene_data table), for checking of special gene assignments:

mito_genes.txt All mitochondrial-genome, mitochondrion-associated and tRNA genes
chloro_genes.txt All chloroplast-genome and chloroplast-associated genes
ribo_genes.txt All rRNA and ribosomal protein genes
ncRNA_genes.txt All remaining noncoding RNA genes (not associated with above groups)

3. Repeat rescue reports: (only generated if repeat rescues were attempted)

compara_query_dump.txt Dump of the Compara gene-family relationships query
rescue_report.txt Results for every tag for which rescue was attempted
rescue_successful.txt Condensed results for tags that were rescued
rescue_unsuccessful.txt Condensed results for tags that were not rescued



Alignment of Reads


Tag tables replace the reference genome in the alignment process. Tables should work with any alignment software, but currently only Eland "results" output can be interpreted by the second stage of the pipeline.



Step 2: Alignment Interpretation

The Nutshell

  1. Eland output is acquired.
  2. The interpreter script is edited to reflect user's database connection settings, and a few other things.
  3. The interpreter script is run, producing lanewise tag count tables and various reports.
  4. Reports are reviewed to ensure all is in order.
  5. Tag count tables are sent to postprocessing / statistical analysis.
For editing instructions and runtime parameters, see the
User's Guide.

PLEASE NOTE: Processing may require a significant amount of RAM.  Highest observed memory usage is ~13.5GB for a 51-million read run. A 26-million read run still takes ~11GB.


The Tag Count Tables

Tag count tables are the main product of Step 2, giving lanewise read counts and gene associations for each tag. Every tag that made it through alignment will be reported in some table, no matter what the condition.

There are five main tables:

genic_table.txt Tags associated with one gene
intergenic_table.txt Single-site tags associated with no genes
ambig_table.txt Single-site tags associated with >1 genes
repeat_table.txt Multiple-site tags associated with >1 genes (or no genes)
annotations_table.txt Annotations for every gene encountered in the tables (including those below).

The above tables are limited to tags with 2 or more reads (for the entire experiment). So-called "singleton" tags are reported in a separate set of tables:

genic_singletons.txt
intergenic_singletons.txt
ambig_singletons.txt
repeat_singletons.txt

Lastly are the miscellaneous tables, which contain 'FYI' data or material for troubleshooting / further investigation:

adaptor_tags.txt NM tags which were found to contain portions of the 3' sequencing adaptor.
processed_NMs.txt "Clean" NM tags (no adaptor pieces), ranked by read count.
mapping_errors.txt  Mapping error tags, where off-center alignment occurred within the tag tables. Most of these are defective reads anyway, but all data from the alignment run makes it into a report one way or the other.


The Reports

The additional reports:

code_summary.txt A comprehensive breakdown of Eland read-code counts and tag table hits. Numbers are reported both readwise and tagwise, using filtered and unfiltered reads, allowing inspection of how the alignments fared, what areas were most impacted by filtering, what types and percentages of reads were mapped to what table, and more.
tables_README.txt   A short file capturing runtime parameters, some descriptive stats, and a rescue summary (where applicable).


The Track Files

Track files give tag positions and read counts in Log10 TPM (tags-per-million). Since tags with < 10 TPM will appear below the line in a .wig file, both strands cannot be accurately represented in one track, so W and C appear in independent files. The raw tracks can be very large, so .bz2 versions are also created automatically. Here ExpName is the experiment name, a user-defined runtime parameter.

{ExpName}_W.wig
{ExpName}_C.wig
{ExpName}_W.wig.bz2
{ExpName}_C.wig.bz2

Also note that repeat tags are excluded, and read counts from silent repeat tags are divided equally among their respective sites.

As the tracks are in .wig format, they are intended for viewing in UCSC and are not Ensembl-compatible. Ensembl-ready, LDAS-type track files (similiar to GFF) are in development.



Postprocessing

Some Ideas

  • Check if normalization is required: Convert values to TPM and look at housekeeper behavior across lanes.
  • Assess differential expression in genic reads (if your experiment involved 2 or more conditions).
  • BLAST the high-copy-count NM tags.
  • Look for read clusters in intergenic space.
  • Inspect gene prediction sets for possible supporting evidence.