General Information:
Home
Pipeline Overview
Pipeline Details
User's Guide
Downloads
Related Things:
NlaIII_vs_DpnII_Release_2.xls
Some Links:
Ensembl Databases
Gramene Databases
Solexa Google Group
Genographia
SeqAnswers
|
Ensembl Pipeline - Operations Overview
Glossary
Important terms which will be used throughout the document:
"The Database" | The Ensembl database used by the pipeline. |
Tag | The 16- or 17-bp sequence read by the Illumina GA, and originally cut adjacent to a CATG or GATC. Depending on
the context, "tag" may refer to the sequence with or without the restriction site (which would increase tag length to 20 or 21bp,
respectively) |
Site | The actual 20- or 21-bp tag footprint, including the CATG or GATC. |
Class | A number from 1-27, assigned to each tag based on several factors. Underpins the assignment of tags to tag
tables and expedites annotation, but is not especially useful for postprocessing. More detail below.
|
Canonical | Tags which would normally arise from the known transcriptome (exons, splice junctions). |
Noncanonical | Tags which would not normally arise from the known transcriptome (introns, exon-intron or
exon-intergenic overlaps). |
Genic | Tags where, for various reasons, the canonical/noncanonical status cannot be distinguished (but still, it
comes from a gene). |
Intergenic | Tags which do not overlap any annotated gene, including Ensembl's "blessed" predictions. |
Ambiguous | Tags which come from areas of SAME-STRAND gene overlap (which gene generated the tag?). |
Silent Repeat | Tags which come from multiple sites, but all within the same gene (can be assigned to one gene). |
Repeat | Tags which come from multiple sites, NOT all within the same gene (cannot be assigned to one gene). |
3 Steps
Once the database has been installed, the process occurs in 3 steps:
- Tag Table Generation (Pipeline "Step 1")
- Alignment
- Alignment Interpretation (Pipeline "Step 2")
Optionally, some postprocessing may be desired, depending on what the reads were
generated for.
Step 1: Tag Table Generation |
|
The Nutshell
- The Ensembl database is downloaded and installed. A minimal installation can be used if desired (see Details).
- The generator script is edited to reflect user's database connection settings, and a few other things.
- The generator script is run, producing tag tables, reports, and new mysql tables.
- Reports are reviewed to ensure all is in order.
- New mysql tables are uploaded to the database. If no further table sets will be produced, the original tables can now be dropped (but
the database cannot be renamed).
- Tag tables are used for alignment.
For editing instructions and runtime parameters, see the User's Guide.
PLEASE NOTE: Processing may require a significant amount of RAM. Memory usage is roughly 10GB per gigabase, plus overhead,
so expect a 3 gigabase genome to absorb ~33GB of RAM during processing. A 1 gigabase genome will take ~16GB.
The Tag Tables
Step 1 will generate up to 10 tag tables. Six are "general" tag tables, based on tag class. Definitions are given in the glossary above.
- Canonical
- Noncanonical
- Genic
- Intergenic
- Ambiguous
- Repeat
Another four are "specific", reserved for specific transcriptional loci, and contain unique tags of any class -- i.e. canonicals,
noncanonicals, and genics get pooled together.
- Mitochondrial
- Chloroplast
- Ribosomal
- ncRNA
For the specific tables, precedence is in decreasing order (mitochondrial takes precedence over chloroplast, chloroplast takes precedence
over ribosomal, and ribosomal takes precedence over ncRNA).
Depending on the organism, available annotations, and exclusion level (see
Details), some tables may not be created.
A readme is generated for every set of tag tables, giving the run parameters and some descriptive stats.
The Mysql Tables
Step 1 produces a gene data table (solexa_gene_data) and two to three tables for the particular tag table set you are creating. The gene
data table is always the same, regardless of run parameters, so only one copy needs to exist in the database. Each mysql table set comes
with its own loader .sql file.
Tables in an annotation set have systematic names of the format:
solexa_{Enzyme}_{TagLength}_{C,S,CS}_X{0,1,2}_{TableType}
Parameters being defined as:
Enzyme | NlaIII, DpnII (user can add more). |
TagLength | 17, 16, or other user-specified length. |
C,S,CS | Tags taken from Chromosomes only, Scaffolds only, or both. |
X{0,1,2} | Exclusion level (X0, X1, or X2). |
The three values of TableType are:
Index | Minimal data about a tag index (indexes allow rapid annotation of mapped reads). |
Locations | Minimal data for each genomic site where the tag was found. |
Rescues | May or may not be created -- tag-to-family mappings for rescued repeat tags. |
At least two of these mysql tables will be generated for every run: Index and Locations. Creation of a Rescues table depends on whether the
repeat rescue feature was engaged and had any success.
An example set, using chromosomes only, NlaIII, exclusion level 2:
- solexa_gene_data
- solexa_NlaIII_17_C_X2_index
- solexa_NlaIII_17_C_X2_locations
- solexa_NlaIII_17_C_X2_48_rescues
The rescue table has an extended systematic name, where {TableType} becomes {ComparaVersion}_{TableType}. This allows for multiple rescue
versions per table set. In the above example, Compara 48 was used.
Creation of a rescues table requires four things:
- An Ensembl-supported organism
- An Ensembl Compara database (see Details for minimal installation
instructions)
- Exclusion Level 1 or 2
- At least one succesfully rescued repeat tag
Repeat rescue is automatically attempted if the first three conditions are met.
The Reports
A large number of reports are generated to allow easy validation of most any aspect of tag generation, classification, and assignment.
There are two groups of reports. One is the "pre-fasta" reports, which give details for each tag found in each tag table (these have the
same names as the tag table fastas). The other group is everything else, which has several categories:
1. General reports:
query_dump.txt | Dump of the Exon-Transcript-Gene relationship query |
chromosome_data.txt | Lengths, genes, and restriction site counts for each chromosome (and/or scaffold) |
complete_locations_report.txt | Full report on every location for every tag found in the genome and transcriptome |
gene_tag_report.txt | Subset of the above; gene-associated tags only |
tagless_3'_exons.txt | A list of 3' exons which cannot be tagged, FYI |
2. Special gene reports (subsets of the solexa_gene_data table), for checking of special gene assignments:
mito_genes.txt | All mitochondrial-genome, mitochondrion-associated and tRNA genes |
chloro_genes.txt | All chloroplast-genome and chloroplast-associated genes |
ribo_genes.txt | All rRNA and ribosomal protein genes |
ncRNA_genes.txt | All remaining noncoding RNA genes (not associated with above groups) |
3. Repeat rescue reports: (only generated if repeat rescues were attempted)
compara_query_dump.txt | Dump of the Compara gene-family relationships query |
rescue_report.txt | Results for every tag for which rescue was attempted |
rescue_successful.txt | Condensed results for tags that were rescued |
rescue_unsuccessful.txt | Condensed results for tags that were not rescued |
Alignment of Reads |
|
Tag tables replace the reference genome in the alignment process. Tables should work with any alignment software, but currently only Eland
"results" output can be interpreted by the second stage of the pipeline.
Step 2: Alignment Interpretation |
|
The Nutshell
- Eland output is acquired.
- The interpreter script is edited to reflect user's database connection settings, and a few other things.
- The interpreter script is run, producing lanewise tag count tables and various reports.
- Reports are reviewed to ensure all is in order.
- Tag count tables are sent to postprocessing / statistical analysis.
For editing instructions and runtime parameters, see the User's Guide.
PLEASE NOTE: Processing may require a significant amount of RAM. Highest observed memory usage is ~13.5GB for a
51-million read run. A 26-million read run still takes ~11GB.
The Tag Count Tables
Tag count tables are the main product of Step 2, giving lanewise read counts and gene associations for each tag. Every tag that made it
through alignment will be reported in some table, no matter what the condition.
There are five main tables:
genic_table.txt | Tags associated with one gene |
intergenic_table.txt | Single-site tags associated with no genes |
ambig_table.txt | Single-site tags associated with >1 genes |
repeat_table.txt | Multiple-site tags associated with >1 genes (or no genes) |
annotations_table.txt | Annotations for every gene encountered in the tables (including those below). |
The above tables are limited to tags with 2 or more reads (for the entire experiment). So-called "singleton" tags are reported in a
separate set of tables:
genic_singletons.txt | |
intergenic_singletons.txt | |
ambig_singletons.txt | |
repeat_singletons.txt | |
Lastly are the miscellaneous tables, which contain 'FYI' data or material for troubleshooting / further investigation:
adaptor_tags.txt | NM tags which were found to contain portions of the 3' sequencing adaptor. |
processed_NMs.txt | "Clean" NM tags (no adaptor pieces), ranked by read count. |
mapping_errors.txt | Mapping error tags, where off-center alignment occurred within the tag tables. Most of these
are defective reads anyway, but all data from the alignment run makes it into a report one way or the other. |
The Reports
The additional reports:
code_summary.txt | A comprehensive breakdown of Eland read-code counts and tag table hits. Numbers are reported both
readwise and tagwise, using filtered and unfiltered reads, allowing inspection of how the alignments fared, what areas were most impacted
by filtering, what types and percentages of reads were mapped to what table, and more. |
tables_README.txt | A short file capturing runtime parameters, some descriptive stats, and a rescue summary
(where applicable). |
The Track Files
Track files give tag positions and read counts in Log10 TPM (tags-per-million). Since tags with < 10 TPM will appear below the line in a
.wig file, both strands cannot be accurately represented in one track, so W and C appear in independent files. The raw tracks can be very
large, so .bz2 versions are also created automatically. Here ExpName is the experiment name, a user-defined runtime parameter.
{ExpName}_W.wig | |
{ExpName}_C.wig | |
{ExpName}_W.wig.bz2 | |
{ExpName}_C.wig.bz2 | |
Also note that repeat tags are excluded, and read counts from silent repeat tags are divided equally among their respective sites.
As the tracks are in .wig format, they are intended for viewing in UCSC and are not Ensembl-compatible. Ensembl-ready, LDAS-type track
files (similiar to GFF) are in development.
Postprocessing |
|
Some Ideas
- Check if normalization is required: Convert values to TPM and look at housekeeper behavior across lanes.
- Assess differential expression in genic reads (if your experiment involved 2 or more conditions).
- BLAST the high-copy-count NM tags.
- Look for read clusters in intergenic space.
- Inspect gene prediction sets for possible supporting evidence.
|