Solexa Google Group
Why Use Tag Tables?
Since the introduction of SAGE, the use of tagging reactions to characterize transcriptional activity has become a widespread practice.
Recently, next-generation sequencing platforms like the Illumina Genome Analyzer and ABI SOLiD have taken SAGE to a new level, generating
tag libraries thousands of times larger than previously possible. This has created new issues for tag annotation, alignment, and quality
assessment, as well as amplified old issues. One of these issues is the problem of mapping tags back to a reference genome.
Genomes of higher eukaryotes are vast and far from random. While the probability of randomly generating any given 17-bp tag is less
than 1 in 17 billion (or 1 in 4.3 billion for 16-bp tags), which is usually much larger than the reference genome, it still happens that
many real tags will map to tens, hundreds, or thousand of genomic locations. Some means of restricting genomic searchspace is critical for
having confidence in mapping, and getting good yields of annotated tags.
Because tagging reactions use anchoring (restriction) enzymes, the meaningful searchspace of a genome can be reduced to only those
sequences which flank the restriction site of interest. This step is crucial to getting useful results (a .pdf with more detail is
The Ensembl Tag-Table Pipeline
At the moment, only genomes which can be packaged into an Ensembl database shell can be converted into tag tables. The generator script
converts Ensembl databases of type "core" or "otherfeatures" into tag tables, reports, and a few mysql tables to be uploaded back to the
database. Several processing options are available, allowing the user to generate tag table sets which better match their analytical needs.
While the tag tables can be used with any alignment software (Eland, MAQ, BLAT, etc), the interpreter script currently only works with
Eland output. The script will use the mysql tables generated in the first step to annotate the alignment results. The new mysql tables
contain all the data necessary for tag identification and annotation, but table entities are still matched to those of the host database
wherever possible, to facilitate user-defined queries and joins between the new tables and the original Ensembl tables.
See the Pipeline Overview for more details.
Support and Development
The pipeline was developed for internal use and is being made available as a open-source beta, under the BSD license. If people find it
useful enough, it will be rewritten in more formal code (Perl -w, use strict, object-orientation, etc.) and developed further. There are
some things which are already in the queue for development:
- Ensembl-compatible, LDAS-type track files.
- Inclusion of custom reporter/plasmid constructs.
- Coverage for tags from unknown alternative splicing events.
- A flatfile-based pipeline is in the works, which will accomodate flatfile genome formats from several major providers (NCBI, JGI, Broad, Sanger).
Features currently available are detailed in the User's Guide.
Pipeline and website development and maintenance by Ariel Paulson
Please address any correspondence to [ apa at stowers-institute dot org ]
Special thanks to other members of the Institute for their inputs and assistance, and making all this possible:
Chris Seidel | Aaron Noll | Malcolm Cook | Madelaine Gogol | Sachin Mathur | Brandon Young
And many thanks to Kevin M. Carr at U Michigan for many key improvements and for being the alpha tester!