General Information:

Pipeline Overview
Pipeline Details
User's Guide

Related Things:


Some Links:

Ensembl Databases
Gramene Databases

Solexa Google Group

Why Use Tag Tables?

Since the introduction of SAGE, the use of tagging reactions to characterize transcriptional activity has become a widespread practice. Recently, next-generation sequencing platforms like the Illumina Genome Analyzer and ABI SOLiD have taken SAGE to a new level, generating tag libraries thousands of times larger than previously possible. This has created new issues for tag annotation, alignment, and quality assessment, as well as amplified old issues. One of these issues is the problem of mapping tags back to a reference genome.

Genomes of higher eukaryotes are vast and far from random. While the probability of randomly generating any given 17-bp tag is less than 1 in 17 billion (or 1 in 4.3 billion for 16-bp tags), which is usually much larger than the reference genome, it still happens that many real tags will map to tens, hundreds, or thousand of genomic locations. Some means of restricting genomic searchspace is critical for having confidence in mapping, and getting good yields of annotated tags.

Because tagging reactions use anchoring (restriction) enzymes, the meaningful searchspace of a genome can be reduced to only those sequences which flank the restriction site of interest. This step is crucial to getting useful results (a .pdf with more detail is available here).

The Ensembl Tag-Table Pipeline

Table Generation

At the moment, only genomes which can be packaged into an Ensembl database shell can be converted into tag tables. The generator script converts Ensembl databases of type "core" or "otherfeatures" into tag tables, reports, and a few mysql tables to be uploaded back to the database. Several processing options are available, allowing the user to generate tag table sets which better match their analytical needs.

Interpretation of Alignment Data

While the tag tables can be used with any alignment software (Eland, MAQ, BLAT, etc), the interpreter script currently only works with Eland output. The script will use the mysql tables generated in the first step to annotate the alignment results. The new mysql tables contain all the data necessary for tag identification and annotation, but table entities are still matched to those of the host database wherever possible, to facilitate user-defined queries and joins between the new tables and the original Ensembl tables.

See the Pipeline Overview for more details.

Support and Development

The pipeline was developed for internal use and is being made available as a open-source beta, under the BSD license. If people find it useful enough, it will be rewritten in more formal code (Perl -w, use strict, object-orientation, etc.) and developed further. There are some things which are already in the queue for development:

  • Ensembl-compatible, LDAS-type track files.
  • Inclusion of custom reporter/plasmid constructs.
  • Coverage for tags from unknown alternative splicing events.
  • A flatfile-based pipeline is in the works, which will accomodate flatfile genome formats from several major providers (NCBI, JGI, Broad, Sanger).

Features currently available are detailed in the User's Guide.


Pipeline and website development and maintenance by Ariel Paulson
Please address any correspondence to [ apa at stowers-institute dot org ]
Special thanks to other members of the Institute for their inputs and assistance, and making all this possible:
  Chris Seidel | Aaron Noll | Malcolm Cook | Madelaine Gogol | Sachin Mathur | Brandon Young
And many thanks to Kevin M. Carr at U Michigan for many key improvements and for being the alpha tester!