General Information:

Home
Pipeline Overview
Pipeline Details
User's Guide
Downloads

Related Things:

NlaIII_vs_DpnII_Release_2.xls

Some Links:

Ensembl Databases
Gramene Databases

Solexa Google Group
Genographia
SeqAnswers

Ensembl Pipeline - Operations Details


Jump to details about:

  1. Minimal Database Installations
    1. Core
    2. Compara
  2. Step 1 Operations
    1. Feature Types
    2. Hierarchical Exclusion
    3. Tag Classes
    4. The Tag Table Creation Process
      1. Scanning
      2. Classification
      3. Repeat Rescue
      4. Sorting
  3. Step 2 Operations
    1. * Coming Soon *
For details about installing and running the pipeline, please see the User's Guide.



Minimal Database Installations


Core Database

The Ensembl core (or otherfeatures) database must always have its systematic Ensembl name (e.g. homo_sapiens_core_49_36k). The scripts will not recognize it otherwise.

There are two sets of tables required for a minimal install:

Feature Data Tables:
  1. gene
  2. gene_stable_id
  3. transcript
  4. transcript_stable_id
  5. exon
  6. exon_stable_id
  7. exon_transcript
  8. xref
  9. external_synonym
Sequence Data Tables:
  1. assembly
  2. assembly_exception
  3. attrib_type
  4. coord_system
  5. dna
  6. meta
  7. seq_region
  8. seq_region_attrib
The above tables are required for Step 1 only.


Compara Database

The Ensembl Compara database must always have its systematic Ensembl name (e.g. ensembl_compara_48). The scripts will not recognize it otherwise.

Compara is a very large database system, and only a very small portion is necessary to enable the repeat-rescue feature:
  1. family
  2. family_member
  3. member
  4. ncbi_taxa_name


Step 1 Operations


Feature Types

A "site" is the actual 20- or 21-bp tag footprint, including the CATG or GATC. When stored as a coordinate, it is always the position of the first base in the restriction site (on the W strand).

A "feature", as defined here, is the source, or partial source, of a tag; a genomic element which contributes basepairs to a site. The site consists of the tag and its restriction site (so, 21 or 22 bp), and its feature composition is the set of all feature types which contribute basepairs to that site. There are four feature types recognized by the tag-finding algorithm:
  1. I:   Intergenes.
  2. N:  Introns.
  3. E:  Exons.
  4. T:  Transcript-Only features, which are either splice junctions or polyadenylation junctions.
While not contributing actual basepairs, making a separate feature for junctions is important for keeping track of tag origins. Any exonic tag will be found again in the transcriptome (often multiple times); in order to separate exonic tags from those which are truly unique to transcripts, feature type T is required.

Tags are stranded, and only features from the same strand as the tag may contribute basepairs. Antisense transcription is not included in the classification model. So if the tag overlaps an exon, but the exon is on the other strand (and nothing is on the tag's strand), the tag is considered intergenic.

Combinations of the above four features determine whether a tag is canonical or not. E and T are normal transcript features, and tags with only these features are called canonical.

N is not a "normal" feature in a mature transcript (I know, I know...); tags coming from introns are considered noncanonical. Likewise, if the tag overlaps an exon-intron boundary, it gets features E and N; we do not expect this in normal transcripts, so the tag is noncanonical. Same if the tag hangs off the end of a gene (into intergenic space), getting features E and I.

I is the intergenic feature, tags with only this feature are intergenic.


Hierarchical Exclusion

The exclusion process allows more stringent assignment of tags to features, basically by playing "rock, paper, scissors" with transcriptional likelihoods. There are three levels of stringency:
  1. "X0":  No exclusions -- all tag sources treated equally.
  2. "X1":  If a tag has a genic source, then ignore any intergenic sources.
  3. "X2":  X1 stringency, and if a tag has a canonical source, then also ignore any noncanonical sources.
Exclusion levels do not change the number of tags in the tables, only how they are annotated.

At X0, there will be many genic/intergenic repeats, where a tag sequence maybe found both in genes and in intergenic space. X1 and X2 eliminate this; if the tag sequence is found in a gene, the classifier will ignore all non-genic features for that tag. X0 is best for badly underannotated genomes, where intergenic space is rife with undiscovered genes, and one does not want to overestimate the accuracy of any given assignment.

X1 is better for decently-annotated genomes, where most genes have been found, but the transcriptome is not well established. X1 prevents genic/intergenic repeat tags, but it will not distinguish between canonical and noncanonical sources. If a tag sequence is found at two sites -- one is the exon of gene A, the other is the intron of gene B, and the two do not overlap -- it will be classified as a repeat. If a tag comes from the exon of gene A, but gene A is floating in the intron of gene B, the tag will be classified as ambiguous.

X2 is the most restrictive and is best for skimming only the most likely loci for any tag. X2 resolves the previous cases: exons are canonical and take priority over introns, which are noncanonical. In both cases the tag will be assigned to gene A alone. However, if a tag is still found in two unrelated exons, or two unrelated introns (i.e. multiple features of the same priority), it will still be classified as a repeat. X2 is good for organisms with well-characterized transcriptomes, where overestimating the accuracy of assignments is less likely.


Tag Classes

Tag classes speed up tag annotation and assignment to tables, and enable the hierarchical exclusion process. They also assist in track file creation and pipeline troubleshooting. They are not that useful outside the pipeline, however, although they do provide a little context about each tag.

Tags are classified using three pieces of information:
  • 'sites': one or many: number of sites where the tag occurs
  • 'genes': one or many: number of genes associated with these sites
  • 'INET': various combinations: feature composition of each site
All possible combinations of the INET variables, plus the two binary variables 'sites' and 'genes', result in 56 combinations. However, some of these are biologically impossible, and others are practically redundant. The short list has 27 combinations, which become classes.

Table of Classes:
# Name Explanation
01 Exonic Single site, single gene; exon
02 Exonic Ambiguous Single site, 2+ genes with exon overlap
03 Exonic Silent Repeat Multiple sites, single gene; all from exons
04 Exonic Repeat Multiple sites, multiple genes; all from exons
05 Junction/PolyA Single site, single gene, but transcript-only
06 Junction/PolyA Ambiguous Single site, 2+ gene overlap; 2+ transcripts cutting same tag
07 Junction/PolyA Silent Repeat Multiple sites, single gene; all from transcripts only
08 Junction/PolyA Repeat Multiple sites, multiple genes; all from transcripts only
09 Canonical Currently impossible, but that could change in a later version
10 Canonical Ambiguous Single site, 2+ gene overlap; one gene is exon-only, the other is transcript-only
11 Canonical Silent Repeat Multiple sites, single gene; some are exon hits, some from splice junctions
12 Canonical Repeat Multiple sites, multiple genes; all are either exon or transcript-only hits
13 Intronic Single site, single gene; intron
14 Intronic Ambiguous Single site, 2+ genes with intron overlap
15 Intronic Silent Repeat Multiple sites, single gene; all from introns
16 Intronic Repeat Multiple sites, multiple genes; all from introns
17 Noncanonical Single site, single gene; exon-intron / exon-intergenic (or intron-exon-intergenic) overhang
18 Noncanonical Ambiguous Single site, 2+ genes with exon-intron / exon-intergenic overhangs overlapping tag site
19 Noncanonical Silent Repeat Multiple sites, single gene; all sites have exon-intron or exon-intergenic overhang
20 Noncanonical Repeat Multiple sites, multiple genes; all sites have exon-intron or exon-intergenic overhang
21 Genic Single site, single gene; tag the same if spliced, or runs into intron / off gene end
22 Genic Ambiguous Single site, 2+ genes; one site is exonic, the other intronic
23 Genic Silent Repeat Multiple sites, single gene; some sites are exonic, others intronic
24 Genic Repeat Multiple sites, multiple genes; some sites are exonic, others intronic
25 Intergenic Single site, no gene
26 Intergenic Repeat Multiple sites, no genes
27 Genic/Intergenic Repeat Multiple sites, some have genes

The pattern should be immediately visible. There are seven categories:
  1. Exonic
  2. Junction/PolyA
  3. Canonical
  4. Intronic
  5. Noncanonical
  6. Genic
  7. Intergenic
Each of which has four distributions (except for Intergenic):
  1. Unambiguous (just the category name)
  2. Ambiguous
  3. Silent Repeat
  4. Repeat
Certain classes only exist at certain exclusion levels. For instance, 27 only exists at X0, and 21-24 cannot exist at X2.

Some classes almost never occur. Class 9, for instance, is impossible, but there is a good chance that will change in the future, so it is being left in. Classes 10-12 are extremely rare, but not actually impossible. Class 21 holds oddities where a single exonic site can generate the same tag by running across a splice junction OR running into the intron.

Anyway, classes are just a means to an end, and may be replaced in a later version if something more efficient is developed.


The Tag Table Creation Process

The process occurs in four stages:
  1. Scanning
  2. Classifying
  3. Repeat Rescue
  4. Sorting



Scanning:

Chromosomes are sorted into standard and nonstandard/scaffold groups. Standards are processed first. For each chromosome/scaffold, the scanning process is as follows:
  1. The golden path sequence is retrieved from the database.

  2. Transcripts for all genes on the chromosome/scaffold are scanned for restriction sites. Transcript sequences are assembled from scratch, which enables tracking of the true genomic positions of each tag. The gene id, transcript id, transcript start site, and tag are recorded and stored by genomic start site. This way, an exonic tag will not be classified as a repeat, even though it is seen multiple times. Also, tags are only cut in the 3' direction of the transcript; tags cannot run off the end of a transcript, because transcripts are given poly-A tails which are longer than the tag length.

  3. The chromosomal sequence is scanned for restriction sites. Since the sites are palindromes, one tag is cut 3' on the W strand and another is cut 5' on the C strand, then reverse-complemented. If a tag runs off the end of the chromosome, it is discarded. Coordinates and tags are recorded for each site and stored in two separate objects, one for each strand.

  4. Once all sites for a chromosome and its transcriptome have been identified, they are sent to the gene matching subroutine for "featurizing":

    1. Each site is compared to the list of all chromosomal genes ON THE SAME STRAND. No opposite-strand matching is performed.

    2. If the site is within a gene, the gene id is recorded, and the site goes to the exon matching subroutine.

    3. The exon matching works the same as gene matching. If the site is within an exon, it is assigned an "E". If not, it is assigned an "N".

    4. If the site only overlaps a gene, then it gets an "E" and an "I" (and goes to exon matching). If a site overlaps a gene by even one bp, it will be assigned to that gene.

    5. Likewise, if the site only overlaps an exon, then it gets both an "E" and an "N" (unless it overlaps the outer edge of a terminal exon). It is even possible, though very rare, for a site to completely eclipse a small terminal exon and get "N", "E", and "I" altogether.

    6. If the site does not overlap any gene, it is assigned the "I" feature (intergenic).

  5. As sites get "featurized", their features and associated genes are stored in new objects, one for each tag. This will enable the classifier to take each site, look at all the associated features, and ignore any which do not meet the exclusion criteria (if there are any). The tag is then classified according to the feature composition of the remaining sites.



Classification:

Once all chromosomes and/or scaffolds have been scanned, and the feature compositions and tags from all sites have been catalogued, each tag is given a class. Tag annotations are determined by the tag type, e.g. canonical, ambiguous, intergenic, etc. Processing varies slightly depending on the level of hierarchical exclusion. For each tag, the process is as follows:
  1. Purely intergenic sites are tallied and separated.

  2. Each gene-associated site is processed:

    1. All genes associated with the site are tallied (there may be more than one; genes can overlap on the same strand).

    2. Each gene, at this site, is processed:

      1. All features associated with this gene, at this site, are recalled.

      2. Feature combinations are converted into a single value, called a "primary". Each gene-site combination may have only one primary. This immediately determines if the site is canonical for this gene, or not.

      3. The gene-site combination is stored by its primary, in a new object.

  3. Once all gene-site combinations have been assigned primaries, processing returns to the site level.

  4. Once all sites are processed, primaries "compete" against each other, according to the rules of the exclusion level.

    1. Exclusion is applied to each gene at each site; only the genes and sites for "winning" primaries are retained.

    2. The 'sites' variable = 1 or 'many', depending on how many sites were retained.

    3. The 'genes' variable = 1 or 'many', depending on how many genes were retained.

    4. The remaining unique primaries are consolidated into a "secondary", according to the rules of the exclusion level.

    5. Given 'sites', 'genes', and the secondary, a class is assigned.

  5. The tag is now classified.

  6. Note that, if exclusion level was 0, intergenic sites may compete with genic sites. Otherwise, intergenic sites may only factor in if there are no other sites.



Repeat Rescue:

If a tag is gene-associated and is classified as a repeat, and the repeat-rescue feature is enabled, then the tag will immediately go to repeat rescue.

The idea is to use gene families to rescue repeat tags, if it happens that all instances of a repeat come from members of the same family. This is particularly useful when dealing with sets of highly homologous genes which may all generate the same tag, e.g. immunoglobulins, HOX loci, olfactory receptors, ribosomal proteins, tubulins, etc.

As few as 2 genes can be used, so long as all genes associated with the tag belong to the same family. Rescued tags will be treated as singular and annotated to one family (as a silent repeat), instead of many genes. The only issue with the process is that Ensembl's "stable" family IDs are not stable, but version-specific. This is why the Compara version is included with all repeat-rescue data.
  1. All winning gene-site pairs for the tag are recalled.

  2. Compara is queried, and all families for each gene are returned.

  3. If all genes happen to associate with one family, then the tag is rescued, and associated with this family.

  4. If any sites have ambiguous gene associations, which interfere with a singular family assignment, deconvolution will be attempted. This only applies if there are 4+ genes, and at least 75% associate with the dominant family; otherwise, the tag is skipped. If deconvolution shows that all tags associate with one family, then the tag is rescued.

  5. Rescued tags are re-classed as silent repeats (in their original category) and annotated to the family, instead of a gene.

  6. Unrescued tags are returned to sorting, unaltered.



Sorting:

Sorting is coupled to classification; as soon as a tag is classified it gets sorted into a group and annotated accordingly:
  1. Unambiguous tags are annotated with gene and site.

  2. Ambiguous tags are annotated with the site, a list of candidate genes, and several frequencies:

    1. Original Frequency: The original number of sites the tag was found at, before the exclusion process.
    2. Final Frequency: The reduced number of sites the tag is found at, after exclusion.
    3. Gene Frequency: The number of winning genes associated with the tag. Usually 2 for ambigs, but occasionally goes higher.
    4. These frequencies are applied to some other classes as well. In reports, they are found with headers "OFreq", "FFreq", and "GFreq".

  3. Silent Repeat tags are annotated with gene, original frequency, and final frequency.

  4. Repeat tags (genic or intergenic) are annotated with original frequency and final frequency.

  5. Intergenic tags (non-repeat) are annotated with site only.

Annotated tags are sorted in two ways: by category and by distribution (described in the
Tag Classes section above).
  1. Intergenic, Ambiguous, and Repeat tags all go to their respective tables.

    NOTE: in the tag tables, intergenic takes precedence over repeat. In tag annotation (pipeline step 2), repeat takes precedence over intergenic. Keep this in mind if tracking intergenic repeats.

  2. The remaining tags are all associated with single genes, and are divided up in this way:

    1. If the gene is mitochondrion-associated, the tag goes to the mitochondrial table.

    2. Otherwise, if the gene is chloroplast-associated, the tag goes to the chloroplast table.

    3. Otherwise, if the gene is ribosome- or tRNA-associated, the tag goes to the ribosomal table.

    4. Otherwise, if the gene is any kind of noncoding RNA, the tag goes to the ncRNA table.

    5. Otherwise, the tag is not associated with a special gene type, and goes to a "generic" table:

      1. Exonic, Junction/PolyA, and Canonical classes go to the canonical table.

      2. Intronic and Noncanonical classes go to the noncanonical table.

      3. Genic class goes to the genic table.

Tags are sent to pre-fasta reports sorted by class. Unambiguous tags go first, then silent repeats. Each type reports different data, and gets a different header within the report.



Step 2 Operations


Coming Soon