Ensembl Pipeline - Operations Details

"X0": No exclusions -- all tag sources treated equally.
"X1": If a tag has a genic source, then ignore any intergenic sources.
"X2": X1 stringency, and if a tag has a canonical source, then also ignore any noncanonical sources.

Exclusion levels do not change the number of tags in the tables, only how they are annotated.

At X0, there will be many genic/intergenic repeats, where a tag sequence maybe found both in genes and in intergenic space. X1 and X2 eliminate this; if the tag sequence is found in a gene, the classifier will ignore all non-genic features for that tag. X0 is best for badly underannotated genomes, where intergenic space is rife with undiscovered genes, and one does not want to overestimate the accuracy of any given assignment.

X1 is better for decently-annotated genomes, where most genes have been found, but the transcriptome is not well established. X1 prevents genic/intergenic repeat tags, but it will not distinguish between canonical and noncanonical sources. If a tag sequence is found at two sites -- one is the exon of gene A, the other is the intron of gene B, and the two do not overlap -- it will be classified as a repeat. If a tag comes from the exon of gene A, but gene A is floating in the intron of gene B, the tag will be classified as ambiguous.

X2 is the most restrictive and is best for skimming only the most likely loci for any tag. X2 resolves the previous cases: exons are canonical and take priority over introns, which are noncanonical. In both cases the tag will be assigned to gene A alone. However, if a tag is still found in two unrelated exons, or two unrelated introns (i.e. multiple features of the same priority), it will still be classified as a repeat. X2 is good for organisms with well-characterized transcriptomes, where overestimating the accuracy of assignments is less likely.

Tag Classes

Tag classes speed up tag annotation and assignment to tables, and enable the hierarchical exclusion process. They also assist in track file creation and pipeline troubleshooting. They are not that useful outside the pipeline, however, although they do provide a little context about each tag.

Tags are classified using three pieces of information:

'sites': one or many: number of sites where the tag occurs
'genes': one or many: number of genes associated with these sites
'INET': various combinations: feature composition of each site

All possible combinations of the INET variables, plus the two binary variables 'sites' and 'genes', result in 56 combinations. However, some of these are biologically impossible, and others are practically redundant. The short list has 27 combinations, which become classes.

Table of Classes:

#	Name	Explanation
01	Exonic	Single site, single gene; exon
02	Exonic Ambiguous	Single site, 2+ genes with exon overlap
03	Exonic Silent Repeat	Multiple sites, single gene; all from exons
04	Exonic Repeat	Multiple sites, multiple genes; all from exons
05	Junction/PolyA	Single site, single gene, but transcript-only
06	Junction/PolyA Ambiguous	Single site, 2+ gene overlap; 2+ transcripts cutting same tag
07	Junction/PolyA Silent Repeat	Multiple sites, single gene; all from transcripts only
08	Junction/PolyA Repeat	Multiple sites, multiple genes; all from transcripts only
09	Canonical	Currently impossible, but that could change in a later version
10	Canonical Ambiguous	Single site, 2+ gene overlap; one gene is exon-only, the other is transcript-only
11	Canonical Silent Repeat	Multiple sites, single gene; some are exon hits, some from splice junctions
12	Canonical Repeat	Multiple sites, multiple genes; all are either exon or transcript-only hits
13	Intronic	Single site, single gene; intron
14	Intronic Ambiguous	Single site, 2+ genes with intron overlap
15	Intronic Silent Repeat	Multiple sites, single gene; all from introns
16	Intronic Repeat	Multiple sites, multiple genes; all from introns
17	Noncanonical	Single site, single gene; exon-intron / exon-intergenic (or intron-exon-intergenic) overhang
18	Noncanonical Ambiguous	Single site, 2+ genes with exon-intron / exon-intergenic overhangs overlapping tag site
19	Noncanonical Silent Repeat	Multiple sites, single gene; all sites have exon-intron or exon-intergenic overhang
20	Noncanonical Repeat	Multiple sites, multiple genes; all sites have exon-intron or exon-intergenic overhang
21	Genic	Single site, single gene; tag the same if spliced, or runs into intron / off gene end
22	Genic Ambiguous	Single site, 2+ genes; one site is exonic, the other intronic
23	Genic Silent Repeat	Multiple sites, single gene; some sites are exonic, others intronic
24	Genic Repeat	Multiple sites, multiple genes; some sites are exonic, others intronic
25	Intergenic	Single site, no gene
26	Intergenic Repeat	Multiple sites, no genes
27	Genic/Intergenic Repeat	Multiple sites, some have genes

The pattern should be immediately visible. There are seven categories:

Exonic
Junction/PolyA
Canonical
Intronic
Noncanonical
Genic
Intergenic

Each of which has four distributions (except for Intergenic):

Unambiguous (just the category name)
Ambiguous
Silent Repeat
Repeat

Certain classes only exist at certain exclusion levels. For instance, 27 only exists at X0, and 21-24 cannot exist at X2.

Some classes almost never occur. Class 9, for instance, is impossible, but there is a good chance that will change in the future, so it is being left in. Classes 10-12 are extremely rare, but not actually impossible. Class 21 holds oddities where a single exonic site can generate the same tag by running across a splice junction OR running into the intron.

Anyway, classes are just a means to an end, and may be replaced in a later version if something more efficient is developed.

The Tag Table Creation Process

The process occurs in four stages:

Scanning
Classifying
Repeat Rescue
Sorting

Scanning:

Chromosomes are sorted into standard and nonstandard/scaffold groups. Standards are processed first. For each chromosome/scaffold, the scanning process is as follows:

The golden path sequence is retrieved from the database.

Transcripts for all genes on the chromosome/scaffold are scanned for restriction sites. Transcript sequences are assembled from scratch, which enables tracking of the true genomic positions of each tag. The gene id, transcript id, transcript start site, and tag are recorded and stored by genomic start site. This way, an exonic tag will not be classified as a repeat, even though it is seen multiple times. Also, tags are only cut in the 3' direction of the transcript; tags cannot run off the end of a transcript, because transcripts are given poly-A tails which are longer than the tag length.

The chromosomal sequence is scanned for restriction sites. Since the sites are palindromes, one tag is cut 3' on the W strand and another is cut 5' on the C strand, then reverse-complemented. If a tag runs off the end of the chromosome, it is discarded. Coordinates and tags are recorded for each site and stored in two separate objects, one for each strand.

Once all sites for a chromosome and its transcriptome have been identified, they are sent to the gene matching subroutine for "featurizing":
1. Each site is compared to the list of all chromosomal genes ON THE SAME STRAND. No opposite-strand matching is performed.
2. If the site is within a gene, the gene id is recorded, and the site goes to the exon matching subroutine.
3. The exon matching works the same as gene matching. If the site is within an exon, it is assigned an "E". If not, it is assigned an "N".
4. If the site only overlaps a gene, then it gets an "E" and an "I" (and goes to exon matching). If a site overlaps a gene by even one bp, it will be assigned to that gene.
5. Likewise, if the site only overlaps an exon, then it gets both an "E" and an "N" (unless it overlaps the outer edge of a terminal exon). It is even possible, though very rare, for a site to completely eclipse a small terminal exon and get "N", "E", and "I" altogether.
6. If the site does not overlap any gene, it is assigned the "I" feature (intergenic).
As sites get "featurized", their features and associated genes are stored in new objects, one for each tag. This will enable the classifier to take each site, look at all the associated features, and ignore any which do not meet the exclusion criteria (if there are any). The tag is then classified according to the feature composition of the remaining sites.

Classification:

Once all chromosomes and/or scaffolds have been scanned, and the feature compositions and tags from all sites have been catalogued, each tag is given a class. Tag annotations are determined by the tag type, e.g. canonical, ambiguous, intergenic, etc. Processing varies slightly depending on the level of hierarchical exclusion. For each tag, the process is as follows:

Purely intergenic sites are tallied and separated.

Each gene-associated site is processed:

All genes associated with the site are tallied (there may be more than one; genes can overlap on the same strand).

Each gene, at this site, is processed:

All features associated with this gene, at this site, are recalled.

Feature combinations are converted into a single value, called a "primary". Each gene-site combination may have only one primary. This immediately determines if the site is canonical for this gene, or not.

The gene-site combination is stored by its primary, in a new object.

Once all gene-site combinations have been assigned primaries, processing returns to the site level.

Once all sites are processed, primaries "compete" against each other, according to the rules of the exclusion level.
1. Exclusion is applied to each gene at each site; only the genes and sites for "winning" primaries are retained.
2. The 'sites' variable = 1 or 'many', depending on how many sites were retained.
3. The 'genes' variable = 1 or 'many', depending on how many genes were retained.
4. The remaining unique primaries are consolidated into a "secondary", according to the rules of the exclusion level.
5. Given 'sites', 'genes', and the secondary, a class is assigned.
The tag is now classified.

Note that, if exclusion level was 0, intergenic sites may compete with genic sites. Otherwise, intergenic sites may only factor in if there are no other sites.

Repeat Rescue:

If a tag is gene-associated and is classified as a repeat, and the repeat-rescue feature is enabled, then the tag will immediately go to repeat rescue.

The idea is to use gene families to rescue repeat tags, if it happens that all instances of a repeat come from members of the same family. This is particularly useful when dealing with sets of highly homologous genes which may all generate the same tag, e.g. immunoglobulins, HOX loci, olfactory receptors, ribosomal proteins, tubulins, etc.

As few as 2 genes can be used, so long as all genes associated with the tag belong to the same family. Rescued tags will be treated as singular and annotated to one family (as a silent repeat), instead of many genes. The only issue with the process is that Ensembl's "stable" family IDs are not stable, but version-specific. This is why the Compara version is included with all repeat-rescue data.

All winning gene-site pairs for the tag are recalled.

Compara is queried, and all families for each gene are returned.

If all genes happen to associate with one family, then the tag is rescued, and associated with this family.

If any sites have ambiguous gene associations, which interfere with a singular family assignment, deconvolution will be attempted. This only applies if there are 4+ genes, and at least 75% associate with the dominant family; otherwise, the tag is skipped. If deconvolution shows that all tags associate with one family, then the tag is rescued.

Rescued tags are re-classed as silent repeats (in their original category) and annotated to the family, instead of a gene.

Unrescued tags are returned to sorting, unaltered.

Sorting:

Sorting is coupled to classification; as soon as a tag is classified it gets sorted into a group and annotated accordingly:

Unambiguous tags are annotated with gene and site.

Ambiguous tags are annotated with the site, a list of candidate genes, and several frequencies:

Original Frequency: The original number of sites the tag was found at, before the exclusion process.
Final Frequency: The reduced number of sites the tag is found at, after exclusion.
Gene Frequency: The number of winning genes associated with the tag. Usually 2 for ambigs, but occasionally goes higher.
These frequencies are applied to some other classes as well. In reports, they are found with headers "OFreq", "FFreq", and "GFreq".

Silent Repeat tags are annotated with gene, original frequency, and final frequency.

Repeat tags (genic or intergenic) are annotated with original frequency and final frequency.

Intergenic tags (non-repeat) are annotated with site only.

Annotated tags are sorted in two ways: by category and by distribution (described in the Tag Classes section above).

Intergenic, Ambiguous, and Repeat tags all go to their respective tables.

NOTE: in the tag tables, intergenic takes precedence over repeat. In tag annotation (pipeline step 2), repeat takes precedence over intergenic. Keep this in mind if tracking intergenic repeats.

The remaining tags are all associated with single genes, and are divided up in this way:

If the gene is mitochondrion-associated, the tag goes to the mitochondrial table.

Otherwise, if the gene is chloroplast-associated, the tag goes to the chloroplast table.

Otherwise, if the gene is ribosome- or tRNA-associated, the tag goes to the ribosomal table.

Otherwise, if the gene is any kind of noncoding RNA, the tag goes to the ncRNA table.

Otherwise, the tag is not associated with a special gene type, and goes to a "generic" table:

Exonic, Junction/PolyA, and Canonical classes go to the canonical table.

Intronic and Noncanonical classes go to the noncanonical table.

Genic class goes to the genic table.

Tags are sent to pre-fasta reports sorted by class. Unambiguous tags go first, then silent repeats. Each type reports different data, and gets a different header within the report.

Step 2 Operations

Coming Soon

General Information:

Related Things:

Some Links:

Ensembl Pipeline - Operations Details

Jump to details about:

Minimal Database Installations

Step 1 Operations

The Tag Table Creation Process

Step 2 Operations