Statistical Summary of CAMDA '04 Contest Datasets

"Complete" Dataset
7091
Oligos in "Complete" dataset
-216
Oligo_ID is "empty"
6875
Unique Oligo_IDs
-1265
"NULL" Gene ID (No "annotations" are present but 690 had some comments)
[Note:  "NULL" string here is not the same as "null" in database terminology]
-2
GeneID was missing (E9299_3 and N1)
["null" in database terminology]
5608

Oligos with associated GeneIDs.
[Annotation field missing for Oligo_IDs:
F23846_3, oPFF72466]

4314 unique GeneIDs.

35 genes have 5 or more oligos;
Gene MAL6P1.147 represented by 15 oligos.

"Quality Control Set"
5071
Oligos in "Complete" set of 6875 unique oligos
+9

"New" oligos not in "Complete" set:
opfrr0006, ptrgln, ptrgly, ptrgly2, ptrphe, ptrpro, ptrthr, ptrtrp, snr1

5080

Unique oligos in "Quality Control Set"

Bozdech:  "Fourier analysis was performed on each profile in the quality-controlled set (5081 oligonucleotides)."

-555
"NULL" Gene ID (but 87 had some comments)
4525

Oligos with associated GeneIDs.
[Annotation field missing for 30 Oligos]

3533 unique Gene IDs.

3529 unique Gene IDs are in the list of unique Gene IDs from the "Complete" set.  The four new Gene IDs are: ITS2, P, Plastid Genome, snRNA?

24 genes have 5 or more oligos; Gene "Plastid Genome" represented by 27 oligos; Gene MAL6P1.147 represented by 15 oligos; Gene PFI1475w represented by 10 oligos.

"Overview" Dataset
3711
Oligos in "Complete" set of 6875 unique oligos
+8
"New" oligos not in "Complete" set and introduced in the QC set:  ptrgln, ptrgly, ptrgly2, ptrphe, ptrpro, ptrthr, ptrtrp, snr1
3719

Unique oligos in "Overview" dataset .  All "Overview" oligos are in the QC dataset.

-335
"NULL" Gene ID (but 56 had some comments)
3384

Oligos with associated GeneIDs.

2687 unique Gene IDs

Gene "P" in the QC Dataset was renamed to Gene "PF14_0338" in the Overview set.

The annotation information for these three genes changed slightly between the QC dataset and the Overview dataset:  PFE0040c, PF11_0358, PF14_0451.

Note:  Bozdech reports in the P. falciparum Transcriptome Overview:  "The overview set represents 2714 unique ORFs (3395 oligonucleotides).  An additional 324 oligonucleotides represent ORFs that are not currently part of the manually annotated collection."  The caption to Figure 2 shows "transcriptional profiles for 2712 genes".