Notes about Bozdech '03 Plasmodium Raw Data

Zip files were downloaded from the DeRisi Lab Malaria Transcriptome Database page:

  • Complete Set
  • QC Set
  • Overview.

The files of interest were these tab-delimited text files:

Filename Size Contents
Complete_Dataset.txt 3731 KB all raw data for every oligonucleotide at each time point
QC_Dataset.txt 3055 KB set of oligonucleotides that passed all quality control filters
Overview_Dataset.txt 2388 KB dataset used to generate the IDC phaseogram

The data were manipulated with several tools, including an Access database, Excel spreadsheets, and the"R" analysis language.  Access and Excel could read these tab-delimited files directly using defaults.

In R I had problems parsing the file, and this appeared to be somehow related to the Name field in the Complete_Dataset.txt file.  Instead of dealing with a more complicated way to parse the file in R, a new CSV file was created for use with R.

To make analysis in R a bit easier, Excel was used to create a new file, Complete_Dataset.csv, that had these modifications from the original file:

  • Deleted "Name" column
  • Inserted Columns "TP 23" (between "TP 22" and "TP 24") and "TP 29" (between "TP 28" and "TP 30") to complete the time sequence
  • Removed the blank in the "Time Point" column names, e.g., "TP 1" became "TP1"

The "TP23" and "TP29" empty columns were added so time points from 1 to 48 could be treated in a consistent manner, since there were many other missing data points in the dataset.

In the Access database the "Name" field was further parsed into separate fields of GeneID, Annotation, and Comments.


Contents of Datasets

"Complete" Dataset
7091
Oligos in "Complete" dataset
-216
Oligo_ID is "empty" (unclear why some data is present for these)
6875
Unique Oligo_IDs
-1265
"NULL" Gene ID (No "annotations" are present but 690 had some comments)
[Note:  "NULL" string here is not the same as "null" in database terminology]
-2
GeneID was missing (E9299_3 and N1)
["null" in database terminology]
5608

Oligos with associated GeneIDs.
[Annotation field missing for Oligo_IDs:
F23846_3, oPFF72466]

4314 unique GeneIDs.

35 genes have 5 or more oligos;
Gene MAL6P1.147 represented by 15 oligos.

"Quality Control Set"
5071
Oligos in "Complete" set of 6875 unique oligos
+9

"New" oligos not in "Complete" set:
opfrr0006, ptrgln, ptrgly, ptrgly2, ptrphe, ptrpro, ptrthr, ptrtrp, snr1

5080

Unique oligos in "Quality Control Set"

Bozdech:  "Fourier analysis was performed on each profile in the quality-controlled set (5081 oligonucleotides)."

-555
"NULL" Gene ID (but 87 had some comments)
4525

Oligos with associated GeneIDs.
[Annotation field missing for 30 Oligos]

3533 unique Gene IDs.

3529 unique Gene IDs are in the list of unique Gene IDs from the "Complete" set.  The four new Gene IDs are: ITS2, P, Plastid Genome, snRNA?

24 genes have 5 or more oligos; Gene "Plastid Genome" represented by 27 oligos; Gene MAL6P1.147 represented by 15 oligos; Gene PFI1475w represented by 10 oligos.

"Overview" Dataset
3711
Oligos in "Complete" set of 6875 unique oligos
+8
"New" oligos not in "Complete" set and introduced in the QC set:  ptrgln, ptrgly, ptrgly2, ptrphe, ptrpro, ptrthr, ptrtrp, snr1
3719

Unique oligos in "Overview" dataset .  All "Overview" oligos are in the QC dataset.

-335
"NULL" Gene ID (but 56 had some comments)
3384

Oligos with associated GeneIDs.

2687 unique Gene IDs

Gene "P" in the QC Dataset was renamed to Gene "PF14_0338" in the Overview set.

The annotation information for these three genes changed slightly between the QC dataset and the Overview dataset:  PFE0040c, PF11_0358, PF14_0451.

Note:  Bozdech reports in the P. falciparum Transcriptome Overview:  "The overview set represents 2714 unique ORFs (3395 oligonucleotides).  An additional 324 oligonucleotides represent ORFs that are not currently part of the manually annotated collection."  The caption to Figure 2 shows "transcriptional profiles for 2712 genes".

Updated
25 May 2005