Notes about Bozdech '03 Plasmodium Raw Data
Zip files were downloaded from the DeRisi Lab Malaria Transcriptome Database page:
The files of interest were these tab-delimited text files:
The data were manipulated with several tools, including an Access database, Excel spreadsheets, and the"R" analysis language. Access and Excel could read these tab-delimited files directly using defaults. In R I had problems parsing the file, and this appeared to be somehow related to the Name field in the Complete_Dataset.txt file. Instead of dealing with a more complicated way to parse the file in R, a new CSV file was created for use with R. To make analysis in R a bit easier, Excel was used to create a new file, Complete_Dataset.csv, that had these modifications from the original file:
The "TP23" and "TP29" empty columns were added so time points from 1 to 48 could be treated in a consistent manner, since there were many other missing data points in the dataset. In the Access database the "Name" field was further parsed into separate fields of GeneID, Annotation, and Comments. |
Contents of Datasets
| "Complete" Dataset | |
|---|---|
7091 |
Oligos in "Complete" dataset |
-216 |
Oligo_ID is "empty" (unclear why some data is present for these) |
6875 |
Unique Oligo_IDs |
-1265 |
"NULL" Gene ID (No "annotations" are present but 690 had some comments) [Note: "NULL" string here is not the same as "null" in database terminology] |
-2 |
GeneID was missing (E9299_3 and N1) ["null" in database terminology] |
5608 |
Oligos with associated GeneIDs. 4314 unique GeneIDs. 35 genes have 5 or more oligos; |
| "Quality Control Set" | |
5071 |
Oligos in "Complete" set of 6875 unique oligos |
+9 |
"New" oligos not in "Complete" set: |
5080 |
Unique oligos in "Quality Control Set" Bozdech: "Fourier analysis was performed on each profile in the quality-controlled set (5081 oligonucleotides)." |
-555 |
"NULL" Gene ID (but 87 had some comments) |
4525 |
Oligos with associated GeneIDs. 3533 unique Gene IDs. 3529 unique Gene IDs are in the list of unique Gene IDs from the "Complete" set. The four new Gene IDs are: ITS2, P, Plastid Genome, snRNA? 24 genes have 5 or more oligos; Gene "Plastid Genome" represented by 27 oligos; Gene MAL6P1.147 represented by 15 oligos; Gene PFI1475w represented by 10 oligos. |
| "Overview" Dataset | |
3711 |
Oligos in "Complete" set of 6875 unique oligos |
+8 |
"New" oligos not in "Complete" set and introduced in the QC set: ptrgln, ptrgly, ptrgly2, ptrphe, ptrpro, ptrthr, ptrtrp, snr1 |
3719 |
Unique oligos in "Overview" dataset . All "Overview" oligos are in the QC dataset. |
-335 |
"NULL" Gene ID (but 56 had some comments) |
3384 |
Oligos with associated GeneIDs. 2687 unique Gene IDs Gene "P" in the QC Dataset was renamed to Gene "PF14_0338" in the Overview set. The annotation information for these three genes changed slightly between the QC dataset and the Overview dataset: PFE0040c, PF11_0358, PF14_0451. Note: Bozdech reports in the P. falciparum Transcriptome Overview: "The overview set represents 2714 unique ORFs (3395 oligonucleotides). An additional 324 oligonucleotides represent ORFs that are not currently part of the manually annotated collection." The caption to Figure 2 shows "transcriptional profiles for 2712 genes". |
Updated
25 May 2005