Publications • Clinical Data • Gene Expression • Proteomics Data • SNP Data
I have included information here from E-mails exchanged with Suzanne Vernon, Centers for Disease Control and Prevention. Quotations from Suzanne are tagged with her initials, "sdv," the date of the E-mail and are in italics. My E-mail questions to Suzanne or comments are tagged "efg". sdv, 1 Dec 2005 (via an E-mail to Chris Bausch at Stowers)
sdv, 4 Dec 2005:
efg, 6 Dec 2005:
sdv, 6 Dec 2005:
efg, 6 Dec 2005:
sdv, 6 Dec 2005:
ArrayVision and ArrayVision RLS A demo version of ArrayVision 8.0 software can be downloaded from this link and used for 28 days. In Mid-December 2005 the "automatic" E-mail containing the password to install ArrayVision was not sent. You'll need to use the "Contact Us" link on that page and request the password. I obtained the password by phone in just a few minutes. This demo version of the software provides an ArrayVision Version 8.0 Reference Manual as a PDF file (after installation, see C:\Program Files\ArrayVision Evaluation 8.0\Manual\Reference Manual.pdf). However, if one explores the Invitrogen link given above by sdv, there is a software link to a page about ArrayVision RLS software, which was used (see sdv comment below). This page says "ArrayVision™ RLS (Resonance Light Scattering) is available as a stand-alone software package, or as part of the GeniconRLS™ System." ArrayVision RLS is slightly different from the regular ArrayVision. Unfortunately, ArrayVision RLS is being phased out by Invitrogen and a demo version is not available (see below). sdv, 16 Dec 2005
Invitrogen provides a Technical Documentation page, which provides the technical manuals about ArrayVision RLS, as well as several papers about the RLS technology, including:
Invitrogen sells ArrayVision RLS as part of their GeniconRLS™ Detection and Imaging System. A single-user license of ArrayVision RLS Image Analysis Software costs $7,038. ArrayVision is a trademark of GE Healthcare, but according to Peter Chiang from GE Healthcare in a 16 Dec 2005 E-mail ArrayVision RLS belongs to Invitrogen:
The online Startup Guide for ArrayVision RLS mentions the software can be installed from a CD and used for 14 days. Wendy Price, Invitogen's Business Area Manager of Gene Expression Profiling said in a 22 Dec 2005 E-mail:
The description of Linear Normalization in the Training Companion (p. 37) is a bit weak, since it mostly addresses the mechanics of how to enable the calculation, but little about the computation itself:
The Startup Guide gives a few clues about this normalization:
gene_expression_data.zip The Gene Expression directory contains 352 files: 175 TIFs and 177 TXT files (it's not clear why there are two more TXT files than TIFs):
|

Each TIF file is about 21.8 MB. Additional technical information about the TIFF files can be viewed here. While I'm not inclined to re-analyze the images, the ArrayVision Evaluation Software can be used to view and analyze the image files (Use Image | Retrieve .. | select files of type .tif): |

It's unclear what analysis can be performed using "plain" ArrayVision instead of "ArrayVision RLS" since "ArrayVisionRLS ... has some added features designed specifically for quantifying resonance light scattering (RLS)." See comments below about the geometry of the array deduced from the PosX and PosY fields in the .txt files. |
gene_expression_values.zip The 177 TXT files in this directory match the 177 TXT files in the previous section by name and content. (The CRC32s of all files were identical.) |

Each of the TXT files is a tab-delimited file. The black rectangles below show the location of tabs in a representative file. |

These .txt files all have 20,160 probes and have "headers" above each column. Unfortunately, the R language does not "like" many of the characters in the "header" and converts them to periods. Also, R's read.delim function by default treats anything after a "#" as comments. Since some of the control probes, e.g., those shown above with a mwghuman10K prefix, have a "#" as part of their ID fields, the comment.char="" must be specified with read.delim to read the files successfully in R. All but one of the GeneExpression files have a tab separator at the end of each line -- note the black box (tab) at the end of each line above. Since the R language treats tabs as separators, this final tab resulted in an extra column of NAs using read.delim in R. This empty column was deleted. In addition to not having the final tab on each and every line, the file 22149604A-03 apparently was edited, perhaps using Excel, and resaved. This file has many "0" values instead of the "0.000" values found in the other files. Overall this file is over half a megabyte smaller, and a bit suspicious since it's unclear why it's different from the rest of the files.
Here's information about the fields in the .txt files: Definition of Fields in .txt files Created by ArrayVision RLS
The fields PosX and PosY give the location of the spots. The following short R script show the locations of the spots from the PosX and PosY values in columns 6 and 7 for one of the datasets. For example,
The resulting PDF shows the geometry of the slide: 12 rows of 4 blocks with each block containing 20 rows and 21 columns. This means (12 * 4 blocks) (20 *21) = 20,160 spots. This was verified with two other datasets, 21656101A.txt and 29430601A_027A.txt (roughly, first, middle and last from the list of file names). The same number of spots, 20,160, was observed in all three. The first 8 characters of filenames match the "ABTID" field in the clinical and blood work datasets. The Clinical.R program was used to match these filenames with the ABTID fields, and the following Venn diagram was constructed: |
||||||||||||||||||||||||||||||||||||||||||||||||||

156 Gene Expression datasets have complete blood evaluation data and corresponding clinical data. There are replicates for 8 of the patients (16 microarray datasets):
Here is a list of the IDs (i.e., ABTIDs) for these eight patients with replicates.
The raw IDs for these 8*2 replicates look like this:
Five datasets of microarray data do not match any clinical data:
11 Jan 2006 A recently published paper explained several exclusion criteria and helped explain the CLUSTER field in the clinical dataset:
This paper explained that of the 227 patients enrolled in the original study only 164 participants with no medical or psychiatric exclusionary conditions were considered relevant. These 164 patients were further segregated, as described in this paper, in to clusters: Least (Least Severe), Middle (Intermediate), or Worst (Most Severe). (The words, Least, Middle, Worst, are in the data file. The words, Least Severe, Intermediate, Most Severe, were used in the paper). The NoExclusions.csv file, which was described on the Clinical Data page, and an R sccript AssignArrayToCluster.R was used to connect the cluster assignment (i.e., Least, Middle, Worst) from this paper to the 177 filenames of microarray data. When a patient was excluded from the study, the word "EXCLUDED" was used instead of the cluster assignment in the resulting file, ArrayAssignCluster.csv. The first five lines of this file show representative information:
AssignArrayToCluster.R shows this summary of the cluster assignments to the 177 files::
Lastly, AssignArrayToCluster.R creates four separate files that are lists of IDS (i.e., ABTIDs): Least.csv, Middle.csv, Worst.csv, and EXCLUDED.csv. This summary table explains why only123 microaray datasets need to be analyzed since they correspond to the 164 non-excluded patients from the paper:
Files with "rep" in the filename (e.g., 2071790repA.text) are the remaining replicates. Analysis of the 123 microarray datasets would correspond to the same patients analyzed in the paper by Reeves, et al. 30 Jan 2006 The Bioconductor package biomaRt can be used to obtained Gene and Gene Ontology IDs for many of the probes in the microarray data. [Also see 5 Feb 2006 notes below for needed modification.] The R program Probe-to-Gene.R (see ArrayProbe-to_GeneID.R on 5 Feb 2006) uses biomaRt to connect the probe IDs with many Gene names. You can run/modify the program locally on your PC (or UNIX box) if biomaRt is installed and you have a local copy of the file, gene_id_human_40k_a.txt, which can be obtained here (see MWG Biotech info below). Change the R statement defining filename to point to your local file. It's not clear why multiple Gene names were given for some probes. The following shows the connections between probe IDs and genes for the first four probes.
|
probe |
gene |
band |
chromosome |
start |
end |
martID |
description |
| NM_001533 | HNRPL | q13.2 | 19 | 44018883 | 44032452 | ENSG00000104824 | Heterogeneous nuclear ribonucleoprotein L (hnRNP L). [Source:Uniprot/SWISSPROT;Acc:P14866] |
| NM_031990 | PTBP1 | p13.3 | 19 | 748411 | 763327 | ENSG00000011304 | Polypyrimidine tract-binding protein 1 (PTB) (Heterogeneous nuclear ribonucleoprotein I) (hnRNP I) (57 kDa RNA-binding protein PPTB-1). [Source:Uniprot/SWISSPROT;Acc:P26599] |
| S76822 | FDFT1 | p23.1 | 8 | 11697664 | 11734215 | ENSG00000079459 | Squalene synthetase (EC 2.5.1.21) (SQS) (SS) (Farnesyl-diphosphate farnesyltransferase) (FPP:FPP farnesyltransferase). [Source:Uniprot/SWISSPROT;Acc:P37268] |
| AF232742 | KLKB1 | q35.2 | 4 | 187523815 | 187554773 | ENSG00000164344 | Plasma kallikrein precursor (EC 3.4.21.34) (Plasma prekallikrein) (Kininogenin) (Fletcher factor) [Contains: Plasma kallikrein heavy chain; Plasma kallikrein light chain]. [Source:Uniprot/SWISSPROT;Acc:P03952] |
These "connections" can be verified at NCBI by looking for the Homo sapiens "hit" for a gene name and finding the probe ID, usually under "Related Sequences." (Original results removed -- see Feb 5, 2006 results below instead). (Note: 9 of the 10 SNP genes can be "connected" using this dataset -- TPH2 is missing.) I posted a question to the Bioconductor mailing list about the IDs, and why some were type=refseq and others were type=embl. The answer there was to ask the manufacturer! Be careful using this file in Excel because of inexcusable Microsoft defaults -- about 24 genes will appear formatted incorrectly by default. For example, Probe AK000675 is associated with the March1 gene, which Excel will treat as a date field by default. (See Mistaken Identifiers: Gene name errors can be introduced inadvertently when using Excel in bioinformatics.) A very similar program, Probe-to-GO.R (see ArrayProbe-to_GO.R on 5 Feb 2006), also using biomaRt connected the probe IDs with Gene Ontology information. Again, this program can be run locally if you define filename appropriately.(Original results removed -- see Feb 5, 2006 results below instead). Please validate these results yourself. Sample output from GONums.txt (one line per probe with all GO IDs): |
NM_001533 GO:0000166 GO:0003723 GO:0006397 GO:0005654 GO:0030530 GO:0005634 |
Sample output from GOinfo.csv:
Please give me feedback if you find problems with, or have suggestions about, these R programs. Again, use at your own risk and please validate the connections with NCBI. 2 Feb 2006 The GeneInfo.csv and GOinfo.csv files have duplicates which should be removed. The original "master file" of probes from the manufacturer (which can be obtained here) had 20,160 probes. 460 probes were ignored from this file (192 were "empty*", 208 were "mwgaracontrol*", and 60 were either "mwghuman*" or "mwghumcontrol*"). The GO and gene match above was made using the probe name, which was an accession number. All 19,700 probe names of the form ProbeName_index were unique. The suffix on the original probe name was removed, e.g., NG_000016_23 became NG_000016. The 19,700 unique ProbeName_index IDs became 19,508 unique ProbeNames after removal of this suffix.
The "master file" of probes doesn't match the probes in the data files. The comments from Feb 2 summarize the info from the "master file" of probes (which can be obtained here). There were 19,700 probe names that were unique as described above. I verified the "spot names" from all the microarray expression datasets match, but the list of spot names doesn't match the list of probes from the master file. Each array has 20,160 spots (a 12-by-4 matrix of 20-by-21 blocks of spots = 20,160). As on the master file, 460 probes are control or blank (192 were "blank*" instead of "empty*", 208 were "mwgaracontrol*", and 60 were either "mwghuman*" or "mwghumcontrol*"). Of these 19,700 probe names, some have a suffix of "(1)" or (2)" to make them unique. When these suffixes are removed, there are 19,529 unique probe names. Is there good agreement between the 19,508 unique probes from the "master file" with the 19,529 probes actually on each array? Not exactly. There are only 18,711 probes in both sets. The "master file" contains a additional 797 probes, and the actual array files have 818 additional probes. So the Probe-to-GO and Probe-to-Gene scripts should be modified to select information for the real 19,529 probes on the chip, not the 19,508 in the "master file." The list of 19,529 unique array probe ID is in the file arrayprobes.csv. The programs, ArrayProbe-to-GeneID.R and ArrayProbe-to-GO.R, used biomaRt to connect the probe IDs with gene IDs and gene ontology information. The output file from ArrayProbe-to-GeneID.R contains matches to 20,601 gene IDs. However, 3,208 of these were "NA" (not available) and can be be delete leaving 17,393 gene IDs. These 17,393 gene IDs, of which 12,958 are unique, correspond to 16,321 array probes. The output file from ArrayProbe-to-GO.R has not yet been analyzed. |
Files at MWG Biotech (specifically the Excel file or the TXT file) have the expected number of probes, 20,160, as found in the CFS microarray datasets, and information identifying the probes. See notes above, especially for 5 Feb 2006. |
Last Updated
5 Feb 2006