Understanding CAMDA '06 Data

Publications  •  Clinical Data   •  Gene Expression  •   Proteomics Data  •   SNP Data


I have included information here from E-mails exchanged with Suzanne Vernon, Centers for Disease Control and Prevention.  Quotations from Suzanne are tagged with her initials, "sdv," the date of the E-mail and are in italics.  My E-mail questions to Suzanne or comments are tagged "efg".


sdv, 1 Dec 2005 (via an E-mail to Chris Bausch at Stowers)

The paper [Whistler, et al, "Integration of gene expression, ..."] is only provided as background information on the illness and to demonstrate that gene expression profiling of the peripheral blood has utility in describing the heterogeneity of this illness. 

The microarrays we used were from a company called MWG Biotech, the 40K microarray which consists of 2 glass slides (A and B), each with 20,000 features.  We provided only gene expression data for the A microarray only.  The spreadsheet of csv file have microarrays/samples in columns and features/genes in rows.  The data is raw (so [you] can transform, normalize, ...) so there does not seem to be any real good reason to go back to the actual images. 

There are many challenges to this particular dataset; including determining ways to integrate the various types of data to derive new insight into the biology, and new computational approaches for dealing with signal/noise issues in various data types. 

sdv, 4 Dec 2005:

Each data table has the subjects identified with and ABTID.  There were 227 people/subjects eligible for the study.  Of these 227 people, only 177 samples from these people gave satisfactory microarray results.

This is the first study where [we] used the MWG 40K, which was the most recent and comprehensive array.  We only used the A slide from this microarray set as the B slide failed all our qc measures.  Each subject's blood sample, as identified by ABTID, is hybridized to a microarray slide.  This is similar to the Affymetrix approach, one chip:one subject.  The detection was not with fluorescence, as Affy uses, but with a resonance light scattering, resulting in a single intensity readout.  The microarray slides are imaged with a CCD camera - versus a laser - to generate the tiff images you have on your website.  To read more about this approach, see http://www.invitrogen.com/content.cfm?pageid=9912.  The image is acquired and intensities quantified using ArrayVision (can also find a link to this software at this site).  Once the intensities are captured, we use the csv to import into various analysis tools.

this data is about as new and fresh to us as it is to you


efg, 6 Dec 2005:

The paper by Whistler et al 2005 ("Exercise response genes ...") describes a microarray experiment using peripheral blood before and after exercise. Was the state of [the patients with] the 177 microarrays controlled sufficiently to say these were taken in a "resting" state?

sdv, 6 Dec 2005:

Yes, these subjects were in a resting state of sorts. Blood was collected after being recumbent for 30 minutes

efg, 6 Dec 2005:

Were all the microarray slides processed by the same person, using the same process at the same facility?

sdv, 6 Dec 2005:

Microarray data were processed consistently. The same person did the labeling for all, arrays were run by a single person, all on the same batch of reagents with the same equipment. The arrays were all printed in the same batch.


ArrayVision and ArrayVision RLS

A demo version of ArrayVision 8.0 software can be downloaded from this link and used for 28 days. In Mid-December 2005 the "automatic" E-mail containing the password to install ArrayVision was not sent.  You'll need to use the "Contact Us" link on that page and request the password.  I obtained the password by phone in just a few minutes.  This demo version of the software provides an ArrayVision Version 8.0 Reference Manual as a PDF file (after installation, see C:\Program Files\ArrayVision Evaluation 8.0\Manual\Reference Manual.pdf).

However, if one explores the Invitrogen link given above by sdv, there is a software link to a page about ArrayVision RLS software, which was used (see sdv comment below).  This page says "ArrayVision™ RLS (Resonance Light Scattering) is available as a stand-alone software package, or as part of the GeniconRLS™ System." ArrayVision RLS is slightly different from the regular ArrayVision.  Unfortunately, ArrayVision RLS is being phased out by Invitrogen and a demo version is not available (see below).

sdv, 16 Dec 2005

We did use ArrayVisionRLS that has some added features designed specifically for quantifying resonance light scattering (RLS) versus the dual intensity emphasis on the ArrayVision version

Invitrogen provides a Technical Documentation page, which provides the technical manuals about ArrayVision RLS, as well as several papers about the RLS technology, including:

Invitrogen sells ArrayVision RLS as part of their GeniconRLS™ Detection and Imaging System.  A single-user license of ArrayVision RLS Image Analysis Software costs $7,038. 

ArrayVision is a trademark of GE Healthcare, but according to Peter Chiang from GE Healthcare in a 16 Dec 2005 E-mail ArrayVision RLS belongs to Invitrogen:

We (GEHC Bio-Sciences) do not have a demo version nor carry/sell the ArrayVision RLS. You’d have to contact Invitrogen directly to see what they can provide you. Even though it is licensed/developed from us, the ArrayVision RLS is Invitrogen’s product (not ours) from a sales & customer support standpoint.

The online Startup Guide for ArrayVision RLS mentions the software can be installed from a CD and used for 14 days.  Wendy Price, Invitogen's Business Area Manager of Gene Expression Profiling said in a 22 Dec 2005 E-mail:

Unfortunately we do not provide demo versions of the [ArrayVision RLS] software as we are in the process of discontinuing the RLS product line as a Catalog product.  ... I can explain the differences between the regular ArrayVision and the RLS version:

GeniconRLS products needed the artifact rejection capability that came with ArrayVision 6.0 and this was a standard feature of the RLS version of ArrayVision that became standard in all ArrayVision versions later than 8.0, I think. Additionally there was a Linear Normalization technique that you referred to, which is described in the Training Companion ....

The latest version of ArrayVision that was provided to customers was 8.0, so the attached documentation should help understand how ArrayVision was applied in the studies you mentioned. [see ArrayVision Evaluation software for this document]

The description of Linear Normalization in the Training Companion (p. 37) is a bit weak, since it mostly addresses the mechanics of how to enable the calculation, but little about the computation itself:

  • Select the Linear Normalization page of the Protocol Editor and check Enable Linear Normalization.
  • Indicate whether to base the normalization on data from
    All spots or Selected spots (linear reference spots).
  • Define Selected Spots – Linear Reference Spots can be defined via Spot Label or Spot Location – [Note: Spot Labels can be imported or defined manually]
  • When all linear reference spot labels have been identified, press [OK] to exit.  ArrayVision will base the Linear Normalization on the expression values obtained from these spots. Linear Normalization is only available when you select a Comparative Expression Study Type.

The Startup Guide gives a few clues about this normalization:

LINEAR NORMALIZATION OVERVIEW
The Linear Normalization feature allows array data to be normalized to compensate for the assay and system variations in comparative expression array studies (i.e., between Control and Data conditions) that are unrelated to the biological variations under investigation.  This advanced normalization tool applies a closed-form algorithm resulting in a greater level of accuracy and repeatability between comparisons.

Linear Normalization requires a set of 'linear reference spots' present in both the Control and Data conditions. These reference spots should be known to exhibit equivalent expression levels for both conditions and include high and low values within the dynamic range of the experiment.  A least-squares linear regression is first applied to the reference spot values (Density or Volume) for both the Control and Data conditions. The slope and y-intercept of the observed regression line are then used to generate a linear transform equation which will force the data values of the chosen reference spots to fit a regression line with a slope of 1 and a y-intercept of 0 (i.e., where Data values are roughly equivalent to Control values). The linear transformation is then applied to the remaining array data. Adjusted values are expressed as Linear Normalized Density (lDens, lMedianDens, lMTMDens, lARMDens) or Linear Normalized Volume (lVOL, lARVOL). These measures must be selected from the Measures page of the Protocol Editor or Protocol Wizard.


gene_expression_data.zip

The Gene Expression directory contains 352 files: 175 TIFs and 177 TXT files (it's not clear why there are two more TXT files than TIFs):

txt files without tifs
[1] "10860201" "20465002"

Each TIF file is about 21.8 MB.  Additional technical information about the TIFF files can be viewed here.

While I'm not inclined to re-analyze the images, the ArrayVision Evaluation Software can be used to view and analyze the image files (Use Image | Retrieve .. | select files of type  .tif):

It's unclear what analysis can be performed using "plain" ArrayVision instead of "ArrayVision RLS" since "ArrayVisionRLS ... has some added features designed specifically for quantifying resonance light scattering (RLS)."

See comments below about the geometry of the array deduced from the PosX and PosY fields in the .txt files.


gene_expression_values.zip

The 177 TXT files in this directory match the 177 TXT files in the previous section by name and content. (The CRC32s of all files were identical.)

Each of the TXT files is a tab-delimited file. The black rectangles below show the location of tabs in a representative file.

These .txt files all have 20,160 probes and have "headers" above each column.  Unfortunately, the R language does not "like" many of the characters in the "header" and converts them to periods.  Also, R's read.delim function by default treats anything after a "#" as comments.  Since some of the control probes, e.g., those shown above with a mwghuman10K prefix, have a "#" as part of their ID fields, the comment.char="" must be specified with read.delim to read the files successfully in R.

All but one of the GeneExpression files have a tab separator at the end of each line -- note the black box (tab) at the end of each line above.  Since the R language treats tabs as separators, this final tab resulted in an extra column of NAs using read.delim in R.  This empty column was deleted.

In addition to not having the final tab on each and every line, the file 22149604A-03 apparently was edited, perhaps using Excel, and resaved.  This file has many "0" values instead of the "0.000" values found in the other files.  Overall this file is over half a megabyte smaller, and a bit suspicious since it's unclear why it's different from the rest of the files.

sdv, 16 Dec 2005:
As for the strange file, we have it too and unfortunately we are not sure how it got there.  It was obviously overlooked when we submitted the final data to CAMDA - and we are very sorry about that.  You should ignore or delete this file and you can note that on your website too.

Here's information about the fields in the .txt files:

Definition of Fields in .txt files Created by ArrayVision RLS

"All ArrayVision measures are defined as follows. Values are expressed in units that correspond to the image calibration."  The text in italics in this block is from the "Help" inside ArrayVision Evaluation Software:  "Measurement Definitions," or various "Measures" information (pp. 55-58) in the ArrayVision Reference Manual (Version 8.0).

"Please note that the terms “density” and “intensity” are used interchangeably."

I find the ArrayVision notation "value - units" confusing since the "-" implies subtraction.  Something like "value [units]" would have been less confusing.

Field
Name
Description
1
ARM Dens - Levels

ARM Density (ARM Dens)
Artifact-removed density value for each spot. Value represents the average of all the pixels remaining in the spot, after first removing pixels with density values that exceed four median absolute deviations (MADs) above the median. This measure removes the influence of image artifacts (e.g., dust particles) on density estimation.

2
% Removed (ARM - Levels)

% Removed
Percentage of pixels excluded in the calculation of ARM Density and MTM Density, where

MTM Density (MTM Dens):  Median-based Trimmed Mean density value for each spot. The reported value represents the mean of all the pixels remaining in a target, after first removing pixels with density values that exceed four median absolute deviations (MADs) above or below the median. This measure removes the influence of image artifacts (e.g., dust particles) on density estimation.

3
MAD - Levels Median of Absolute Deviation (MAD)
Median of the absolute values of deviations from the median density (i.e., the absolute values of pixel densities – median density). It is a measure of the variation around the median density value of the spot.
4
SD - Levels SD: Standard Deviation
Standard deviation of the pixel density values.
5
Pos X - µm

XY Position
Report horizontal and vertical coordinates of each spot.

6
Pos Y - µm
7
Area - µm2 Area
Area of the spot, background and/or reference region.
8
Bkgd Bkgd: Background
Background density or volume. Background volume value is adjusted to the size of spot, where applicable.
9
sARMDens

sARMDens
Subtracted ARM Density value. ARM Density value of the spot, minus the background Density value.

10
S/N S/N: Signal-to-Noise Ratio
Spot density minus Background density, divided by the SD of the Background density.

11
Flag Flag
Select this measure to manually flag data in the data table.


0s in all arrays.  Can be deleted/ignored.
12
% At Floor % at Floor
Proportion of pixels at the lower limit of the density scale. Actual signal intensity may be below the imaging device’s threshold of detection.

13
% At Ceiling % at Ceiling
Proportion of pixels at the upper limit of the density scale (i.e., saturated). Actual signal intensity may be higher than the recorded value.
14
% AT Floor - Bkgd  
15
% AT Ceiling - Bkgd  

The fields PosX and PosY give the location of the spots.  The following short R script show the locations of the spots from the PosX and PosY values in columns 6 and 7 for one of the datasets. For example,

expression <- read.delim("U:/camda/2006/gene_expression_values/10043905-A-03R.txt")
dim(expression)
pdf("AtlasGlassHuman3.8I.pdf", width=8, height=10.5)
plot(expression[,6], expression[,7], type="p", pch=".", xlab="PosX", ylab="PosY",
main="Atlas Human 3.8I Microarray: 10043905-A-03R.txt")
dev.off()

The resulting PDF shows the geometry of the slide:  12 rows of 4 blocks with each block containing 20 rows and 21 columns. This means (12 * 4 blocks) (20 *21) = 20,160 spots.  This was verified with two other datasets, 21656101A.txt and 29430601A_027A.txt (roughly, first, middle and last from the list of file names). The same number of spots, 20,160, was observed in all three.

The first 8 characters of filenames match the "ABTID" field in the clinical and blood work datasets.  The Clinical.R program was used to match these filenames with the ABTID fields, and the following Venn diagram was constructed:

156 Gene Expression datasets have complete blood evaluation data and corresponding clinical data. 

There are replicates for 8 of the patients (16 microarray datasets):

sdv, 6 Dec 2005:
Replicates:
These are the same specimen RNA that was labelled and hybridized at different times during the study. So they are considered technical replicates. These are designated by a 'rep' in the file name.

Here is a list of the IDs (i.e., ABTIDs) for these eight patients with replicates.  

> replicates
[1] "10243501" "20052705" "20129103" "20717901" "20866603" "22803203" "26153406"
[8] "28542303"

The raw IDs for these 8*2 replicates look like this:

txt duplicates
[1] "Gene Expression Replicates"
[1] "10243501A-017A" "10243501Arep" "20052705-A-025-R" "20052705repA"
[5] "20129103-A-03-R" "20129103repA" "20717901A-03" "20717901repA"
[9] "20866603-A-01-A" "20866603rep-A" "22803203A-03-R1" "22803203repA"
[13] "26153406-A-016-A1 600nm" "26153406A-rep" "28542303-A-014-A1 600nm" "28542303A-rep"

Five datasets of microarray data do not match any clinical data:

> GeneExpressionNoClinical
[1] "22149604" "22340204" "23366103" "24120805" "28293201"

sdv, 6 Dec 2005:
The arrays which do not match Patient IDs were subjects that were excluded from the study, or withdrew consent for some portion - we included them as they may be of use in class discovery analysis.


11 Jan 2006

A recently published paper explained several exclusion criteria and helped explain the CLUSTER field in the clinical dataset:

Chronic fatigue syndrome -
a clinically empirical approach to its definition and study

William C. Reeves, Dieter Wagner, Rosane Nisenbaum, James F. Jones, Brian Gurbaxani, Laura Solomon, Dimitris A. Papanicolaou, Elizabeth R. Unger, Suzanne D Vernon1, Christine Heim
BMC Medicine 2005, 3:19 (15 December 2005)   Provisional PDF

This paper explained that of the 227 patients enrolled in the original study only 164 participants with no medical or psychiatric exclusionary conditions were considered relevant.  These 164 patients were further segregated, as described in this paper, in to clusters:  Least (Least Severe), Middle (Intermediate), or Worst (Most Severe).  (The words, Least, Middle, Worst, are in the data file.  The words, Least Severe, Intermediate, Most Severe, were used in the paper).

The NoExclusions.csv file, which was described on the Clinical Data page, and an R sccript AssignArrayToCluster.R was used to connect the cluster assignment (i.e., Least, Middle, Worst) from this paper to the 177 filenames of microarray data.  When a patient was excluded from the study, the word "EXCLUDED" was used instead of the cluster assignment in the resulting file, ArrayAssignCluster.csv.  The first five lines of this file show representative information:

Filename with ABTID Prefix
CLUSTER
10043905-A-03R.txt Least
10081101A.txt Middle
10103103-A-015-A.txt Middle
10193601A.txt Middle
10203401-A1-01-A1 600nm.txt EXCLUDED

AssignArrayToCluster.R shows this summary of the cluster assignments to the 177 files::

EXCLUDED Least Middle Worst
      54    44     53    26

Lastly, AssignArrayToCluster.R creates four separate files that are lists of IDS (i.e., ABTIDs):  Least.csv, Middle.csv, Worst.csv, and EXCLUDED.csv.

This summary table explains why only123 microaray datasets need to be analyzed since they correspond to the 164 non-excluded patients from the paper:

Cluster Name Count
Missing
Microarray Data
Replicates
Microarray Datasets
Least Severe (Least)
67
-25
+2
44
Intermediate (Middle)
67
-16
+2
53
Most Severe  (Worst)
30
-5
+1
26
Subtotal
164
-46
+5
123
Excluded      
54
Total      
177

Files with "rep" in the filename (e.g., 2071790repA.text) are the remaining replicates. 

Analysis of the 123 microarray datasets would correspond to the same patients analyzed in the paper by Reeves, et al.


30 Jan 2006

The Bioconductor package biomaRt can be used to obtained Gene and Gene Ontology IDs for many of the probes in the microarray data.  [Also see 5 Feb 2006 notes below for needed modification.]

The R program Probe-to-Gene.R (see ArrayProbe-to_GeneID.R on 5 Feb 2006) uses biomaRt to connect the probe IDs with many Gene names. You can run/modify the program locally on your PC (or UNIX box) if biomaRt is installed and you have a local copy of the file, gene_id_human_40k_a.txt, which can be obtained here (see MWG Biotech info below).  Change the R statement defining filename to point to your local file.

It's not clear why multiple Gene names were given for some probes. The following shows the connections between probe IDs and genes for the first four probes.

 

 

probe
gene
band
chromosome
start
end
martID
description
NM_001533 HNRPL q13.2 19 44018883 44032452 ENSG00000104824 Heterogeneous nuclear ribonucleoprotein L (hnRNP L). [Source:Uniprot/SWISSPROT;Acc:P14866]
NM_031990 PTBP1 p13.3 19 748411 763327 ENSG00000011304 Polypyrimidine tract-binding protein 1 (PTB) (Heterogeneous nuclear ribonucleoprotein I) (hnRNP I) (57 kDa RNA-binding protein PPTB-1). [Source:Uniprot/SWISSPROT;Acc:P26599]
S76822 FDFT1 p23.1 8 11697664 11734215 ENSG00000079459 Squalene synthetase (EC 2.5.1.21) (SQS) (SS) (Farnesyl-diphosphate farnesyltransferase) (FPP:FPP farnesyltransferase). [Source:Uniprot/SWISSPROT;Acc:P37268]
AF232742 KLKB1 q35.2 4 187523815 187554773 ENSG00000164344 Plasma kallikrein precursor (EC 3.4.21.34) (Plasma prekallikrein) (Kininogenin) (Fletcher factor) [Contains: Plasma kallikrein heavy chain; Plasma kallikrein light chain]. [Source:Uniprot/SWISSPROT;Acc:P03952]

 

These "connections" can be verified at NCBI by looking for the Homo sapiens "hit" for a gene name and finding the probe ID, usually under "Related Sequences." (Original results removed -- see Feb 5, 2006 results below instead).   (Note:  9 of the 10 SNP genes can be "connected" using this dataset -- TPH2 is missing.) 

I posted a question to the Bioconductor mailing list about the IDs, and why some were type=refseq and others were type=embl.  The answer there was to ask the manufacturer!

Be careful using this file in Excel because of inexcusable Microsoft defaults -- about 24 genes will appear formatted incorrectly by default. For example, Probe AK000675 is associated with the March1 gene, which Excel will treat as a date field by default. (See Mistaken Identifiers: Gene name errors can be introduced inadvertently when using Excel in bioinformatics.)

A very similar program, Probe-to-GO.R (see ArrayProbe-to_GO.R on 5 Feb 2006), also using biomaRt connected the probe IDs with Gene Ontology information.  Again, this program can be run locally if you define filename appropriately.(Original results removed -- see Feb 5, 2006 results below instead).  Please validate these results yourself.

Sample output from GONums.txt (one line per probe with all GO IDs):

NM_001533 GO:0000166 GO:0003723 GO:0006397 GO:0005654 GO:0030530 GO:0005634
NM_031990 GO:0000166 GO:0005515 GO:0008187 GO:0000398 GO:0008380 GO:0005654 GO:0005730 GO:0030530
        GO:0003676 GO:0003723 GO:0006397 GO:0005634
S76822 GO:0000287 GO:0004310 GO:0016491 GO:0016740 GO:0006695 GO:0008299 GO:0005783 GO:0016021
AF232742 GO:0003807 GO:0004263 GO:0004295 GO:0008233 GO:0006508 GO:0006954 GO:0007596 GO:0042730 GO:0005615

 

Sample output from GOinfo.csv:

"id","GOID","description","evidence","martID"
"NM_001533","GO:0000166","nucleotide binding","IEA","ENSG00000104824"
"NM_001533","GO:0003723","RNA binding","TAS","ENSG00000104824"
"NM_001533","GO:0006397","mRNA processing","IEA","ENSG00000104824"
"NM_001533","GO:0005654","nucleoplasm","TAS","ENSG00000104824"
"NM_001533","GO:0030530","heterogeneous nuclear ribonucleoprotein complex","TAS","ENSG00000104824"
"NM_001533","GO:0005634","nucleus","IEA","ENSG00000104824"

Please give me feedback if you find problems with, or have suggestions about, these R programs. Again, use at your own risk and please validate the connections with NCBI.

2 Feb 2006

The GeneInfo.csv and GOinfo.csv files have duplicates which should be removed.

The original "master file" of probes from the manufacturer (which can be obtained here) had 20,160 probes.  460 probes were ignored from this file (192 were "empty*", 208 were "mwgaracontrol*", and 60 were either "mwghuman*" or "mwghumcontrol*").  The GO and gene match above was made using the probe name, which was an accession number.  All 19,700 probe names of the form ProbeName_index were unique.

The suffix on the original probe name was removed, e.g., NG_000016_23 became NG_000016.  The 19,700 unique ProbeName_index IDs became 19,508 unique ProbeNames after removal of this suffix.


5 Feb 2006

The "master file" of probes doesn't match the probes in the data files.

The comments from Feb 2 summarize the info from the "master file" of probes (which can be obtained here).  There were 19,700 probe names that were unique as described above.

I verified the "spot names" from all the microarray expression datasets match, but the list of spot names doesn't match the list of probes from the master file.

Each array has 20,160 spots (a 12-by-4 matrix of 20-by-21 blocks of spots = 20,160).  As on the master file, 460 probes are control or blank (192 were "blank*" instead of "empty*", 208 were "mwgaracontrol*", and 60 were either "mwghuman*" or "mwghumcontrol*").  Of these 19,700 probe names, some have a suffix of "(1)" or (2)" to make them unique.  When these suffixes are removed, there are 19,529 unique probe names.

Is there good agreement between the 19,508 unique probes from the "master file" with the 19,529 probes actually on each array?  Not exactly.  There are only 18,711 probes in both sets.  The "master file" contains a additional 797 probes, and the actual array files have 818 additional probes.  So the Probe-to-GO and Probe-to-Gene scripts should be modified to select information for the real 19,529 probes on the chip, not the 19,508 in the "master file."

The list of 19,529 unique array probe ID is in the file arrayprobes.csv.  The programs, ArrayProbe-to-GeneID.R and ArrayProbe-to-GO.R, used biomaRt to connect the probe IDs with gene IDs and gene ontology information.

The output file from ArrayProbe-to-GeneID.R contains matches to 20,601 gene IDs.  However, 3,208 of these were "NA" (not available) and can be be delete leaving 17,393 gene IDs.  These 17,393 gene IDs, of which 12,958 are unique, correspond to 16,321 array probes.

The output file from ArrayProbe-to-GO.R has not yet been analyzed.

 


Files at MWG Biotech (specifically the Excel file or the TXT file) have the expected number of probes, 20,160, as found in the CFS microarray datasets, and information identifying the probes.

See notes above, especially for 5 Feb 2006.


Last Updated
5 Feb 2006