Working with Affy GeneChip Sequence and Annotation Files in Access
Earl F. Glynn
Stowers Institute
Affymetrix provides sequence and annotation information as raw data files for use with a variety of software tools. This page discusses some details that my be helpful in working with these raw data files. Mouse and Chicken GeneChips will be used in the following discussion. Where are the raw data files containing Affy sequence and annotation information? For some time the starting point to get GeneChip information is the Affy products page. The selections on that page recently changed. Look for a dropdown box under "Expression Analysis Arrays" on that page and select an array of interest. I find Affy's naming conventions somewhat confusing for their mouse chips. The older chip was called the "Murine Genome U74Av2" (MGU74Av2). Newer chips are called "Mouse Expression Set 430" and are sold in "A" and "B" chips, MOE430A and MOE430B. The annotations were better for the "A" chip in general, with many ESTs present on the "B" chip.. Then these "A" and "B" chips were combined into a single chip, Mouse430_2. For some reason there is an "A" version of this chip now, Mouse430A_2. Having two chips with similar names, MOE430A and Mouse430A_2, is a confusing convention After you are certain which chip you care about and have found its web page, there should be an area at the lower right labeled "ADDITIONAL SUPPORT". One of the links under this heading is "Technical Notes." To find data, follow the link labeled "All Support Materials...". Here are direct links to these support materials: Mouse430_2 Chicken On each support materials page you'll want the links under "NetAffx Annotation Files" and "Sequence Files." For example, with Mouse430_2 you'll see these files: NetAffx Annotations Files
Sequence Files
|
Affy updates these files roughly quarterly (March, June, Sept, Dec), and they provide no automatic notification when the files are updated. As far as I know, the only way to find out if new files exist is to periodically check for updates. For now, Affy provides some release notes on this page identifying the significant changes. Automating the download process: I never would have found Affy's automatic download tool without calling and asking Affy on the phone: On the main Affy page, select the Support link from the tabsheet horizontal menu bar. Along the left-side, click on the Developers' Network link. Next, select the DevNet Tools link. Look for the section "NetAffx Download Utility." Download and install this program under Windows. This utility is setup to check for changes in the files every day (even though most files change quarterly at most). To automatically detect changes, one needs to check the annotation directory every day and check for changes. Here's what this utility looks like:
DO NOT specify a network drive for the Annotation Directory or the service will just die silently without transferring any files. The NetAffyx Download Utility is not a "normal" program. You control this program via an icon in the Windows Control Panel. |
What information is in the CSV Annotation file? The Mouse430_2 annotation file has significant information for the mouse GeneChip. However, very little information is currently available for the Chick GeneChip. Knowing which fields have information, and which fields should be ignored because they are constant or missing, for a particular genome is useful information. In the past I have loaded the annotation CSV files into Access databases to "connect" Affy results with this annotation information. I found the default Access definitions often truncated a number of data fields. I would need to changed any "text" fields that were longer than 256 bytes to a "memo" to avoid this truncation. To solve both problems (to find which fields have useful data, and to see how long the fields are), I wrote a simple R script to create a summary table. Copy and paste this line into R and select a file to process an Affy xxx_annot.csv file: Here are the results from processing the file Mouse430_2_annot.csv: |
GeneChip.Array Species.Scientific.Name Annotation.Date 1 Mouse Genome 430 2.0 Array Mus musculus Sep 16, 2005 name MinLength MaxLength UniqueCount MissingCount 1 Probe.Set.ID 10 29 45101 0 2 GeneChip.Array 26 26 1 0 3 Species.Scientific.Name 12 23 5 0 4 Annotation.Date 12 12 1 0 5 Sequence.Type 16 18 2 0 6 Sequence.Source 7 31 2 0 7 Transcript.ID.Array.Design. 6 26 39079 0 8 Target.Description 76 814 42783 0 9 Representative.Public.ID 6 26 39079 0 10 Archival.UniGene.Cluster 3 9 32169 2812 11 UniGene.ID 3 23 20579 5919 12 Genome.Version 20 20 1 0 13 Alignments 3 996 37658 2472 14 Gene.Title 3 6259 24385 2770 15 Gene.Symbol 1 2234 21842 4607 16 Chromosomal.Location 3 1752 4304 10091 17 Unigene.Cluster.Type 3 19 4 11330 18 Ensembl 3 2847 15932 12127 19 Entrez.Gene 3 1777 21727 4475 20 SwissProt 3 4142 16969 10308 21 EC 3 169 483 43472 22 OMIM 3 3 1 45101 23 RefSeq.Protein.ID 3 2423 17922 9332 24 RefSeq.Transcript.ID 3 2121 17941 9300 25 FlyBase 3 3 1 45101 26 AGI 3 3 1 45101 27 WormBase 3 3 1 45101 28 MGI.Name 3 2047 13558 23541 29 RGD.Name 3 4 2 45100 30 SGD.accession.number 3 3 1 45101 31 Gene.Ontology.Biological.Process 3 10734 5022 24729 32 Gene.Ontology.Cellular.Component 3 5147 2847 24266 33 Gene.Ontology.Molecular.Function 3 10091 5365 22245 34 Pathway 3 923 438 39949 35 Protein.Families 3 1056 423 44355 36 Protein.Domains 3 1267 618 43936 37 InterPro 3 2401 4050 21957 38 Trans.Membrane 3 2205 4994 37696 39 QTL 3 40 2 45100 40 Annotation.Description 146 189 200 0 41 Annotation.Transcript.Cluster 6 6000 41566 0 42 Transcript.Assignments 48 29972 43111 0 43 Annotation.Notes 3 29972 19485 25510
Notes about Mouse430_2 Data Summary:
Here are the results from processing the file Chicken_annot.csv: |
GeneChip.Array Species.Scientific.Name Annotation.Date 1 Chicken Array Gallus gallus Sep 16, 2005 name MinLength MaxLength UniqueCount MissingCount 1 Probe.Set.ID 12 25 38535 0 2 GeneChip.Array 13 13 1 0 3 Species.Scientific.Name 13 13 1 0 4 Annotation.Date 12 12 1 0 5 Sequence.Type 18 18 1 0 6 Sequence.Source 3 31 3 12784 7 Transcript.ID.Array.Design. 6 16 33534 0 8 Target.Description 35 708 34597 0 9 Representative.Public.ID 3 21 33920 689 10 Archival.UniGene.Cluster 3 9 16829 17361 11 UniGene.ID 3 46 14995 16335 12 Genome.Version 39 39 1 0 13 Alignments 3 903 33864 2415 14 Gene.Title 2 4248 21758 4298 15 Gene.Symbol 1 1241 13804 15857 16 Chromosomal.Location 3 7 31 38267 17 Unigene.Cluster.Type 3 11 3 33028 18 Ensembl 3 3 1 38535 19 Entrez.Gene 3 974 13736 15644 20 SwissProt 3 589 3443 32543 21 EC 3 29 73 38373 22 OMIM 3 3 1 38535 23 RefSeq.Protein.ID 3 1419 13596 15833 24 RefSeq.Transcript.ID 3 1241 13597 15832 25 FlyBase 3 3 1 38535 26 AGI 3 3 1 38535 27 WormBase 3 3 1 38535 28 MGI.Name 3 3 1 38535 29 RGD.Name 3 4 2 38534 30 SGD.accession.number 3 3 1 38535 31 Gene.Ontology.Biological.Process 3 561 2 38534 32 Gene.Ontology.Cellular.Component 3 184 2 38534 33 Gene.Ontology.Molecular.Function 3 522 2 38534 34 Pathway 3 606 41 38446 35 Protein.Families 3 734 485 37492 36 Protein.Domains 3 1174 745 36663 37 InterPro 3 374 3689 18341 38 Trans.Membrane 3 7417 3537 32115 39 QTL 3 40 2 38534 40 Annotation.Description 133 190 148 0 41 Annotation.Transcript.Cluster 0 3853 32917 0 42 Transcript.Assignments 32 19852 36945 0 43 Annotation.Notes 3 5581 14663 23763
Notes about Chicken GeneChip Data Summary:
Importing Annotation CSV Files into Access. A few additional notes from Affymetrix would make this process much easier. In particular, Affy's Probe Set Annotations in Tabular Format document should include some of the following notes. Why not Excel instead of Access? Unbelievably, Microsoft still has a limit of 65536 (i.e., 64K) rows in Excel 2002. If you try to load more rows than that, like with the Orthologs file, Excel will show this message |
You won't see records past row 65536 (The Orthologs file for the older MOE430A has over 200,000 rows). |
Access provides database "joins" and other ways to manipulate the data not done easily in Excel. (The same functionality is available in R using the "merge" statement. I hope to replace some work I've been doing in Access and use R instead. |
Loading the Annotation Table in Access. The Microsoft Access defaults are not always helpful in importing this file. Start with an empty Access database:
|
Next | In a New Table | Next | Next | No Primary Key | Next
Select the Advanced button.
Microsoft's Import Specification dialog box is a bit rigid and doesn't allow resizing so all the fields can be viewed at one time.
As shown below, none of the fields need to be indexed for now.
Several of the Data Types need to be changed from the usual default Text data type (that only works with strings with 255 or fewer characters) to a Memo data type since the strings are fairly long:
Check the MaxLength column in the above AffyStats summary table to determine which fields must be a "memo" type to avoid truncation. Many of these "Memo" fields are so long because they contain "denormalized data" from a database perspective. These fields contain "repeating groups," such as multiple Gene Ontology identifiers within a single field. |
I named the new table "Annotation" to be consistent with the column name on the Affy web page. |

Specify the name ("Annotation" as shown above) | Finish |

If all goes well, you'll see the above message. Select OK. When you have problems, you're likely to see a message such as this: |

Fields that are truncated should be changed from "Text" to "Memo". The Probe Set ID can be used as a primary key since it's unique, which can be changed, if desired, in table design mode. Here's what the table looks like in design mode: |

Here are observations about the various data fields for MOE430A: |
Comments about Annotation Data File for MOE430A GeneChip
| Field | Comments |
| Probe Set ID | This field is unique, so it can be used as a primary key. |
| GeneChip Array | Constant: Mouse Genome MOE430A Array |
| Species Scientific Name | Mus musculus (count = 22,645)
Bacillius subtilis
(count = 24) All non-mouse species are part of the "Control Sequence" probesets. See Sequence Type. |
| Annotation Date | Mar 21, 2005 |
| Sequence Type | Consensus (count = 22,626) or Control (count = 64) [Note: "Control" sequences have a prefix of "AFFX" in their Probe Set ID.] |
| Sequence Source | GenBank (count = 22,626) or "---" (count = 64) |
| Target Description | Denomalized field that could be parsed into additional fields. For example: gb:BC024686.1 /DB_XREF=gi:19354080 /FEA=FLmRNA |
| Genome Version | Constant: May 2004 (NCBI 33) |
| Alignments | Denomalized field that could be parsed into additional fields. chr6:88224568-88250219 (+) // 97.36 // qD1 |
| Unigene Cluster Type | "full length" (count=18643), "est" (count=33), "est /// full length" (count=4), "---" (count=4010) |
| LocusLink | Usually a single ID, but can show repeating groups: 14017 /// 216984 |
| SwissProt | Often shows repeating groups in this single field: P41230 /// Q6ZQF8 /// Q80XQ9 /// Q8CGG4 |
| RefSeq Protein ID | Can show repeating groups: NP_034291 /// NP_666135 |
| RefSeq Transcript ID | Can show repeating groups: NM_010161 /// NM_146023 |
FlyBase |
constant: "---" |
| Gene Ontology Biological Process | Denomalized field that could be parsed into additional fields. 6886 // intracellular protein transport // inferred from sequence or structural similarity /// 15031 // protein transport // inferred from sequence or structural similarity /// 6810 // transport // inferred from sequence or structural similarity |
| Gene Ontology Cellular Component | Denomalized field that could be parsed into additional fields. 5794 // Golgi apparatus // inferred from sequence or structural similarity /// 16020 // membrane // inferred from sequence or structural similarity |
| Gene Ontology Molecular Function | Denomalized field that could be parsed into additional fields. 5525 // GTP binding // inferred from electronic annotation /// 3743 // translation initiation factor activity // inferred from electronic annotation |
| Pathway | Sparse information, but is a denormalized field: Example: Nucleotide_Metabolism // GenMAPP |
| Protein Families | Sparse information, but is a denormalized field: Example: |
| Protein Domains | Sparse information, but is a denormalized field: Example: scop // a.4.3.ARID domain // All alpha proteins; DNA/RNA-binding 3-helical bundle; ARID-like; ARID domain // 2.0E-31 |
| InterPro |
Very sparse information. Defined for only 10 probesets. Example: IPR002586 // Cobyrinic acid a,c-diamide synthase |
| Trans Membrane QTL |
Constant: "---" |
| Annotaton Description | 78 possible values. Most common (count=14249) This probe set was annotated using the Matching Probes based pipeline to a Locus Link identifier using 1 transcripts. // false // Matching Probes // A |
| Annotation Transcript Cluster | Field contains repeating groups: Example: AB017026(11),AK050641(11),BC046466(9),NM_207530(11) |
| Transcript Assignments | Field contains subfields that could be parsed: Example: |
| Annotation Notes | "---" for over half (count=11820) Field contains subfields that could be parsed: Example: AB054987 // gb // 3 // Cross Hyb Matching Probes |
Loading the BLASTP and BLASTX tables. These CSV files also cannot be loaded into Access databases using default parameters. The "Hit Description" must be made into a Memo field. The "E-value" can be stored as a "double" instead of a character string. The Probe Set IDs are not unique, so this field cannot be made a primary key, but making it an index is probably a good idea to speed future processing. |




Loading the Orthologs table. This one is easy since the Microsoft Access defaults work fine. |


The Probe Set IDs are not unique in this table, but an index can be added to this field to speed joins with other tables. This table has 18,663 unique Probe Set IDs. |
| Ortholog Chip | Total |
Curated Ortholog |
Putative Ortholog |
| AtGenome1 | 715 | 715 | |
| ATH1-121501 | 1,525 | 1,525 | |
| C. elegans | 2,062 | 2,062 | |
| DrosGenome1 | 4,293 | 4,293 | |
| HC-G110 | 2,613 | 2,544 | 69 |
| HG-Focus | 11,844 | 11,205 | 639 |
| HG-U133A | 27,617 | 25,519 | 2,098 |
| HG-U133B | 13,680 | 11,502 | 2,178 |
| HG-U133-PLUS | 44,992 | 40,150 | 4,842 |
| HG-U95Av2 | 16,040 | 14,984 | 1,056 |
| HG-U95B | 8,514 | 7,314 | 1,200 |
| HG-U95C | 6,665 | 5,679 | 986 |
| HG-U95D | 4,762 | 4,149 | 613 |
| HG-U95E | 7,294 | 6,330 | 964 |
| Hu35KsubA | 9,542 | 8,452 | 1,090 |
| Hu35KsubB | 4,974 | 4,272 | 702 |
| Hu35KsubC | 6,056 | 5,372 | 684 |
| Hu35KsubD | 6,070 | 5,390 | 680 |
| HuGeneFL | 8,493 | 8,039 | 454 |
| RAE230A | 8,054 | 550 | 7,504 |
| RAE230B | 610 | 47 | 563 |
| RG-U34A | 7,433 | 520 | 6,913 |
| RG-U34B | 1,631 | 125 | 1,506 |
| RG-U34C | 1,932 | 109 | 1,823 |
| RN-U34 | 1,453 | 79 | 1,374 |
| RT-U34 | 939 | 44 | 895 |
| TOTAL | 209,803 | 162,375 | 47,428 |
Loading Lists of Probe Set IDs for Consensus, Control and Target Sequences. The following files contain FASTA sequence information, so they cannot be loaded into Access tables directly:
A short UNIX script can be used to extract the probeset IDs from these files, which can then be loaded into Access:
The consensus.txt file can be loaded into an Access table, named Consensus, like this:
Repeat the above procedure for the control.txt and target.txt files and change the table names accordingly. This process would also work with the MOE430A_probe_fasta file, which is in FASTA format. However, the MOE430A_probe_tab file is already tab delimited delimited, so that file can be loaded directly into Access. To get Access to recognize the file a little easier, let's add a ".txt" suffix to MOE430_probe_tab. Use these similar steps to load the file:
In Design Mode, change the field "Probe Set Name" to "Probe Set ID" for consistency with other tables. This field is not unique so it cannot be a primary key, but make it an indexed field with "duplicates OK". Here's what's in the Probe table: |


Each probe set has a number of rows in this table. For example the first probe set, 1415670_at, has eleven probe pairs. Here is the distribution of the number of probe pairs per probe set for MOE430A: |
| Number of Probe Pairs Per Probeset |
Frequency | Comments |
| 8 | 1 | 1424262_at |
| 9 | 4 | 1439244_a_at, 1420464_s_at, 1434127_a_at, 1437835_a_at |
| 10 | 11 | |
| 11 | 22,631 | |
| 20 | 40 | controls |
| 21 | 3 | controls |
| Total | 22,690 |
The (Probe X, Probe Y) coordinates in this table specify the location of the probe pair in the Affy .DAT file. |
Load the list of "mask" maintenance gene probeset IDs. From http://www.affymetrix.com/support/technical/mask_files.affx
Download the file http://www.affymetrix.com/Auth/support/downloads/mask_files/moe430anorm.zip. Unzip this file to create the file moe430anorm.msk. Copy this file to moe430A_maintenance.txt. Replace the first three lines ("MOE430A", "[Call]", "[Comp]") with "Probe Set ID". Load this file into an Access table and call it "Maintenance": File | Get External Data | Import | moe4380Aa_maintenance.txt | Import | Delimited | Next | First Row Contains Data | Next | Next | Next | select Probe Set ID as primary key | Next | Maintenance | Finish |

A "join" of the Annotation and Maintenance tables can be used to see additional info about these maintenance genes. |

For MOE430A, all of the maintenance probe sets had exactly 11 probes. NOTE: The "Control" genes have a Probe Set ID starting with "AFFX" so there is no overlap between the "Control" genes and the "Maintenance" genes. |
Summary. The notes above show how to load these MOE430A sequence and annotation tables into an Access database: |
Sequence Files
Table Records Unique ProbeSet IDs Comments Consensus 26,626 26,626 ID List only Control 64 64 ID List only Target 22,690 22,690 ID List only Probe 249,958 22,690 Annotation Files
Table Records Unique ProbeSet IDs Comments Annotation 22,690 22,690 blastp 36,115 17,743 blastx 33,336 9,861 Orthologs 209,803 18,663 Other Files
Table Records Unique ProbeSet IDs Comments Maintenance 100 100 ID List only
e f g @ s t o w e r s - i n s t i t u t e . o r g
Updated
7 Oct 2005