Affymetrix

Affymetrix® CDF Data File Format


CDF FILE

 

Description
The CDF file describes the layout for an Affymetrix GeneChip array. An array may contain Expression, Genotyping, CustomSeq, Copy Number and/or Tag probe sets. All probe set names within an array are unique. Multiple copies of a probe set may exist on a single array as long as each copy has a unique name.

The information below will describe the following versions:

  • ASCII text format is used by the MAS and GCOS 1.0 software. This was also known as the ASCII version.
  • XDA format is used by the GCOS 1.2 and above software. This was also know as the binary or XDA version.

ASCII Text Format
The format of this CDF file is an ASCII text file similar to the Windows INI format.

The file is divided up into sections. The start of each section is defined by a line containing a section name enclosed in square braces. The section names are: "CDF", "Chip", "QCI" (where I ranges from 1 to the number of QC probe sets), "UnitJ" (where J is an internal index to uniquely distinguish probe sets),  and "UnitJ_BlockK" (where J and K are internal indicies used to distinguish subsets of a probe set). The data in each section is of the format TAG=VALUE.

The "CDF" section contains the version number of the file. The TAGS are:

TAG Description
Version The version number. Should always be set to "GC1.0", "GC2.0", "GC3.0" or "GC4.0". This document describes GC3.0 and GC4.0 version CDF files.

The "Chip" section contains the following TAGS:

TAG Description
Name The name of the array. This item is not used by the software.
Rows The number of rows of cells on the array.
Cols The number of columns of cells on the array.
NumberOfUnits The number of units in the array not including QC units. For CustomSeq arrays, there are 2 units: Unit1 contains the probes interrogating a sense target and Unit2 contains the probes interrogating an anti-sense target. For all other array types, there exists one unit per probe set.
MaxUnit Each unit is given a unique number. This value is the maximum of the unit numbers of all the units in the array (not including QC units).
NumQCUnits The number of QC units. QC units are defined in version 2 and above. CustomSeq arrays do not contain any QC units.
ChipReference Used for CustomSeq, HIV and P53 arrays only. This is the reference sequence displayed by the Affymetrix software. The sequence may contain spaces. This value is defined for version 2 and above.

The next set of sections where the name begins with "QC" define the QC units or probe sets in the array. There are NumQCUnits (from the Chip section) QC sections. Each section name is a combination of "QC" and an index ranging from 1 to NumQCUnits-1 and will be listed sequentially. QC units are defined for version 2 and above. Each section contains the following TAGS:

TAG Description
Type Defines the type of QC probe set. The defined types are:

0 - Unknown
1 - Checkerboard Negative
2 - Checkerboard Positive
3 - Hybridization Negative
4 - Hybridization Positive
5 - Text Features Negative
6 - Text Features Positive
7 - Central Negative
8 - Central Positive
9 - Gene Expression Negative
10 - Gene Expression Positive
11 - Cycle Fidelity Negative
12 - Cycle Fidelity Positive
13 - Central Cross Negative
14 - Central Cross Positive
15 – Cross Hyb Negative
16 – Cross Hyb Positive

NumberCells The number of cells in the probe set.
CellHeader Defines the data contained in the subsequent lines, separated by tabs.

For all QC probe set types:
X - The X coordinate of the cell.
Y - The Y coordinate of the cell.
PROBE - The probe sequence of the cell. Typically set to "N".
PLEN - The number of bases in the probe sequence.
ATOM - An index used to group multiple cells.
INDEX - An index used to look up the corresponding cell data in the CEL file.

The final data items are dependent on the type of the QC probe set:
MATCH - A boolean flag indicating a perfect match probe. For types: 7 - Central Negative, 8 - Central Positive, 9 - Gene Expression Negative, 10 - Gene Expression Positive
BG - A boolean flag indicating a background (blank) cell. For types: 9 - Gene Expression Negative, 10 - Gene Expression Positive
CYCLES - This item is always a list of 0's separated by a tab. There are as many 0's as number of bases in the probe sequence (PLEN). For types: 11 - Cycle Fidelity Negative, 12 - Cycle Fidelity Positive

Celli This contains the information about a cell that belongs to the probe set. The value of i in the tag ranges from 1 to the number of cells in the probe set and will be listed sequentially. The values in each line depend on the CellHeader. The values are separated by tabs.

The next set of sections where the name begins with "Unit" define the probes that are a member of the unit (probe set). Each unit is divided into subsections termed "Blocks" which are referred to as "groups" in the Files SDK documentation.

Each section name is a combination of "Unit" and an index. There is no meaning to the index value. Immediately following the "Unit" section there will be the "Block" sections for that unit before the next unit is defined.

Each "Unit" section contains the following TAGS:

TAG Description
Name The name of the unit. The probe set name for Genotyping, Copy Number and Polymorphic Marker units or "NONE" for all other unit types.
Direction Defines if the probes are interrogating a sense target or anti-sense target (1 - sense, 2 - anti-sense, 3 - both).
NumAtoms The number of atoms in the entire probe set. This TAG name contain two values after the equal sign. The first is the number of atoms and the second (if found) is the number of cells in each atom. An atom is a probe quartet for CustomSeq units and a probe pair for all other unit types.
NumCells The number of cells in the entire probe set. Probe pairs contain 2 cells and probe quartets contain 4 cells.
UnitNumber An arbitrary index value for the probe set.
UnitType Defines the type of unit (0 - Unknown, 1 - CustomSeq, 2 - Genotyping, 3 - Expression, 7 - Tag/GenFlex, 8 - Copy Number, 9 - Genotyping Control, 10 - Expression Control, 11 - Polymorphic Marker). An array may contain units of varying types.
NumberBlocks The number of blocks or groups in the probe set.
MutationType Used for Genotyping units only in defining the type of polymorphism (0 - substitution, 1 - insertion, 2 - deletion). This value is available in version 2 and above.

After the "Unit" section follows the  "Unit_Block" sections. There are as many "Unit_Block" sections as defined by NumberBlocks. A block will list the probes as its members.

The TAGS are:

TAG Description
Name The name of the block. For Genotyping units this is the allele. For Polymorphic Marker units this is "None". For all other unit types this is the name of the probe set.
BlockNumber An index to the block.
Wobble
The wobble situation for Polymorphic Marker units in the block. Only available in version 4.
Allele
The allele code for Polymorphic Marker units in the block. Only available in version 4.
NumAtoms The number of atoms in the block.
NumCells The number of cells in the block.
StartPosition The position of the first atom.
StopPosition The position of the last atom.
Direction Used for Genotyping units only in defining whether the probes are interrogating a sense target or anti-sense target (0 - no direction, 1 - sense, 2 - anti-sense). This value is available in version 3 and above.
CellHeader Defines the data contained in the subsequent lines, separated by tabs. The values are:

X- The X coordinate of the cell.
Y - The Y coordinate of the cell.
PROBE- The probe sequence of the cell. Typically set to "N".
FEAT - Unused string.
QUAL - The probe set name plus the allele for Genotyping units. The probe set name for all other unit types.
EXPOS - Ranges from 0 to the NumAtoms - 1 for Expression units. For all other unit types, provides relative positional information for the probe.
PLEN - The length of probe sequence. Only available in version 4.
POS - An index to the base position within the probe where the mismatch occurs.
CBASE - Not used.
PBASE - The probe base at the substitution position.
TBASE - The base of the target where the probe interrogates at the substitution position.
ATOM - An index used to group probe pairs or quartets. For Expression, identical to EXPOS.
INDEX - An index used to look up the corresponding cell data in the CEL file.
GROUP - The physical grouping of probe on the array. Only available in version 4.

The following are only available in version 2 and above:
CODONIND - Always set to -1
CODON -Always set to -1
REGIONTYPE - Always set to 99
REGION - Always set to a blank character

Celli This contains the information about a cell that belongs to the block. The value of i in the tag ranges from 1 to the number of cells in the block. The values in each line depend on the CellHeader. The values are separated by tabs.

XDA Format
The format of this CDF file is an binary file created for faster access and smaller file size. The values in the file are stored in little-endian format.

The file contents are define by:

Item Description Type
1 Magic number. Always set to 67. integer
2 Version number. Should set to 1 or 2.
integer
3 The number of columns of cells on the array. unsigned short
4 The number of rows of cells on the array. unsigned short
5 The number of units in the array not including QC units. The term unit is an internal term which means probe set. integer
6 The number of QC units. integer
7 The length of the CustomSeq reference sequence. integer
8 The CustomSeq reference sequence. char[ length defined above]
9 The probe set name. The UNIT name for CustomSeq and Genotyping. The BLOCK name for Expression. char[64] * (# of units)
10 File position for the start of each QC unit information block. integer * (# of QC units)
11 File position for the start of each unit information block. integer * (# of units)
12 QC information, repeated for each QC unit:

Type - unsigned short
Number of probes - integer

Probe information, repeated for each probe in the QC unit:

X coordinate - unsigned short
Y coordinate - unsigned short
Probe length - unsigned char
Perfect match flag - unsigned char
Background probe flag - unsigned char
see description
13 Unit information, repeated for each unit:

UnitType - unsigned short (1 - Expression, 2 - Genotyping, 3 - CustomSeq, 4 - Tag, 5 - Copy Number, 6 - Genotyping Control, 7 - Expression Control, 8 - Polymorphic Marker)
Direction - unsigned char
Number of atoms - integer
Number of blocks - integer (always 1 for Expression units)
Number of cells - integer
Unit number (probe set number) - integer
Number of cells per atom - unsigned char

Block information, repeated for each block in the unit:

Number of atoms - integer
Number of cells - integer
Number of cells per atom - unsigned char
Direction - unsigned char
The position of the first atom - integer
<unused integer value> - integer
The block name - char[64]
Wobble situation - unsigned short (only available in version 2)
Allele code - unsigned short (only available in version 2)

Cell information, repeated for each cell in the block:

Atom number - integer
X coordinate - unsigned short
Y coordinate - unsigned short
Index position (relative to sequence for CustomSeq, Genotyping, and Copy Number units, for Expression units this value is the atom number) - integer
Base of probe at substitution position - char
Base of target at interrogation position - char
Length of probe sequence - unsigned short (only available in version 2)
Physical grouping of probe - unsigned short (only available in version 2)

see description