 |

|
The CDF file describes the layout for an Affymetrix GeneChip array. An
array may contain Expression, Genotyping, CustomSeq, Copy Number and/or
Tag probe sets. All probe set names within an array are unique.
Multiple copies of a probe set may exist on a single array as long as
each copy has a unique name.
The information below will describe the following
versions:
- ASCII text format is
used by the MAS and GCOS 1.0 software. This was also known as the ASCII
version.
- XDA format is used by
the GCOS 1.2 and above software. This was also know as the binary or
XDA version.
The format of this CDF file is an ASCII text file similar to the
Windows INI format.
The file is divided up into sections. The start of
each section is defined by a line containing a section name enclosed in
square braces. The section names are: "CDF", "Chip", "QCI"
(where I ranges from 1 to the number of QC probe sets), "UnitJ"
(where J is an internal index to uniquely distinguish probe
sets), and "UnitJ_BlockK" (where J and K are
internal indicies used to distinguish subsets of a probe set). The data
in each section is of the format TAG=VALUE.
The "CDF" section contains the version number of
the file. The TAGS are:
| TAG |
Description |
| Version |
The version number.
Should always be set to "GC1.0", "GC2.0", "GC3.0" or "GC4.0". This
document describes GC3.0 and GC4.0 version CDF files. |
The "Chip" section contains the following TAGS:
| TAG |
Description |
| Name |
The name of the
array. This item is not used by the software. |
| Rows |
The number of rows
of cells on the array. |
| Cols |
The number of
columns of cells on the array. |
| NumberOfUnits |
The number of units
in the array not including QC units. For CustomSeq arrays, there are
2 units: Unit1 contains the probes interrogating a sense target and
Unit2 contains the probes interrogating an anti-sense target. For all
other array types, there exists one unit per probe set. |
| MaxUnit |
Each unit is given
a unique number. This value is the maximum of the unit numbers of all
the units in the array (not including QC units). |
| NumQCUnits |
The number of QC
units. QC units are defined in version 2 and above. CustomSeq arrays do
not contain any QC units. |
| ChipReference |
Used for CustomSeq,
HIV and P53 arrays only. This is the reference sequence displayed by
the Affymetrix software. The sequence may contain spaces. This value is
defined for version 2 and above. |
The next set of sections where the name begins
with "QC" define the QC units or probe sets in the array. There are
NumQCUnits (from the Chip section) QC sections. Each section name is a
combination
of "QC" and an index ranging from 1 to NumQCUnits-1 and will be listed
sequentially.
QC
units
are
defined
for
version 2 and above. Each section contains the following TAGS:
| TAG |
Description |
| Type |
Defines the type of
QC probe set. The defined types are:
0 - Unknown
1 - Checkerboard Negative
2 - Checkerboard Positive
3 - Hybridization Negative
4 - Hybridization Positive
5 - Text Features Negative
6 - Text Features Positive
7 - Central Negative
8 - Central Positive
9 - Gene Expression Negative
10 - Gene Expression Positive
11 - Cycle Fidelity Negative
12 - Cycle Fidelity Positive
13 - Central Cross Negative
14 - Central Cross Positive
15 – Cross Hyb Negative
16 – Cross Hyb Positive
|
| NumberCells |
The number of cells
in the probe set. |
| CellHeader |
Defines the data
contained in the subsequent lines, separated by tabs.
For all QC probe set types:
X - The X coordinate of the
cell.
Y - The Y coordinate of the
cell.
PROBE - The probe sequence of
the cell. Typically set to "N".
PLEN - The number of bases in
the probe sequence.
ATOM - An index used to group
multiple cells.
INDEX - An index used to look
up the corresponding cell data in the CEL file.
The final data items are dependent on the
type of the QC probe set:
MATCH - A boolean flag
indicating a perfect match probe. For types: 7 - Central Negative, 8 -
Central Positive, 9 - Gene Expression Negative,
10 - Gene Expression Positive
BG - A boolean flag indicating
a background (blank) cell. For types: 9 - Gene Expression Negative, 10
- Gene Expression Positive
CYCLES - This item is always a
list of 0's separated by a tab. There are as many 0's as number of
bases in the probe sequence (PLEN). For types: 11 - Cycle Fidelity
Negative, 12 - Cycle
Fidelity Positive
|
| Celli |
This contains the
information about a cell that belongs to the probe set. The value of i
in the tag ranges from 1 to the number of cells in the probe set
and will be listed sequentially. The values in each line depend on the
CellHeader.
The values are separated by
tabs. |
The next set of sections where the name begins
with "Unit" define the probes that are a member of the unit (probe
set). Each unit
is
divided into subsections termed "Blocks" which are referred to as
"groups" in the Files SDK documentation.
Each section name is a combination of "Unit" and
an index. There is no meaning to the index value. Immediately following
the "Unit" section there will be the "Block" sections for that unit
before the next unit is defined.
Each "Unit" section contains the following TAGS:
| TAG |
Description |
| Name |
The name of the
unit. The probe set name for Genotyping, Copy Number and Polymorphic
Marker units or "NONE" for all other unit types. |
| Direction |
Defines if the
probes are interrogating a sense target or anti-sense target (1 -
sense, 2 - anti-sense, 3 - both). |
| NumAtoms |
The number of atoms
in the entire probe set. This TAG name contain two values after the
equal sign. The first is the number of atoms and the second (if found)
is the number of cells
in each atom. An atom is a probe quartet for CustomSeq units and a
probe pair for all other unit types. |
| NumCells |
The number of cells
in the entire probe set. Probe pairs contain 2 cells and probe quartets
contain 4 cells. |
| UnitNumber |
An arbitrary index
value for the probe set. |
| UnitType |
Defines the type of
unit (0 - Unknown, 1 - CustomSeq, 2 - Genotyping, 3 - Expression, 7 -
Tag/GenFlex, 8 - Copy Number, 9 - Genotyping Control, 10 - Expression
Control, 11 - Polymorphic Marker). An array may contain units of
varying types. |
| NumberBlocks |
The number of
blocks or groups in the probe set. |
| MutationType |
Used for Genotyping
units only in defining the type of polymorphism (0 - substitution, 1 -
insertion, 2 - deletion). This value is available in version 2 and
above. |
After the "Unit" section follows the
"Unit_Block" sections. There are as many "Unit_Block" sections as
defined
by NumberBlocks. A block will list the probes as its members.
The TAGS are:
| TAG |
Description |
| Name |
The name of the
block. For Genotyping units this is the allele. For Polymorphic Marker
units this is "None". For all other unit types this is the name of the
probe set. |
| BlockNumber |
An index to the
block. |
Wobble
|
The wobble
situation for Polymorphic Marker units in the block. Only available in
version 4.
|
Allele
|
The allele
code for Polymorphic Marker units in the block. Only available in
version 4. |
| NumAtoms |
The number of atoms
in the block. |
| NumCells |
The number of cells
in the block. |
| StartPosition |
The position of the
first atom. |
| StopPosition |
The position of the
last atom. |
| Direction |
Used for Genotyping
units only in defining whether the probes are interrogating a sense
target or anti-sense target (0 - no direction, 1 - sense, 2 -
anti-sense). This value is available in version 3 and above. |
| CellHeader |
Defines the data
contained in the subsequent lines, separated by tabs. The values are:
X- The X coordinate of the
cell.
Y - The Y coordinate of the
cell.
PROBE- The probe sequence of
the cell. Typically set to "N".
FEAT - Unused string.
QUAL - The probe set name plus
the allele for Genotyping units. The probe set name for all other unit
types.
EXPOS - Ranges from 0 to the
NumAtoms - 1 for Expression units. For all other unit types, provides
relative positional information for the probe.
PLEN - The length of probe
sequence. Only available in version 4.
POS - An
index to the base position within the probe where the mismatch occurs.
CBASE - Not used.
PBASE - The probe base at the
substitution position.
TBASE - The base of the target
where the probe interrogates at the substitution position.
ATOM - An index used to group
probe pairs or quartets. For Expression, identical to EXPOS.
INDEX - An index used to look
up the corresponding cell data in the CEL file.
GROUP - The physical grouping
of probe on the array. Only available in version 4.
The following are only available in version
2 and above:
CODONIND - Always set to -1
CODON -Always set to -1
REGIONTYPE - Always set to 99
REGION - Always set to a blank
character
|
| Celli |
This contains the
information about a cell that belongs to the block. The value of i
in the tag ranges from 1 to the number of cells in the block. The
values in each line depend on the CellHeader. The values are separated
by tabs. |
The format of this CDF file is an binary file created for faster access
and smaller file size. The values in the file are stored in
little-endian format.
The file contents are define by:
| Item |
Description |
Type |
| 1 |
Magic number.
Always set to 67. |
integer |
| 2 |
Version number.
Should set to 1 or 2.
|
integer |
| 3 |
The number of
columns of cells on the array. |
unsigned short |
| 4 |
The number of rows
of cells on the array. |
unsigned short |
| 5 |
The number of units
in the array not including QC units. The term unit is an internal term
which means probe set. |
integer |
| 6 |
The number of QC
units. |
integer |
| 7 |
The length of the
CustomSeq reference sequence. |
integer |
| 8 |
The CustomSeq
reference sequence. |
char[ length
defined above] |
| 9 |
The probe set name.
The UNIT name for CustomSeq and Genotyping. The BLOCK name for
Expression. |
char[64] * (# of
units) |
| 10 |
File position for
the start of each QC unit information block. |
integer * (# of QC
units) |
| 11 |
File position for
the start of each unit information block. |
integer * (# of
units) |
| 12 |
QC information,
repeated for each QC unit:
Type - unsigned short
Number of probes - integer
Probe information, repeated for each probe
in the QC unit:
X coordinate - unsigned short
Y coordinate - unsigned short
Probe length - unsigned char
Perfect match flag - unsigned char
Background probe flag - unsigned char
|
see description |
| 13 |
Unit information,
repeated for each unit:
UnitType - unsigned short (1 - Expression, 2
- Genotyping, 3 - CustomSeq, 4 - Tag, 5 - Copy Number, 6 - Genotyping
Control, 7 - Expression Control, 8 - Polymorphic Marker)
Direction - unsigned char
Number of atoms - integer
Number of blocks - integer (always 1 for Expression units)
Number of cells - integer
Unit number (probe set number) - integer
Number of cells per atom - unsigned char
Block information, repeated for each block
in the unit:
Number of atoms - integer
Number of cells - integer
Number of cells per atom - unsigned char
Direction - unsigned char
The position of the first atom - integer
<unused integer value> - integer
The block name - char[64]
Wobble situation - unsigned short (only available in version 2)
Allele code - unsigned short (only available in version 2)
Cell information, repeated for each cell in
the block:
Atom number - integer
X coordinate - unsigned short
Y coordinate - unsigned short
Index position (relative to sequence for CustomSeq, Genotyping, and
Copy Number units, for Expression units this value is the atom number)
- integer
Base of probe at substitution position - char
Base of target at interrogation position - char
Length of probe sequence - unsigned short (only available in version 2)
Physical grouping of probe - unsigned short (only available in version
2)
|
see description |
|