Affymetrix Calvin Generic Data File Format

COMMAND CONSOLE GENERIC DATA FILE

Description
The Command Console generic data file format is a file format developed by Affymetrix for storing a variety of Affymetrix data and results including scanner acquisition data and intensity and probe array analysis results. Unlike previous Affymetrix files which stores only one type of information, this file is designed to store multiple data types where the contents of the file are self describing within its header.

This format was also developed to support localization. Strings stored within the file are stored as 2 byte UNICODE characters.

Another design criteria with the file is to be able to uniquely identify a file and its parentage independent of the file name. This was accomplished by the use of unique identifiers that are part of the file header.

Format
The format of the file is a binary file with data stored in network byte order (big-endian format). The file is divided into "file header", "generic data header" and "data" sections. Each section is described below.

The order of the sections are described as:

File Header

Generic Data Header (for the file)

Generic Data Header (for the files 1^st parent)

Generic Data Header (for the files 1^st parents 1^st parent)

Generic Data Header (for the files 1^st parents 2^nd parent)

...

Generic Data Header (for the files 1^st parents M^th parent)

(assuming there are M parents for the files 1^st parent)

Generic Data Header (for the files 2^nd parent)

...

Generic Data Header (for the files N^th parent)

(assuming the file was created from N parent files)

Data Group #1

Data Set #1

Parameters

Column definitions

Matrix of data

Data Set #2

...

Data Set #L

(assuming there are L data sets in the group)

Data Group #2

...

Data Group #K

(assuming there are K groups in the file)

Data Types
The following table defines the data types used in the file format description below.

Type Description

BYTE 8 bit signed integral number

UBYTE 8 bit unsigned integral number

SHORT 16 bit signed integral number

USHORT 16 bit unsigned integral number

INT 32 bit signed integral number

UINT 32 bit unsigned integral number

FLOAT 32 bit signed floating point number

DOUBLE 64 bit signed floating point number

GUID STRING (see below)

[ ] Indicates array of data

DATETIME ISO 8601 date time in WSTRING format based on Universal Time Clock UTC (UTC is also known as GMT, or Greenwich Mean Time)
E.g. "2005-11-23T13:45:53Z"

LOCALE ISO 639 (two-character code) and ISO 3166 (two-character code). Use only the language part of the specification.

PARAMETER BYTE (value type) / INT (size) / value object (depending on the data type and size).

STRING A 1 byte character string. A string object is stored as an INT (to store the string length) followed by the CHAR array (to store the string contents).

WSTRING A UNICODE string. A string object is stored as an INT (to store the string length) followed by the WCHAR array (to store the string contents).

WCHAR 2 byte character.

CHAR 1 byte character.

CONTROLLEDLIST An array of WSTRING's.

TYPE A MIME type stored in a WSTRING.
The possible MIME types used are:

text/x-calvin-integer-8

text/x-calvin-unsigned-integer-8

text/x-calvin-integer-16

text/x-calvin-unsigned-integer-16

text/x-calvin-integer-32

text/x-calvin-unsigned-integer-32

text/x-calvin-float

text/plain

VALUE A MIME encoded strings stored in a STRING.

ROW An array of data type values that make up a data set row. The data types in a row is defined in the data set header.

Value Types
The following table defines the numeric values for the value types. The value type is used to representing the type of value stored in the file.

Value	Type
0	BYTE
1	UBYTE
2	SHORT
3	USHORT
4	INT
5	UINT
6	FLOAT
7	STRING
8	WSTRING

File Header
The file header section is the first section of the file. This section is used to identify the type of file (i.e. Command Console data file), its version number (for the file format) and the number of data groups stored within the file. Information about the contents of the file such as the data type identifier, the parameters used to create the file and its parentage is stored within the generic data header section.

Item Description Type

1 Magic number. A value to identify that this is a Command Console data file. The value will be fixed to 59. UBYTE

2 The version number of the file. This is the version of the file format. It is currently fixed to 1. UBYTE

3 The number of data groups. INT

4 File position of the first data group. UINT

Following this section in the file is the generic data header section.

Generic Data Header
This section stores the file and file type identifiers, data to describe the contents of the file, parameters on how it was created and information about its parentage. This section contains a circular dependency so as to traverse across the entire parentage of a file. This information will provide the entire history of how a file came to be.

The first data header section immediately follows the file header section.

Item Description Type

1 The data type identifier. This is used to identify the type of data stored in the file. For example:

acquisition data (affymetrix-calvin-scan-acquisition)

intensity data (affymetrix-calvin-intensity)

expression results generated by MAS5 (affymetrix-probeset-analysis)

expression results generated by RMA or PLIER (affymetrix-quantification-analysis)

expression results generated by RMA or PLIER with DABG (affymetrix-quantification-detection-analysis)

genotyping, copy number, copy number variation, DMET results (affymetrix-multi-data-type-analysis)

STRING

2 Unique file identifier. This is the identifier to use to link the file with parent files. This identifier will be updated whenever the contents of the file change.
Example: When a user manually aligns the grid in a DAT file the grid coordinates are updated in the DAT file and the file is given a new file identifier.
GUID

3 Date and time of file creation. DATETIME

4 The locale of the operating system that the file was created on. LOCALE

5 The number of name/type/value parameters. INT

6 Array of parameters stored as name/value/type triplets. (WSTRING / VALUE / TYPE ) [ ]

7 Number of parent file headers. INT

8 Array of parent file headers. Generic Data Header [ ]

Data Group
This section describes the data group. A data group is a group of data sets. The file supports one or more data groups in a file.

Item Description Type

1 File position of the next data group. When this is the last data group in the file, the value should be 0. UINT

2 File position of the first data set within the data group. UINT

3 The number of data sets within the data group. INT

4 The data group name. WSTRING

Data Set
This section describes the data for a single data set item (probe set, sequence, allele, etc.). The file supports one or more data sets within a data group.

Item Description Type

1 The file position of the first data element in the data set. This is the first byte after the data set header. UINT

2 The file position of the next data set within the data group. When this is the last data set in the data group the value shall be 1 byte past the end of the data set. This way the size of the data set may be determined. UINT

3 The data set name. WSTRING

4 The number of name/value/type parameters. INT

5 Array of name/value/type parameters. (WSTRING / VALUE / TYPE) [ ]

6 Number of columns in the data set.
Example: For expression arrays, columns may include signal, p-value, detection call and for genotyping arrays columns may include allele call, and confidence value. For universal arrays, columns may include probe set intensities and background. UINT

7 An array of column names, column value types and column type sizes (one per column).
The value type shall be represented by the value from the value type table. The size shall be the size of the type in bytes. For strings, this value shall be the size of the string in bytes plus 4 bytes for the string length written before the string in the file. (WSTRING / BYTE / INT) [ ]

8 The number of rows in the data set. UINT

9 The data set table, consisting of rows of columns (data values). The specific type and size of each column is described by the data and size types above. ROW [ ]

Affymetrix GUIDs
Affymetrix GUIDs are universal unique identifiers (UUIDs) used to identify files and retain relationships between files. For example, "lineage GUIDS" are used to establish parent-child relationships between files. "Execution GUIDs" are used to identify CHP files generated during the same analysis run.

To allow flexibity with our software, Affymetrix does not require GUIDs to be compliant with an established format such as RFC 4122. It is the responsility of the users of our software to ensure that UUIDs are unique.