Variant file format

File format description

Introduction

The data files used by programs in the Agile suite of programs to store the varaint and read depth data are simple tab-delimited plain text. Data for each variant is located on at least two lines, with the first line containing data on the sequence variant and the subsequent lines containing the data on the affect the variant has on the genes transcripts (one line per transcript). Single base changes and deletions use a common line format, however insertion variants us a different format. Data describing a single base variant start with a 'S' and lines containing data on inserts begin with an 'S', each format is described below:

Single base variants

When opened in a spread sheet program, the data for a single base variant occupies a number of cells, labelled A to U in Figure 1. The data used to describe single base sequence variants are identified by a 'S' in the first cell of the first data line of a sequence variant (See A in Figure 1A).

Figure 1: The file format for a single base sequence variant.

A: This cell contains the row's data format type, with 'S' referring to a single base variant format and 'I' indicating an insertion variant. All data in this format starts with an 'S'.
B: This value identifies the number of novel transcript variants creates. If a gene has three transcripts with transcript A having a different start sequence to the other two, transcript C has a different end sequence to the others and the variant is in present in all three transcripts. Then the annotation for the variant in transcripts B and C will be the same, while it will differ for transcript A. Therefore there will be two novel transcript variants and so two variant lines of data after this line (Figure 1B shows a variant with to transcript data lines).
C: This identifies the chromosome's number.
D: If this cell contains the word 'TRUE the gene is on the forward strand while 'FALSE' indicates the gene is on the reverse strand.
E: This number indicates if the variant is a substitution (0) or deletion (1).
F: This is the variant's chromosomal position.
G: The name of the gene linked to the variant is in this cell.
H: This value identifies the variant nucleotide, with possible values of A, C, G or T for substitutions and B, D, H or U for del A, del C, del G or del T respectively. Figure 1C shows the data for a deletion (del C).
I: The reference sequence nucleotide.
J, K, L and M: The reads mapping to each nucleotide in the order A, C, G and T.
N: The number of reads suggesting a deletion.
O: The variants status, if the variant has a RS number it is shown, otherwise it can be 'U' not found in the 1000 Geneome Project, 'T' found in the 1000 Geneome Project, but has no RS number and 'N' shows the data has not been filtered by AgileKnownSNPFilter.
P: The number of the transcript variants, 0 = first variant, 1 = 2nd variant etc. (see item B for details.)
Q: Type and position of mutation with reference to the proteins sequence. WT = wild type, In = intronic, Sp = splice site, KS = Kozak consensus sequence. If the variant changes the amino acid sequence the substitution is shown. i.e. I>G is a isoluecine to glycine and I>FS is a frameshift mutation in codon coding for isoluecine.
R: List of the CCDS transcript's 'ID'
S: Number indicating the location of the variant with possible values of 0 (Intron (5')), 1 (Intron (3')), 2 (Splice site (3'), 3 (Splice site (5')), 4 (Exon) and 5 (kozak site).)
T: The variants distance from the start codon for transcripts on the chromosome's forward strand or from the stop codon for transcripts on the chromosome's reverse strand. If the variant in not in the coding sequence this value is set to -1
U: The number of amino acids between the affected codon distance and the start codon for transcripts on the chromosome's forward strand or from the stop codon for transcript on the chromosome's reverse strand. If the variant in not in the coding sequence this value is set to -1.

Insert variants

When opened in a spread sheet program, the data for a DNA insert variant occupies a number of cells, labelled A to U in Figure 1. The data used to describe single base sequence variants are identified by a 'S' in the first cell of the first data line of a sequence variant (See A in Figure 1A).

Figure 2: The file format for a insert sequence variant.

A: This cell contains the rows data format type, with 'S' referring to a single base variant format and 'I' indicating an insertion variant. Data for insert variants always starts with an 'I'.
B: This identifies the number of novel transcript variants creates. If a gene has three transcripts with transcript A having a different start sequence to the other two, transcript C using a different end sequence to the others and the variant is present in all the transcripts. Then the annotation for the variant in transcripts B and C will be the same, while it will differ for transcript A. Therefore there will be two novel transcript variants and so two variant lines of data after this line (Figure 1B shows a variant with two transcript data lines).
C: This identifies the chromosome's number.
D: If this cell contains the word 'TRUE' the gene is on the forward strand while 'FALSE' indicates the gene is on the reverse strand.
E: This number indicates if the variant is an insertion (2), this format always has a value of 2.
F: This is the variants chromosomal position.
G: This cell contains the name of the gene linked to the variant.
H: This lists the different inserts identified at this position and the number of reads each insert was found in. In Figure 2 the value is C:23-N:1 which indicates that 23 reads suggested that a C was inserted and a single read suggested a single base was inserted, but the quality score was too low to call the nucleotide, which was set to N.
I, J, K, and L: The reads depths of each nucleotide in the order A, C, G, T.
M: The number of reads suggesting a deletion at this location.
N: The number of reads suggesting an insert (of any sequence) at this position.
O: If 'TRUE' the variant is homozygous, while if 'FALSE' the variant is heterozygous. These values are used to set the initial state of the variant and are overridden when the read depth and/or allele frequency parameters are changed.
P: The variants status, if the variant has a RS number it is shown, otherwise it can be U not found in the 1000 Genome Project, T found in the 1000 Genome Project, but has no RS number and N shows the data has not been filtered by AgileKnownSNPFilter.
Q: The number of the transcript variants, 0 = first variant, 1 = 2nd variant etc. (see item B for details.)
R: List of the CCDS transcript's 'ID'
S: Number indicating the location of the variant with possible values of 0 (Intron (5')), 1 (Intron (3')), 2 (Splice site (3'), 3 (Splice site (5')), 4 (Exon) and 5 (kozak site).)
T: The variants distance from the start codon for trancscripts on the chromosome's forward strand or from the stop codon for trancscripts on the chromosome's reverse strand. If the variant in not in the coding sequence this value is set to -1
U: The number of amino acids between the affected codon distance and the start codon for trancscripts on the chromosome's forward strand or from the stop codon for trancscripts on the chromosome's reverse strand. If the variant in not in the coding sequence this value is set to -1

Read depth file format

The read depth file format is shown in Tabe 1 below. The file is a tab-delimited plain text file, with each line containing the read depth information for a single exon. When opened in a spread sheet application the first column identifies the chromosome that contains the gene named in the second column. the third column identifies the exon, with the numbers starting at 0 and not 1. Also the exon are number from the p telomere end of the gene, so genes encoded on the reverse strand of a chromosome are numbered in the opposite direction than expected. The remaining three columns contain the read depth values that 95%, 90% and 50% of the positions in each exon have are exceed. For example row one of Table 1 relates to the first exon (as judged by is closeness to the p telomere) of SAMB11 and 95% of the coding positions have a read depth of 62 reads or more, 90% of the positions have a read depth of 66 reads or more and 50% of the positions have a read depth of 78 reads or more. The last value is equivalent to the median read depth of the coding sequences of the exon. If a gene has no reads mapped to its exons, the gene will not appear in this list and all exon read depth values will be set to 0.

Chromosome	Gene name	Exon number	95% read depth	105 read depth	50% read depth
1	SAMD11	0	62	66	78
1	SAMD11	1	3	3	11
1	SAMD11	2	13	14	17
1	SAMD11	3	6	8	35
1	SAMD11	4	33	34	48
1	SAMD11	5	6	6	10
1	SAMD11	6	5	6	8
1	SAMD11	7	0	0	0
1	SAMD11	8	0	0	0
1	SAMD11	9	0	0	1
1	SAMD11	10	15	16	21
1	SAMD11	11	7	8	11
1	SAMD11	12	3	3	18
1	NOC2L	0	0	1	5
1	NOC2L	1	311	330	400
1	NOC2L	2	238	263	422
1	NOC2L	3	42	48	57
1	NOC2L	4	32	33	39
1	NOC2L	5	9	9	27
1	NOC2L	6	132	144	184
1	NOC2L	7	14	18	26
1	NOC2L	8	44	56	102
1	NOC2L	9	32	40	78
1	NOC2L	10	52	54	110
1	NOC2L	11	9	12	15
1	NOC2L	12	37	37	49
1	NOC2L	13	107	126	319
1	NOC2L	14	114	118	161
1	NOC2L	15	367	414	542
1	NOC2L	16	19	31	71
1	NOC2L	17	9	12	21
1	NOC2L	18	0	0	0

Table 1