File Type | File Name | Description |
---|---|---|
FASTQ file | Sample_1.fastq.gz | Raw read1 sequence data |
Sample_2.fastq.gz | Raw read2 sequence data | |
BAM file | Sample.recal.bam | BWA alignment file |
Sample.recal.bam.bai | BWA alignment index file | |
Variant Call Result | Sample.final.vcf | SNP/INDEL file (vcf format) |
Sample.g.vcf | Genomic VCF | |
Sample_SNP_Indel_ANNO.xlsx | Annotated variant list file (excel file) | |
Summary | All_samples_stats.xlsx | Analysis stats report of all samples (excel file) |
Example:
FASTQ file consists of four lines.
Quality score is represented with each character. One character matches its base with Phred+33.
Q = -10 log10(error rate)
Phred Quality Score | Probability of Incorrect Base Call | Base Call Accuracy |
---|---|---|
10 | 1 in 10 | 90% |
20 | 1 in 100 | 99% |
30 | 1 in 1000 | 99.9% |
40 | 1 in 10000 | 99.99% |
50 | 1 in 100000 | 99.999% |
60 | 1 in 1000000 | 99.9999% |
HiSeq4000,NovaSeq6000 groups quality scores into specific ranges, or bins, and assigns a value to each range.
For example, the original quality scosres 20-24 may from one bin, and can all be mapped to a new value of 22. Q-score binning significantly reduces storage space requirements without affecting accuracy or performance of downstream applications. Please refer to this table below, Q Scores for HiSeq4000 are binned using the following criteria.
Q-Score Bins | Example of Empirically Mapped Q-Scores |
---|---|
N(no call) | N(no call) |
2-9 | 7 |
10-19 | 11 |
20-24 | 22 |
25-29 | 27 |
30-34 | 32 |
35-39 | 37 |
40-45 | 42 |
The Variant Call Format (VCF) is a text file format that contains information about variants found at specific positions in a reference genome. The file format consists of meta-information lines, a header line, and data lines. Each data line contains information about a single variant.
Example :
Header | Description |
---|---|
#CHROM | Chromosome |
POS | Position (with the 1st base having position 1) |
ID | The dbSNP rs identifier of the SNP |
REF | Reference base(s) |
ALT | Comma separated list of alternate non-reference alleles called on at least one of the samples |
QUAL | A Phred-scaled quality score assigned by the variant caller. Higher scores indicate higher confidence in the variant (and lower probability of errors). |
FILTER | Filter status: PASS if this position has passed all filters, i.e. a call is made at this position. Otherwise, if the site has not passed all filters, a semicolon-separated below list of codes for filters that fail. See FILTER tag table for possible entries. |
INFO | Additional information: INFO fields are encoded as a semicolon-separated series of short keys with optional values in the format: |
FORMAT | See FORMAT tag table for possible entries. |
Tag | Description |
---|---|
LowQual | Low quality |
MG_INDEL_Filter | QD < 2.0 || FS > 200.0 || ReadPosRankSum < -20.0 |
MG_SNP_Filter | QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0 |
Tag | Description |
---|---|
AC | Allele count in genotypes, for each ALT allele, in the same order as listed |
AF | Allele Frequency, for each ALT allele, in the same order as listed |
AN | Total number of alleles in called genotypes |
BaseQRankSum | Z-score from Wilcoxon rank sum test of Alt Vs. Ref base qualities |
ClippingRankSum | Z-score From Wilcoxon rank sum test of Alt vs. Ref number of hard clipped bases |
DB | dbSNP Membership |
DP | Approximate read depth; some reads may have been filtered |
FS | Phred-scaled p-value using Fisher’s exact test to detect strand bias |
HaplotypeScore | Consistency of the site with at most two segregating haplotypes |
InbreedingCoeff | Inbreeding coefficient as estimated from the genotype likelihoods per-sample when compared against the Hardy-Weinberg expectation |
MLEAC | Maximum likelihood expectation (MLE) for the allele counts (not necessarily the same as the AC), for each ALT allele, in the same order as listed |
MLEAF | Maximum likelihood expectation (MLE) for the allele frequency (not necessarily the same as the AF), for each ALT allele, in the same order as listed |
MQ | RMS Mapping Quality |
MQ0 | Total Mapping Quality Zero Reads |
MQRankSum | Z-score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities |
QD | Variant Confidence/Quality by Depth |
ReadPosRankSum | Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias |
SOR | Symmetric Odds Ratio of 2x2 contingency table to detect strand bias |
set | Source VCF for the merged record in CombineVariants |
SNP | Variant is a SNP |
MNP | Variant is an MNP |
INS | Variant is an insertion |
DEL | Variant is an deletion |
MIXED | Variant is mixture of INS/DEL/SNP/MNP |
HOM | Variant is homozygous |
HET | Variant is heterozygous |
VARTYPE | Comma separated list of variant types. One per allele. |
Tag | Description |
---|---|
GT | Genotype 0/0 - the sample is homozygous reference 0/1 - the sample is heterozygous, carrying 1 copy of each of the REF and ALT alleles 1/1 - the sample is homozygous alternate |
AD | Allelic depths for the ref and alt alleles in the order listed. |
DP | Read depth at this position for this sample |
GQ | Conditional genotype quality, encoded as a phred quality |
PL | The normalized, Phred-scaled likelihoods for each of the 0/0, 0/1, and 1/1, without priors. The most likely genotype (given in the GT field) is scaled so that it’s P = 1.0 (0 when Phred-scaled), and the other likelihoods reflect their Phred-scaled likelihoods relative to this most likely genotype. |