Multisample SNP Calling (original) (raw)

See http://htslib.org/ for the new 1.x releases of SAMtools, BCFtools, and HTSlib. This website contains information pertaining to the old 0.1.19 samtools release, and so is useful but somewhat out of date. As time permits, this information will be updated for the new samtools/bcftools versions and moved to the new website.

Calling SNPs/INDELs with SAMtools/BCFtools

The basic Command line

Suppose we have reference sequences in ref.fa, indexed by samtools faidx, and position sorted alignment files aln1.bam and aln2.bam, the following command lines call SNPs and short INDELs: where the -D option sets the maximum read depth to call a SNP. SAMtools acquires sample information from the SM tag in the @RG header lines. One alignment file can contain multiple samples; reads from one sample can also be distributed in different alignment files. SAMtools will regroup the reads anyway. In addition, if no @RG lines are present, each alignment file is taken as one sample.

Since r865, it is possible to generate the consensus sequence with

Understanding the command line

In the command line above, samtools collects summary information in the input BAMs, computes the likelihood of data given each possible genotype and stores the likelihoods in the BCF format (see below). It does not call variants.

Bcftools applies the prior and does the actual calling. It can also concatenate BCF files, index BCFs for fast random access and convert BCF to VCF. In addition, bcftools can operate on some VCFs (e.g. calling SNPs from GL-tagged VCFs), but not for all VCFs; VCF to BCF conversion is not working at the moment, either.

Tuning the parameters

One should consider to apply the following parameters to mpileup in different scenarios:

Apply -C50 to reduce the effect of reads with excessive mismatches. This aims to fix overestimated mapping quality and appears to be preferred for BWA-short.
Given multiple technologies, apply -P to specify which technologies to use for collecting initial INDEL candidates. It is recommended to find INDEL candidates from technologies with low INDEL error rate, such as Illumina. When this option is in use, the value(s) following the option must appear in the PL tag in the @RG header lines.
Apply -D and -S to keep per-sample read depth and strand bias. This is preferred if there are more than one samples at high coverage.
Adjust -m and -F to control when to initiate indel realignment (requiring r877+). Samtools only finds INDELs where there are sufficient reads containing the INDEL at the same position. It does this to avoid excessive realignment that is computationally demanding. The default works well for many low-coverage samples but not for, say, 500 exomes. In the latter case, using -m 3 -F 0.0002 (3 supporting reads at minimum 0.02% frequency) is necessary to find singletons.
Use `-BQ0 -d10000000 -f ref.fa' if the purpose is to get the precise depth of coverage rather than call SNPs. Under this setting, mpileup will count low-quality bases, process all reads (by default the depth is capped at 8000), and skip the time-demanding BAQ calculation.
Apply -A to use anomalous read pairs in mpileup, which are not used by default (requring r874+).

Understanding the output: the VCF/BCF format

The VCF format

The Variant Call Format (VCF) is the emerging standard for storing variant data. Originally designed for SNPs and short INDELs, it also works for structural variations.

VCF consists of a header section and a data section. The header must contain a line starting with one '#', showing the name of each field, and then the sample names starting at the 10th column. The data section is TAB delimited with each line consisting of at least 8 mandatory fields (the first 8 fields in the table below). The FORMAT field and sample information are allowed to be absent. We refer to the official VCF spec for a more rigorous description of the format.

Col	Field	Description
1	CHROM	Chromosome name
2	POS	1-based position. For an indel, this is the position preceding the indel.
3	ID	Variant identifier. Usually the dbSNP rsID.
4	REF	Reference sequence at POS involved in the variant. For a SNP, it is a single base.
5	ALT	Comma delimited list of alternative seuqence(s).
6	QUAL	Phred-scaled probability of all samples being homozygous reference.
7	FILTER	Semicolon delimited list of filters that the variant fails to pass.
8	INFO	Semicolon delimited list of variant information.
9	FORMAT	Colon delimited list of the format of individual genotypes in the following fields.
10+	Sample(s)	Individual genotype information defined by FORMAT.

The BCF Format

BCF, or the binary variant call format, is the binary version of VCF. It keeps the same information in VCF, while much more efficient to process especially for many samples. The relationship between BCF and VCF is similar to that between BAM and SAM. The detailed format description of the BCF format can be found in bcf.tex included in the samtools source code package.

SAMtools/BCFtools specific information

SAMtools/BCFtools may write the following tags in the INFO field in VCF/BCF.

Tag	Description
I16	16 integers: 1#reference Q13 bases on the forward strand2#reference Q13 bases on the reverse strand3#non-ref Q13 bases on the forward strand4#non-ref Q13 bases on the reverse strand5sum of reference base qualities6sum of squares of reference base qualities7sum of non-ref base qualities8sum of squares of non-ref base qualities9sum of ref mapping qualities10sum of squares of ref mapping qualities11sum of non-ref mapping qualities12sum of squares of non-ref mapping qualities13sum of tail distance for ref bases14sum of squares of tail distance for ref bases15sum of tail distance for non-ref bases16sum of squares of tail distance for non-ref
INDEL	Indicating the variant is an INDEL.
DP	The number of reads covering or bridging POS.
DP4	Number of 1) forward ref alleles; 2) reverse ref; 3) forward non-ref; 4) reverse non-ref alleles, used in variant calling. Sum can be smaller than DP because low-quality bases are not counted.
PV4	P-values for 1) strand bias (exact test); 2) baseQ bias (t-test); 3) mapQ bias (t); 4) tail distance bias (t)
FQ	Consensus quality. If positive, FQ equals the phred-scaled probability of there being two or more different alleles. If negative, FQ equals the minus phred-scaled probability of all chromosomes being identical. Notably, given one sample, FQ is positive at hets and negative at homs.
AF1	EM estimate of the site allele frequency of the strongest non-reference allele.
CI95	Equal-tail (Bayesian) credible interval of the site allele frequency at the 95% level.
PC2	Phred-scaled probability of the alternate allele frequency of group1 samples being larger (,smaller) than of group2 samples.
PCHI2	Posterior weighted chi^2 P-value between group1 and group2 samples. This P-value is conservative.
QCHI2	Phred-scaled PCHI2
RP	Number of permutations yeilding a smaller PCHI2

For a read, suppose the k-th base is overlapping the reference base we are looking at. The tail distance for this read equals min{k-1,l-k} where l is the length of the read. If tail distance is small, the difference on the read may be caused by undetected indels.
In the BCF generated by SAMtools, an non-ref base 'X' represents an base has not been seen from the alignment data. Such a base is necessary to evaluate the probability of missing a non-ref allele due to sampling fluctuation.
FQ is computed by taking a folded prior (i.e. p'(k)=[p(k)+p(n-k)]/2). It measures if the locus is polymophic among samples. The vcf2fq command replies on this annotation.
SAMtools/BCFtools writes genotype likelihoods in the PL format which is a comma delimited list of phred-scaled data likelihoods of each possible genotype. For example, suppose REF=C and ALT=A,G, PL=7,0,37,13,40,49 means for the sample we are looking at, P(D|CC)=10^{-0.7}, P(D|CA)=1, P(D|AA)=10^{-3.7}, P(D|CG)=10^{-1.3}, P(D|AG)=1e-4 and P(D|GG)=10^{-4.9}. This ordering has been changed since r921.
PC2, PCHI2 and QCHI2 are only present with `view -1' is set.

Performing Association Test

The following command performs association test: where `xxx' is a file containing the list of samples with the first `yyy' samples being cases (or controls) and the rest being controls (or cases). In the output, the PCHI2 INFO field gives the P-value of association. This P-value is known to be conservative, in the sense that the true P-value is probably smaller (more significant). Nonetheless, in practice PCHI2 appears to be a good indicator of associations. On a QQ-plot for negative controls, expected P-value and observed PCHI2 fall on a straight line, though not on the diagonal.

To further calibrate P-value, one may perform a permutation test with the `-U' option. The `RP' INFO field will give the number of permutations which yield a smaller PCHI2 test statistics. Permutation test is very slow, due to the genotype ambiguity in sequencing data.

Estimating the Allele Frequency Spectrum

While calling SNPs, bcftools will print the estimated AFS to the error output at lines starting with [afs]. However, to accurately estimate AFS, we need to iterate the procedure. In most applications, we are only interested in the AFS conditional on a list of loci. Suppose we have the list of loci in file cond.txt with the first two columns in the file giving the coordinates, the procedure to estimate AFS is: until the AFS converges, which usually takes less than 10 rounds of EM iterations. The first command line above extracts sites in cond.txt for efficiency in later steps. Option -P specifies the initial AFS (in SNP calling, this is prior), which can be a file (as in the 3rd and 4th command lines) or 'full', 'cond2' or 'flat' (as in the 2nd command line). Choosing the right initial AFS helps accuracy and reduces iterations and potential overfitting.

Another way to estimate AFS is to get the list of site allele frequency by and then derive the histogram. It should be noted that AF1 does not use the initial AFS. No iterations are needed. However, the histogram of AF1 usually has a lot of noises. It is up to the users to decide which method to use.

Base Alignment Quality (BAQ)

Base Alignment Quality (BAQ) is a new concept deployed in samtools-0.1.9+. It aims to provide an efficient and effective way to rule out false SNPs caused by nearby INDELs. The following shows the alignments of 6 reads by a typical read mapper in the presence of a 4bp homozygous INDEL:

coor 12345678901234 5678901234567890123456 ref aggttttataaaac----aattaagtctacagagcaacta sample aggttttataaaacAAATaattaagtctacagagcaacta read1 aggttttataaaacaaAtaa read2 ggttttataaaacaaAtaaTt read3 ttataaaacAAATaattaagtctaca read4 CaaaTaattaagtctacagagcaac read5 aaTaattaagtctacagagcaact read6 T****aattaagtctacagagcaacta

where capital bases represent differences from the reference and underlined bases are the inserted bases. The alignments except for read3 are wrong because the 4bp insertion is misplaced. The mapper produces such alignments because when doing a pairwise alignment, the mapper prefers one or two mismatches over a 4bp insertion. What is hurting more is that the wrong alignments lead to recurrent mismatches, which are likely to deceive most site-independent SNP callers into calling false SNPs.

One way to avoid such an artifact is to do multi-sequence realignment, but the current implementations are very computationally demanding. SAMtools seeks another solution. It assigns each base a BAQ which is the Phred-scaled probability of the base being misaligned. BAQ is low if the base is aligned to a different reference base in a suboptimal alignment, and in this case a mismatch should contribute little to SNP calling even if the base quality is high. With BAQ, the mismatches in the example above are significantly downweighted. SAMtools will not call SNPs from that.

The BAQ strategy is invoked by default in mpileup. To make other SNP callers take advantage of BAQ, one should run to cap base quality by BAQ and then give aln.baq.bam to the SNP callers as the input. For high-coverage single-sample SNP calling, BAQ appears to be as effective as multi-sequence realignment, while being much faster and easier to use. Currently the BAQ strategy is the only practical way to avoid the INDEL artifact in low-coverage multi-sample SNP calling.

Limitations

BCFtools does not properly handle multi-allelic variants. It only takes the strongest non-reference allele.
SAMtools does not properly compute the likelihoods of multi-allelic INDELs.
The VCF file produced by BCFtools does not strictly conform the VCF spec. For example, the GT genotype information is not always present because for the purpose of BCF, GT is unnecessary and takes disk space. In addition, GT is not the first as is required by the VCF spec. This can be fixed by the bcf-fix.pl script that comes with the source code package, and will be fixed in future (fixed in r880+).

Appendix: Use Cases

SNP/INDEL calling for hundreds of exomes

The following shows the detailed procedure on how to call SNPs/INDELs for hundreds of exomes. It only aims to provide an overview of how to handle huge data sets. Some command lines given below may not work for all systems. Advanced users may also want to modify based on their own system configurations.

In the following, the key and the most difficult part is the command line calling samtools mpileup. Once that is done, one can use 3rd party tools or write their own scsripts to achieve the rest.

Input and preparation

Obtain auxiliary scripts/programs: array job submission script asub, basic BED processing utility bedutils.pl,alphanumeric sort and tabix:
Check out a more recent version of samtools
The reference sequence file ref.fa. To avoid potential mistakes, ref.fa is better the one that is used for read mapping. File ref.fa should be indexed by samtools' faidx, if this has not been done.
File bam.list which gives the paths of indexed BAM files, one line per file. These files should be better tagged with read groups; if not, one BAM per sample.
File target.bed which gives the target regions in the BED format.

Procedure

Calling SNPs/INDELs in small regions
Merge regions To make sure merge.bcf is correctly generated, one should use bcftools view to retrieve data towards the end of the last chromosome. If something goes wrong in data processing, no data will be retrieved. When we confirm merge.bcf is correct, we may delete intermediate files part-*.bcf and chr-*.bcf.
Extract target regions Here we use tabix to retrieve target regions, which is a little complicated. In future new functionality may be added to bcftools to simplify this step.
Filtering. For now, an effective way of filtering is unknown. For indels, one may consider the following: although more effective filters may be existing. For INDELs, one may use the fraction of frameshift indels as a proxy to specificity.