A framework for variation discovery and genotyping using next-generation DNA sequencing data - PubMed (original) (raw)

Comparative Study

doi: 10.1038/ng.806. Epub 2011 Apr 10.

Eric Banks, Ryan Poplin, Kiran V Garimella, Jared R Maguire, Christopher Hartl, Anthony A Philippakis, Guillermo del Angel, Manuel A Rivas, Matt Hanna, Aaron McKenna, Tim J Fennell, Andrew M Kernytsky, Andrey Y Sivachenko, Kristian Cibulskis, Stacey B Gabriel, David Altshuler, Mark J Daly

Affiliations

Comparative Study

A framework for variation discovery and genotyping using next-generation DNA sequencing data

Mark A DePristo et al. Nat Genet. 2011 May.

Abstract

Recent advances in sequencing technology make it possible to comprehensively catalog genetic variation in population samples, creating a foundation for understanding human disease, ancestry and evolution. The amounts of raw data produced are prodigious, and many computational steps are required to translate this output into high-quality variant calls. We present a unified analytic framework to discover and genotype variation among multiple samples simultaneously that achieves sensitive and specific results across five sequencing technologies and three distinct, canonical experimental designs. Our process includes (i) initial read mapping; (ii) local realignment around indels; (iii) base quality score recalibration; (iv) SNP discovery and genotyping to find all potential variants; and (v) machine learning to separate true segregating variation from machine artifacts common to next-generation sequencing technologies. We here discuss the application of these tools, instantiated in the Genome Analysis Toolkit, to deep whole-genome, whole-exome capture and multi-sample low-pass (∼4×) 1000 Genomes Project datasets.

PubMed Disclaimer

Figures

Figure 1

Figure 1

Framework for variation discovery and genotyping from next-generation DNA sequencing. See text for a detailed description.

Figure 2

Figure 2

IGV visualization of alignments in region chr1:1,510,446–1,510,622 from the (a) Trio NA12878 Illumina reads from 1000 Genomes and (b) NA12878 HiSeq reads before (left) and after (right) multiple sequence realignment. Reads are depicted as arrows oriented by increasing machine cycle; highlighted bases indicate mismatches to the reference: A is green, G is orange, T is red, and deleted bases are dashes; a coverage histogram per base is shown above the reads. Both the 4bp indel (rs34877486) and the C/T polymorphism (rs28788874) are present in dbSNP, as are the artifactual A/G polymorphisms (rs28782535 and rs28783181) resulting from the mis-modeled indel, indicating that these sites are common misalignment errors.

Figure 3

Figure 3

Raw (violet) and recalibrated (blue) base quality scores for NGS paired end read sets of NA12878 of (a) Illumina/GA (b) Life/SOLiD and (c) Roche/454 lanes from 1000 Genomes, and (d) Illumina/HiSeq. For each technology: top panel: shows reported base quality scores compared to the empirical estimates (Methods); middle panel: the difference between the average reported and empirical quality score for each machine cycle, with positive and negative cycle values given for the first and second read in the pair, respectively; bottom panel: the difference between reported and empirical quality scores for each of the 16 genomic dinucleotide contexts. For example, the AG context occurs at all sites in a read where G is the current nucleotide and A is the preceding one in the read. Root-mean-square errors (RMSE) are given for the pre- and post-recalibration curves.

Figure 4

Figure 4

(a) Relationship in the HiSeq call set between strand bias and quality by depth, for genomic locations in HapMap3 (red) and dbSNP (yellow) used for training the variant quality score recalibrator (left) and the same annotations applied to differentiate likely true positive (green) from false positive (purple) novel SNPs. (b,c,d) Quality tranches in the recalibrated HiSeq (b), exome (c), and low-pass CEU (d) calls beginning with (top) the highest-quality but smallest call set with an estimated false positive rate among novel SNP calls of <1/1000 to a more comprehensive call set (bottom) that includes effectively all true positives in the raw call set along with more false positive calls for a cumulative false positive rate of 10%. Each successive call set contains within it the previous tranche’s true and false positive calls (shaded bars) as well as tranche-specific calls of both classes (solid bars). The tranche selected for further analyses here is indicated.

Figure 5

Figure 5

Variation discovered among 60 individuals from the CEPH population from 1000 Genomes pilot phase plus low-pass NA12878. (a) Discovered SNPs by non-reference allele count in the 61 CEPH cohort, colored by known (light blue, striped) and novel (dark blue, filled) variation, along with non-reference sensitivity to CEU HapMap3 and 1000 Genomes low-pass variants. (b) Quality and certainty of discovered SNPs by non-reference allele count. The histogram depicts the certainty of called variation broken out into 0.1, 1, and 10% novel FDR tranches. The Ti/Tv ratio is shown for known and novel variation for each allele count, aggregating the novel calls with allele count > 74 due to their limited numbers. (c,d) Genotyping accuracy for NA12878 from reads alone (blue circles) and following genotype-likelihood based imputation (pink squares) called in the 61 sample call set as assessed by the NRD rate to HiSeq genotypes, as a function of allele count (c) and sequencing depth (d).

Figure 6

Figure 6

Sensitivity and specificity of multi-sample discovery of variation in NA12878 with increasing cohort size for low-pass NA12878 read sets processed with N additional CEPH samples. (a) Receiver operating characteristic (ROC) curves for SNP calls relating specificity and sensitivity to discover non-reference sites from the NA12878 HiSeq call set. The maximum callable sensitivity, 66%, is the percent of sites from the HiSeq call set where at least one read carries the alternate allele in the low-pass data for NA12878; it reflects both differences in the sequencing technologies (36–76bp GAII for the low-pass NA12878 sample vs. 101bp HiSeq) as well as the vagaries of sampling at 4× coverage. Because most of these missed sites are common and are consequently called in the other samples, imputation recovers ~50% of these sites. (b,c) Increasing power to identify strand-biased, likely false positive SNP calls with additional samples. Histograms of the Strand Bias annotation at raw variant calls discovered in the low-pass CEU data using NA12878 at 4× combined with one other CEU individual (b) and with 60 other individuals (c) stratified into sites present (green) and not (purple) in the 1000 Genomes CEU trio.

Similar articles

Cited by

References

    1. The 1000 Genomes Project Consortium. A map of human genome variation from population scale sequencing. Nature. 2010 - PMC - PubMed
    1. Yi X, et al. Sequencing of 50 Human Exomes Reveals Adaptation to High Altitude. Science. 2010;329:75–78. - PMC - PubMed
    1. Ng SB, et al. Exome sequencing identifies the cause of a mendelian disorder. Nat Genet. 2009 - PMC - PubMed
    1. Lee W, et al. The mutation spectrum revealed by paired genome sequences from a lung cancer patient. Nature. 2010;465:473–477. - PubMed
    1. Pleasance ED, et al. A comprehensive catalogue of somatic mutations from a human cancer genome. Nature. 2009 - PMC - PubMed

Publication types

MeSH terms

Grants and funding

LinkOut - more resources