A framework for variation discovery and genotyping using next-generation DNA sequencing data - PubMed (original) (raw)
Comparative Study
doi: 10.1038/ng.806. Epub 2011 Apr 10.
Eric Banks, Ryan Poplin, Kiran V Garimella, Jared R Maguire, Christopher Hartl, Anthony A Philippakis, Guillermo del Angel, Manuel A Rivas, Matt Hanna, Aaron McKenna, Tim J Fennell, Andrew M Kernytsky, Andrey Y Sivachenko, Kristian Cibulskis, Stacey B Gabriel, David Altshuler, Mark J Daly
Affiliations
- PMID: 21478889
- PMCID: PMC3083463
- DOI: 10.1038/ng.806
Comparative Study
A framework for variation discovery and genotyping using next-generation DNA sequencing data
Mark A DePristo et al. Nat Genet. 2011 May.
Abstract
Recent advances in sequencing technology make it possible to comprehensively catalog genetic variation in population samples, creating a foundation for understanding human disease, ancestry and evolution. The amounts of raw data produced are prodigious, and many computational steps are required to translate this output into high-quality variant calls. We present a unified analytic framework to discover and genotype variation among multiple samples simultaneously that achieves sensitive and specific results across five sequencing technologies and three distinct, canonical experimental designs. Our process includes (i) initial read mapping; (ii) local realignment around indels; (iii) base quality score recalibration; (iv) SNP discovery and genotyping to find all potential variants; and (v) machine learning to separate true segregating variation from machine artifacts common to next-generation sequencing technologies. We here discuss the application of these tools, instantiated in the Genome Analysis Toolkit, to deep whole-genome, whole-exome capture and multi-sample low-pass (∼4×) 1000 Genomes Project datasets.
Figures
Figure 1
Framework for variation discovery and genotyping from next-generation DNA sequencing. See text for a detailed description.
Figure 2
IGV visualization of alignments in region chr1:1,510,446–1,510,622 from the (a) Trio NA12878 Illumina reads from 1000 Genomes and (b) NA12878 HiSeq reads before (left) and after (right) multiple sequence realignment. Reads are depicted as arrows oriented by increasing machine cycle; highlighted bases indicate mismatches to the reference: A is green, G is orange, T is red, and deleted bases are dashes; a coverage histogram per base is shown above the reads. Both the 4bp indel (rs34877486) and the C/T polymorphism (rs28788874) are present in dbSNP, as are the artifactual A/G polymorphisms (rs28782535 and rs28783181) resulting from the mis-modeled indel, indicating that these sites are common misalignment errors.
Figure 3
Raw (violet) and recalibrated (blue) base quality scores for NGS paired end read sets of NA12878 of (a) Illumina/GA (b) Life/SOLiD and (c) Roche/454 lanes from 1000 Genomes, and (d) Illumina/HiSeq. For each technology: top panel: shows reported base quality scores compared to the empirical estimates (Methods); middle panel: the difference between the average reported and empirical quality score for each machine cycle, with positive and negative cycle values given for the first and second read in the pair, respectively; bottom panel: the difference between reported and empirical quality scores for each of the 16 genomic dinucleotide contexts. For example, the AG context occurs at all sites in a read where G is the current nucleotide and A is the preceding one in the read. Root-mean-square errors (RMSE) are given for the pre- and post-recalibration curves.
Figure 4
(a) Relationship in the HiSeq call set between strand bias and quality by depth, for genomic locations in HapMap3 (red) and dbSNP (yellow) used for training the variant quality score recalibrator (left) and the same annotations applied to differentiate likely true positive (green) from false positive (purple) novel SNPs. (b,c,d) Quality tranches in the recalibrated HiSeq (b), exome (c), and low-pass CEU (d) calls beginning with (top) the highest-quality but smallest call set with an estimated false positive rate among novel SNP calls of <1/1000 to a more comprehensive call set (bottom) that includes effectively all true positives in the raw call set along with more false positive calls for a cumulative false positive rate of 10%. Each successive call set contains within it the previous tranche’s true and false positive calls (shaded bars) as well as tranche-specific calls of both classes (solid bars). The tranche selected for further analyses here is indicated.
Figure 5
Variation discovered among 60 individuals from the CEPH population from 1000 Genomes pilot phase plus low-pass NA12878. (a) Discovered SNPs by non-reference allele count in the 61 CEPH cohort, colored by known (light blue, striped) and novel (dark blue, filled) variation, along with non-reference sensitivity to CEU HapMap3 and 1000 Genomes low-pass variants. (b) Quality and certainty of discovered SNPs by non-reference allele count. The histogram depicts the certainty of called variation broken out into 0.1, 1, and 10% novel FDR tranches. The Ti/Tv ratio is shown for known and novel variation for each allele count, aggregating the novel calls with allele count > 74 due to their limited numbers. (c,d) Genotyping accuracy for NA12878 from reads alone (blue circles) and following genotype-likelihood based imputation (pink squares) called in the 61 sample call set as assessed by the NRD rate to HiSeq genotypes, as a function of allele count (c) and sequencing depth (d).
Figure 6
Sensitivity and specificity of multi-sample discovery of variation in NA12878 with increasing cohort size for low-pass NA12878 read sets processed with N additional CEPH samples. (a) Receiver operating characteristic (ROC) curves for SNP calls relating specificity and sensitivity to discover non-reference sites from the NA12878 HiSeq call set. The maximum callable sensitivity, 66%, is the percent of sites from the HiSeq call set where at least one read carries the alternate allele in the low-pass data for NA12878; it reflects both differences in the sequencing technologies (36–76bp GAII for the low-pass NA12878 sample vs. 101bp HiSeq) as well as the vagaries of sampling at 4× coverage. Because most of these missed sites are common and are consequently called in the other samples, imputation recovers ~50% of these sites. (b,c) Increasing power to identify strand-biased, likely false positive SNP calls with additional samples. Histograms of the Strand Bias annotation at raw variant calls discovered in the low-pass CEU data using NA12878 at 4× combined with one other CEU individual (b) and with 60 other individuals (c) stratified into sites present (green) and not (purple) in the 1000 Genomes CEU trio.
Similar articles
- An efficient and scalable analysis framework for variant extraction and refinement from population-scale DNA sequence data.
Jun G, Wing MK, Abecasis GR, Kang HM. Jun G, et al. Genome Res. 2015 Jun;25(6):918-25. doi: 10.1101/gr.176552.114. Epub 2015 Apr 16. Genome Res. 2015. PMID: 25883319 Free PMC article. - A probabilistic method for the detection and genotyping of small indels from population-scale sequence data.
Bansal V, Libiger O. Bansal V, et al. Bioinformatics. 2011 Aug 1;27(15):2047-53. doi: 10.1093/bioinformatics/btr344. Epub 2011 Jun 7. Bioinformatics. 2011. PMID: 21653520 Free PMC article. - A map of human genome variation from population-scale sequencing.
1000 Genomes Project Consortium; Abecasis GR, Altshuler D, Auton A, Brooks LD, Durbin RM, Gibbs RA, Hurles ME, McVean GA. 1000 Genomes Project Consortium, et al. Nature. 2010 Oct 28;467(7319):1061-73. doi: 10.1038/nature09534. Nature. 2010. PMID: 20981092 Free PMC article. - Model-based quality assessment and base-calling for second-generation sequencing data.
Bravo HC, Irizarry RA. Bravo HC, et al. Biometrics. 2010 Sep;66(3):665-74. doi: 10.1111/j.1541-0420.2009.01353.x. Biometrics. 2010. PMID: 19912177 Free PMC article. Review. - Genome structural variation discovery and genotyping.
Alkan C, Coe BP, Eichler EE. Alkan C, et al. Nat Rev Genet. 2011 May;12(5):363-76. doi: 10.1038/nrg2958. Epub 2011 Mar 1. Nat Rev Genet. 2011. PMID: 21358748 Free PMC article. Review.
Cited by
- A multi-regional human brain atlas of chromatin accessibility and gene expression facilitates promoter-isoform resolution genetic fine-mapping.
Dong P, Song L, Bendl J, Misir R, Shao Z, Edelstien J, Davis DA, Haroutunian V, Scott WK, Acker S, Lawless N, Hoffman GE, Fullard JF, Roussos P. Dong P, et al. Nat Commun. 2024 Nov 22;15(1):10113. doi: 10.1038/s41467-024-54448-y. Nat Commun. 2024. PMID: 39578476 Free PMC article. - The 1000 Chinese Indigenous Pig Genomes Project provides insights into the genomic architecture of pigs.
Du H, Zhou L, Liu Z, Zhuo Y, Zhang M, Huang Q, Lu S, Xing K, Jiang L, Liu JF. Du H, et al. Nat Commun. 2024 Nov 22;15(1):10137. doi: 10.1038/s41467-024-54471-z. Nat Commun. 2024. PMID: 39578420 Free PMC article. - Whole-genome sequencing to identify rare variants in East Asian patients with dementia with Lewy bodies.
Kimura T, Fujita K, Sakurai T, Niida S, Ozaki K, Shigemizu D. Kimura T, et al. NPJ Aging. 2024 Nov 21;10(1):52. doi: 10.1038/s41514-024-00180-2. NPJ Aging. 2024. PMID: 39572598 Free PMC article. - A combination of upstream alleles involved in rice heading hastens natural long-day responses.
Kim MS, Kim JS, Song SI, Jun KM, Shim SH, Jeon JS, Lee TH, Lee SB, Lee GS, Kim YK. Kim MS, et al. Genes Genomics. 2024 Nov 20. doi: 10.1007/s13258-024-01597-5. Online ahead of print. Genes Genomics. 2024. PMID: 39567417 - Genome-wide association study reveals novel QTLs and candidate genes for panicle number in rice.
Guo J, Wang W, Li W. Guo J, et al. Front Genet. 2024 Nov 5;15:1470294. doi: 10.3389/fgene.2024.1470294. eCollection 2024. Front Genet. 2024. PMID: 39563736 Free PMC article.
References
- Lee W, et al. The mutation spectrum revealed by paired genome sequences from a lung cancer patient. Nature. 2010;465:473–477. - PubMed
Publication types
MeSH terms
Grants and funding
- U54 HG003067-01/HG/NHGRI NIH HHS/United States
- U01 HG005208-01/HG/NHGRI NIH HHS/United States
- P30 DK043351/DK/NIDDK NIH HHS/United States
- 54 HG003067/HG/NHGRI NIH HHS/United States
- U54 HG003067/HG/NHGRI NIH HHS/United States
- U01 HG005208/HG/NHGRI NIH HHS/United States
LinkOut - more resources
Full Text Sources
Other Literature Sources