Single molecule molecular inversion probes for targeted, high-accuracy detection of low-frequency variation - PubMed (original) (raw)

Single molecule molecular inversion probes for targeted, high-accuracy detection of low-frequency variation

Joseph B Hiatt et al. Genome Res. 2013 May.

Abstract

The detection and quantification of genetic heterogeneity in populations of cells is fundamentally important to diverse fields, ranging from microbial evolution to human cancer genetics. However, despite the cost and throughput advances associated with massively parallel sequencing, it remains challenging to reliably detect mutations that are present at a low relative abundance in a given DNA sample. Here we describe smMIP, an assay that combines single molecule tagging with multiplex targeted capture to enable practical and highly sensitive detection of low-frequency or subclonal variation. To demonstrate the potential of the method, we simultaneously resequenced 33 clinically informative cancer genes in eight cell line and 45 clinical cancer samples. Single molecule tagging facilitated extremely accurate consensus calling, with an estimated per-base error rate of 8.4 × 10(-6) in cell lines and 2.6 × 10(-5) in clinical specimens. False-positive mutations in the single molecule consensus base-calls exhibited patterns predominantly consistent with DNA damage, including 8-oxo-guanine and spontaneous deamination of cytosine. Based on mixing experiments with cell line samples, sensitivity for mutations above 1% frequency was 83% with no false positives. At clinically informative sites, we identified seven low-frequency point mutations (0.2%-4.7%), including BRAF p.V600E (melanoma, 0.2% alternate allele frequency), KRAS p.G12V (lung, 0.6%), JAK2 p.V617F (melanoma, colon, two lung, 0.3%-1.4%), and NRAS p.Q61R (colon, 4.7%). We anticipate that smMIP will be broadly adoptable as a practical and effective method for accurately detecting low-frequency mutations in both research and clinical settings.

PubMed Disclaimer

Figures

Figure 1.

Figure 1.

Schematic of smMIP method. (A) Molecular inversion probes (MIPs) consisting of two 16–24 nt “targeting arms” (dark gray) joined by a constant 28-nt “backbone” sequence (light gray) and a 12-nt degenerate “molecular tag” (red) were designed for the coding exons (light-blue rectangle) of 33 cancer-related genes. Targeting arms were complementary to sequences flanking individual regions of interest, each 112 nt in length. (B) Probes are pooled, hybridized to genomic DNA, and polymerase and ligase were added to “gap-fill” the reverse complement of the genomic DNA to which the probe is hybridized (light-blue) and ligate the probe into a single-stranded circle. (C) After exonuclease treatment and PCR, sequencing library molecules consist of platform compatibility (black), probe backbone (light gray), targeting arm (dark gray), copied target (light blue), molecular tag (red), and sample-specific index introduced during PCR (green). Massively parallel sequencing is used to collect three reads (dark blue). (D) Overlapping read-pairs are reconciled to form “fr-reads” (dark blue), assigned to samples via the sample-specific index sequence (green) and individual capture events via the molecular tag (red). (E) Groups of fr-reads assigned to the same probe via alignment to the reference genome and sharing the same molecular tag and sample index form a “tag-defined read group” (TDRG). Random errors (yellow) that occur during library construction and sequencing may be present in some members of the TDRG at some positions. (F) TDRGs are used to call a

s

ingle

m

olecule

c

onsensus sequence (“smc-read”) for the captured target sequence that is robust to such errors.

Figure 2.

Figure 2.

smMIP capture performance and detection of low-frequency variation. (A) Distributions of minimum coverage in a given percentile of total targeted coding positions, rank-ordered by smc-read coverage, for eight HapMap cell line (red) and 45 clinical cancer (blue and green) samples (box plot center line: median; top and bottom edges: quartiles; whiskers: farthest data point within 150% of interquartile range; dots: outliers). Zeroth-percentile indicates maximum coverage. (B) Distributions of fraction of coding positions above a given smc-read coverage cutoff. (C) Observed versus expected variant frequency in smc-read base-calls from mixtures of HapMap genomic DNA samples at known ratios for positions with at least 100× coverage (R = 0.94). Ideal performance is shown as gray line (y = x).

Figure 3.

Figure 3.

Substitution error rates as a function of expected and observed nucleotide during gap-fill. (A) Schematic illustrating mononucleotide and dinucleotide substitution dependencies being considered. All rates are shown for a given expected gap-fill mono- or dinucleotide, which is the complementary nucleotide(s) to the nucleotide(s) present in the target genomic DNA, considering only ≥Q41 fr-read base-calls and Q60 smc-read base-calls at putative homozygous positions based on GATK calls. (B) Distributions of substitution error rates for eight HapMap cell line and 45 clinical cancer samples, comparing fr-reads and smc-reads, and all substitutions other than C>A or G>A (W>N + N>B, left) to only C>A (middle), or G>A (right). (C) Distributions of substitution error rates comparing fr-reads and smc-reads, and all G>A substitutions occurring in the non-CG dinucleotide context (DG>DA + GN>AN, left) to G>A substitutions occurring only in the CG dinucleotide context (CG>CA, right).

Figure 4.

Figure 4.

Sensitivity and false discovery rates for subclonal variation in synthetic mixtures. Sensitivity versus false discovery rate for low-frequency variants (0.1%–40%) in synthetically mixed HapMap samples for variant calls from fr-reads (red) and smc-reads (blue), for coding positions that were adequately genotyped in both unmixed HapMap samples and for which there was no substantial (binomial adjusted P < 10−10) subclonality in the predominant HapMap sample. Expected subclonal variant frequencies are listed at the top of each panel. Area beneath the curve is shown as an inset in each panel. Candidate subclonal variants occurring in coding sequence and at a frequency of at least 0.1% were prioritized using multiple testing-adjusted binomial _P_-values that were calculated from substitution error rates.

Similar articles

Cited by

References

    1. The 1000 Genomes Project Consortium 2010. A map of human genome variation from population-scale sequencing. Nature 467: 1061–1073 - PMC - PubMed
    1. Bielas JH, Loeb LA 2005. Quantification of random genomic mutations. Nat Methods 2: 285–290 - PubMed
    1. Carter SL, Cibulskis K, Helman E, McKenna A, Shen H, Zack T, Laird PW, Onofrio RC, Winckler W, Weir BA, et al. 2012. Absolute quantification of somatic DNA alterations in human cancer. Nat Biotechnol 30: 413–421 - PMC - PubMed
    1. Casbon JA, Osborne RJ, Brenner S, Lichtenstein CP 2011. A method for counting PCR template molecules with application to next-generation sequencing. Nucleic Acids Res 39: e81. - PMC - PubMed
    1. De Roock W, Jonker DJ, Di Nicolantonio F, Sartore-Bianchi A, Tu D, Siena S, Lamba S, Arena S, Frattini M, Piessevaux H, et al. 2010. Association of KRAS p.G13D mutation with outcome in patients with chemotherapy-refractory metastatic colorectal cancer treated with cetuximab. JAMA 304: 1812–1820 - PubMed

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources