Identification of genetic variants using bar-coded multiplexed sequencing - PubMed (original) (raw)

doi: 10.1038/nmeth.1251. Epub 2008 Sep 14.

John V Pearson, Szabolcs Szelinger, Aswin Sekar, Margot Redman, Jason J Corneveaux, Traci L Pawlowski, Trisha Laub, Gary Nunn, Dietrich A Stephan, Nils Homer, Matthew J Huentelman

Affiliations

Identification of genetic variants using bar-coded multiplexed sequencing

David W Craig et al. Nat Methods. 2008 Oct.

Abstract

We developed a generalized framework for multiplexed resequencing of targeted human genome regions on the Illumina Genome Analyzer using degenerate indexed DNA bar codes ligated to fragmented DNA before sequencing. Using this method, we simultaneously sequenced the DNA of multiple HapMap individuals at several Encyclopedia of DNA Elements (ENCODE) regions. We then evaluated the use of Bayes factors for discovering and genotyping polymorphisms. For polymorphisms that were either previously identified within the Single Nucleotide Polymorphism database (dbSNP) or visually evident upon re-inspection of archived ENCODE traces, we observed a false positive rate of 11.3% using strict thresholds for predicting variants and 69.6% for lax thresholds. Conversely, false negative rates were 10.8-90.8%, with false negatives at stricter cut-offs occurring at lower coverage (<10 aligned reads). These results suggest that >90% of genetic variants are discoverable using multiplexed sequencing provided sufficient coverage at the polymorphic base.

PubMed Disclaimer

Figures

Figure 1

Figure 1

Schematic describing the preparation of indexed libraries. The red box indicates the indexing step, where for each person a unique indexed adapter was ligated to the fragmented genomic DNA.

Figure 2

Figure 2. Comparison of index performance

Index variability in initial sequencing runs (Library A) used for evaluating index performance are shown (top graph). Percentages of reads aligning to the reference sequence are listed by index, without introduction of normalization methods. A total of 30 indexes were present in >0.05% of all aligned reads. Highlighted in the blue box are 19 indexes with less than 5 fold difference in index frequencies, used in subsequence studies. Indexes matching with 0 errors are in blue bars and indexes with 1 error are in magenta bars. The bottom graph shows the location of errors by base, for each index.

Figure 3

Figure 3. Relationship between mean and local coverage

Example coverage of 4 individuals sequenced within a single line of an 8-lane flow-cell for 10 pooled amplicons as part of Library A. Amplicons are shown consecutively for each individual by the alternating shaded background. Index sequence and mean coverage for that individual are shown above each graph. The maximum and minimum coverage is shown for each amplicon in the top of the graph. Overlaying pie charts show the observed distribution of bases across all amplicons and the expected distribution determined from a Poisson distribution of the mean coverage, binned by 0 reads, 1–4 reads, 5–9 reads, 10–19 reads, and >20 reads.

Figure 4

Figure 4. Discovery of variant bases by simultaneous analysis of all individuals

(a.) The Bayes-factor for polymorphism discovery(Ks) is plotted for each of the10 sequenced 5kb amplicons from Library A. Exact positions matching known polymorphisms are colored as red spheres and the dbSNP identifier is provided for the most significant SNPs. Black bars at top indicate locations of documented SNPs. A magnified view of amplicon 1 (b.) and amplicon 6 (c.) is provided to compare variants predicted by indexed-multiplexed sequencing to previous deep capillary sequencing results for the same individuals as part of the ENCODE project. (d–e.) Examples of false-positives arising from sequence homology to elsewhere in the genome. (f–i.) Examples of sequence traces validating the discovery of novel SNPs not previously annotated in ENCODE capillary sequencing traces. Similar analysis was conducted on Library B (shown in the supplementary figure 1).

Figure 5

Figure 5. Relationship between base-level coverage and Bayes-factor for polymorphism discovery and variant genotyping

(a.) The y-axis is Log(Ks) and the x-axis is the total coverage across only those individuals with a non-reference genotype at a known polymorphism (AB or BB). (b.) Same, zoomed to lower Ks and lower coverage. (c.) The percent of the time the correct genotype was determined is plotted versus the coverage of the variant within the individual. Plots contain cumulative statistics using variant discovery and genotyping within both Library A and B.

Similar articles

Cited by

References

    1. Frazer KA, Ballinger DG, Cox DR, et al. Nature. 2007;449(7164):851. - PMC - PubMed
    1. Nature. 2007;447(7145):661. - PMC - PubMed
    1. Zondervan KT, Cardon LR. Nat Protoc. 2007;2(10):2492. - PMC - PubMed
    1. Meyer M, Stenzel U, Myles S, et al. Nucleic Acids Res. 2007;35(15):e97. - PMC - PubMed
    1. Parameswaran P, Jalili R, Tao L, et al. Nucleic Acids Res. 2007;35(19):e130. - PMC - PubMed

Publication types

MeSH terms

LinkOut - more resources