Quality scores and SNP detection in sequencing-by-synthesis systems - PubMed (original) (raw)

Quality scores and SNP detection in sequencing-by-synthesis systems

William Brockman et al. Genome Res. 2008 May.

Abstract

Promising new sequencing technologies, based on sequencing-by-synthesis (SBS), are starting to deliver large amounts of DNA sequence at very low cost. Polymorphism detection is a key application. We describe general methods for improved quality scores and accurate automated polymorphism detection, and apply them to data from the Roche (454) Genome Sequencer 20. We assess our methods using known-truth data sets, which is critical to the validity of the assessments. We developed informative, base-by-base error predictors for this sequencer and used a variant of the phred binning algorithm to combine them into a single empirically derived quality score. These quality scores are more useful than those produced by the system software: They both better predict actual error rates and identify many more high-quality bases. We developed a SNP detection method, with variants for low coverage, high coverage, and PCR amplicon applications, and evaluated it on known-truth data sets. We demonstrate good specificity in single reads, and excellent specificity (no false positives in 215 kb of genome) in high-coverage data.

PubMed Disclaimer

Figures

Figure 1.

Figure 1.

New quality scores for 454 reads. Old: quality scores from 454 software v.1.0.52. New: quality scores developed for this work. Data for panels A, B, and C come from 13 different runs on DNA from five different species. (A) Predicted vs. observed quality for old and new quality scores. New quality scores are much closer to the ideal, 1:1 line. (B) Proportion of bases greater than a given actual quality. The new quality scores accurately identify many more bases at quality ≥30 (63% vs. 23%). (C) Error prediction by error type. New quality scores accurately predict different types of errors: predicted vs. actual quality when errors are separated into overcalls, undercalls, and miscalls. (D) New quality scores are stable across machines. Predicted vs. actual quality for three runs of the same human BAC library on three different machines. (E) New quality scores are stable across genomes. Predicted vs. actual quality for five genomes varying in GC content and proportion of bases in homopolymers.

Figure 2.

Figure 2.

Sensitivity of SNP calling as a function of coverage. Coverage is counted by bases accepted for SNP calling. At all coverages, the fraction of the reference that could be correctly called in haploid DNA (a BAC) exceeds the sensitivity for heterozygous SNPs that can be called in diploid DNA (a mixture of two different, overlapping BACs). Sensitivity is generally lower for heterozygous SNPs in PCR amplicons, due to pooling variation. No false positives were found at any coverage level in genomic data: ∼545 kb haploid, ∼220 kb diploid. For PCR amplicons approximately one false positive was found in ∼27 kb.

Similar articles

Cited by

References

    1. Altshuler D., Pollara V., Cowles C., Van Etten W., Baldwin J., Linton L., Lander E., Pollara V., Cowles C., Van Etten W., Baldwin J., Linton L., Lander E., Cowles C., Van Etten W., Baldwin J., Linton L., Lander E., Van Etten W., Baldwin J., Linton L., Lander E., Baldwin J., Linton L., Lander E., Linton L., Lander E., Lander E. An SNP map of the human genome generated by reduced representation shotgun sequencing. Nature. 2000;407:513–516. - PubMed
    1. Applied Biosystems, Inc. 2004.
    1. Ewing B., Green P., Green P. Basecalling of automated sequencer traces using phred. II. Error probabilities. Genome Res. 1998;8:186–194. - PubMed
    1. Ewing B., Hillier L., Wendl M., Green P., Hillier L., Wendl M., Green P., Wendl M., Green P., Green P. Basecalling of automated sequencer traces using phred. I. Accuracy assessment. Genome Res. 1998;8:175–185. - PubMed
    1. International HapMap Consortium The International HapMap Project. Nature. 2003;426:789–796. - PubMed

Publication types

MeSH terms

LinkOut - more resources