A general approach to single-nucleotide polymorphism discovery (original) (raw)
- Letter
- Published: December 1999
- Ian Korf1,
- Mark D. Yandell1,
- Raymond T. Yeh1,
- Zhijie Gu2,
- Hamideh Zakeri2,
- Nathan O. Stitziel1,
- LaDeana Hillier1,
- Pui-Yan Kwok2 &
- …
- Warren R. Gish1
Nature Genetics volume 23, pages 452–456 (1999)Cite this article
- 4516 Accesses
- 384 Citations
- 9 Altmetric
- Metrics details
Abstract
Single-nucleotide polymorphisms (SNPs) are the most abundant form of human genetic variation and a resource for mapping complex genetic traits1. The large volume of data produced by high-throughput sequencing projects is a rich and largely untapped source of SNPs (refs 2, 3, 4, 5). We present here a unified approach to the discovery of variations in genetic sequence data of arbitrary DNA sources. We propose to use the rapidly emerging genomic sequence6,7 as a template on which to layer often unmapped, fragmentary sequence data8,9,10,11 and to use base quality values12 to discern true allelic variations from sequencing errors. By taking advantage of the genomic sequence we are able to use simpler yet more accurate methods for sequence organization: fragment clustering, paralogue identification and multiple alignment. We analyse these sequences with a novel, Bayesian inference engine, POLYBAYES, to calculate the probability that a given site is polymorphic. Rigorous treatment of base quality permits completely automated evaluation of the full length of all sequences, without limitations on alignment depth. We demonstrate this approach by accurate SNP predictions in human ESTs aligned to finished and working-draft quality genomic sequences, a data set representative of the typical challenges of sequence-based SNP discovery.
This is a preview of subscription content, access via your institution
Access options
Subscribe to this journal
Receive 12 print issues and online access
$209.00 per year
only $17.42 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Additional access options:
Similar content being viewed by others
Accession codes
Accessions
GenBank/EMBL/DDBJ
References
- Collins, F.S., Guyer, M.S. & Chakravarti, A. Variations on a theme: cataloging human DNA sequence variation. Science 278, 1580– 1581 (1997).
Article CAS Google Scholar - Wang, D.G. et al. Large-scale identification, mapping, and genotyping of single nucleotide polymorphisms in the human genome. Science 280,1077–1082 (1998).
Article CAS Google Scholar - Taillon-Miller, P., Gu, Z., Hillier, L. & Kwok, P.-Y. Overlapping genomic sequences: a treasure trove of single-nucleotide polymorphisms. Genome Res. 8, 748–754 ( 1998).
Article CAS Google Scholar - Picoult-Newberg, L. et al. Mining SNPs from EST databases. Genome Res. 9, 167–174 (1999).
CAS PubMed PubMed Central Google Scholar - Buetow, K.H., Edmondson, M.N. & Cassidy, A.B. Reliable identification of large numbers of candidate SNPs from public EST data. Nature Genet. 21, 323–325 (1999).
Article CAS Google Scholar - The Sanger Centre & The Washington University Genome Sequencing Center. Toward a complete human genome sequence. Genome Res. 8, 1097–1108 (1998).
- Venter, J.C. et al. Shotgun sequencing of the human genome. Science 280, 1540–1542 ( 1998).
Article CAS Google Scholar - Hillier, L. et al. Generation and analysis of 280,000 human expressed sequence tags. Genome Res. 6, 807– 828 (1996).
Article CAS Google Scholar - Adams, M.D., Soares, M.B., Kerlavage, A.R., Fields, C. & Venter, J.C. Rapid cDNA sequencing (expressed sequence tags) from a directionally cloned human infant brain cDNA library. Nature Genet. 4, 373–380 (1993).
Article CAS Google Scholar - Hudson, T.J. et al. An STS-based map of the human genome. Science 270, 1945–1954 (1995).
Article CAS Google Scholar - Marra, M., Weinstock, L.A. & Mardis, E.R. End sequence determination from large insert clones using energy transfer fluorescent primers. Genome Res. 6, 1118–1122 (1996).
Article CAS Google Scholar - Durbin, R. & Dear, S. Base qualities help sequencing software. Genome Res. 8, 161–162 (1998).
Article CAS Google Scholar - Ewing, B., Hillier, L., Wendl, M.C. & Green, P. Base-calling of automated traces using Phred. I. Accuracy assessment. Genome Res. 8, 175–185 (1998).
Article CAS Google Scholar - Ewing, B. & Green, P. Base-calling of automated traces using Phred. II. Error probabilities. Genome Res. 8, 186–194 (1998).
Article CAS Google Scholar - Bayes, T. An essay towards solving a problem in the doctrine of chances. Philos. Trans. R. Soc. 53, 370–418 (1763). Reprinted in Biometrika 45, 293–315 (1958).
Article Google Scholar - Aaronson, J. et al. Toward the development of a gene index to the human genome: an assessment of the nature of high-throughput EST sequence data. Genome Res. 6, 829–845 (1996).
Article CAS Google Scholar - Kwok, P.-Y., Carlson, C., Yager, T., Ankener, W. & Nickerson, D.A. Comparative analysis of human DNA variations by fluorescence-based sequencing of PCR products. Genomics 23, 138–144 (1994).
Article CAS Google Scholar - Taillon-Miller, P. et al. The homozygous complete hydatidiform mole: a unique resource for genome studies. Genomics 46, 307– 310 (1997).
Article CAS Google Scholar - Collins, F.S. et al. New goals for the U.S. Human Genome Project: 1998–2003. Science 282, 682–689 (1998).
Article CAS Google Scholar - Nickerson, D.A. et al. DNA sequence diversity in a 9.7-kb region of the human lipoprotein lipase gene. Nature Genet. 19, 233– 240 (1998).
Article CAS Google Scholar - Cargill, M. et al. Characterization of single-nucleotide polymorphisms in coding regions of human genes. Nature Genet. 22, 231–238 (1999).
Article CAS Google Scholar - Halushka, M.K. et al. Patterns of single-nucleotide polymorphisms in candidate genes regulating blood-pressure homeostasis. Nature Genet. 22, 239–247 (1999).
Article CAS Google Scholar - Gordon, D., Abaijan, C. & Green, P. Consed: a graphical tool for sequence finishing. Genome Res. 8, 195–202 (1998).
Article CAS Google Scholar
Acknowledgements
We thank T. Blackwell and S. Eddy for informative discussions during the development of the mathematical framework of the technique. This work was supported by NIH grants P50HG01458 (L.H. and W.R.G.), R01HG1720 (P.-Y.K.) and T32AR07284 (Z.G.), and an equipment loan from Compaq Computer Corporation.
Author information
Authors and Affiliations
- Washington University Department of Genetics and Genome Sequencing Center, St. Louis, Missouri, USA
Gabor T. Marth, Ian Korf, Mark D. Yandell, Raymond T. Yeh, Nathan O. Stitziel, LaDeana Hillier & Warren R. Gish - Washington University Division of Dermatology, St. Louis, Missouri, USA
Zhijie Gu, Hamideh Zakeri & Pui-Yan Kwok
Authors
- Gabor T. Marth
You can also search for this author inPubMed Google Scholar - Ian Korf
You can also search for this author inPubMed Google Scholar - Mark D. Yandell
You can also search for this author inPubMed Google Scholar - Raymond T. Yeh
You can also search for this author inPubMed Google Scholar - Zhijie Gu
You can also search for this author inPubMed Google Scholar - Hamideh Zakeri
You can also search for this author inPubMed Google Scholar - Nathan O. Stitziel
You can also search for this author inPubMed Google Scholar - LaDeana Hillier
You can also search for this author inPubMed Google Scholar - Pui-Yan Kwok
You can also search for this author inPubMed Google Scholar - Warren R. Gish
You can also search for this author inPubMed Google Scholar
Corresponding authors
Correspondence toGabor T. Marth or Pui-Yan Kwok.
Rights and permissions
About this article
Cite this article
Marth, G., Korf, I., Yandell, M. et al. A general approach to single-nucleotide polymorphism discovery.Nat Genet 23, 452–456 (1999). https://doi.org/10.1038/70570
- Received: 17 August 1999
- Accepted: 18 October 1999
- Issue Date: December 1999
- DOI: https://doi.org/10.1038/70570