SNP detection for massively parallel whole-genome resequencing (original) (raw)
- Ruiqiang Li1,2,3,
- Yingrui Li1,3,
- Xiaodong Fang1,
- Huanming Yang1,
- Jian Wang1,
- Karsten Kristiansen1,2 and
- Jun Wang1,2,4
- 1 Beijing Genomics Institute at Shenzhen, Shenzhen 518000, China;
- 2 Department of Biochemistry and Molecular Biology, University of Southern Denmark, Odense M DK-5230, Denmark
- ↵3 These authors contributed equally to this work.
Abstract
Next-generation massively parallel sequencing technologies provide ultrahigh throughput at two orders of magnitude lower unit cost than capillary Sanger sequencing technology. One of the key applications of next-generation sequencing is studying genetic variation between individuals using whole-genome or target region resequencing. Here, we have developed a consensus-calling and SNP-detection method for sequencing-by-synthesis Illumina Genome Analyzer technology. We designed this method by carefully considering the data quality, alignment, and experimental errors common to this technology. All of this information was integrated into a single quality score for each base under Bayesian theory to measure the accuracy of consensus calling. We tested this methodology using a large-scale human resequencing data set of 36× coverage and assembled a high-quality nonrepetitive consensus sequence for 92.25% of the diploid autosomes and 88.07% of the haploid X chromosome. Comparison of the consensus sequence with Illumina human 1M BeadChip genotyped alleles from the same DNA sample showed that 98.6% of the 37,933 genotyped alleles on the X chromosome and 98% of 999,981 genotyped alleles on autosomes were covered at 99.97% and 99.84% consistency, respectively. At a low sequencing depth, we used prior probability of dbSNP alleles and were able to improve coverage of the dbSNP sites significantly as compared to that obtained using a nonimputation model. Our analyses demonstrate that our method has a very low false call rate at any sequencing depth and excellent genome coverage at a high sequencing depth.
Footnotes
↵4 Corresponding author.
E-mail wangj{at}genomics.org.cn; fax 86-755-2527-4247.[SOAPsnp is freely available from http://soap.genomics.org.cn under GPL license. The raw sequence data used in this report have been deposited in the EBI/NCBI Short Read Archive (http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi) under accession no. ERA000005, and the SNP set has been deposited in dbSNP (release 130). These data are also available at http://yh.genomics.org.cn.]
Article published online before print. Article and publication date are at http://www.genome.org/cgi/doi/10.1101/gr.088013.108.
- Received October 15, 2008.
- Accepted March 11, 2009.
Copyright © 2009 by Cold Spring Harbor Laboratory Press