Comparison of sequencing platforms for single nucleotide variant calls in a human sample - PubMed (original) (raw)

Comparison of sequencing platforms for single nucleotide variant calls in a human sample

Aakrosh Ratan et al. PLoS One. 2013.

Abstract

Next-generation sequencings platforms coupled with advanced bioinformatic tools enable re-sequencing of the human genome at high-speed and large cost savings. We compare sequencing platforms from Roche/454(GS FLX), Illumina/HiSeq (HiSeq 2000), and Life Technologies/SOLiD (SOLiD 3 ECC) for their ability to identify single nucleotide substitutions in whole genome sequences from the same human sample. We report on significant GC-related bias observed in the data sequenced on Illumina and SOLiD platforms. The differences in the variant calls were investigated with regards to coverage, and sequencing error. Some of the variants called by only one or two of the platforms were experimentally tested using mass spectrometry; a method that is independent of DNA sequencing. We establish several causes why variants remained unreported, specific to each platform. We report the indel called using the three sequencing technologies and from the obtained results we conclude that sequencing human genomes with more than a single platform and multiple libraries is beneficial when high level of accuracy is required.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: Some of the authors are employed by a commercial company, Genentech Inc. This does not alter the authors’ adherence to all the PLOS ONE policies on sharing data and materials.

Figures

Figure 1

Figure 1. Depth of coverage distribution for the three platforms.

The y-axis indicates the fraction of the bases in the reference sequence that has a particular coverage. This does not include secondary alignments and potential PCR duplicates. The dashed lighter curves depict the coverage distribution as calculated using a Poisson model for each sequencing technology.

Figure 2

Figure 2. Variation of coverage with GC content in the three sequencing technologies.

The red line shows the mean coverage across the whole genome. Each point on the plot reflects the mean coverage and fraction of GC content in 50 kbp non-overlapping window. The y-axis shows the coverage whereas the x-axis shows the fraction of C, G nucleotides in the window. This does not include secondary alignments and potential PCR duplicates.

Figure 3

Figure 3. Venn diagram showing the overlap in the SNP calls made using data from the three sequencing technologies.

We display the sizes of each of the seven categories of overlaps among the variant calls in the three technologies. (a) depicts the overlaps when all substitution calls are used, (b) depicts the overlaps when all calls from Illumina and SOLiD are used but only the high-confidence subset of the 454 dataset is used, and (c) depicts the overlaps when only the variants in the uniquely alignable regions of the reference sequence are used.

Figure 4

Figure 4. Discrepant SNP calls from each platform.

The categories on the x-axis are (1) no coverage at location (2) not enough coverage at location (3) more than expected coverage (4) alternate allele not seen (5) alternate allele seen just once (6) too many SNPs around location (7) close to a high-quality indel (8) low RMS mapping quality (9) low SNP quality. The y-axis depicts the number of locations (frequency) in each category. a) Comparison of SOLiD generated sequences with other sequences based on SNP calls and alignments. (i) SNPs called using 454 and Illumina sequences but not called using SOLiD reads. (ii) SNPs called only by SOLiD sequences. We investigate why they were not called using Illumina alignments. (iii) SNPs called only by SOLiD sequences. We investigate why they were not called using 454 alignments. b) Comparison of Illumina generated sequences with other sequences based on SNP calls and alignments. (i) SNPs called using 454 and SOLiD reads but not called using Illumina reads. (ii) SNPs called only by Illumina sequences. We investigate why they were not called using SOLiD alignments. (iii) SNPs called only by SOLiD sequences. We investigate why they were not called using 454 alignments. c) Comparison of 454 generated sequences with other sequences based on SNP calls and alignments. (i) SNPs called using SOLiD and Illumina reads but not called using 454 reads. (ii) SNPs called only by 454 sequences. We investigate why they were not called using SOLiD alignments. (iii) SNPs called only by 454 sequences. We investigate why they were not called using Illumina alignments.

Figure 5

Figure 5. SNP Validation using Mass spectroscopy.

Validation of 300 putative SNP locations from each of the six sets of SNP calls in Figure 3a, where not all three technologies agree on the computed genotype. The categories on x-axis are “454” (SNPs called by 454 only), “Illumina” (SNPs called by Illumina only), “SOLiD” (SNPs called by SOLiD only), “454 & Illumina” (SNPs called by 454 and Illumina), “454 & SOLiD” (SNPs called by 454 and SOLiD), “Illumina & SOLiD” (SNPs called by Illumina and SOLiD). The color categories include “Primer Failure” (Primer extension failure), “Assay Failure” (Assay Failure), “Validated” and “Not Validated”.

Similar articles

Cited by

References

    1. Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, et al. (2001) The sequence of the human genome. Science (New York, NY) 291: 1304–1351 doi:10.1126/science.1058040. - DOI - PubMed
    1. Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, et al. (2001) Initial sequencing and analysis of the human genome. Nature 409: 860–921 doi:10.1038/35057062. - DOI - PubMed
    1. Wheeler D a, Srinivasan M, Egholm M, Shen Y, Chen L, et al. (2008) The complete genome of an individual by massively parallel DNA sequencing. Nature 452: 872–876 doi:10.1038/nature06884. - DOI - PubMed
    1. Levy S, Sutton G, Ng PC, Feuk L, Halpern AL, et al. (2007) The diploid genome sequence of an individual human. PLoS biology 5: e254 doi:10.1371/journal.pbio.0050254. - DOI - PMC - PubMed
    1. Schuster SC, Miller W, Ratan A, Tomsho LP, Giardine B, et al. (2010) Complete Khoisan and Bantu genomes from southern Africa. Nature 463: 943–947 doi:10.1038/nature08795. - DOI - PMC - PubMed

Publication types

MeSH terms

Grants and funding

This study is based on sequencing data generated for the Southern African genomes Project [5]. The 454 data was generated at Penn State University and supported by Roche/454 via the donation of sequencing reagents. We like to thank Timothy Harkins and Kevin McCarren for making SOLiD data for the KB1 genome available to this group of investigators. The sequence data for Illumina GA IIx and HiSeq 2000 was generated at Penn State University and supported through internal funds. The funders had no role in study design, data collection (other than what has been stated above) and analysis, decision to publish, or preparation of the manuscript. This project is funded, in part, under a grant by the Pennsylvania Department of Health using Tobacco CURE Funds to AR. The Pennsylvania Department of Health specifically disclaims responsibility for any analyses, interpretations or conclusions.

LinkOut - more resources