Single-molecule sequencing of an individual human genome - PubMed (original) (raw)

Single-molecule sequencing of an individual human genome

Dmitry Pushkarev et al. Nat Biotechnol. 2009 Sep.

Abstract

Recent advances in high-throughput DNA sequencing technologies have enabled order-of-magnitude improvements in both cost and throughput. Here we report the use of single-molecule methods to sequence an individual human genome. We aligned billions of 24- to 70-bp reads (32 bp average) to approximately 90% of the National Center for Biotechnology Information (NCBI) reference genome, with 28x average coverage. Our results were obtained on one sequencing instrument by a single operator with four data collection runs. Single-molecule sequencing enabled analysis of human genomic information without the need for cloning, amplification or ligation. We determined approximately 2.8 million single nucleotide polymorphisms (SNPs) with a false-positive rate of less than 1% as validated by Sanger sequencing and 99.8% concordance with SNP genotyping arrays. We identified 752 regions of copy number variation by analyzing coverage depth alone and validated 27 of these using digital PCR. This milestone should allow widespread application of genome sequencing to many aspects of genetics and human health, including personal genomics.

PubMed Disclaimer

Figures

Figure 1

Figure 1

P0 genome sequencing metrics. (a) Read length distributions for raw reads (blue) and uniquely aligned reads (red) from Helicos single-molecule sequencing of the genome of Patient Zero (P0). Filtered reads tend to be shorter because a larger proportion of the long reads are instrument artifacts related to the base addition order. (b) Coverage depth for sequence data of the P0 genome, computed over repeat masked regions (ENSEMBL, blue) compared to theoretical Poisson limit (red). (c) Error rate as a function of sequence coverage depth. Above 30× coverage, sampling noise from the limited number of BeadArray results begins to dominate the error rate, and error rate measurements are not accurate. Error rates are defined as concordance with independent measurement of SNPs using the Illumina Human610-Quad SNP BeadArray (see Online Methods for details). (d) Quality score (QS) tradeoffs between sensitivity and accuracy. High sensitivity is obtained by using a QS threshold of 0, which results in calls for all comparison BeadArray locations, with an accuracy of 98.3%. Raising the QS threshold to 1 results in 97% of comparison BeadArray locations being called, thereby lowering the sensitivity but increasing the accuracy of those calls to 99.2%. Numbers next to each data point indicate accuracy (percentages) and cutoff score (in brackets).

Figure 2

Figure 2

SNP discovery in P0. (a) SNP distribution in the P0 genome as a function of quality score. Putative SNPs are ‘validated’ or ‘nonvalidated’ if they are annotated as such in dbSNP. Putative SNPs not found in dbSNP are ‘novel’. SNPs with larger quality scores are called with higher confidence. A substantial decrease in the proportion of validated SNPs is seen as the quality score drops below 2.8, suggesting that 2.8 is a reasonable threshold for identifying high quality SNPs. (b) Distribution of high-quality SNP calls (quality score >2.8) for the P0 human genome. Validated, nonvalidated and novel SNPs are defined as in a. (c) Overlap in SNP locations between the genomes of P0, James Watson and Craig Venter (in thousands). In this figure the quality-score cutoff was moved to the second plateau in a (QS >1.9), increasing the sensitivity and resulting in a total of 3,263,470 SNPs in the P0 genome. This is due to a further 389,736 novel SNPs, 18,495 unvalidated SNPs and 49,768 validated SNPs. The ratio of validated to novel SNPs can be used to estimate that this improvement in sensitivity comes at a cost of an increased overall false-positive rate (from 1% to 10%). Even with this less restrictive cutoff, the SNP proportions shared with Venter and Watson remain consistent.

Figure 3

Figure 3

Copy number variation in the P0 human genome. Blue, signal from simulated dataset (simulated reads per 1 kb bin). Magenta, CNV estimate. Green, raw signal (actual reads mapped per 1 kb bin). (a) Heterozygous deletion. (b) Homozygous deletion. (c) Homozygous duplication. (d) Heterozygous deletion.

Comment in

Similar articles

Cited by

References

    1. Venter JC, et al. The sequence of the human genome. Science. 2001;291:1304–1351. - PubMed
    1. Lander ES, et al. Initial sequencing and analysis of the human genome. Nature. 2001;409:860–921. - PubMed
    1. Levy S, et al. The diploid genome sequence of an individual human. PLoS Biol. 2007;5:e254. - PMC - PubMed
    1. Wheeler DA, et al. The complete genome of an individual by massively parallel DNA sequencing. Nature. 2008;452:872–876. - PubMed
    1. Ley TJ, et al. DNA sequencing of a cytogenetically normal acute myeloid leukaemia genome. Nature. 2008;456:66–72. - PMC - PubMed

Publication types

MeSH terms

LinkOut - more resources