Sequence and structural variation in a human genome uncovered by short-read, massively parallel ligation sequencing using two-base encoding - PubMed (original) (raw)
. 2009 Sep;19(9):1527-41.
doi: 10.1101/gr.091868.109. Epub 2009 Jun 22.
Heather E Peckham, Gina L Costa, Stephen F McLaughlin, Yutao Fu, Eric F Tsung, Christopher R Clouser, Cisyla Duncan, Jeffrey K Ichikawa, Clarence C Lee, Zheng Zhang, Swati S Ranade, Eileen T Dimalanta, Fiona C Hyland, Tanya D Sokolsky, Lei Zhang, Andrew Sheridan, Haoning Fu, Cynthia L Hendrickson, Bin Li, Lev Kotler, Jeremy R Stuart, Joel A Malek, Jonathan M Manning, Alena A Antipova, Damon S Perez, Michael P Moore, Kathleen C Hayashibara, Michael R Lyons, Robert E Beaudoin, Brittany E Coleman, Michael W Laptewicz, Adam E Sannicandro, Michael D Rhodes, Rajesh K Gottimukkala, Shan Yang, Vineet Bafna, Ali Bashir, Andrew MacBride, Can Alkan, Jeffrey M Kidd, Evan E Eichler, Martin G Reese, Francisco M De La Vega, Alan P Blanchard
Affiliations
- PMID: 19546169
- PMCID: PMC2752135
- DOI: 10.1101/gr.091868.109
Sequence and structural variation in a human genome uncovered by short-read, massively parallel ligation sequencing using two-base encoding
Kevin Judd McKernan et al. Genome Res. 2009 Sep.
Abstract
We describe the genome sequencing of an anonymous individual of African origin using a novel ligation-based sequencing assay that enables a unique form of error correction that improves the raw accuracy of the aligned reads to >99.9%, allowing us to accurately call SNPs with as few as two reads per allele. We collected several billion mate-paired reads yielding approximately 18x haploid coverage of aligned sequence and close to 300x clone coverage. Over 98% of the reference genome is covered with at least one uniquely placed read, and 99.65% is spanned by at least one uniquely placed mate-paired clone. We identify over 3.8 million SNPs, 19% of which are novel. Mate-paired data are used to physically resolve haplotype phases of nearly two-thirds of the genotypes obtained and produce phased segments of up to 215 kb. We detect 226,529 intra-read indels, 5590 indels between mate-paired reads, 91 inversions, and four gene fusions. We use a novel approach for detecting indels between mate-paired reads that are smaller than the standard deviation of the insert size of the library and discover deletions in common with those detected with our intra-read approach. Dozens of mutations previously described in OMIM and hundreds of nonsynonymous single-nucleotide and structural variants in genes previously implicated in disease are identified in this individual. There is more genetic variation in the human genome still to be uncovered, and we provide guidance for future surveys in populations and cancer biopsies.
Figures
Figure 1.
Cumulative plot of sequence and clone coverage from uniquely placed fragments and uniquely placed mate pairs. The sequence coverage is derived from the fragment, 2 × 25 mate-paired, and 2 × 50 mate-paired libraries while the clone coverage is from only the mate-paired libraries (2 × 25 and 2 × 50).
Figure 2.
Uniquely placed mate pairs provide a more comprehensive sampling of the human genome than the unique placement of each of the tags independently. The coverage is separated by mate-paired data treated as single tags before pairing (mate pairs, unpaired; blue) and mate-paired data treated as mate pairs (mate pairs, paired; pink).
Figure 3.
Dependence of genotype calling on depth of sequence coverage. The NA18507 genotypes called by SOLiD at all HapMap loci are compared with the HapMap genotypes by SOLiD coverage per genome position (average 18× coverage). Coverage includes alleles representing the reference or a valid base change; i.e., alleles with single or invalid adjacent mismatches are not included. No prior information about SNP presence or SNP alleles was used in making SOLiD gentoype calls. The number of HapMap loci with a given level of SOLiD coverage (“Count”) are shown and the percentage of these loci for which SOLiD gives the same genotype as HapMap for homozygotes and heterozygotes is represented by the colored lines (graphed using the left-hand _y_-axis and referred to as “% Concordance”) using two genotyping algorithms: Consensus Caller and diBayes. diBayes is more sensitive at heterozygous SNP detection and yields a lower false-negative rate than Consensus Caller, but we did not attempt to estimate the false-positive rate of diBayes with validation data. SOLiD genotypes that differ from HapMap gentoypes are nearly always heterozygous undercalls (i.e., the position is called homozogyous for one of the two alleles) or called as N (insufficient evidence to make a confident genotype call).
Figure 4.
Length distributions of small and medium insertions and deletions under sequencing reads with respective concordances. Deletions are detected up to 500 bp and insertions up to 20 bp. A high prevalence of small indels, even-sized indels, and _Alu_-sized deletions (300–350 bp) are found in this genome. Larger indels (deletions 12 bp and higher and insertions 4 bp and higher) are called with more restrictive settings (see Methods) than smaller ones.
Figure 5.
Length distributions of large insertions and deletions identified between mate-paired tags. There is an abundance of insertions and deletions in the size range of _Alu_s as well as a spike in the number of deletions in the size range of LINEs (6000 bp).
Figure 6.
The distribution of the 193 deletions identified in NA18507 with SOLiD by both the intra-read and inter-read approaches. (Inset) A 328-bp deletion detected using both the inter- and intra-read approaches. Four nonredundant molecules identify the deletion with the intra-read approach while 81 clones identify the deletion with the inter-read approach. This deletion has also been found in the Venter, Watson, and YH genomes.
Figure 7.
Copy number variations detected with SOLiD mate-paired reads in NA18507. (A) The size distribution of CNVs detected with SOLiD mate-paired reads. (B) Overlap of copy numbers computed from normalized SOLiD coverage and from Affymetrix array CGH (aCGH) (McCarroll and Altshuler 2007). Colors indicate CNV calls from aCGH. On the top of the figure are the numbers of SOLiD CNV calls that overlap with aCGH data at each copy number.
Figure 8.
Theoretical and actual detection of SNPs and indels at various levels of average sequence coverage. (A) The upper bound on the number of SNPs and intra-read indels that can be detected at various levels of coverage. This is calculated by assessing how much of the genome meets the coverage requirements for each type of variant, 2× coverage for homozygous SNPs, 4× coverage for heterozygous SNPs, and 6× coverage without considering the 3 bp on each end of the reads for intra-read indels. For small indels, two split reads are required to make a call, but due to the more restrictive manner of these calls, only about one in three reads (as found in simulations) can be used for this. (B) The actual number of SNPs and intra-read indels detected at various levels of average sequence coverage. (C) The number of insertions and deletions ≥200 bp detected between mate-paired reads at various average levels of sequence coverage.
Similar articles
- U87MG decoded: the genomic sequence of a cytogenetically aberrant human cancer cell line.
Clark MJ, Homer N, O'Connor BD, Chen Z, Eskin A, Lee H, Merriman B, Nelson SF. Clark MJ, et al. PLoS Genet. 2010 Jan 29;6(1):e1000832. doi: 10.1371/journal.pgen.1000832. PLoS Genet. 2010. PMID: 20126413 Free PMC article. - De novo fragment assembly with short mate-paired reads: Does the read length matter?
Chaisson MJ, Brinza D, Pevzner PA. Chaisson MJ, et al. Genome Res. 2009 Feb;19(2):336-46. doi: 10.1101/gr.079053.108. Epub 2008 Dec 3. Genome Res. 2009. PMID: 19056694 Free PMC article. - Coverage-based consensus calling (CbCC) of short sequence reads and comparison of CbCC results to identify SNPs in chickpea (Cicer arietinum; Fabaceae), a crop species without a reference genome.
Azam S, Thakur V, Ruperao P, Shah T, Balaji J, Amindala B, Farmer AD, Studholme DJ, May GD, Edwards D, Jones JD, Varshney RK. Azam S, et al. Am J Bot. 2012 Feb;99(2):186-92. doi: 10.3732/ajb.1100419. Epub 2012 Feb 1. Am J Bot. 2012. PMID: 22301893 - Computational methods for discovering structural variation with next-generation sequencing.
Medvedev P, Stanciu M, Brudno M. Medvedev P, et al. Nat Methods. 2009 Nov;6(11 Suppl):S13-20. doi: 10.1038/nmeth.1374. Nat Methods. 2009. PMID: 19844226 Review. - Whole genome sequencing.
Ng PC, Kirkness EF. Ng PC, et al. Methods Mol Biol. 2010;628:215-26. doi: 10.1007/978-1-60327-367-1_12. Methods Mol Biol. 2010. PMID: 20238084 Review.
Cited by
- On the core segmentation algorithms of copy number variation detection tools.
Zhang Y, Liu W, Duan J. Zhang Y, et al. Brief Bioinform. 2024 Jan 22;25(2):bbae022. doi: 10.1093/bib/bbae022. Brief Bioinform. 2024. PMID: 38340093 Free PMC article. - The Development of Plant Genome Sequencing Technology and Its Conservation and Application in Endangered Gymnosperms.
Hong K, Radian Y, Manda T, Xu H, Luo Y. Hong K, et al. Plants (Basel). 2023 Nov 28;12(23):4006. doi: 10.3390/plants12234006. Plants (Basel). 2023. PMID: 38068641 Free PMC article. Review. - Short-read aligner performance in germline variant identification.
Wilton R, Szalay AS. Wilton R, et al. Bioinformatics. 2023 Aug 1;39(8):btad480. doi: 10.1093/bioinformatics/btad480. Bioinformatics. 2023. PMID: 37527006 Free PMC article. Review.
References
- Brenner S, Johnson M, Bridgham J, Golda G, Lloyd DH, Johnson D, Luo S, McCurdy S, Foy M, Ewan M, et al. Gene expression analysis by massively parallel signature sequencing (MPSS) on microbead arrays. Nat Biotechnol. 2000;18:630–634. - PubMed
Publication types
MeSH terms
Substances
Grants and funding
- R01 HG004962-02/HG/NHGRI NIH HHS/United States
- R01 HG004962/HG/NHGRI NIH HHS/United States
- HG002993/HG/NHGRI NIH HHS/United States
- R44 HG002993/HG/NHGRI NIH HHS/United States
- HG004120/HG/NHGRI NIH HHS/United States
- R43 HG002993/HG/NHGRI NIH HHS/United States
- R01 HG004962-01/HG/NHGRI NIH HHS/United States
- P01 HG004120/HG/NHGRI NIH HHS/United States
LinkOut - more resources
Full Text Sources
Other Literature Sources
Research Materials
Miscellaneous