Sequence and structural variation in a human genome uncovered by short-read, massively parallel ligation sequencing using two-base encoding - PubMed (original) (raw)

. 2009 Sep;19(9):1527-41.

doi: 10.1101/gr.091868.109. Epub 2009 Jun 22.

Heather E Peckham, Gina L Costa, Stephen F McLaughlin, Yutao Fu, Eric F Tsung, Christopher R Clouser, Cisyla Duncan, Jeffrey K Ichikawa, Clarence C Lee, Zheng Zhang, Swati S Ranade, Eileen T Dimalanta, Fiona C Hyland, Tanya D Sokolsky, Lei Zhang, Andrew Sheridan, Haoning Fu, Cynthia L Hendrickson, Bin Li, Lev Kotler, Jeremy R Stuart, Joel A Malek, Jonathan M Manning, Alena A Antipova, Damon S Perez, Michael P Moore, Kathleen C Hayashibara, Michael R Lyons, Robert E Beaudoin, Brittany E Coleman, Michael W Laptewicz, Adam E Sannicandro, Michael D Rhodes, Rajesh K Gottimukkala, Shan Yang, Vineet Bafna, Ali Bashir, Andrew MacBride, Can Alkan, Jeffrey M Kidd, Evan E Eichler, Martin G Reese, Francisco M De La Vega, Alan P Blanchard

Affiliations

PMID: 19546169
PMCID: PMC2752135
DOI: 10.1101/gr.091868.109

Sequence and structural variation in a human genome uncovered by short-read, massively parallel ligation sequencing using two-base encoding

Kevin Judd McKernan et al. Genome Res. 2009 Sep.

Abstract

We describe the genome sequencing of an anonymous individual of African origin using a novel ligation-based sequencing assay that enables a unique form of error correction that improves the raw accuracy of the aligned reads to >99.9%, allowing us to accurately call SNPs with as few as two reads per allele. We collected several billion mate-paired reads yielding approximately 18x haploid coverage of aligned sequence and close to 300x clone coverage. Over 98% of the reference genome is covered with at least one uniquely placed read, and 99.65% is spanned by at least one uniquely placed mate-paired clone. We identify over 3.8 million SNPs, 19% of which are novel. Mate-paired data are used to physically resolve haplotype phases of nearly two-thirds of the genotypes obtained and produce phased segments of up to 215 kb. We detect 226,529 intra-read indels, 5590 indels between mate-paired reads, 91 inversions, and four gene fusions. We use a novel approach for detecting indels between mate-paired reads that are smaller than the standard deviation of the insert size of the library and discover deletions in common with those detected with our intra-read approach. Dozens of mutations previously described in OMIM and hundreds of nonsynonymous single-nucleotide and structural variants in genes previously implicated in disease are identified in this individual. There is more genetic variation in the human genome still to be uncovered, and we provide guidance for future surveys in populations and cancer biopsies.

PubMed Disclaimer

Figures

Figure 1.

Cumulative plot of sequence and clone coverage from uniquely placed fragments and uniquely placed mate pairs. The sequence coverage is derived from the fragment, 2 × 25 mate-paired, and 2 × 50 mate-paired libraries while the clone coverage is from only the mate-paired libraries (2 × 25 and 2 × 50).

Figure 2.

Uniquely placed mate pairs provide a more comprehensive sampling of the human genome than the unique placement of each of the tags independently. The coverage is separated by mate-paired data treated as single tags before pairing (mate pairs, unpaired; blue) and mate-paired data treated as mate pairs (mate pairs, paired; pink).

Figure 3.

Dependence of genotype calling on depth of sequence coverage. The NA18507 genotypes called by SOLiD at all HapMap loci are compared with the HapMap genotypes by SOLiD coverage per genome position (average 18× coverage). Coverage includes alleles representing the reference or a valid base change; i.e., alleles with single or invalid adjacent mismatches are not included. No prior information about SNP presence or SNP alleles was used in making SOLiD gentoype calls. The number of HapMap loci with a given level of SOLiD coverage (“Count”) are shown and the percentage of these loci for which SOLiD gives the same genotype as HapMap for homozygotes and heterozygotes is represented by the colored lines (graphed using the left-hand _y_-axis and referred to as “% Concordance”) using two genotyping algorithms: Consensus Caller and diBayes. diBayes is more sensitive at heterozygous SNP detection and yields a lower false-negative rate than Consensus Caller, but we did not attempt to estimate the false-positive rate of diBayes with validation data. SOLiD genotypes that differ from HapMap gentoypes are nearly always heterozygous undercalls (i.e., the position is called homozogyous for one of the two alleles) or called as N (insufficient evidence to make a confident genotype call).

Figure 4.

Length distributions of small and medium insertions and deletions under sequencing reads with respective concordances. Deletions are detected up to 500 bp and insertions up to 20 bp. A high prevalence of small indels, even-sized indels, and _Alu_-sized deletions (300–350 bp) are found in this genome. Larger indels (deletions 12 bp and higher and insertions 4 bp and higher) are called with more restrictive settings (see Methods) than smaller ones.

Figure 5.

Length distributions of large insertions and deletions identified between mate-paired tags. There is an abundance of insertions and deletions in the size range of _Alu_s as well as a spike in the number of deletions in the size range of LINEs (6000 bp).

Figure 6.

The distribution of the 193 deletions identified in NA18507 with SOLiD by both the intra-read and inter-read approaches. (Inset) A 328-bp deletion detected using both the inter- and intra-read approaches. Four nonredundant molecules identify the deletion with the intra-read approach while 81 clones identify the deletion with the inter-read approach. This deletion has also been found in the Venter, Watson, and YH genomes.

Figure 7.

Copy number variations detected with SOLiD mate-paired reads in NA18507. (A) The size distribution of CNVs detected with SOLiD mate-paired reads. (B) Overlap of copy numbers computed from normalized SOLiD coverage and from Affymetrix array CGH (aCGH) (McCarroll and Altshuler 2007). Colors indicate CNV calls from aCGH. On the top of the figure are the numbers of SOLiD CNV calls that overlap with aCGH data at each copy number.

Figure 8.

Theoretical and actual detection of SNPs and indels at various levels of average sequence coverage. (A) The upper bound on the number of SNPs and intra-read indels that can be detected at various levels of coverage. This is calculated by assessing how much of the genome meets the coverage requirements for each type of variant, 2× coverage for homozygous SNPs, 4× coverage for heterozygous SNPs, and 6× coverage without considering the 3 bp on each end of the reads for intra-read indels. For small indels, two split reads are required to make a call, but due to the more restrictive manner of these calls, only about one in three reads (as found in simulations) can be used for this. (B) The actual number of SNPs and intra-read indels detected at various levels of average sequence coverage. (C) The number of insertions and deletions ≥200 bp detected between mate-paired reads at various average levels of sequence coverage.

Cited by

On the core segmentation algorithms of copy number variation detection tools.
Zhang Y, Liu W, Duan J. Zhang Y, et al. Brief Bioinform. 2024 Jan 22;25(2):bbae022. doi: 10.1093/bib/bbae022. Brief Bioinform. 2024. PMID: 38340093 Free PMC article.
The Development of Plant Genome Sequencing Technology and Its Conservation and Application in Endangered Gymnosperms.
Hong K, Radian Y, Manda T, Xu H, Luo Y. Hong K, et al. Plants (Basel). 2023 Nov 28;12(23):4006. doi: 10.3390/plants12234006. Plants (Basel). 2023. PMID: 38068641 Free PMC article. Review.
Short-read aligner performance in germline variant identification.
Wilton R, Szalay AS. Wilton R, et al. Bioinformatics. 2023 Aug 1;39(8):btad480. doi: 10.1093/bioinformatics/btad480. Bioinformatics. 2023. PMID: 37527006 Free PMC article. Review.

References

1. Bashir A, Volik S, Collins C, Bafna V, Raphael BJ. Evaluation of paired-end sequencing strategies for detection of genome rearrangements in cancer. PLoS Comput Biol. 2008;4:e1000051. doi: 10.1371/journal.pcbi.1000051. - DOI - PMC - PubMed
1. Bentley DR, Balasubramanian S, Swerdlow HP, Smith GP, Milton J, Brown CG, Hall KP, Evers DJ, Barnes CL, Bignell HR, et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature. 2008;456:53–59. - PMC - PubMed
1. Braslavsky I, Hebert B, Kartalov E, Quake SR. Sequence information can be obtained from single DNA molecules. Proc Natl Acad Sci. 2003;100:3960–3964. - PMC - PubMed
1. Brenner S, Johnson M, Bridgham J, Golda G, Lloyd DH, Johnson D, Luo S, McCurdy S, Foy M, Ewan M, et al. Gene expression analysis by massively parallel signature sequencing (MPSS) on microbead arrays. Nat Biotechnol. 2000;18:630–634. - PubMed
1. Campbell PJ, Stephens PJ, Pleasance ED, O'Meara S, Li H, Santarius T, Stebbings LA, Leroy C, Edkins S, Hardy C, et al. Identification of somatically acquired rearrangements in cancer using genome-wide massively parallel paired-end sequencing. Nat Genet. 2008;40:722–729. - PMC - PubMed

Sequence and structural variation in a human genome uncovered by short-read, massively parallel ligation sequencing using two-base encoding - PubMed (original) (raw)

Sequence and structural variation in a human genome uncovered by short-read, massively parallel ligation sequencing using two-base encoding

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Research Materials

Miscellaneous