Identification of genomic indels and structural variations using split reads - PubMed (original) (raw)

Identification of genomic indels and structural variations using split reads

Zhengdong D Zhang et al. BMC Genomics. 2011.

Abstract

Background: Recent studies have demonstrated the genetic significance of insertions, deletions, and other more complex structural variants (SVs) in the human population. With the development of the next-generation sequencing technologies, high-throughput surveys of SVs on the whole-genome level have become possible. Here we present split-read identification, calibrated (SRiC), a sequence-based method for SV detection.

Results: We start by mapping each read to the reference genome in standard fashion using gapped alignment. Then to identify SVs, we score each of the many initial mappings with an assessment strategy designed to take into account both sequencing and alignment errors (e.g. scoring more highly events gapped in the center of a read). All current SV calling methods have multilevel biases in their identifications due to both experimental and computational limitations (e.g. calling more deletions than insertions). A key aspect of our approach is that we calibrate all our calls against synthetic data sets generated from simulations of high-throughput sequencing (with realistic error models). This allows us to calculate sensitivity and the positive predictive value under different parameter-value scenarios and for different classes of events (e.g. long deletions vs. short insertions). We run our calculations on representative data from the 1000 Genomes Project. Coupling the observed numbers of events on chromosome 1 with the calibrations gleaned from the simulations (for different length events) allows us to construct a relatively unbiased estimate for the total number of SVs in the human genome across a wide range of length scales. We estimate in particular that an individual genome contains ~670,000 indels/SVs.

Conclusions: Compared with the existing read-depth and read-pair approaches for SV identification, our method can pinpoint the exact breakpoints of SV events, reveal the actual sequence content of insertions, and cover the whole size spectrum for deletions. Moreover, with the advent of the third-generation sequencing technologies that produce longer reads, we expect our method to be even more useful.

PubMed Disclaimer

Figures

Figure 1

Figure 1

The size spectrum of SVs identifiable to different methods. No method can identify SVs of all different sizes. The black bars indicate the size ranges of discoverable SVs by different methods, which include the dbSNP database, the high-resolution array CGH (hr-aCGH), the read-pair (RP) method with fosmid, 454, and Solexa sequencing, and the split-read analysis. The range of detectable indels by RP depends on three values: the mean and the standard deviation between the distances of mapped read pairs and the multiple coefficient of s.d. for significance. These triple values are (40 kb, 2.8 kb, 3), (1 kb, 0.8 kb, 3), (250 bp, 25 bp, 6) for fosmid, 454, and Solexa sequencing, respectively.

Figure 2

Figure 2

Effect of different thresholds on SV identification. Different sets of indels are called at combinations of different values for thresholds _t_r, _t_n, and _t_c. Each bar shows the percentages of the true positives, the false negatives, and the false positives of each call set, which are represented by the colored, the white, and the gray portions, respectively. The bars in different shades of green and red are used for the true positive calls of deletion and insertion of different length. (A-B) The alignment score ratio threshold, _t_r. SV are calls for a set of simulated reads using different _t_r while _t_n = 5 and _t_c = 0.1 are kept unchanged. (C-D) The number of supportive read threshold, _t_n. SV are calls for the same set of simulated reads using different _t_n while _t_r = 1 and _t_c = 0.1 are kept unchanged. (E-F) The maximum centeredness threshold, _t_c. SV are calls for the same set of simulated reads using different _t_c while _t_n = 5 and _t_r = 1 are kept unchanged.

Figure 3

Figure 3

Effect of sequencing on SV identification. The lengths of the colored, the white, and the gray portions of each bar signify the percentages of the true positives, the false negatives, and the false positives, respectively. The bars in different shades of green and red are used for the true positive calls of deletion and insertion of different length. (A-B) Different read length. SV are calls for sets of simulated reads of different lengths with the same coverage (5×). (C-D) Different coverage. SV are calls for sets of simulated reads of the same lengths (~400-bp) with different coverage.

Figure 4

Figure 4

Discoverable simulated SVs. Not all SVs are identifiable, as some of them are not covered by any or enough sequence reads. The lengths of the gray and the colored portions of each bar signify the log-number of indels covered by only one and more than one read, respectively. The bars in different shades of green and red are used for the true positive calls of deletion and insertion of different length. A missing bar indicates a zero count. The counts of simulated deletions (A) and insertions (B) that are covered by at least two reads and by only one read are plotted as colored and gray bars.

Figure 5

Figure 5

The flowchart of the split-read analysis pipeline.

Figure 6

Figure 6

The conceptual diagrams of the split-read analysis. SVs can be detected by sequence reads spanning their break points. The split-read analysis can directly identify deletions, small insertions, and the boundaries of large insertions. After the identification of SVs, duplications and translocations can be isolated out based on matching of insertions and deletions. Breakages in blue genomic lines denote different chromosomes.

Figure 7

Figure 7

The curves of the threshold functions. Each SV call is scored by the number of supportive reads and the maximum centeredness in those reads. The thresholds on these two quantities are determined by two threshold functions, plotted as the read and the blue curves, respectively. The gray dashed curve is the threshold function for the number of supportive reads before rounding. The parameter values used for the shown functional curves are λ = 1, _t_n = 8, and _t_c = 0.7.

References

    1. Campbell PJ, Stephens PJ, Pleasance ED, O'Meara S, Li H, Santarius T, Stebbings LA, Leroy C, Edkins S, Hardy C. et al.Identification of somatically acquired rearrangements in cancer using genome-wide massively parallel paired-end sequencing. Nature genetics. 2008;40(6):722–729. doi: 10.1038/ng.128. - DOI - PMC - PubMed
    1. Stenson PD, Mort M, Ball EV, Howells K, Phillips AD, Thomas NS, Cooper DN. The Human Gene Mutation Database: 2008 update. Genome Med. 2009;1(1):13. doi: 10.1186/gm13. - DOI - PMC - PubMed
    1. Iafrate AJ, Feuk L, Rivera MN, Listewnik ML, Donahoe PK, Qi Y, Scherer SW, Lee C. Detection of large-scale variation in the human genome. Nat Genet. 2004;36(9):949–951. doi: 10.1038/ng1416. - DOI - PubMed
    1. Korbel JO, Urban AE, Affourtit JP, Godwin B, Grubert F, Simons JF, Kim PM, Palejev D, Carriero NJ, Du L. et al.Paired-end mapping reveals extensive structural variation in the human genome. Science. 2007;318(5849):420–426. doi: 10.1126/science.1149504. - DOI - PMC - PubMed
    1. Redon R, Ishikawa S, Fitch KR, Feuk L, Perry GH, Andrews TD, Fiegler H, Shapero MH, Carson AR, Chen W. et al.Global variation in copy number in the human genome. Nature. 2006;444(7118):444–454. doi: 10.1038/nature05329. - DOI - PMC - PubMed

Publication types

MeSH terms

LinkOut - more resources