Nucleotide-resolution analysis of structural variants using BreakSeq and a breakpoint library - PubMed (original) (raw)

Nucleotide-resolution analysis of structural variants using BreakSeq and a breakpoint library

Hugo Y K Lam et al. Nat Biotechnol. 2010 Jan.

Abstract

Structural variants (SVs) are a major source of human genomic variation; however, characterizing them at nucleotide resolution remains challenging. Here we assemble a library of breakpoints at nucleotide resolution from collating and standardizing ~2,000 published SVs. For each breakpoint, we infer its ancestral state (through comparison to primate genomes) and its mechanism of formation (e.g., nonallelic homologous recombination, NAHR). We characterize breakpoint sequences with respect to genomic landmarks, chromosomal location, sequence motifs and physical properties, finding that the occurrence of insertions and deletions is more balanced than previously reported and that NAHR-formed breakpoints are associated with relatively rigid, stable DNA helices. Finally, we demonstrate an approach, BreakSeq, for scanning the reads from short-read sequenced genomes against our breakpoint library to accurately identify previously overlooked SVs, which we then validate by PCR. As new data become available, we expect our BreakSeq approach will become more sensitive and facilitate rapid SV genotyping of personal genomes.

PubMed Disclaimer

Figures

Figure 1

Figure 1

Composition of the SV breakpoint library. SVs in the library were based on different SV-mapping and breakpoint-sequencing strategies. A large fraction (44%) of the breakpoints were based on data generated using 454/Roche sequencing, including resequencing of an individual human genome (Wheeler, 602 SVs) and sequencing of breakpoints in two individuals following high-resolution and massive paired-end mapping (Korbel and Kim, 264 SVs). The remaining 56% of the breakpoints were identified using other approaches, including Sanger capillary sequencing of breakpoints identified by whole-genome shotgun sequencing and assembly of an individual human genome(Levy, 694 SVs), fosmid-paired-end sequencing carried out in multiple individuals (Tuzun and Kidd, 281 SVs), breakpoints mined from SNP discovery DNA resequencing traces(Mills, 98 SVs), and tiling-array based comparative genomic hybridization followed by breakpoint sequencing (Perry, 22 SVs). Fewer than five breakpoints were reported in two genomes sequenced using short 36 bp reads (Illumina/Solexa), , presumably owing to the complex DNA sequence patterns frequently associated with breakpoints, , .

Figure 2

Figure 2

Mapping breakpoints using the library. (a) Overview of the BreakSeq approach. Breakpoints are used to generate junction sequences (upper)—the 30 bp of sequence flanking each side of the breakpoint is extracted to form a 60 bp of junction sequence. Then, DNA reads are aligned to the junction sequences (lower). Alignment results are interpreted as follows. In the case of insertions relative to the reference genome (left), sequences A and B represent the left and right breakpoint junction sequences of the non-reference SV allele, respectively. In the case of deletions (right), sequence C represents the junction sequence of the non-reference SV allele. Solid lines with arrows, successful alignments. Dashed lines with crosses, no proper alignment. (b) Representative PCR validation of detected SVs in NA12891. Primers flanking each SV were used to amplify41 different genomic regions(see Supplementary Table 3 for genomic coordinates and primer sequences). Expected band sizes for the reference and non-reference SV alleles are given at the top of each lane. The difference in size of the products for the reference and non-reference alleles confirmed the presence of the SVs for all loci except 6, 13 (confirmed by LongAmp Taq in a separate experiment), 21, 25 and 36. M1 is a 100bp marker and M2 is a 1kb marker. (c) A subset of SVs, which were confirmed by sequencing, was analyzed in nine additional genomic DNA samples (HapMap individuals with ancestry in Europe) to test for SV frequency within the CEPH population. An asterisk indicates that the SV is present polymorphically.

Figure 2

Figure 2

Mapping breakpoints using the library. (a) Overview of the BreakSeq approach. Breakpoints are used to generate junction sequences (upper)—the 30 bp of sequence flanking each side of the breakpoint is extracted to form a 60 bp of junction sequence. Then, DNA reads are aligned to the junction sequences (lower). Alignment results are interpreted as follows. In the case of insertions relative to the reference genome (left), sequences A and B represent the left and right breakpoint junction sequences of the non-reference SV allele, respectively. In the case of deletions (right), sequence C represents the junction sequence of the non-reference SV allele. Solid lines with arrows, successful alignments. Dashed lines with crosses, no proper alignment. (b) Representative PCR validation of detected SVs in NA12891. Primers flanking each SV were used to amplify41 different genomic regions(see Supplementary Table 3 for genomic coordinates and primer sequences). Expected band sizes for the reference and non-reference SV alleles are given at the top of each lane. The difference in size of the products for the reference and non-reference alleles confirmed the presence of the SVs for all loci except 6, 13 (confirmed by LongAmp Taq in a separate experiment), 21, 25 and 36. M1 is a 100bp marker and M2 is a 1kb marker. (c) A subset of SVs, which were confirmed by sequencing, was analyzed in nine additional genomic DNA samples (HapMap individuals with ancestry in Europe) to test for SV frequency within the CEPH population. An asterisk indicates that the SV is present polymorphically.

Figure 2

Figure 2

Mapping breakpoints using the library. (a) Overview of the BreakSeq approach. Breakpoints are used to generate junction sequences (upper)—the 30 bp of sequence flanking each side of the breakpoint is extracted to form a 60 bp of junction sequence. Then, DNA reads are aligned to the junction sequences (lower). Alignment results are interpreted as follows. In the case of insertions relative to the reference genome (left), sequences A and B represent the left and right breakpoint junction sequences of the non-reference SV allele, respectively. In the case of deletions (right), sequence C represents the junction sequence of the non-reference SV allele. Solid lines with arrows, successful alignments. Dashed lines with crosses, no proper alignment. (b) Representative PCR validation of detected SVs in NA12891. Primers flanking each SV were used to amplify41 different genomic regions(see Supplementary Table 3 for genomic coordinates and primer sequences). Expected band sizes for the reference and non-reference SV alleles are given at the top of each lane. The difference in size of the products for the reference and non-reference alleles confirmed the presence of the SVs for all loci except 6, 13 (confirmed by LongAmp Taq in a separate experiment), 21, 25 and 36. M1 is a 100bp marker and M2 is a 1kb marker. (c) A subset of SVs, which were confirmed by sequencing, was analyzed in nine additional genomic DNA samples (HapMap individuals with ancestry in Europe) to test for SV frequency within the CEPH population. An asterisk indicates that the SV is present polymorphically.

Figure 3

Figure 3

Ancestral state classification. (a) Junction sequences are aligned onto syntenic regions of a non-human primate genome to infer SV ancestral states. For rectifying an SV insertion event (from deletion) according to ancestral state (left), sequences A and B represent the junction sequences of the reference SV allele, where as sequence C represents the junction sequence of the non-reference SV allele. For rectifying an SV deletion event (from deletion) according to ancestral state(right), sequence C represents the junction sequence of the reference SV allele and sequences A and B represent the junction sequences of the non-reference SV allele. Solid lines with arrows indicate successful alignments and dashed lines with crosses indicate no proper alignment. (b) Results of classifying SVs as insertions or deletions according to ancestral state. An SV event is defined as ‘rectifiable’ (indicated by darker color) if unambiguous high-quality alignments to putative ancestral regions could be constructed for the loci in any primate genomes (regardless of whether the classification is changed according to the ancestral state), and as ‘unrectifiable’ (represented by lighter color) if not.

Figure 3

Figure 3

Ancestral state classification. (a) Junction sequences are aligned onto syntenic regions of a non-human primate genome to infer SV ancestral states. For rectifying an SV insertion event (from deletion) according to ancestral state (left), sequences A and B represent the junction sequences of the reference SV allele, where as sequence C represents the junction sequence of the non-reference SV allele. For rectifying an SV deletion event (from deletion) according to ancestral state(right), sequence C represents the junction sequence of the reference SV allele and sequences A and B represent the junction sequences of the non-reference SV allele. Solid lines with arrows indicate successful alignments and dashed lines with crosses indicate no proper alignment. (b) Results of classifying SVs as insertions or deletions according to ancestral state. An SV event is defined as ‘rectifiable’ (indicated by darker color) if unambiguous high-quality alignments to putative ancestral regions could be constructed for the loci in any primate genomes (regardless of whether the classification is changed according to the ancestral state), and as ‘unrectifiable’ (represented by lighter color) if not.

Figure 4

Figure 4

Inferring mechanisms of SV formation. (a) Pipeline for classifying SV-formation mechanisms. TE, transposable element. TSD, target site duplication. (b) Mechanisms of formation inferred for SVs in the library (larger circle on right). For NAHR (red) and MTEI/STEI (green), darker wedges represent high-confidence classification subsets, and lighter wedges are extended subsets. STEI is further subdivided in the left circle according to the fraction of previously reported L1insertions, novel L1 insertions and processed pseudogene insertions in our dataset. (c) SV-indel distribution for all rectifiable events, broken down by formation mechanism. (d) Distribution of inter-vs. intra-chromosomal events for all consistently rectifiable insertions, broken down by formation mechanism. (e) Distances of putative ancestral loci to insertion sites for all consistently rectifiable intra-chromosomal insertions, showing that intra-chromosomal NAHR insertions usually involve nearby sequences, whereas TEIs and NHR-associated insertions usually involve distant sequences. (f) Genome-wide view of insertion trace. The outermost circle represents chromosomal ideograms; the second circle represents SV formational mechanisms of 1,554 events in a stacked histogram. The lines in the innermost circle indicate the origin of the insertion sequences in the human genome for all 321 consistently rectifiable insertions.

Figure 4

Figure 4

Inferring mechanisms of SV formation. (a) Pipeline for classifying SV-formation mechanisms. TE, transposable element. TSD, target site duplication. (b) Mechanisms of formation inferred for SVs in the library (larger circle on right). For NAHR (red) and MTEI/STEI (green), darker wedges represent high-confidence classification subsets, and lighter wedges are extended subsets. STEI is further subdivided in the left circle according to the fraction of previously reported L1insertions, novel L1 insertions and processed pseudogene insertions in our dataset. (c) SV-indel distribution for all rectifiable events, broken down by formation mechanism. (d) Distribution of inter-vs. intra-chromosomal events for all consistently rectifiable insertions, broken down by formation mechanism. (e) Distances of putative ancestral loci to insertion sites for all consistently rectifiable intra-chromosomal insertions, showing that intra-chromosomal NAHR insertions usually involve nearby sequences, whereas TEIs and NHR-associated insertions usually involve distant sequences. (f) Genome-wide view of insertion trace. The outermost circle represents chromosomal ideograms; the second circle represents SV formational mechanisms of 1,554 events in a stacked histogram. The lines in the innermost circle indicate the origin of the insertion sequences in the human genome for all 321 consistently rectifiable insertions.

Figure 4

Figure 4

Inferring mechanisms of SV formation. (a) Pipeline for classifying SV-formation mechanisms. TE, transposable element. TSD, target site duplication. (b) Mechanisms of formation inferred for SVs in the library (larger circle on right). For NAHR (red) and MTEI/STEI (green), darker wedges represent high-confidence classification subsets, and lighter wedges are extended subsets. STEI is further subdivided in the left circle according to the fraction of previously reported L1insertions, novel L1 insertions and processed pseudogene insertions in our dataset. (c) SV-indel distribution for all rectifiable events, broken down by formation mechanism. (d) Distribution of inter-vs. intra-chromosomal events for all consistently rectifiable insertions, broken down by formation mechanism. (e) Distances of putative ancestral loci to insertion sites for all consistently rectifiable intra-chromosomal insertions, showing that intra-chromosomal NAHR insertions usually involve nearby sequences, whereas TEIs and NHR-associated insertions usually involve distant sequences. (f) Genome-wide view of insertion trace. The outermost circle represents chromosomal ideograms; the second circle represents SV formational mechanisms of 1,554 events in a stacked histogram. The lines in the innermost circle indicate the origin of the insertion sequences in the human genome for all 321 consistently rectifiable insertions.

Figure 4

Figure 4

Inferring mechanisms of SV formation. (a) Pipeline for classifying SV-formation mechanisms. TE, transposable element. TSD, target site duplication. (b) Mechanisms of formation inferred for SVs in the library (larger circle on right). For NAHR (red) and MTEI/STEI (green), darker wedges represent high-confidence classification subsets, and lighter wedges are extended subsets. STEI is further subdivided in the left circle according to the fraction of previously reported L1insertions, novel L1 insertions and processed pseudogene insertions in our dataset. (c) SV-indel distribution for all rectifiable events, broken down by formation mechanism. (d) Distribution of inter-vs. intra-chromosomal events for all consistently rectifiable insertions, broken down by formation mechanism. (e) Distances of putative ancestral loci to insertion sites for all consistently rectifiable intra-chromosomal insertions, showing that intra-chromosomal NAHR insertions usually involve nearby sequences, whereas TEIs and NHR-associated insertions usually involve distant sequences. (f) Genome-wide view of insertion trace. The outermost circle represents chromosomal ideograms; the second circle represents SV formational mechanisms of 1,554 events in a stacked histogram. The lines in the innermost circle indicate the origin of the insertion sequences in the human genome for all 321 consistently rectifiable insertions.

Figure 4

Figure 4

Inferring mechanisms of SV formation. (a) Pipeline for classifying SV-formation mechanisms. TE, transposable element. TSD, target site duplication. (b) Mechanisms of formation inferred for SVs in the library (larger circle on right). For NAHR (red) and MTEI/STEI (green), darker wedges represent high-confidence classification subsets, and lighter wedges are extended subsets. STEI is further subdivided in the left circle according to the fraction of previously reported L1insertions, novel L1 insertions and processed pseudogene insertions in our dataset. (c) SV-indel distribution for all rectifiable events, broken down by formation mechanism. (d) Distribution of inter-vs. intra-chromosomal events for all consistently rectifiable insertions, broken down by formation mechanism. (e) Distances of putative ancestral loci to insertion sites for all consistently rectifiable intra-chromosomal insertions, showing that intra-chromosomal NAHR insertions usually involve nearby sequences, whereas TEIs and NHR-associated insertions usually involve distant sequences. (f) Genome-wide view of insertion trace. The outermost circle represents chromosomal ideograms; the second circle represents SV formational mechanisms of 1,554 events in a stacked histogram. The lines in the innermost circle indicate the origin of the insertion sequences in the human genome for all 321 consistently rectifiable insertions.

Figure 4

Figure 4

Inferring mechanisms of SV formation. (a) Pipeline for classifying SV-formation mechanisms. TE, transposable element. TSD, target site duplication. (b) Mechanisms of formation inferred for SVs in the library (larger circle on right). For NAHR (red) and MTEI/STEI (green), darker wedges represent high-confidence classification subsets, and lighter wedges are extended subsets. STEI is further subdivided in the left circle according to the fraction of previously reported L1insertions, novel L1 insertions and processed pseudogene insertions in our dataset. (c) SV-indel distribution for all rectifiable events, broken down by formation mechanism. (d) Distribution of inter-vs. intra-chromosomal events for all consistently rectifiable insertions, broken down by formation mechanism. (e) Distances of putative ancestral loci to insertion sites for all consistently rectifiable intra-chromosomal insertions, showing that intra-chromosomal NAHR insertions usually involve nearby sequences, whereas TEIs and NHR-associated insertions usually involve distant sequences. (f) Genome-wide view of insertion trace. The outermost circle represents chromosomal ideograms; the second circle represents SV formational mechanisms of 1,554 events in a stacked histogram. The lines in the innermost circle indicate the origin of the insertion sequences in the human genome for all 321 consistently rectifiable insertions.

Figure 5

Figure 5

Analysis of breakpoint features. (a) Distance to chromosomal landmarks. Brackets indicate significantly different classes (_P_-value <0.05in Wilcoxon rank sum test after multiple hypothesis test correction by the Holm method). NAHR events are found to be significantly closer to telomeres and human-chimpanzee synteny block boundaries than the other mechanistic classes; VNTRs are significantly enriched in centromeric and pericentromeric regions. (b) DNA flexibility (dashed lines and left y-axis) and helix stability (solid lines and right y-axis) around NAHR and NHR breakpoints. (c) Distribution of NHR events with different lengths of microhomologies at the breakpoints. Microhomologies are significantly enriched in NHR breakpoints compared to a random background (KS test _P_-value=2.43E-11).

Figure 5

Figure 5

Analysis of breakpoint features. (a) Distance to chromosomal landmarks. Brackets indicate significantly different classes (_P_-value <0.05in Wilcoxon rank sum test after multiple hypothesis test correction by the Holm method). NAHR events are found to be significantly closer to telomeres and human-chimpanzee synteny block boundaries than the other mechanistic classes; VNTRs are significantly enriched in centromeric and pericentromeric regions. (b) DNA flexibility (dashed lines and left y-axis) and helix stability (solid lines and right y-axis) around NAHR and NHR breakpoints. (c) Distribution of NHR events with different lengths of microhomologies at the breakpoints. Microhomologies are significantly enriched in NHR breakpoints compared to a random background (KS test _P_-value=2.43E-11).

Figure 5

Figure 5

Analysis of breakpoint features. (a) Distance to chromosomal landmarks. Brackets indicate significantly different classes (_P_-value <0.05in Wilcoxon rank sum test after multiple hypothesis test correction by the Holm method). NAHR events are found to be significantly closer to telomeres and human-chimpanzee synteny block boundaries than the other mechanistic classes; VNTRs are significantly enriched in centromeric and pericentromeric regions. (b) DNA flexibility (dashed lines and left y-axis) and helix stability (solid lines and right y-axis) around NAHR and NHR breakpoints. (c) Distribution of NHR events with different lengths of microhomologies at the breakpoints. Microhomologies are significantly enriched in NHR breakpoints compared to a random background (KS test _P_-value=2.43E-11).

Similar articles

Cited by

References

    1. Sebat J, et al. Large-scale copy number polymorphism in the human genome. Science. 2004;305:525–528. - PubMed
    1. Iafrate AJ, et al. Detection of large-scale variation in the human genome. Nat Genet. 2004;36:949–951. - PubMed
    1. Tuzun E, et al. Fine-scale structural variation of the human genome. Nat Genet. 2005;37:727–732. - PubMed
    1. Redon R, et al. Global variation in copy number in the human genome. Nature. 2006;444:444–454. - PMC - PubMed
    1. Korbel JO, et al. Paired-end mapping reveals extensive structural variation in the human genome. Science. 2007;318:420–426. - PMC - PubMed

Publication types

MeSH terms

Substances

LinkOut - more resources