Rapid detection of expanded short tandem repeats in personal genomics using hybrid sequencing - PubMed (original) (raw)
. 2014 Mar 15;30(6):815-22.
doi: 10.1093/bioinformatics/btt647. Epub 2013 Nov 8.
Taku Monjo, Pham H Hoang, Jun Yoshimura, Hideaki Yurino, Jun Mitsui, Hiroyuki Ishiura, Yuji Takahashi, Yaeko Ichikawa, Jun Goto, Shoji Tsuji, Shinichi Morishita
Affiliations
- PMID: 24215022
- PMCID: PMC3957077
- DOI: 10.1093/bioinformatics/btt647
Rapid detection of expanded short tandem repeats in personal genomics using hybrid sequencing
Koichiro Doi et al. Bioinformatics. 2014.
Abstract
Motivation: Long expansions of short tandem repeats (STRs), i.e. DNA repeats of 2-6 nt, are associated with some genetic diseases. Cost-efficient high-throughput sequencing can quickly produce billions of short reads that would be useful for uncovering disease-associated STRs. However, enumerating STRs in short reads remains largely unexplored because of the difficulty in elucidating STRs much longer than 100 bp, the typical length of short reads.
Results: We propose ab initio procedures for sensing and locating long STRs promptly by using the frequency distribution of all STRs and paired-end read information. We validated the reproducibility of this method using biological replicates and used it to locate an STR associated with a brain disease (SCA31). Subsequently, we sequenced this STR site in 11 SCA31 samples using SMRT(TM) sequencing (Pacific Biosciences), determined 2.3-3.1 kb sequences at nucleotide resolution and revealed that (TGGAA)- and (TAAAATAGAA)-repeat expansions determined the instability of the repeat expansions associated with SCA31. Our method could also identify common STRs, (AAAG)- and (AAAAG)-repeat expansions, which are remarkably expanded at four positions in an SCA31 sample. This is the first proposed method for rapidly finding disease-associated long STRs in personal genomes using hybrid sequencing of short and long reads.
Availability and implementation: Our TRhist software is available at http://trhist.gi.k.u-tokyo.ac.jp/.
Contact: moris@cb.k.u-tokyo.ac.jp
Supplementary information: Supplementary data are available at Bioinformatics online.
Figures
Fig. 1.
Sensing and locating STRs in short reads. (A) An original short read. (B) An approximate STR (AGAGGC)n (n = 6) in the short read. The central four copies of AGAGGC are an exact STR with no mutations, whereas the flanking copies contain the mutations shown in bold letters. If one of the regions (black) surrounding the STR aligns in a unique position, the STR can be located in the genome. (C) A read occupied by an approximate STR. (D) Sensing STRs from frequency distributions of (AGAGCC)n in NA12877 (father of the HapMap CEU trio), NA12878 (mother) and NA18507 (an African male). The _x_-axis is the lengths of STR occurrences detected in a read, and the _y_-axis is the frequency of reads containing STR occurrences of the length indicated on the _x_-axis. Note that 100-bp long STR occurrences are frequent in NA12877, whereas no STR occurrences of length >70 bp are observed in samples NA12878 and NA18507. (E) When a read is filled with an STR (red), we attempt to anchor the other end read (blue) to a unique position unambiguously. (F and G) An STR is located easily if its location can be sandwiched using information on paired-end reads. The length of an STR of length <100 bp is easily estimated (F), whereas determining the length of a much longer STR is non-trivial (G). We need to use third-generation sequencers, such as PacBio RS, with the capability of reading DNA fragments having a length of thousands of bases
Fig. 2.
Select positions where STR occurrences are expanded significantly. (A) We generate the frequency distribution of lengths of STR occurrences in paired-end reads. This picture shows the case of a 70-bp long STR. The histogram of the frequency distribution peaks at 70 bp. (B) When the STR is 160-bp long, the distribution has a significant peak at 100 bp. We test if the peak is a significant outlier in the frequency distribution using the Smirnov–Grubbs’ test
Fig. 3.
Sensing expanded STRs associated with SCA31. (A) Frequencies of 100-bp STRs that have >10 occurrences in one of SCA31, NA12877, NA12878 or NA18507. For example, the arrow in the second lowest row shows that the (AAAATAGAAT) repeat is expanded only in SCA31. Our ab initio procedure analyzes this bar chart and selects STRs that are significantly abundant in the case sample (e.g., SCA31) but absent in all of the control samples. The bar chart is also useful for confirming the abundance of (AATGG) and (AACCCT) repeats, equivalent to the (GGGTTA) repeat, where the former and latter motifs are known to be enriched in centromeres and telomeres, respectively. (B) Frequency distributions of the (AAAATAGAAT) repeat. SCA31 has many 100-bp occurrences, whereas no occurrences of length >55 bp were observed in NA12877, NA12878 and NA18507
Fig. 4.
Locating and sequencing expanded STRs associated with SCA31. (A) A real example from SCA31. One haplotype contains a ∼2.5–3.8 kb insertion at Chr.16 66 524 303 in hg19 in an intron of BEAN1 and TK2. The right boundary of the insertion could be identified using paired-end reads with AAAATAGAAT repeats at their left ends and uniquely mapped reads at their right ends. The lower bar illustrates the reference genome (hg19) with an AAAAT repeat. (B) A form of expanded repeat associated with SCA31 samples. The values of i, j, l and m vary in the individual SCA31 samples. (C) We determined the values of i, j, l and m in 11 SCA31 samples using SMRTTM sequencing. This shows that ∼90% of the repeat expansion are (TAGAA)j and (TAAAA TAGAA)m. (D) The values of j and m are positively correlated (r = 0.70). These two values are the determinants of the instability of the repeat expansions in SCA31
Fig. 5.
Sizes of the common STRs, (AAAG)n and (AAAAG)n, at four genomic positions in the SCA31 sample and reference genome. Note that individual STR occurrences are significantly expanded in the SCA31 sample. The PCR primers used for amplifying individual regions and the sequences of amplicons can be found in
Supplementary Figure S6
References
- Brook JD, et al. Molecular basis of myotonic dystrophy: expansion of a trinucleotide (CTG) repeat at the 3′ end of a transcript encoding a protein kinase family member. Cell. 1992;69:385. -PubMed
Publication types
MeSH terms
LinkOut - more resources
Full Text Sources
Other Literature Sources
Research Materials