TopHat-Fusion: an algorithm for discovery of novel fusion transcripts - PubMed (original) (raw)

TopHat-Fusion: an algorithm for discovery of novel fusion transcripts

Daehwan Kim et al. Genome Biol. 2011.

Abstract

TopHat-Fusion is an algorithm designed to discover transcripts representing fusion gene products, which result from the breakage and re-joining of two different chromosomes, or from rearrangements within a chromosome. TopHat-Fusion is an enhanced version of TopHat, an efficient program that aligns RNA-seq reads without relying on existing annotation. Because it is independent of gene annotation, TopHat-Fusion can discover fusion products deriving from known genes, unknown genes and unannotated splice variants of known genes. Using RNA-seq data from breast and prostate cancer cell lines, we detected both previously reported and novel fusions with solid supporting evidence. TopHat-Fusion is available at http://tophat-fusion.sourceforge.net/.

PubMed Disclaimer

Figures

Figure 1

Figure 1

Read distributions around two fusions: _BCAS4_-BCAS3 and _TOB1_-SYNRG. (a) Sixty reads aligned by TopHat-Fusion that identify a fusion product formed by the BCAS4 gene on chromosome 20 and the BCAS3 gene on chromosome 17. The data contained more reads than shown; they are collapsed to illustrate how well they are distributed. The inset figures show the coverage depth in 600-bp windows around each fusion. (b) TOB1 (ENSG00000141232)-SYNRG is a novel fusion gene found by TopHat-Fusion, shown here with 70 reads mapping across the fusion point. Note that some of the reads in green span an intron (indicated by thin horizontal lines extending to the right), a feature that can be detected by TopHat's spliced alignment procedure.

Figure 2

Figure 2

TopHat-Fusion pipeline. TopHat-Fusion consists of two main modules: (1) finding candidate fusions and aligning reads across them; and (2) filtering out false fusions using a series of post-processing routines.

Figure 3

Figure 3

Aligning a read that spans a fusion point. (a) An initially unmapped read of 75 bp is split into three segments of 25 bp, each of which is mapped separately. As shown here, the left (red) and right (blue) segments are mapped to two different chromosomes, i and j. (b) The unmapped green segment is used to find the precise fusion point between i and j. This is done by aligning the green segment to the sequences just to the right of the red segment on chromosome i and just to the left of the blue segment on chromosome j.

Figure 4

Figure 4

Mapping against fusion points and selecting best read alignments. (a) Bowtie is used to align all segments from the initially unmapped (IUM) reads against spliced fusion contigs, shown in gray on the right. For example, the brown read on the top left aligns to the first spliced fusion contig on the top right. (b) IUM reads 1 and 2 each have multiple alignments. Read 1 has a gap-free alignment, shown in dark blue, which is preferred over the other two alignments shown in lighter shades of blue. The gap-free alignment with three mismatches is preferred over the fusion alignment with one mismatch. If all alignments have gaps and mismatches, then the algorithm prefers those with fewer mismatches, as shown by the dark green alignment for IUM read 2. Full details of the scoring function that determines these preferences are described in the Materials and methods.

Figure 5

Figure 5

Supporting and contradicting evidence for fusion transcripts. (a) Given a fusion point and the chromosomes (gray) spanning it, single-end and paired-end reads (blue) support the fusion. Other reads (red) contradict the fusion by mapping entirely to either of the two chromosomes. (b) TopHat-Fusion prefers reads that uniformly cover a 600-bp window centered in any fusion point. On the upper left, blue reads cover the entire window. On the lower left, red reads cover only a narrow window around the fusion. On the lower right, reads do not cover part of the 600-bp window. The cases shown in orange will be rejected by TopHat-Fusion.

Figure 6

Figure 6

TopHat-Fusion's scoring scheme of read distributions. A scoring scheme of how well distributed reads are around a fusion point; these result scores are used to sort the list of candidate fusions. Variables are defined in the main text.

Similar articles

Cited by

References

    1. Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods. 2008;5:621–628. doi: 10.1038/nmeth.1226. - DOI - PubMed
    1. Nagalakshmi U, Wang Z, Waern K, Shou C, Raha D, Gerstein M, Snyder M. The transcriptional landscape of the yeast genome defined by RNA sequencing. Science. 2008;320:1344–1349. doi: 10.1126/science.1158441. - DOI - PMC - PubMed
    1. Lister R, O'Malley RC, Tonti-Filippini J, Gregory BD, Berry CC, Millar AH, Ecker JR. Highly integrated single-base resolution maps of the epigenome in Arabidopsis. Cell. 2008;133:523–536. doi: 10.1016/j.cell.2008.03.029. - DOI - PMC - PubMed
    1. Salzberg SL. Recent advances in RNA sequence analysis. F1000 Biol Rep. 2010;2:64. - PMC - PubMed
    1. Trapnell C, Pachter L, Salzberg SL. TopHat: discovering splice junctions with RNA-Seq. Bioinformatics. 2009;25:1105–1111. doi: 10.1093/bioinformatics/btp120. - DOI - PMC - PubMed

Publication types

MeSH terms

Grants and funding

LinkOut - more resources