TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions - PubMed (original) (raw)

TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions

Daehwan Kim et al. Genome Biol. 2013.

Abstract

TopHat is a popular spliced aligner for RNA-sequence (RNA-seq) experiments. In this paper, we describe TopHat2, which incorporates many significant enhancements to TopHat. TopHat2 can align reads of various lengths produced by the latest sequencing technologies, while allowing for variable-length indels with respect to the reference genome. In addition to de novo spliced alignment, TopHat2 can align reads across fusion breaks, which can occur after genomic translocations. TopHat2 combines the ability to identify novel splice sites with direct mapping to known transcripts, producing sensitive and accurate alignments, even for highly repetitive genomes or in the presence of pseudogenes. TopHat2 is available at http://ccb.jhu.edu/software/tophat.

PubMed Disclaimer

Figures

Figure 1

Figure 1

Two possible incorrect alignments of spliced reads. 1) A read extending a few bases into the flanking exon can be aligned to the intron instead of the exon. 2) A read spanning multiple exons from genes with processed pseudogene copies can be aligned to the pseudogene copies instead of the gene from which it originates.

Figure 2

Figure 2

The number of read alignments from TopHat2, GSNAP, RUM, MapSplice, and STAR. Tthe RNA-seq reads are from Chen et al. [11]. TopHat2 was run with and without realignment (realignment edit distance of 0). TopHat2, GSNAP, and STAR were run in both de novo and gene-mapping modes, while MapSplice was run only in de novo mode and RUM was run only in gene-mapping mode. The number of alignments at each edit distance is cumulative; for instance, the number of alignments at an edit distance of 2 includes all the alignments with edit distance of 0, 1, or 2.

Figure 3

Figure 3

The number of spliced-read alignments from TopHat2, GSNAP, RUM, MapSplice, and STAR. The RNA-seq reads are from Chen et al. [11]. TopHat2, GSNAP, and STAR were run in both de novo and gene-mapping modes while MapSplice was run only in de novo mode and RUM was run only in gene-mapping mode. For each mapping mode, the two panels on the left show the number of spliced alignments whose splice sites were found in the gene annotations, and the two panels on the right show the number of all spliced alignments including novel splice sites.

Figure 4

Figure 4

The number of read and spliced-read alignments from TopHat2, using different realignment edit distances and no realignment. Edit distances of 0, 1, and 2 were used. As TopHat2 allows more realignment from no realignment to 2 to 1 to 0, the number of read alignments and spliced-read alignments increases, so that the differences in the numbers of read alignments from TopHat run with different realignment edit distance are mostly explained by the increase in the number of spliced-read alignments.

Figure 5

Figure 5

The number of spliced-read alignments from TopHat2, GSNAP, STAR, and MapSplice without using gene annotation. The number of read alignments whose splice sites were found in the gene annotations are shown in brown, and the number of all spliced-read alignments including novel splice sites are shown in green.

Figure 6

Figure 6

TopHat2 pipeline. Details are given in the main text.

Similar articles

Cited by

References

    1. Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods. 2008;14:621–628. doi: 10.1038/nmeth.1226. - DOI - PubMed
    1. Pei B, Sisu C, Frankish A, Howald C, Habegger L, Mu XJ, Harte R, Balasubramanian S, Tanzer A, Diekhans M, Reymond A, Hubbard TJ, Harrow J, Gerstein MB. The GENCODE pseudogene resource. Genome Biol. 2012;14:R51. doi: 10.1186/gb-2012-13-9-r51. - DOI - PMC - PubMed
    1. Roberts A, Trapnell C, Donaghey J, Rinn JL, Pachter L. Improving RNA-Seq expression estimates by correcting for fragment bias. Genome Biol. 2011;14:R22. doi: 10.1186/gb-2011-12-3-r22. - DOI - PMC - PubMed
    1. Trapnell C, Pachter L, Salzberg SL. TopHat: discovering splice junctions with RNA-Seq. Bioinformatics. 2009;14:1105–1111. doi: 10.1093/bioinformatics/btp120. - DOI - PMC - PubMed
    1. Wu TD, Nacu S. Fast and SNP-tolerant detection of complex variants and splicing in short reads. Bioinformatics. 2010;14:873–881. doi: 10.1093/bioinformatics/btq057. - DOI - PMC - PubMed

Publication types

MeSH terms

LinkOut - more resources