TopHat: discovering splice junctions with RNA-Seq - PubMed (original) (raw)

TopHat: discovering splice junctions with RNA-Seq

Cole Trapnell et al. Bioinformatics. 2009.

Abstract

Motivation: A new protocol for sequencing the messenger RNA in a cell, known as RNA-Seq, generates millions of short sequence fragments in a single run. These fragments, or 'reads', can be used to measure levels of gene expression and to identify novel splice variants of genes. However, current software for aligning RNA-Seq data to a genome relies on known splice junctions and cannot identify novel ones. TopHat is an efficient read-mapping algorithm designed to align reads from an RNA-Seq experiment to a reference genome without relying on known splice sites.

Results: We mapped the RNA-Seq reads from a recent mammalian RNA-Seq experiment and recovered more than 72% of the splice junctions reported by the annotation-based software from that study, along with nearly 20,000 previously unreported junctions. The TopHat pipeline is much faster than previous systems, mapping nearly 2.2 million reads per CPU hour, which is sufficient to process an entire RNA-Seq experiment in less than a day on a standard desktop computer. We describe several challenges unique to ab initio splice site discovery from RNA-Seq reads that will require further algorithm development.

Availability: TopHat is free, open-source software available from http://tophat.cbcb.umd.edu.

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

Fig. 1.

The TopHat pipeline. RNA-Seq reads are mapped against the whole reference genome, and those reads that do not map are set aside. An initial consensus of mapped regions is computed by Maq. Sequences flanking potential donor/acceptor splice sites within neighboring regions are joined to form potential splice junctions. The IUM reads are indexed and aligned to these splice junction sequences.

Fig. 2.

An intron entirely overlapped by the 5′-UTR of another transcript. Both isoforms are present in the brain tissue RNA sample. The top track is the normalized uniquely mappable read coverage reported by ERANGE for this region (Mortazavi et al., 2008). The lack of a large coverage gap causes TopHat to report a single island containing both exons. TopHat looks for introns within single islands in order to detect this junction.

Fig. 3.

The seed and extend alignment used to match reads to possible splice sites. For each possible splice site, a seed is formed by combining a small amount of sequence upstream of the donor and downstream of the acceptor. This seed, shown in dark gray, is used to query the index of reads that were not initially mapped by Bowtie. Any read containing the seed is checked for a complete alignment to the exons on either side of the possible splice. In the light gray portion of the alignment, TopHat allows a user-specified number of mismatches. Because reads typically contain low-quality base calls on their 3′ ends, TopHat only examines the first 28 bp on the 5′ end of each read by default.

Fig. 4.

TopHat sensitivity as RPKM varies. For genes transcribed above 15.0 RPKM, TopHat detects more than 80% reported by ERANGE in the M. musculus brain tissue study. TopHat detects more than 72% of all junctions observed by ERANGE, including those in genes expressed at only a single transcript per cell. A de novo assembly of the RNA-Seq reads, followed by spliced alignment of the assembled transcripts produces markedly poorer sensitivity, detecting around 40% of junctions in genes transcribed above 25.0 RPKM, but comparatively few junctions in more highly transcribed genes.

Fig. 5.

The BLAT E-value distribution of known, previously unreported, and randomly generated splice junction sequences when searched against GenBank mouse ESTs. As expected, known junctions have high-quality BLAT hits to the EST database. Randomly-generated junction sequences do not. High-quality BLAT hits for more than 11% of the junctions identified by TopHat suggest that the UCSC gene models for mouse are incomplete. These junctions are almost certainly genuine, and because the mouse EST database is not complete, 11% is only a lower bound on the specificity of TopHat.

Fig. 6.

TopHat detects junctions in genes transcribed at very low levels. The gene Pnlip was transcribed at only 7.88 RPKM in the brain tissue according to ERANGE, and yet TopHat reports the complete known gene model.

Fig. 7.

A previously unreported splice junction detected by TopHat is shown as the topmost horizontal line. This junction skips two exons in the ADP-ribosylation gene Arfgef1. As explained in Section 2, islands of read coverage in the Bowtie mapping are extended by 45 bp on either side.

Cited by

Novornabreak: Local Assembly for Novel Splice Junction and Fusion Transcript Detection from RNA-Seq Data.
Tan Y, Mohanty V, Liang S, Dou J, Ma J, Kim KH, Bonder MJ, Shi X, Lee C; Human Genome Structural Variation Consortium; Chong Z, Chen K. Tan Y, et al. J Bioinform Syst Biol. 2023;6(2):74-81. doi: 10.26502/jbsb.5107050. Epub 2023 Apr 4. J Bioinform Syst Biol. 2023. PMID: 39301431 Free PMC article.
Recruitment of the m6A/m6Am demethylase FTO to target RNAs by the telomeric zinc finger protein ZBTB48.
Nabeel-Shah S, Pu S, Burke GL, Ahmed N, Braunschweig U, Farhangmehr S, Lee H, Wu M, Ni Z, Tang H, Zhong G, Marcon E, Zhang Z, Blencowe BJ, Greenblatt JF. Nabeel-Shah S, et al. Genome Biol. 2024 Sep 19;25(1):246. doi: 10.1186/s13059-024-03392-7. Genome Biol. 2024. PMID: 39300486 Free PMC article.
Splam: a deep-learning-based splice site predictor that improves spliced alignments.
Chao KH, Mao A, Salzberg SL, Pertea M. Chao KH, et al. Genome Biol. 2024 Sep 16;25(1):243. doi: 10.1186/s13059-024-03379-4. Genome Biol. 2024. PMID: 39285451 Free PMC article.
Heat Stress Induces Alterations in Gene Expression of Actin Cytoskeleton and Filament of Cellular Components Causing Gut Disruption in Growing-Finishing Pigs.
Choi Y, Park H, Kim J, Lee H, Kim M. Choi Y, et al. Animals (Basel). 2024 Aug 26;14(17):2476. doi: 10.3390/ani14172476. Animals (Basel). 2024. PMID: 39272260 Free PMC article.
Transcriptome analysis of the effect of HERV-K env gene knockout in ovarian cancer cell lines.
Ko EJ, Suh DS, Kim H, Lee JY, Eo WK, Kim H, Kim KH, Cha HJ. Ko EJ, et al. Genes Genomics. 2024 Sep 13. doi: 10.1007/s13258-024-01544-4. Online ahead of print. Genes Genomics. 2024. PMID: 39271536

References

1. Abouelhoda M, et al. Replacing suffix trees with enhanced suffix arrays. J. Discrete Alg. 2004;2:53–86.
1. Adams MD, et al. Rapid cDNA sequencing (expressed sequence tags) from a directionally cloned human infant brain cDNA library. Nat. Genet. 1993;4:373–380. - PubMed
1. Burrows M, Wheeler D. Technical Report 124. Palo Alto, California: DEC, Digital Systems Research Center; 1994. A block sorting lossless data compression algorithm.
1. Cloonan N, et al. Stem cell transcriptome profiling via massive-scale mRNA sequencing. Nat. Meth. 2008;5:613–619. - PubMed
1. De Bona F, et al. Optimal spliced alignments of short sequence reads. Bioinformatics. 2008;24:i174–i180. - PubMed

TopHat: discovering splice junctions with RNA-Seq - PubMed (original) (raw)