TopHat: discovering splice junctions with RNA-Seq - PubMed (original) (raw)
TopHat: discovering splice junctions with RNA-Seq
Cole Trapnell et al. Bioinformatics. 2009.
Abstract
Motivation: A new protocol for sequencing the messenger RNA in a cell, known as RNA-Seq, generates millions of short sequence fragments in a single run. These fragments, or 'reads', can be used to measure levels of gene expression and to identify novel splice variants of genes. However, current software for aligning RNA-Seq data to a genome relies on known splice junctions and cannot identify novel ones. TopHat is an efficient read-mapping algorithm designed to align reads from an RNA-Seq experiment to a reference genome without relying on known splice sites.
Results: We mapped the RNA-Seq reads from a recent mammalian RNA-Seq experiment and recovered more than 72% of the splice junctions reported by the annotation-based software from that study, along with nearly 20,000 previously unreported junctions. The TopHat pipeline is much faster than previous systems, mapping nearly 2.2 million reads per CPU hour, which is sufficient to process an entire RNA-Seq experiment in less than a day on a standard desktop computer. We describe several challenges unique to ab initio splice site discovery from RNA-Seq reads that will require further algorithm development.
Availability: TopHat is free, open-source software available from http://tophat.cbcb.umd.edu.
Supplementary information: Supplementary data are available at Bioinformatics online.
Figures
Fig. 1.
The TopHat pipeline. RNA-Seq reads are mapped against the whole reference genome, and those reads that do not map are set aside. An initial consensus of mapped regions is computed by Maq. Sequences flanking potential donor/acceptor splice sites within neighboring regions are joined to form potential splice junctions. The IUM reads are indexed and aligned to these splice junction sequences.
Fig. 2.
An intron entirely overlapped by the 5′-UTR of another transcript. Both isoforms are present in the brain tissue RNA sample. The top track is the normalized uniquely mappable read coverage reported by ERANGE for this region (Mortazavi et al., 2008). The lack of a large coverage gap causes TopHat to report a single island containing both exons. TopHat looks for introns within single islands in order to detect this junction.
Fig. 3.
The seed and extend alignment used to match reads to possible splice sites. For each possible splice site, a seed is formed by combining a small amount of sequence upstream of the donor and downstream of the acceptor. This seed, shown in dark gray, is used to query the index of reads that were not initially mapped by Bowtie. Any read containing the seed is checked for a complete alignment to the exons on either side of the possible splice. In the light gray portion of the alignment, TopHat allows a user-specified number of mismatches. Because reads typically contain low-quality base calls on their 3′ ends, TopHat only examines the first 28 bp on the 5′ end of each read by default.
Fig. 4.
TopHat sensitivity as RPKM varies. For genes transcribed above 15.0 RPKM, TopHat detects more than 80% reported by ERANGE in the M. musculus brain tissue study. TopHat detects more than 72% of all junctions observed by ERANGE, including those in genes expressed at only a single transcript per cell. A de novo assembly of the RNA-Seq reads, followed by spliced alignment of the assembled transcripts produces markedly poorer sensitivity, detecting around 40% of junctions in genes transcribed above 25.0 RPKM, but comparatively few junctions in more highly transcribed genes.
Fig. 5.
The BLAT E-value distribution of known, previously unreported, and randomly generated splice junction sequences when searched against GenBank mouse ESTs. As expected, known junctions have high-quality BLAT hits to the EST database. Randomly-generated junction sequences do not. High-quality BLAT hits for more than 11% of the junctions identified by TopHat suggest that the UCSC gene models for mouse are incomplete. These junctions are almost certainly genuine, and because the mouse EST database is not complete, 11% is only a lower bound on the specificity of TopHat.
Fig. 6.
TopHat detects junctions in genes transcribed at very low levels. The gene Pnlip was transcribed at only 7.88 RPKM in the brain tissue according to ERANGE, and yet TopHat reports the complete known gene model.
Fig. 7.
A previously unreported splice junction detected by TopHat is shown as the topmost horizontal line. This junction skips two exons in the ADP-ribosylation gene Arfgef1. As explained in Section 2, islands of read coverage in the Bowtie mapping are extended by 45 bp on either side.
Similar articles
- Read-Split-Run: an improved bioinformatics pipeline for identification of genome-wide non-canonical spliced regions using RNA-Seq data.
Bai Y, Kinne J, Donham B, Jiang F, Ding L, Hassler JR, Kaufman RJ. Bai Y, et al. BMC Genomics. 2016 Aug 22;17 Suppl 7(Suppl 7):503. doi: 10.1186/s12864-016-2896-7. BMC Genomics. 2016. PMID: 27556805 Free PMC article. - MapSplice: accurate mapping of RNA-seq reads for splice junction discovery.
Wang K, Singh D, Zeng Z, Coleman SJ, Huang Y, Savich GL, He X, Mieczkowski P, Grimm SA, Perou CM, MacLeod JN, Chiang DY, Prins JF, Liu J. Wang K, et al. Nucleic Acids Res. 2010 Oct;38(18):e178. doi: 10.1093/nar/gkq622. Epub 2010 Aug 27. Nucleic Acids Res. 2010. PMID: 20802226 Free PMC article. - Comparative analysis of RNA-Seq alignment algorithms and the RNA-Seq unified mapper (RUM).
Grant GR, Farkas MH, Pizarro AD, Lahens NF, Schug J, Brunk BP, Stoeckert CJ, Hogenesch JB, Pierce EA. Grant GR, et al. Bioinformatics. 2011 Sep 15;27(18):2518-28. doi: 10.1093/bioinformatics/btr427. Epub 2011 Jul 19. Bioinformatics. 2011. PMID: 21775302 Free PMC article. - Mapping RNA-seq Reads with STAR.
Dobin A, Gingeras TR. Dobin A, et al. Curr Protoc Bioinformatics. 2015 Sep 3;51:11.14.1-11.14.19. doi: 10.1002/0471250953.bi1114s51. Curr Protoc Bioinformatics. 2015. PMID: 26334920 Free PMC article. Review. - Protocol for transcriptome assembly by the TransBorrow algorithm.
Zhao D, Liu J, Yu T. Zhao D, et al. Biol Methods Protoc. 2023 Nov 1;8(1):bpad028. doi: 10.1093/biomethods/bpad028. eCollection 2023. Biol Methods Protoc. 2023. PMID: 38023349 Free PMC article. Review.
Cited by
- Novornabreak: Local Assembly for Novel Splice Junction and Fusion Transcript Detection from RNA-Seq Data.
Tan Y, Mohanty V, Liang S, Dou J, Ma J, Kim KH, Bonder MJ, Shi X, Lee C; Human Genome Structural Variation Consortium; Chong Z, Chen K. Tan Y, et al. J Bioinform Syst Biol. 2023;6(2):74-81. doi: 10.26502/jbsb.5107050. Epub 2023 Apr 4. J Bioinform Syst Biol. 2023. PMID: 39301431 Free PMC article. - Recruitment of the m6A/m6Am demethylase FTO to target RNAs by the telomeric zinc finger protein ZBTB48.
Nabeel-Shah S, Pu S, Burke GL, Ahmed N, Braunschweig U, Farhangmehr S, Lee H, Wu M, Ni Z, Tang H, Zhong G, Marcon E, Zhang Z, Blencowe BJ, Greenblatt JF. Nabeel-Shah S, et al. Genome Biol. 2024 Sep 19;25(1):246. doi: 10.1186/s13059-024-03392-7. Genome Biol. 2024. PMID: 39300486 Free PMC article. - Splam: a deep-learning-based splice site predictor that improves spliced alignments.
Chao KH, Mao A, Salzberg SL, Pertea M. Chao KH, et al. Genome Biol. 2024 Sep 16;25(1):243. doi: 10.1186/s13059-024-03379-4. Genome Biol. 2024. PMID: 39285451 Free PMC article. - Heat Stress Induces Alterations in Gene Expression of Actin Cytoskeleton and Filament of Cellular Components Causing Gut Disruption in Growing-Finishing Pigs.
Choi Y, Park H, Kim J, Lee H, Kim M. Choi Y, et al. Animals (Basel). 2024 Aug 26;14(17):2476. doi: 10.3390/ani14172476. Animals (Basel). 2024. PMID: 39272260 Free PMC article. - Transcriptome analysis of the effect of HERV-K env gene knockout in ovarian cancer cell lines.
Ko EJ, Suh DS, Kim H, Lee JY, Eo WK, Kim H, Kim KH, Cha HJ. Ko EJ, et al. Genes Genomics. 2024 Sep 13. doi: 10.1007/s13258-024-01544-4. Online ahead of print. Genes Genomics. 2024. PMID: 39271536
References
- Abouelhoda M, et al. Replacing suffix trees with enhanced suffix arrays. J. Discrete Alg. 2004;2:53–86.
- Adams MD, et al. Rapid cDNA sequencing (expressed sequence tags) from a directionally cloned human infant brain cDNA library. Nat. Genet. 1993;4:373–380. - PubMed
- Burrows M, Wheeler D. Technical Report 124. Palo Alto, California: DEC, Digital Systems Research Center; 1994. A block sorting lossless data compression algorithm.
- Cloonan N, et al. Stem cell transcriptome profiling via massive-scale mRNA sequencing. Nat. Meth. 2008;5:613–619. - PubMed
- De Bona F, et al. Optimal spliced alignments of short sequence reads. Bioinformatics. 2008;24:i174–i180. - PubMed
Publication types
MeSH terms
Substances
Grants and funding
- R01 GM083873/GM/NIGMS NIH HHS/United States
- R01 LM006845-10/LM/NLM NIH HHS/United States
- R01 LM006845-09/LM/NLM NIH HHS/United States
- R01 LM006845/LM/NLM NIH HHS/United States
- R01-LM06845/LM/NLM NIH HHS/United States
- R01-GM083873/GM/NIGMS NIH HHS/United States
- R01 GM083873-06/GM/NIGMS NIH HHS/United States
LinkOut - more resources
Full Text Sources
Other Literature Sources