Full-length transcriptome assembly from RNA-Seq data without a reference genome - PubMed (original) (raw)
. 2011 May 15;29(7):644-52.
doi: 10.1038/nbt.1883.
Brian J Haas, Moran Yassour, Joshua Z Levin, Dawn A Thompson, Ido Amit, Xian Adiconis, Lin Fan, Raktima Raychowdhury, Qiandong Zeng, Zehua Chen, Evan Mauceli, Nir Hacohen, Andreas Gnirke, Nicholas Rhind, Federica di Palma, Bruce W Birren, Chad Nusbaum, Kerstin Lindblad-Toh, Nir Friedman, Aviv Regev
Affiliations
- PMID: 21572440
- PMCID: PMC3571712
- DOI: 10.1038/nbt.1883
Full-length transcriptome assembly from RNA-Seq data without a reference genome
Manfred G Grabherr et al. Nat Biotechnol. 2011.
Abstract
Massively parallel sequencing of cDNA has enabled deep and efficient probing of transcriptomes. Current approaches for transcript reconstruction from such data often rely on aligning reads to a reference genome, and are thus unsuitable for samples with a partial or missing reference genome. Here we present the Trinity method for de novo assembly of full-length transcripts and evaluate it on samples from fission yeast, mouse and whitefly, whose reference genome is not yet available. By efficiently constructing and analyzing sets of de Bruijn graphs, Trinity fully reconstructs a large fraction of transcripts, including alternatively spliced isoforms and transcripts from recently duplicated genes. Compared with other de novo transcriptome assemblers, Trinity recovers more full-length transcripts across a broad range of expression levels, with a sensitivity similar to methods that rely on genome alignments. Our approach provides a unified solution for transcriptome reconstruction in any sample, especially in the absence of a reference genome.
Conflict of interest statement
The authors declare no competing financial interest.
Figures
Figure 1. Overview of Trinity
(a) Inchworm assembles the read data set (short black line, top) by greedily searching for paths in a k-mer graph (middle), resulting in a collection of linear contigs (color lines, bottom), with each k-mer present only once in the contigs. (b) Chrysalis pools contigs if they share at least one k-1-mer and reads span the join, and builds individual de Bruijn graphs from each pool (colored lines). (c) Butterfly takes each de Bruijn graph from Chrysalis (top), and trims spurious edges and compacts linear paths (middle). It then reconciles the graph with reads (dashed colored arrows, bottom) and pairs (not shown), and outputs one linear sequence for each splice form and/or paralogous transcript reflected in the graph (bottom, colored sequences).
Figure 2. Trinity correctly reconstructs the majority of full-length transcripts in fission yeast and mouse
(a,c) Shown is the fraction of Oracle genes fully reconstructed in different expression quintiles (5% increments) in fission yeast (50M pairs assembly) (a) and the fraction of Oracle genes with at least one transcript fully reconstructed in different expression quintiles in mouse (53M pairs assembly) (c). Each bar represents a 5% quintile of read coverage for genes expressed. Bar height is the fraction of annotated genes in that quintile and among the Oracle set (grey) or the subset of the Oracle set that are fully reconstructed by Trinity (blue). For example, ~36% of the S. pombe transcripts at the bottom 5% of expression levels are fully reconstructed by Trinity; ~45% of the transcripts in this quintile are in the Oracle set. (b, d) Shown are the median values for coverage by length of reference transcripts by the longest corresponding Trinity-assembled transcript, according to expression quintiles in yeast (b) and mouse (d), depending on the number of read pairs that went into each assembly.
Figure 3. Trinity improves the yeast annotation
Shown are examples of Trinity assemblies (red) along with the corresponding annotated transcripts (blue) and underlying reads (grey) all aligned to the S. pombe genome (read alignment shown for graphical clarity; no alignments were used to generate the assemblies). (a) Trinity identifies a new multi-exonic transcript (left) and extends the 5′ and 3′ UTRs of the Coq9 gene (right). (b) Trinity extends the UTRs of two convergently transcribed and overlapping genes.
Figure 4. Trinity resolves closely paralogous genes
(a) Shown is the compacted component graph for two paralogous mouse genes, Ddx19a and Ddx19b (93% identity), highlighting the two paths (red and blue) chosen by Trinity out of the 64 possible paths in this portion alone. (b) Shown are the alignments between the transcripts represented by the red and blue paths in (a) and the paralogous genes Ddx19a and Ddx19b relative to the mouse reference genome (genome alignment shown for graphical clarity only; no alignments were used to generate the assemblies).
Figure 5. Comparison of Trinity to other mapping-first and assembly-first methods
(a,b) Evaluation based on number of full-length annotated transcripts reconstructed by each method in in S. pombe (50M read pair assemblies) (a) and mouse (53M read pair assemblies) (b). Shown is the number of genes reconstructed in full length (blue) or as fusions of two full-length genes (green, yeast only) and the number of full length reconstructed transcript isoforms (red, mouse only) in each of four ‘assembly first’ (de novo) and two ‘mapping first’ approaches. (c,d) Evaluation based on the number of introns defined by the transcripts from each method for S. pombe (c) and mouse (d). Shown is the number of distinct introns consistent with the reference annotation (y axis) versus the number of uniquely predicted introns (x axis), based on mapping to the genome of the transcripts reconstructed by each of Trinity (red), Trans-ABySS (yellow), ABySS (blue), SOAPdenovo (green), Scripture (purple) and Cufflinks (grey). (e,f) Evaluation based on the number of splicing patterns (complete sets of introns in multi-intronic transcripts) defined by the transcripts from each method for S. pombe (e) and mouse (f). Shown are the numbers of distinct splicing patterns (y axis) consistent with the reference annotation versus the number of unique splicing patterns (x axis), for each method (methods are colored as above).
Figure 6. Trinity reconstructs polymorphic transcripts in whitefly
(a) Allelic variation evident from mapping RNA-Seq reads to a Trinity-reconstructed full-length whitefly transcript. Top: Shown is a single transcript (top, red bar), orthologous to the D. melanogaster Lamin gene, determined by grouping of allelic variant transcripts generated by Trinity. SNPs: yellow bars, Middle: Cummulative read coverage along the transcripts; colored bars: SNPs; bar height: relative proportions of SNP variants. Blue: C, red: T, orange: G, green: A. Bottom: Individual read coverage. (b) Example of two alternatively spliced transcripts resolved even in the absence of a reference genome. Top: Shown are two isoforms of an ELAV-like gene (top) reconstructed by Trinity (grey boxes, alternative exons). Exon structure is determined for visualization by the D. melanogaster ortholog. Bottom: shown is the protein sequence alignment of the two whitefly isoforms to orthologous proteins from other insects, confirming the splice variants (grey boxes). (c) Comparison of performance in de novo assembly of the whitefly transcriptome. For each of the methods, shown is the number of unique top-matching (blastx) uniref90 protein sequences aligned across the corresponding minimum percent protein length value at >= 80% (blue), >= 90% (green), >= 95% (orange) and 100% (red).
Comment in
- RNA-Seq unleashed.
Iyer MK, Chinnaiyan AM. Iyer MK, et al. Nat Biotechnol. 2011 Jul 11;29(7):599-600. doi: 10.1038/nbt.1915. Nat Biotechnol. 2011. PMID: 21747384 No abstract available.
Similar articles
- De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis.
Haas BJ, Papanicolaou A, Yassour M, Grabherr M, Blood PD, Bowden J, Couger MB, Eccles D, Li B, Lieber M, MacManes MD, Ott M, Orvis J, Pochet N, Strozzi F, Weeks N, Westerman R, William T, Dewey CN, Henschel R, LeDuc RD, Friedman N, Regev A. Haas BJ, et al. Nat Protoc. 2013 Aug;8(8):1494-512. doi: 10.1038/nprot.2013.084. Epub 2013 Jul 11. Nat Protoc. 2013. PMID: 23845962 Free PMC article. - Next-generation transcriptome assembly.
Martin JA, Wang Z. Martin JA, et al. Nat Rev Genet. 2011 Sep 7;12(10):671-82. doi: 10.1038/nrg3068. Nat Rev Genet. 2011. PMID: 21897427 Review. - Optimizing de novo transcriptome assembly from short-read RNA-Seq data: a comparative study.
Zhao QY, Wang Y, Kong YM, Luo D, Li X, Hao P. Zhao QY, et al. BMC Bioinformatics. 2011 Dec 14;12 Suppl 14(Suppl 14):S2. doi: 10.1186/1471-2105-12-S14-S2. BMC Bioinformatics. 2011. PMID: 22373417 Free PMC article. - RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome.
Li B, Dewey CN. Li B, et al. BMC Bioinformatics. 2011 Aug 4;12:323. doi: 10.1186/1471-2105-12-323. BMC Bioinformatics. 2011. PMID: 21816040 Free PMC article. - Characterizing and annotating the genome using RNA-seq data.
Chen G, Shi T, Shi L. Chen G, et al. Sci China Life Sci. 2017 Feb;60(2):116-125. doi: 10.1007/s11427-015-0349-4. Epub 2016 Jun 13. Sci China Life Sci. 2017. PMID: 27294835 Review.
Cited by
- Integrated transcriptome and endogenous hormone analyses reveal the factors affecting the yield of Camellia oleifera.
Zhu Y, Huo D, Zhang M, Wang G, Xiao F, Xu J, Li F, Zeng Q, Wei Y, Xu J. Zhu Y, et al. BMC Genomics. 2024 Sep 20;25(1):887. doi: 10.1186/s12864-024-10795-0. BMC Genomics. 2024. PMID: 39304819 Free PMC article. - Combinatorial Wnt signaling landscape during brachiopod anteroposterior patterning.
Vellutini BC, Martín-Durán JM, Børve A, Hejnol A. Vellutini BC, et al. BMC Biol. 2024 Sep 19;22(1):212. doi: 10.1186/s12915-024-01988-w. BMC Biol. 2024. PMID: 39300453 Free PMC article. - Chromosome-level genome assembly of cotton thrips Thrips tabaci (Thysanoptera: Thripidae).
Gao Y, Ji J, Xu C, Wang L, Zhang K, Li D, Wang X, Xin M, Hua H, Chen L, Gao X, Zhu X, Cui J, Luo J. Gao Y, et al. Sci Data. 2024 Sep 16;11(1):1003. doi: 10.1038/s41597-024-03737-8. Sci Data. 2024. PMID: 39294155 Free PMC article. - Decoupling of strain- and intrastrain-level interactions of microbiomes in a sponge holobiont.
Wang W, Song W, Majzoub ME, Feng X, Xu B, Tao J, Zhu Y, Li Z, Qian PY, Webster NS, Thomas T, Fan L. Wang W, et al. Nat Commun. 2024 Sep 18;15(1):8205. doi: 10.1038/s41467-024-52464-6. Nat Commun. 2024. PMID: 39294150 Free PMC article. - Haplotype-resolved genome assembly of the upas tree (Antiaris toxicaria).
Miao K, Wang Y, Hou L, Liu Y, Liu H, Ji Y. Miao K, et al. Sci Data. 2024 Sep 18;11(1):1011. doi: 10.1038/s41597-024-03860-6. Sci Data. 2024. PMID: 39294147 Free PMC article.
References
- Birol I, et al. De novo transcriptome assembly with ABySS. Bioinformatics. 2009;25:2872–2877. - PubMed
- Haas BJ, Zody MC. Advancing RNA-Seq analysis. Nat Biotechnol. 2010;28:421–423. - PubMed
Publication types
MeSH terms
Substances
Grants and funding
- U54 HG003067/HG/NHGRI NIH HHS/United States
- DP1 OD003958-03/OD/NIH HHS/United States
- R01 GM069957/GM/NIGMS NIH HHS/United States
- 1 U54 HG03067/HG/NHGRI NIH HHS/United States
- HHSN27220090018C/PHS HHS/United States
- DP1 OD003958/OD/NIH HHS/United States
- U54 HG003067-06/HG/NHGRI NIH HHS/United States
- HHMI_/Howard Hughes Medical Institute/United States
LinkOut - more resources
Full Text Sources
Other Literature Sources