Optimizing de novo transcriptome assembly from short-read RNA-Seq data: a comparative study - PubMed (original) (raw)
Comparative Study
Optimizing de novo transcriptome assembly from short-read RNA-Seq data: a comparative study
Qiong-Yi Zhao et al. BMC Bioinformatics. 2011.
Abstract
Background: With the fast advances in nextgen sequencing technology, high-throughput RNA sequencing has emerged as a powerful and cost-effective way for transcriptome study. De novo assembly of transcripts provides an important solution to transcriptome analysis for organisms with no reference genome. However, there lacked understanding on how the different variables affected assembly outcomes, and there was no consensus on how to approach an optimal solution by selecting software tool and suitable strategy based on the properties of RNA-Seq data.
Results: To reveal the performance of different programs for transcriptome assembly, this work analyzed some important factors, including k-mer values, genome complexity, coverage depth, directional reads, etc. Seven program conditions, four single k-mer assemblers (SK: SOAPdenovo, ABySS, Oases and Trinity) and three multiple k-mer methods (MK: SOAPdenovo-MK, trans-ABySS and Oases-MK) were tested. While small and large k-mer values performed better for reconstructing lowly and highly expressed transcripts, respectively, MK strategy worked well for almost all ranges of expression quintiles. Among SK tools, Trinity performed well across various conditions but took the longest running time. Oases consumed the most memory whereas SOAPdenovo required the shortest runtime but worked poorly to reconstruct full-length CDS. ABySS showed some good balance between resource usage and quality of assemblies.
Conclusions: Our work compared the performance of publicly available transcriptome assemblers, and analyzed important factors affecting de novo assembly. Some practical guidelines for transcript reconstruction from short-read RNA-Seq data were proposed. De novo assembly of C. sinensis transcriptome was greatly improved using some optimized methods.
Figures
Figure 1
Runtime and RAM usage performance for each assembler. Runtime and RAM usage for each assembler: Oases, SOAPdenovo, ABySS and Trinity. (a) Real-time monitored runtime and RAM usage of each method using Dme-13g data set. The maximum RAM usage was marked as asterisk for each assembler, and three stages of Oases and Trinity were shown by different colors (red: Velveth and Inchworm; green: Velvetg and Chrysalis; blue: Oases and Butterfly). RAM usage (b) and runtime (d) of each method using different amounts of inputs with _k_-mer value of 25. RAM usage (c) and runtime (e) of each method using Dme-13g data set with different _k_-mer value. #Alternatively, jobs from Butterfly module could be distributed in clusters using a job array, which could greatly reduce the running time for this step.
Figure 2
Number of transcripts that could be aligned to the genome. Shown are the percentages of transcripts that could be successfully aligned to its corresponding genome with Dme-13g (a) and Spo-6.8g (c) data sets. (b) Percentage of unique and shared transcripts that could be successfully aligned to the genome using Dme-13g data set by each of SOAPdenovo-MK, trans-ABySS, Oases-MK and Trinity. (d) The percentage of unique unmapped-transcripts produced from each of assembly methods using Spo-6.8g data set. Numbers above the histogram are the number of unique unmapped-transcripts (left) and number of unique unmapped-transcripts that had BLASTX top hits (E≤10-10) to Uniprot database (right, within the brackets).
Figure 3
Number of reconstructed protein coding genes. Number of full-length protein coding genes reconstructed by each method using inputs with different depth of coverage: D. melanogaster data sets (a), S. pombe data sets (b). Number of reconstructed genes were shown using Dme-3g (c), Dme-13g (d), Spo-1g (e) and Spo-6.8g (f) data sets, which included full-length reconstructed genes with 100% (blue) and at least 95% identity (reddish brown); partial-length reconstructed genes: 80% (green) and 50% (purple). Trinity assembly with strand specific option “--SS_lib_type RF” was marked as asterisk. The assessment of Trinity without “--jaccard_clip” option was shown as “Trinity#” using Spo-6.8g data set (f).
Figure 4
Full-length genes reconstructed by each method at different expression quintiles. Shown are the percentages of reconstructed full-length genes (Y axis) at different expression quintiles (X axis, 10% increment) by Oases with different _k_-mer values using Dme-3g (a) and Spo-1g (b) or by each assembler using Dme-3g (c) and Dme-13g (d) data sets. (e) An example is shown as an assembled transcript in D. melanogaster by different assembly methods. NM_079795 is one of the highly expressed genes at highest expression quintile, which could be completely reconstructed by Trinity (red), but failed by other methods. Only incomplete transcripts (green) were reconstructed and both ends of coding region were lost. Incomplete transcript with 1 bp deletion assembled by Oases-MK is shown below its gene model. Reads coverage is shown at the bottom.
Similar articles
- Comparison of De Novo Transcriptome Assemblers and k-mer Strategies Using the Killifish, Fundulus heteroclitus.
Rana SB, Zadlock FJ 4th, Zhang Z, Murphy WR, Bentivegna CS. Rana SB, et al. PLoS One. 2016 Apr 7;11(4):e0153104. doi: 10.1371/journal.pone.0153104. eCollection 2016. PLoS One. 2016. PMID: 27054874 Free PMC article. - Comprehensive evaluation of de novo transcriptome assembly programs and their effects on differential gene expression analysis.
Wang S, Gribskov M. Wang S, et al. Bioinformatics. 2017 Feb 1;33(3):327-333. doi: 10.1093/bioinformatics/btw625. Bioinformatics. 2017. PMID: 28172640 - Optimizing de novo assembly of short-read RNA-seq data for phylogenomics.
Yang Y, Smith SA. Yang Y, et al. BMC Genomics. 2013 May 14;14:328. doi: 10.1186/1471-2164-14-328. BMC Genomics. 2013. PMID: 23672450 Free PMC article. - Current and Future Methods for mRNA Analysis: A Drive Toward Single Molecule Sequencing.
Bayega A, Fahiminiya S, Oikonomopoulos S, Ragoussis J. Bayega A, et al. Methods Mol Biol. 2018;1783:209-241. doi: 10.1007/978-1-4939-7834-2_11. Methods Mol Biol. 2018. PMID: 29767365 Review. - A simple guide to de novo transcriptome assembly and annotation.
Raghavan V, Kraft L, Mesny F, Rigerte L. Raghavan V, et al. Brief Bioinform. 2022 Mar 10;23(2):bbab563. doi: 10.1093/bib/bbab563. Brief Bioinform. 2022. PMID: 35076693 Free PMC article. Review.
Cited by
- Effects of cell morphology, physiology, biochemistry and CHS genes on four flower colors of Impatiens uliginosa.
Zhao LQ, Liu Y, Huang Q, Gao S, Huang MJ, Huang HQ. Zhao LQ, et al. Front Plant Sci. 2024 Mar 1;15:1343830. doi: 10.3389/fpls.2024.1343830. eCollection 2024. Front Plant Sci. 2024. PMID: 38495370 Free PMC article. - Comparative Analysis and Phylogenetic Study of Dawkinsia filamentosa and Pethia nigrofasciata Mitochondrial Genomes.
Sun CH, Lu CH. Sun CH, et al. Int J Mol Sci. 2024 Mar 5;25(5):3004. doi: 10.3390/ijms25053004. Int J Mol Sci. 2024. PMID: 38474250 Free PMC article. - Improved meta-analysis pipeline ameliorates distinctive gene regulators of diabetic vasculopathy in human endothelial cell (hECs) RNA-Seq data.
Pandey D, Perumal P O. Pandey D, et al. PLoS One. 2023 Nov 9;18(11):e0293939. doi: 10.1371/journal.pone.0293939. eCollection 2023. PLoS One. 2023. PMID: 37943808 Free PMC article. - Optimizing an efficient ensemble approach for high-quality de novo transcriptome assembly of Thymus daenensis.
Ahmadi H, Sheikh-Assadi M, Fatahi R, Zamani Z, Shokrpour M. Ahmadi H, et al. Sci Rep. 2023 Jul 31;13(1):12415. doi: 10.1038/s41598-023-39620-6. Sci Rep. 2023. PMID: 37524806 Free PMC article. - Comparative Transcriptomics of Multi-Stress Responses in Pachycladon cheesemanii and Arabidopsis thaliana.
Dong Y, Gupta S, Wargent JJ, Putterill J, Macknight RC, Gechev TS, Mueller-Roeber B, Dijkwel PP. Dong Y, et al. Int J Mol Sci. 2023 Jul 11;24(14):11323. doi: 10.3390/ijms241411323. Int J Mol Sci. 2023. PMID: 37511083 Free PMC article.
References
Publication types
MeSH terms
LinkOut - more resources
Full Text Sources
Other Literature Sources
Miscellaneous