Optimizing de novo transcriptome assembly from short-read RNA-Seq data: a comparative study - PubMed (original) (raw)

Comparative Study

Optimizing de novo transcriptome assembly from short-read RNA-Seq data: a comparative study

Qiong-Yi Zhao et al. BMC Bioinformatics. 2011.

Abstract

Background: With the fast advances in nextgen sequencing technology, high-throughput RNA sequencing has emerged as a powerful and cost-effective way for transcriptome study. De novo assembly of transcripts provides an important solution to transcriptome analysis for organisms with no reference genome. However, there lacked understanding on how the different variables affected assembly outcomes, and there was no consensus on how to approach an optimal solution by selecting software tool and suitable strategy based on the properties of RNA-Seq data.

Results: To reveal the performance of different programs for transcriptome assembly, this work analyzed some important factors, including k-mer values, genome complexity, coverage depth, directional reads, etc. Seven program conditions, four single k-mer assemblers (SK: SOAPdenovo, ABySS, Oases and Trinity) and three multiple k-mer methods (MK: SOAPdenovo-MK, trans-ABySS and Oases-MK) were tested. While small and large k-mer values performed better for reconstructing lowly and highly expressed transcripts, respectively, MK strategy worked well for almost all ranges of expression quintiles. Among SK tools, Trinity performed well across various conditions but took the longest running time. Oases consumed the most memory whereas SOAPdenovo required the shortest runtime but worked poorly to reconstruct full-length CDS. ABySS showed some good balance between resource usage and quality of assemblies.

Conclusions: Our work compared the performance of publicly available transcriptome assemblers, and analyzed important factors affecting de novo assembly. Some practical guidelines for transcript reconstruction from short-read RNA-Seq data were proposed. De novo assembly of C. sinensis transcriptome was greatly improved using some optimized methods.

PubMed Disclaimer

Figures

Figure 1

Runtime and RAM usage performance for each assembler. Runtime and RAM usage for each assembler: Oases, SOAPdenovo, ABySS and Trinity. (a) Real-time monitored runtime and RAM usage of each method using Dme-13g data set. The maximum RAM usage was marked as asterisk for each assembler, and three stages of Oases and Trinity were shown by different colors (red: Velveth and Inchworm; green: Velvetg and Chrysalis; blue: Oases and Butterfly). RAM usage (b) and runtime (d) of each method using different amounts of inputs with _k_-mer value of 25. RAM usage (c) and runtime (e) of each method using Dme-13g data set with different _k_-mer value. #Alternatively, jobs from Butterfly module could be distributed in clusters using a job array, which could greatly reduce the running time for this step.

Figure 2

Number of transcripts that could be aligned to the genome. Shown are the percentages of transcripts that could be successfully aligned to its corresponding genome with Dme-13g (a) and Spo-6.8g (c) data sets. (b) Percentage of unique and shared transcripts that could be successfully aligned to the genome using Dme-13g data set by each of SOAPdenovo-MK, trans-ABySS, Oases-MK and Trinity. (d) The percentage of unique unmapped-transcripts produced from each of assembly methods using Spo-6.8g data set. Numbers above the histogram are the number of unique unmapped-transcripts (left) and number of unique unmapped-transcripts that had BLASTX top hits (E≤10-10) to Uniprot database (right, within the brackets).

Figure 3

Number of reconstructed protein coding genes. Number of full-length protein coding genes reconstructed by each method using inputs with different depth of coverage: D. melanogaster data sets (a), S. pombe data sets (b). Number of reconstructed genes were shown using Dme-3g (c), Dme-13g (d), Spo-1g (e) and Spo-6.8g (f) data sets, which included full-length reconstructed genes with 100% (blue) and at least 95% identity (reddish brown); partial-length reconstructed genes: 80% (green) and 50% (purple). Trinity assembly with strand specific option “--SS_lib_type RF” was marked as asterisk. The assessment of Trinity without “--jaccard_clip” option was shown as “Trinity#” using Spo-6.8g data set (f).

Figure 4

Full-length genes reconstructed by each method at different expression quintiles. Shown are the percentages of reconstructed full-length genes (Y axis) at different expression quintiles (X axis, 10% increment) by Oases with different _k_-mer values using Dme-3g (a) and Spo-1g (b) or by each assembler using Dme-3g (c) and Dme-13g (d) data sets. (e) An example is shown as an assembled transcript in D. melanogaster by different assembly methods. NM_079795 is one of the highly expressed genes at highest expression quintile, which could be completely reconstructed by Trinity (red), but failed by other methods. Only incomplete transcripts (green) were reconstructed and both ends of coding region were lost. Incomplete transcript with 1 bp deletion assembled by Oases-MK is shown below its gene model. Reads coverage is shown at the bottom.

Cited by

Effects of cell morphology, physiology, biochemistry and CHS genes on four flower colors of Impatiens uliginosa.
Zhao LQ, Liu Y, Huang Q, Gao S, Huang MJ, Huang HQ. Zhao LQ, et al. Front Plant Sci. 2024 Mar 1;15:1343830. doi: 10.3389/fpls.2024.1343830. eCollection 2024. Front Plant Sci. 2024. PMID: 38495370 Free PMC article.
Comparative Analysis and Phylogenetic Study of Dawkinsia filamentosa and Pethia nigrofasciata Mitochondrial Genomes.
Sun CH, Lu CH. Sun CH, et al. Int J Mol Sci. 2024 Mar 5;25(5):3004. doi: 10.3390/ijms25053004. Int J Mol Sci. 2024. PMID: 38474250 Free PMC article.
Improved meta-analysis pipeline ameliorates distinctive gene regulators of diabetic vasculopathy in human endothelial cell (hECs) RNA-Seq data.
Pandey D, Perumal P O. Pandey D, et al. PLoS One. 2023 Nov 9;18(11):e0293939. doi: 10.1371/journal.pone.0293939. eCollection 2023. PLoS One. 2023. PMID: 37943808 Free PMC article.
Optimizing an efficient ensemble approach for high-quality de novo transcriptome assembly of Thymus daenensis.
Ahmadi H, Sheikh-Assadi M, Fatahi R, Zamani Z, Shokrpour M. Ahmadi H, et al. Sci Rep. 2023 Jul 31;13(1):12415. doi: 10.1038/s41598-023-39620-6. Sci Rep. 2023. PMID: 37524806 Free PMC article.
Comparative Transcriptomics of Multi-Stress Responses in Pachycladon cheesemanii and Arabidopsis thaliana.
Dong Y, Gupta S, Wargent JJ, Putterill J, Macknight RC, Gechev TS, Mueller-Roeber B, Dijkwel PP. Dong Y, et al. Int J Mol Sci. 2023 Jul 11;24(14):11323. doi: 10.3390/ijms241411323. Int J Mol Sci. 2023. PMID: 37511083 Free PMC article.

References

1. Graveley BR, Brooks AN, Carlson JW, Duff MO, Landolin JM, Yang L, Artieri CG, van Baren MJ, Boley N, Booth BW. et al.The developmental transcriptome of Drosophila melanogaster. Nature. 2010;471(7339):473–479. - PMC - PubMed
1. Li P, Ponnala L, Gandotra N, Wang L, Si Y, Tausta SL, Kebrom TH, Provart N, Patel R, Myers CR. et al.The developmental dynamics of the maize leaf transcriptome. Nat Genet. 2010;42(12):1060–1067. doi: 10.1038/ng.703. - DOI - PubMed
1. Shi CY, Yang H, Wei CL, Yu O, Zhang ZZ, Jiang CJ, Sun J, Li YY, Chen Q, Xia T. et al.Deep sequencing of the Camellia sinensis transcriptome revealed candidate genes for major metabolic pathways of tea-specific compounds. BMC Genomics. 2011;12:131. doi: 10.1186/1471-2164-12-131. - DOI - PMC - PubMed
1. Voineagu I, Wang X, Johnston P, Lowe JK, Tian Y, Horvath S, Mill J, Cantor RM, Blencowe BJ, Geschwind DH. Transcriptomic analysis of autistic brain reveals convergent molecular pathology. Nature. 2011. - PMC - PubMed
1. Wang XW, Luan JB, Li JM, Bao YY, Zhang CX, Liu SS. De novo characterization of a whitefly transcriptome and analysis of its gene expression during development. BMC Genomics. 2010;11:400. doi: 10.1186/1471-2164-11-400. - DOI - PMC - PubMed

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations
Miscellaneous
- NCI CPTAC Assay Portal

Optimizing de novo transcriptome assembly from short-read RNA-Seq data: a comparative study - PubMed (original) (raw)