Optimizing de novo transcriptome assembly from short-read RNA-Seq data: a comparative study - PubMed (original) (raw)

Comparative Study

Optimizing de novo transcriptome assembly from short-read RNA-Seq data: a comparative study

Qiong-Yi Zhao et al. BMC Bioinformatics. 2011.

Abstract

Background: With the fast advances in nextgen sequencing technology, high-throughput RNA sequencing has emerged as a powerful and cost-effective way for transcriptome study. De novo assembly of transcripts provides an important solution to transcriptome analysis for organisms with no reference genome. However, there lacked understanding on how the different variables affected assembly outcomes, and there was no consensus on how to approach an optimal solution by selecting software tool and suitable strategy based on the properties of RNA-Seq data.

Results: To reveal the performance of different programs for transcriptome assembly, this work analyzed some important factors, including k-mer values, genome complexity, coverage depth, directional reads, etc. Seven program conditions, four single k-mer assemblers (SK: SOAPdenovo, ABySS, Oases and Trinity) and three multiple k-mer methods (MK: SOAPdenovo-MK, trans-ABySS and Oases-MK) were tested. While small and large k-mer values performed better for reconstructing lowly and highly expressed transcripts, respectively, MK strategy worked well for almost all ranges of expression quintiles. Among SK tools, Trinity performed well across various conditions but took the longest running time. Oases consumed the most memory whereas SOAPdenovo required the shortest runtime but worked poorly to reconstruct full-length CDS. ABySS showed some good balance between resource usage and quality of assemblies.

Conclusions: Our work compared the performance of publicly available transcriptome assemblers, and analyzed important factors affecting de novo assembly. Some practical guidelines for transcript reconstruction from short-read RNA-Seq data were proposed. De novo assembly of C. sinensis transcriptome was greatly improved using some optimized methods.

PubMed Disclaimer

Figures

Figure 1

Figure 1

Runtime and RAM usage performance for each assembler. Runtime and RAM usage for each assembler: Oases, SOAPdenovo, ABySS and Trinity. (a) Real-time monitored runtime and RAM usage of each method using Dme-13g data set. The maximum RAM usage was marked as asterisk for each assembler, and three stages of Oases and Trinity were shown by different colors (red: Velveth and Inchworm; green: Velvetg and Chrysalis; blue: Oases and Butterfly). RAM usage (b) and runtime (d) of each method using different amounts of inputs with _k_-mer value of 25. RAM usage (c) and runtime (e) of each method using Dme-13g data set with different _k_-mer value. #Alternatively, jobs from Butterfly module could be distributed in clusters using a job array, which could greatly reduce the running time for this step.

Figure 2

Figure 2

Number of transcripts that could be aligned to the genome. Shown are the percentages of transcripts that could be successfully aligned to its corresponding genome with Dme-13g (a) and Spo-6.8g (c) data sets. (b) Percentage of unique and shared transcripts that could be successfully aligned to the genome using Dme-13g data set by each of SOAPdenovo-MK, trans-ABySS, Oases-MK and Trinity. (d) The percentage of unique unmapped-transcripts produced from each of assembly methods using Spo-6.8g data set. Numbers above the histogram are the number of unique unmapped-transcripts (left) and number of unique unmapped-transcripts that had BLASTX top hits (E≤10-10) to Uniprot database (right, within the brackets).

Figure 3

Figure 3

Number of reconstructed protein coding genes. Number of full-length protein coding genes reconstructed by each method using inputs with different depth of coverage: D. melanogaster data sets (a), S. pombe data sets (b). Number of reconstructed genes were shown using Dme-3g (c), Dme-13g (d), Spo-1g (e) and Spo-6.8g (f) data sets, which included full-length reconstructed genes with 100% (blue) and at least 95% identity (reddish brown); partial-length reconstructed genes: 80% (green) and 50% (purple). Trinity assembly with strand specific option “--SS_lib_type RF” was marked as asterisk. The assessment of Trinity without “--jaccard_clip” option was shown as “Trinity#” using Spo-6.8g data set (f).

Figure 4

Figure 4

Full-length genes reconstructed by each method at different expression quintiles. Shown are the percentages of reconstructed full-length genes (Y axis) at different expression quintiles (X axis, 10% increment) by Oases with different _k_-mer values using Dme-3g (a) and Spo-1g (b) or by each assembler using Dme-3g (c) and Dme-13g (d) data sets. (e) An example is shown as an assembled transcript in D. melanogaster by different assembly methods. NM_079795 is one of the highly expressed genes at highest expression quintile, which could be completely reconstructed by Trinity (red), but failed by other methods. Only incomplete transcripts (green) were reconstructed and both ends of coding region were lost. Incomplete transcript with 1 bp deletion assembled by Oases-MK is shown below its gene model. Reads coverage is shown at the bottom.

Similar articles

Cited by

References

    1. Graveley BR, Brooks AN, Carlson JW, Duff MO, Landolin JM, Yang L, Artieri CG, van Baren MJ, Boley N, Booth BW. et al.The developmental transcriptome of Drosophila melanogaster. Nature. 2010;471(7339):473–479. - PMC - PubMed
    1. Li P, Ponnala L, Gandotra N, Wang L, Si Y, Tausta SL, Kebrom TH, Provart N, Patel R, Myers CR. et al.The developmental dynamics of the maize leaf transcriptome. Nat Genet. 2010;42(12):1060–1067. doi: 10.1038/ng.703. - DOI - PubMed
    1. Shi CY, Yang H, Wei CL, Yu O, Zhang ZZ, Jiang CJ, Sun J, Li YY, Chen Q, Xia T. et al.Deep sequencing of the Camellia sinensis transcriptome revealed candidate genes for major metabolic pathways of tea-specific compounds. BMC Genomics. 2011;12:131. doi: 10.1186/1471-2164-12-131. - DOI - PMC - PubMed
    1. Voineagu I, Wang X, Johnston P, Lowe JK, Tian Y, Horvath S, Mill J, Cantor RM, Blencowe BJ, Geschwind DH. Transcriptomic analysis of autistic brain reveals convergent molecular pathology. Nature. 2011. - PMC - PubMed
    1. Wang XW, Luan JB, Li JM, Bao YY, Zhang CX, Liu SS. De novo characterization of a whitefly transcriptome and analysis of its gene expression during development. BMC Genomics. 2010;11:400. doi: 10.1186/1471-2164-11-400. - DOI - PMC - PubMed

Publication types

MeSH terms

LinkOut - more resources