Optimization of de novo transcriptome assembly from next-generation sequencing data - PubMed (original) (raw)
Optimization of de novo transcriptome assembly from next-generation sequencing data
Yann Surget-Groba et al. Genome Res. 2010 Oct.
Abstract
Transcriptome analysis has important applications in many biological fields. However, assembling a transcriptome without a known reference remains a challenging task requiring algorithmic improvements. We present two methods for substantially improving transcriptome de novo assembly. The first method relies on the observation that the use of a single k-mer length by current de novo assemblers is suboptimal to assemble transcriptomes where the sequence coverage of transcripts is highly heterogeneous. We present the Multiple-k method in which various k-mer lengths are used for de novo transcriptome assembly. We demonstrate its good performance by assembling de novo a published next-generation transcriptome sequence data set of Aedes aegypti, using the existing genome to check the accuracy of our method. The second method relies on the use of a reference proteome to improve the de novo assembly. We developed the Scaffolding using Translation Mapping (STM) method that uses mapping against the closest available reference proteome for scaffolding contigs that map onto the same protein. In a controlled experiment using simulated data, we show that the STM method considerably improves the assembly, with few errors. We applied these two methods to assemble the transcriptome of the non-model catfish Loricaria gr. cataphracta. Using the Multiple-k and STM methods, the assembly increases in contiguity and in gene identification, showing that our methods clearly improve quality and can be widely used. The new methods were used to assemble successfully the transcripts of the core set of genes regulating tooth development in vertebrates, while classic de novo assembly failed.
Figures
Figure 1.
Diagrammatic representation of the STM method. This pipeline can either use only contigs (STM- method) or, if reads are long enough, contigs plus unassembled reads (STM+ method). These contigs/reads are mapped on the reference proteome using BLASTX. When a contig has no significant hit or is the only one to map on a given reference protein, it cannot be further assembled and is directed into the final assembly. When there are several hits on a same reference protein (Box 1: an example with 5 hits) their relative positions are recorded on the reference scale. If there is an overlap in the positioning of several hits (here hits 2, 3, and 4 form an overlap group), their consensus sequence is computed, and when the number of ambiguities is below a user-defined threshold, the consensus is accepted and a scaffold is constructed (Box 2: dashed line represents N's added to join the contigs). Else, the consensus is rejected and the contigs of the overlap group are assembled using CAP. If the result of this assembly step is a single “super-contig,” it is accepted and a scaffold is constructed (Box 3). If more than one super-contig is obtained (Box 4), the overlap group assembly is rejected and the contigs are placed as independent transcripts in the final assembly. If present, the other nonoverlapping hits (or nonambiguous overlap groups) are joined into a scaffold, which is incorporated into the final assembly.
Similar articles
- Benchmarking next-generation transcriptome sequencing for functional and evolutionary genomics.
Gibbons JG, Janson EM, Hittinger CT, Johnston M, Abbot P, Rokas A. Gibbons JG, et al. Mol Biol Evol. 2009 Dec;26(12):2731-44. doi: 10.1093/molbev/msp188. Epub 2009 Aug 25. Mol Biol Evol. 2009. PMID: 19706727 - Challenges and advances for transcriptome assembly in non-model species.
Ungaro A, Pech N, Martin JF, McCairns RJS, Mévy JP, Chappaz R, Gilles A. Ungaro A, et al. PLoS One. 2017 Sep 20;12(9):e0185020. doi: 10.1371/journal.pone.0185020. eCollection 2017. PLoS One. 2017. PMID: 28931057 Free PMC article. - Comparative performance of transcriptome assembly methods for non-model organisms.
Huang X, Chen XG, Armbruster PA. Huang X, et al. BMC Genomics. 2016 Jul 27;17:523. doi: 10.1186/s12864-016-2923-8. BMC Genomics. 2016. PMID: 27464550 Free PMC article. - [Transcript assembly and quality assessment].
Deng F, Jia X, Lai S, Liu Y, Chen S. Deng F, et al. Sheng Wu Gong Cheng Xue Bao. 2015 Sep;31(9):1271-8. Sheng Wu Gong Cheng Xue Bao. 2015. PMID: 26955705 Review. Chinese. - Sequence assembly using next generation sequencing data--challenges and solutions.
Chin FY, Leung HC, Yiu SM. Chin FY, et al. Sci China Life Sci. 2014 Nov;57(11):1140-8. doi: 10.1007/s11427-014-4752-9. Epub 2014 Oct 17. Sci China Life Sci. 2014. PMID: 25326069 Review.
Cited by
- Roast: a tool for reference-free optimization of supertranscriptome assemblies.
Shabbir M, Mithani A. Shabbir M, et al. BMC Bioinformatics. 2024 Jan 2;25(1):2. doi: 10.1186/s12859-023-05614-4. BMC Bioinformatics. 2024. PMID: 38166712 Free PMC article. - Metabolomics and transcriptomics analyses for characterizing the alkaloid metabolism of Chinese jujube and sour jujube fruits.
Xue X, Zhao A, Wang Y, Ren H, Su W, Li Y, Shi M, Liu L, Li D. Xue X, et al. Front Plant Sci. 2023 Sep 18;14:1267758. doi: 10.3389/fpls.2023.1267758. eCollection 2023. Front Plant Sci. 2023. PMID: 37790781 Free PMC article. - Optimizing an efficient ensemble approach for high-quality de novo transcriptome assembly of Thymus daenensis.
Ahmadi H, Sheikh-Assadi M, Fatahi R, Zamani Z, Shokrpour M. Ahmadi H, et al. Sci Rep. 2023 Jul 31;13(1):12415. doi: 10.1038/s41598-023-39620-6. Sci Rep. 2023. PMID: 37524806 Free PMC article. - Elucidating the Mesocarp Drupe Transcriptome of Açai (Euterpe oleracea Mart.): An Amazonian Tree Palm Producer of Bioactive Compounds.
Darnet E, Teixeira B, Schaller H, Rogez H, Darnet S. Darnet E, et al. Int J Mol Sci. 2023 May 26;24(11):9315. doi: 10.3390/ijms24119315. Int J Mol Sci. 2023. PMID: 37298279 Free PMC article. - Normalized Workflow to Optimize Hybrid De Novo Transcriptome Assembly for Non-Model Species: A Case Study in Lilium ledebourii (Baker) Boiss.
Sheikh-Assadi M, Naderi R, Salami SA, Kafi M, Fatahi R, Shariati V, Martinelli F, Cicatelli A, Triassi M, Guarino F, Improta G, Claros MG. Sheikh-Assadi M, et al. Plants (Basel). 2022 Sep 10;11(18):2365. doi: 10.3390/plants11182365. Plants (Basel). 2022. PMID: 36145766 Free PMC article.
References
- Carninci P 2008. Hunting hidden transcripts. Nat Methods 5: 587–589 - PubMed
Publication types
MeSH terms
Substances
LinkOut - more resources
Full Text Sources
Other Literature Sources
Miscellaneous