Optimization of de novo transcriptome assembly from next-generation sequencing data - PubMed (original) (raw)

Optimization of de novo transcriptome assembly from next-generation sequencing data

Yann Surget-Groba et al. Genome Res. 2010 Oct.

Abstract

Transcriptome analysis has important applications in many biological fields. However, assembling a transcriptome without a known reference remains a challenging task requiring algorithmic improvements. We present two methods for substantially improving transcriptome de novo assembly. The first method relies on the observation that the use of a single k-mer length by current de novo assemblers is suboptimal to assemble transcriptomes where the sequence coverage of transcripts is highly heterogeneous. We present the Multiple-k method in which various k-mer lengths are used for de novo transcriptome assembly. We demonstrate its good performance by assembling de novo a published next-generation transcriptome sequence data set of Aedes aegypti, using the existing genome to check the accuracy of our method. The second method relies on the use of a reference proteome to improve the de novo assembly. We developed the Scaffolding using Translation Mapping (STM) method that uses mapping against the closest available reference proteome for scaffolding contigs that map onto the same protein. In a controlled experiment using simulated data, we show that the STM method considerably improves the assembly, with few errors. We applied these two methods to assemble the transcriptome of the non-model catfish Loricaria gr. cataphracta. Using the Multiple-k and STM methods, the assembly increases in contiguity and in gene identification, showing that our methods clearly improve quality and can be widely used. The new methods were used to assemble successfully the transcripts of the core set of genes regulating tooth development in vertebrates, while classic de novo assembly failed.

PubMed Disclaimer

Figures

Figure 1.

Figure 1.

Diagrammatic representation of the STM method. This pipeline can either use only contigs (STM- method) or, if reads are long enough, contigs plus unassembled reads (STM+ method). These contigs/reads are mapped on the reference proteome using BLASTX. When a contig has no significant hit or is the only one to map on a given reference protein, it cannot be further assembled and is directed into the final assembly. When there are several hits on a same reference protein (Box 1: an example with 5 hits) their relative positions are recorded on the reference scale. If there is an overlap in the positioning of several hits (here hits 2, 3, and 4 form an overlap group), their consensus sequence is computed, and when the number of ambiguities is below a user-defined threshold, the consensus is accepted and a scaffold is constructed (Box 2: dashed line represents N's added to join the contigs). Else, the consensus is rejected and the contigs of the overlap group are assembled using CAP. If the result of this assembly step is a single “super-contig,” it is accepted and a scaffold is constructed (Box 3). If more than one super-contig is obtained (Box 4), the overlap group assembly is rejected and the contigs are placed as independent transcripts in the final assembly. If present, the other nonoverlapping hits (or nonambiguous overlap groups) are joined into a scaffold, which is incorporated into the final assembly.

Similar articles

Cited by

References

    1. Altschul SF, Madden TL, Schaffer AA, Zhang JH, Zhang Z, Miller W, Lipman DJ 1997. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res 25: 3389–3402 - PMC - PubMed
    1. Barbazuk WB, Emrich SJ, Chen HD, Li L, Schnable PS 2007. SNP discovery via 454 transcriptome sequencing. Plant J 51: 910–918 - PMC - PubMed
    1. Burki F, Shalchian-Tabrizi K, Pawlowski J 2008. Phylogenomics reveals a new ‘megagroup’ including most photosynthetic eukaryotes. Biol Lett 4: 366–369 - PMC - PubMed
    1. Butler J, MacCallum I, Kleber M, Shlyakhter IA, Belmonte MK, Lander ES, Nusbaum C, Jaffe DB 2008. ALLPATHS: De novo assembly of whole-genome shotgun microreads. Genome Res 18: 810–820 - PMC - PubMed
    1. Carninci P 2008. Hunting hidden transcripts. Nat Methods 5: 587–589 - PubMed

Publication types

MeSH terms

Substances

LinkOut - more resources