Velvet: algorithms for de novo short read assembly using de Bruijn graphs - PubMed (original) (raw)
Velvet: algorithms for de novo short read assembly using de Bruijn graphs
Daniel R Zerbino et al. Genome Res. 2008 May.
Abstract
We have developed a new set of algorithms, collectively called "Velvet," to manipulate de Bruijn graphs for genomic sequence assembly. A de Bruijn graph is a compact representation based on short words (k-mers) that is ideal for high coverage, very short read (25-50 bp) data sets. Applying Velvet to very short reads and paired-ends information only, one can produce contigs of significant length, up to 50-kb N50 length in simulations of prokaryotic data and 3-kb N50 on simulated mammalian BACs. When applied to real Solexa data sets without read pairs, Velvet generated contigs of approximately 8 kb in a prokaryote and 2 kb in a mammalian BAC, in close agreement with our simulated results without read-pair information. Velvet represents a new approach to assembly that can leverage very short reads in combination with read pairs to produce useful assemblies.
Figures
Figure 1.
Schematic representation of our implementation of the de Bruijn graph. Each node, represented by a single rectangle, represents a series of overlapping _k_-mers (in this case, k = 5), listed directly above or below. (Red) The last nucleotide of each _k_-mer. The sequence of those final nucleotides, copied in large letters in the rectangle, is the sequence of the node. The twin node, directly attached to the node, either below or above, represents the reverse series of reverse complement _k_-mers. Arcs are represented as arrows between nodes. The last _k_-mer of an arc’s origin overlaps with the first of its destination. Each arc has a symmetric arc. Note that the two nodes on the left could be merged into one without loss of information, because they form a chain.
Figure 2.
Example of Tour Bus error correction. (A) Original graph. (B) The search starts from A and spreads toward the right. The progression of the top path (through B′ and C′) is stopped because D was previously visited. The nucleotide sequences corresponding to the alternate paths B′C′ and BC are extracted from the graph, aligned, and compared. (C) The two paths are judged similar, so the longer one, B′C′, is merged into the shorter one, BC. The merging is directed by the alignment of the consensus sequences, indicated in red lines in B. Note that node X, which was connected to node B′, is now connected to node B. The search progresses, and the bottom path (through C′ and D′) arrives second in E. Once again, the corresponding paths, C′D′ and CD are compared. (D) CD and C′D′ are judged similar enough. The longer path is merged into the shorter one.
Figure 3.
Simulations of Tour Bus. The genome of E. coli and 5-Mb samples of DNA from three other species (S. cerevisiae, C. elegans, and H. sapiens, respectively) were used to generate 35-bp read sets of varying read depths (_X_-axis of each plot). We measured the contig length N50 (_Y_-axis, log scale) after tip-clipping (black curve) then after the subsequent bubble smoothing (red curve). In the first column are the results for perfect, error-free reads. In the second column, we inserted errors in the reads at a rate of 1%. In the third column, we generated a slightly variant genome from the original by inserting random SNPs at a rate of 1 in 500. The reads were then generated with errors from both variants, thus simulating a diploid assembly.
Figure 4.
Effect of coverage on contig length with experimental Streptococcus data.
Figure 5.
Breadcrumb algorithm. Two long contigs produced after error correction, A and B, are joined by several paired reads (red and blue arcs). The path between the two can be broken up because of a repeat internal to the connecting sequence, because of an overlap with a distinct part of the genome, or because of some unresolved errors. The small square nodes represent either nodes of the path between A and B, or other nodes of the graph connected to the former. Finding the exact path in the graph from A to B is not straightforward because of all the alternate paths that need to be explored. However, if we mark all the nodes that are paired up to either A or B (with a blue circle), we can define a subgraph much simpler to explore. Ideally, only a linear path connects both nodes.
Figure 6.
Breadcrumb performance on simulated data sets. As in Figure 3, we sampled 5-Mb DNA sequences from four different species (E. coli, S. cerevisiae, C. elegans, and H. sapiens, respectively) and generated 50× read sets. The horizontal lines represent the N50 reached at the end of Tour Bus (see Fig. 3) (broken black line) and after applying a 4× coverage cutoff (broken red line). Note how the difference in N50 between the graph of perfect reads and that of erroneous reads is significantly reduced by this last cutoff. (Black curves) The results after the basic Breadcrumb algorithm; (red curves) the results after super-contigging.
Similar articles
- Benchmarking of de novo assembly algorithms for Nanopore data reveals optimal performance of OLC approaches.
Cherukuri Y, Janga SC. Cherukuri Y, et al. BMC Genomics. 2016 Aug 22;17 Suppl 7(Suppl 7):507. doi: 10.1186/s12864-016-2895-8. BMC Genomics. 2016. PMID: 27556636 Free PMC article. - Efficient parallel and out of core algorithms for constructing large bi-directed de Bruijn graphs.
Kundeti VK, Rajasekaran S, Dinh H, Vaughn M, Thapar V. Kundeti VK, et al. BMC Bioinformatics. 2010 Nov 15;11:560. doi: 10.1186/1471-2105-11-560. BMC Bioinformatics. 2010. PMID: 21078174 Free PMC article. - ALLPATHS: de novo assembly of whole-genome shotgun microreads.
Butler J, MacCallum I, Kleber M, Shlyakhter IA, Belmonte MK, Lander ES, Nusbaum C, Jaffe DB. Butler J, et al. Genome Res. 2008 May;18(5):810-20. doi: 10.1101/gr.7337908. Epub 2008 Mar 13. Genome Res. 2008. PMID: 18340039 Free PMC article. - The present and future of de novo whole-genome assembly.
Sohn JI, Nam JW. Sohn JI, et al. Brief Bioinform. 2018 Jan 1;19(1):23-40. doi: 10.1093/bib/bbw096. Brief Bioinform. 2018. PMID: 27742661 Review. - De novo assembly of short sequence reads.
Paszkiewicz K, Studholme DJ. Paszkiewicz K, et al. Brief Bioinform. 2010 Sep;11(5):457-72. doi: 10.1093/bib/bbq020. Epub 2010 Aug 19. Brief Bioinform. 2010. PMID: 20724458 Review.
Cited by
- Plastid phylogenomics of Robinsonia (Senecioneae; Asteraceae), endemic to the Juan Fernández Islands: insights into structural organization and molecular evolution.
Cho MS, Yang J, Kim SH, Crawford DJ, Stuessy TF, López-Sepúlveda P, Kim SC. Cho MS, et al. BMC Plant Biol. 2024 Oct 28;24(1):1016. doi: 10.1186/s12870-024-05711-3. BMC Plant Biol. 2024. PMID: 39465373 Free PMC article. - Draft Genome Sequence of Agrobacterium sp. Strain UHFBA-218, Isolated from Rhizosphere Soil of Crown Gall-Infected Cherry Rootstock Colt.
Dua A, Sangwan N, Kaur J, Saxena A, Kohli P, Gupta AK, Lal R. Dua A, et al. Genome Announc. 2013 May 30;1(3):e00302-13. doi: 10.1128/genomeA.00302-13. Genome Announc. 2013. PMID: 23723402 Free PMC article. - Evidence for suppression of immunity as a driver for genomic introgressions and host range expansion in races of Albugo candida, a generalist parasite.
McMullan M, Gardiner A, Bailey K, Kemen E, Ward BJ, Cevik V, Robert-Seilaniantz A, Schultz-Larsen T, Balmuth A, Holub E, van Oosterhout C, Jones JD. McMullan M, et al. Elife. 2015 Feb 27;4:e04550. doi: 10.7554/eLife.04550. Elife. 2015. PMID: 25723966 Free PMC article. - Genome sequence of a novel archaeal rudivirus recovered from a mexican hot spring.
Servín-Garcidueñas LE, Peng X, Garrett RA, Martínez-Romero E. Servín-Garcidueñas LE, et al. Genome Announc. 2013 Jan;1(1):e00040-12. doi: 10.1128/genomeA.00040-12. Epub 2013 Jan 15. Genome Announc. 2013. PMID: 23405288 Free PMC article. - The induction and identification of novel Colistin resistance mutations in Acinetobacter baumannii and their implications.
Thi Khanh Nhu N, Riordan DW, Do Hoang Nhu T, Thanh DP, Thwaites G, Huong Lan NP, Wren BW, Baker S, Stabler RA. Thi Khanh Nhu N, et al. Sci Rep. 2016 Jun 22;6:28291. doi: 10.1038/srep28291. Sci Rep. 2016. PMID: 27329501 Free PMC article.
References
- Batzoglou S. Algorithmic challenges in mammalian genome sequence assembly. In: Dunn M., et al., editors. Encyclopedia of genomics, proteomics and bioinformatics. John Wiley and Sons; New York: 2005. Part 4.
- Batzoglou S., Jaffe D.B., Stanley K., Butler J., Gnerre S., Mauceli E., Berger B., Mesirov J.P., Lander E.S., Jaffe D.B., Stanley K., Butler J., Gnerre S., Mauceli E., Berger B., Mesirov J.P., Lander E.S., Stanley K., Butler J., Gnerre S., Mauceli E., Berger B., Mesirov J.P., Lander E.S., Butler J., Gnerre S., Mauceli E., Berger B., Mesirov J.P., Lander E.S., Gnerre S., Mauceli E., Berger B., Mesirov J.P., Lander E.S., Mauceli E., Berger B., Mesirov J.P., Lander E.S., Berger B., Mesirov J.P., Lander E.S., Mesirov J.P., Lander E.S., Lander E.S. ARACHNE: A whole genome shotgun assembler. Genome Res. 2002;12:177–189. - PMC - PubMed
- Bentley D.R. Whole-genome re-sequencing. Curr. Opin. Genet. Dev. 2006;16:545–552. - PubMed
- Bokhari S.H., Sauer J.R., Sauer J.R. A parallel graph decomposition algorithm for DNA sequencing with nanopores. Bioinformatics. 2005;21:889–896. - PubMed
- Chaisson M., Pevzner P.A., Tang H., Pevzner P.A., Tang H., Tang H. Fragment assembly with short reads. Bioinformatics. 2004;20:2067–2074. - PubMed
Publication types
MeSH terms
LinkOut - more resources
Full Text Sources
Other Literature Sources