Assembly algorithms for next-generation sequencing data - PubMed (original) (raw)

Review

Assembly algorithms for next-generation sequencing data

Jason R Miller et al. Genomics. 2010 Jun.

Abstract

The emergence of next-generation sequencing platforms led to resurgence of research in whole-genome shotgun assembly algorithms and software. DNA sequencing data from the Roche 454, Illumina/Solexa, and ABI SOLiD platforms typically present shorter read lengths, higher coverage, and different error profiles compared with Sanger sequencing data. Since 2005, several assembly software packages have been created or revised specifically for de novo assembly of next-generation sequencing data. This review summarizes and compares the published descriptions of packages named SSAKE, SHARCGS, VCAKE, Newbler, Celera Assembler, Euler, Velvet, ABySS, AllPaths, and SOAPdenovo. More generally, it compares the two standard methods known as the de Bruijn graph approach and the overlap/layout/consensus approach to assembly.

PubMed Disclaimer

Conflict of interest statement

CONFLICT OF INTEREST: None.

Figures

Figure 1

A read represented by K-mer graphs. (a) The read is represented by two types of K-mer graph with K=4. Larger values of K are used for real data. (b) The graph has a node for every K-mer in the read plus a directed edge for every pair of K-mers that overlap by K-1 bases in the read. (c) An equivalent graph has an edge for every K-mer in the read and the nodes implicitly represent overlaps of K-1 bases. In these examples, the paths are simple because the value K=4 is larger than the 2bp repeats in the read. The read sequence is easily reconstructed from the path in either graph.

Figure 2

A pair-wise overlap represented by a K-mer graph. (a) Two reads have an error-free overlap of 4 bases. (b) One K-mer graph, with K=4, represents both reads. The pair-wise alignment is a by-product of the graph construction. (c) The simple path through the graph implies a contig whose consensus sequence is easily reconstructed from the path.

Figure 3

Complexity in K-mer graphs can be diagnosed with read multiplicity information. In these graphs, edges represented in more reads are drawn with thicker arrows. (a) An errant base call toward the end of a read causes a “spur” or short dead-end branch. The same pattern could be induced by coincidence of zero coverage after polymorphism near a repeat. (b) An errant base call near a read middle causes a “bubble” or alternate path. Polymorphisms between donor chromosomes would be expected to induce a bubble with parity of read multiplicity on the divergent paths. (c) Repeat sequences lead to the “frayed rope” pattern of convergent and divergent paths.

Figure 4

Three methods to resolve graph complexity. (a) Read threading joins paths across collapsed repeats that are shorter than the read lengths. (b) Mate threading joins paths across collapsed repeats that are shorter than the paired-end distances. (c) Path following chooses one path if its length fits the paired-end constraint. Reads and mates are shown as patterned lines. Not all tangles can be resolved by reads and mates. The non-branching paths are illustrative; they could be simplified to single edges or nodes.

Cited by

Intraisolate mitochondrial genetic polymorphism and gene variants coexpression in arbuscular mycorrhizal fungi.
Beaudet D, de la Providencia IE, Labridy M, Roy-Bolduc A, Daubois L, Hijri M. Beaudet D, et al. Genome Biol Evol. 2014 Dec 19;7(1):218-27. doi: 10.1093/gbe/evu275. Genome Biol Evol. 2014. PMID: 25527836 Free PMC article.
Genetic basis of a violation of Dollo's Law: re-evolution of rotating sex combs in Drosophila bipectinata.
Seher TD, Ng CS, Signor SA, Podlaha O, Barmina O, Kopp A. Seher TD, et al. Genetics. 2012 Dec;192(4):1465-75. doi: 10.1534/genetics.112.145524. Epub 2012 Oct 19. Genetics. 2012. PMID: 23086218 Free PMC article.
Assessment of metagenomic assembly using simulated next generation sequencing data.
Mende DR, Waller AS, Sunagawa S, Järvelin AI, Chan MM, Arumugam M, Raes J, Bork P. Mende DR, et al. PLoS One. 2012;7(2):e31386. doi: 10.1371/journal.pone.0031386. Epub 2012 Feb 23. PLoS One. 2012. PMID: 22384016 Free PMC article.
Complete genome and characteristics of cluster BC bacteriophage SoJo, isolated using Streptomyces mirabilis NRRL B-2400 in Columbia, MD.
Kumar SV, Schaffer N, Bharmal Z, Mood Q; 2022 UMBC Phage Hunters; Erill I, Caruso SM. Kumar SV, et al. Microbiol Resour Announc. 2024 Apr 11;13(4):e0006824. doi: 10.1128/mra.00068-24. Epub 2024 Feb 23. Microbiol Resour Announc. 2024. PMID: 38394246 Free PMC article.
DIME: a novel framework for de novo metagenomic sequence assembly.
Guo X, Yu N, Ding X, Wang J, Pan Y. Guo X, et al. J Comput Biol. 2015 Feb;22(2):159-77. doi: 10.1089/cmb.2014.0251. J Comput Biol. 2015. PMID: 25684202 Free PMC article.

References

1. Sanger F, Coulson AR, Barrell BG, Smith AJ, Roe BA. Cloning in single-stranded bacteriophage as an aid to rapid DNA sequencing. J Mol Biol. 1980;143:161–78. - PubMed
1. Staden R. A strategy of DNA sequencing employing computer programs. Nucleic Acids Res. 1979;6:2601–10. - PMC - PubMed
1. Pop M. Genome assembly reborn: recent computational challenges. Brief Bioinform. 2009;10:354–66. - PMC - PubMed
1. Mardis ER. The impact of next-generation sequencing technology on genetics. Trends Genet. 2008;24:133–41. - PubMed
1. Morozova O, Marra MA. Applications of next-generation sequencing technologies in functional genomics. Genomics. 2008;92:255–64. - PubMed

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database

Assembly algorithms for next-generation sequencing data - PubMed (original) (raw)