Assembly of large genomes using second-generation sequencing - PubMed (original) (raw)

Assembly of large genomes using second-generation sequencing

Michael C Schatz et al. Genome Res. 2010 Sep.

Abstract

Second-generation sequencing technology can now be used to sequence an entire human genome in a matter of days and at low cost. Sequence read lengths, initially very short, have rapidly increased since the technology first appeared, and we now are seeing a growing number of efforts to sequence large genomes de novo from these short reads. In this Perspective, we describe the issues associated with short-read assembly, the different types of data produced by second-gen sequencers, and the latest assembly algorithms designed for these data. We also review the genomes that have been assembled recently from short reads and make recommendations for sequencing strategies that will yield a high-quality assembly.

PubMed Disclaimer

Figures

Figure 1.

The _k_-mer uniqueness ratio for five well-known organisms and one single-celled human parasite. The ratio is defined here as the percentage of the genome that is covered by unique sequences of length k or longer. The horizontal axis shows the length in base pairs of the sequences. For example, ∼92.5% of the grapevine genome is contained in unique sequences of 100 bp or longer.

Figure 2.

Differences between an overlap graph and a de Bruijn graph for assembly. Based on the set of 10 8-bp reads (A), we can build an overlap graph (B) in which each read is a node, and overlaps >5 bp are indicated by directed edges. Transitive overlaps, which are implied by other longer overlaps, are shown as dotted edges. In a de Bruin graph (C), a node is created for every _k_-mer in all the reads; here the _k_-mer size is 3. Edges are drawn between every pair of successive _k_-mers in a read, where the _k_-mers overlap by k − 1 bases. In both approaches, repeat sequences create a fork in the graph. Note here we have only considered the forward orientation of each sequence to simplify the figure.

Figure 3.

Expected average contig length for a range of different read lengths and coverage values. Also shown are the average contig lengths and N50 lengths for the dog genome, assembled with 710-bp reads, and the panda genome, assembled with reads averaging 52 bp in length.

Cited by

Diversity and evolution of transposable elements in the plant-parasitic nematodes.
Dayi M. Dayi M. BMC Genomics. 2024 May 23;25(1):511. doi: 10.1186/s12864-024-10435-7. BMC Genomics. 2024. PMID: 38783171 Free PMC article.
Haplotype-resolved assembly of diploid and polyploid genomes using quantum computing.
Chen Y, Huang JH, Sun Y, Zhang Y, Li Y, Xu X. Chen Y, et al. Cell Rep Methods. 2024 May 20;4(5):100754. doi: 10.1016/j.crmeth.2024.100754. Epub 2024 Apr 12. Cell Rep Methods. 2024. PMID: 38614089 Free PMC article.
Closing the genome of unculturable cable bacteria using a combined metagenomic assembly of long and short sequencing reads.
Hiralal A, Geelhoed JS, Hidalgo-Martinez S, Smets B, van Dijk JR, Meysman FJR. Hiralal A, et al. Microb Genom. 2024 Feb;10(2):001197. doi: 10.1099/mgen.0.001197. Microb Genom. 2024. PMID: 38376381 Free PMC article.
Draft Genome Sequence of Alternaria alternata JS-1623, a Fungal Endophyte of Abies koreana.
Park SY, Jeon J, Kim JA, Jeon MJ, Jeong MH, Kim Y, Lee Y, Chung H, Lee YH, Kim S. Park SY, et al. Mycobiology. 2020 May 7;48(3):240-244. doi: 10.1080/12298093.2020.1756134. eCollection 2020. Mycobiology. 2020. PMID: 37970559 Free PMC article.
Exploiting Potential Probiotic Lactic Acid Bacteria Isolated from Chlorella vulgaris Photobioreactors as Promising Vitamin B12 Producers.
Ribeiro M, Maciel C, Cruz P, Darmancier H, Nogueira T, Costa M, Laranjeira J, Morais RMSC, Teixeira P. Ribeiro M, et al. Foods. 2023 Sep 1;12(17):3277. doi: 10.3390/foods12173277. Foods. 2023. PMID: 37685210 Free PMC article.

References

1. Batzoglou S, Jaffe DB, Stanley K, Butler J, Gnerre S, Mauceli E, Berger B, Mesirov JP, Lander ES 2002. ARACHNE: A whole-genome shotgun assembler. Genome Res 12: 177–189 - PMC - PubMed
1. Bentley DR, Balasubramanian S, Swerdlow HP, Smith GP, Milton J, Brown CG, Hall KP, Evers DJ, Barnes CL, Bignell HR, et al. 2008. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456: 53–59 - PMC - PubMed
1. Butler J, MacCallum I, Kleber M, Shlyakhter IA, Belmonte MK, Lander ES, Nusbaum C, Jaffe DB 2008. ALLPATHS: De novo assembly of whole-genome shotgun microreads. Genome Res 18: 810–820 - PMC - PubMed
1. Collins FS, Weissman SM 1984. Directional cloning of DNA fragments at a large distance from an initial probe: A circularization method. Proc Natl Acad Sci 81: 6812–6816 - PMC - PubMed
1. Dohm JC, Lottaz C, Borodina T, Himmelbauer H 2008. Substantial biases in ultra-short read data sets from high-throughput DNA sequencing. Nucleic Acids Res 36: e105 doi: 10.1093/nar/gkn425 - PMC - PubMed

Assembly of large genomes using second-generation sequencing - PubMed (original) (raw)