Assembly of large genomes using second-generation sequencing - PubMed (original) (raw)
Assembly of large genomes using second-generation sequencing
Michael C Schatz et al. Genome Res. 2010 Sep.
Abstract
Second-generation sequencing technology can now be used to sequence an entire human genome in a matter of days and at low cost. Sequence read lengths, initially very short, have rapidly increased since the technology first appeared, and we now are seeing a growing number of efforts to sequence large genomes de novo from these short reads. In this Perspective, we describe the issues associated with short-read assembly, the different types of data produced by second-gen sequencers, and the latest assembly algorithms designed for these data. We also review the genomes that have been assembled recently from short reads and make recommendations for sequencing strategies that will yield a high-quality assembly.
Figures
Figure 1.
The _k_-mer uniqueness ratio for five well-known organisms and one single-celled human parasite. The ratio is defined here as the percentage of the genome that is covered by unique sequences of length k or longer. The horizontal axis shows the length in base pairs of the sequences. For example, ∼92.5% of the grapevine genome is contained in unique sequences of 100 bp or longer.
Figure 2.
Differences between an overlap graph and a de Bruijn graph for assembly. Based on the set of 10 8-bp reads (A), we can build an overlap graph (B) in which each read is a node, and overlaps >5 bp are indicated by directed edges. Transitive overlaps, which are implied by other longer overlaps, are shown as dotted edges. In a de Bruin graph (C), a node is created for every _k_-mer in all the reads; here the _k_-mer size is 3. Edges are drawn between every pair of successive _k_-mers in a read, where the _k_-mers overlap by k − 1 bases. In both approaches, repeat sequences create a fork in the graph. Note here we have only considered the forward orientation of each sequence to simplify the figure.
Figure 3.
Expected average contig length for a range of different read lengths and coverage values. Also shown are the average contig lengths and N50 lengths for the dog genome, assembled with 710-bp reads, and the panda genome, assembled with reads averaging 52 bp in length.
Similar articles
- De novo sequencing of plant genomes using second-generation technologies.
Imelfort M, Edwards D. Imelfort M, et al. Brief Bioinform. 2009 Nov;10(6):609-18. doi: 10.1093/bib/bbp039. Brief Bioinform. 2009. PMID: 19933209 Review. - Benchmarking of de novo assembly algorithms for Nanopore data reveals optimal performance of OLC approaches.
Cherukuri Y, Janga SC. Cherukuri Y, et al. BMC Genomics. 2016 Aug 22;17 Suppl 7(Suppl 7):507. doi: 10.1186/s12864-016-2895-8. BMC Genomics. 2016. PMID: 27556636 Free PMC article. - GAGE: A critical evaluation of genome assemblies and assembly algorithms.
Salzberg SL, Phillippy AM, Zimin A, Puiu D, Magoc T, Koren S, Treangen TJ, Schatz MC, Delcher AL, Roberts M, Marçais G, Pop M, Yorke JA. Salzberg SL, et al. Genome Res. 2012 Mar;22(3):557-67. doi: 10.1101/gr.131383.111. Epub 2012 Jan 6. Genome Res. 2012. PMID: 22147368 Free PMC article. - SHARCGS, a fast and highly accurate short-read assembly algorithm for de novo genomic sequencing.
Dohm JC, Lottaz C, Borodina T, Himmelbauer H. Dohm JC, et al. Genome Res. 2007 Nov;17(11):1697-706. doi: 10.1101/gr.6435207. Epub 2007 Oct 1. Genome Res. 2007. PMID: 17908823 Free PMC article. - State of the art de novo assembly of human genomes from massively parallel sequencing data.
Li Y, Hu Y, Bolund L, Wang J. Li Y, et al. Hum Genomics. 2010 Apr;4(4):271-7. doi: 10.1186/1479-7364-4-4-271. Hum Genomics. 2010. PMID: 20511140 Free PMC article. Review.
Cited by
- Diversity and evolution of transposable elements in the plant-parasitic nematodes.
Dayi M. Dayi M. BMC Genomics. 2024 May 23;25(1):511. doi: 10.1186/s12864-024-10435-7. BMC Genomics. 2024. PMID: 38783171 Free PMC article. - Haplotype-resolved assembly of diploid and polyploid genomes using quantum computing.
Chen Y, Huang JH, Sun Y, Zhang Y, Li Y, Xu X. Chen Y, et al. Cell Rep Methods. 2024 May 20;4(5):100754. doi: 10.1016/j.crmeth.2024.100754. Epub 2024 Apr 12. Cell Rep Methods. 2024. PMID: 38614089 Free PMC article. - Closing the genome of unculturable cable bacteria using a combined metagenomic assembly of long and short sequencing reads.
Hiralal A, Geelhoed JS, Hidalgo-Martinez S, Smets B, van Dijk JR, Meysman FJR. Hiralal A, et al. Microb Genom. 2024 Feb;10(2):001197. doi: 10.1099/mgen.0.001197. Microb Genom. 2024. PMID: 38376381 Free PMC article. - Draft Genome Sequence of Alternaria alternata JS-1623, a Fungal Endophyte of Abies koreana.
Park SY, Jeon J, Kim JA, Jeon MJ, Jeong MH, Kim Y, Lee Y, Chung H, Lee YH, Kim S. Park SY, et al. Mycobiology. 2020 May 7;48(3):240-244. doi: 10.1080/12298093.2020.1756134. eCollection 2020. Mycobiology. 2020. PMID: 37970559 Free PMC article. - Exploiting Potential Probiotic Lactic Acid Bacteria Isolated from Chlorella vulgaris Photobioreactors as Promising Vitamin B12 Producers.
Ribeiro M, Maciel C, Cruz P, Darmancier H, Nogueira T, Costa M, Laranjeira J, Morais RMSC, Teixeira P. Ribeiro M, et al. Foods. 2023 Sep 1;12(17):3277. doi: 10.3390/foods12173277. Foods. 2023. PMID: 37685210 Free PMC article.
References
Publication types
MeSH terms
Grants and funding
- R01 GM083873/GM/NIGMS NIH HHS/United States
- R01-LM006845/LM/NLM NIH HHS/United States
- R01 LM006845-10/LM/NLM NIH HHS/United States
- R01 LM006845-11/LM/NLM NIH HHS/United States
- R01-GM083873/GM/NIGMS NIH HHS/United States
- R01 LM006845/LM/NLM NIH HHS/United States
- R01 GM083873-07/GM/NIGMS NIH HHS/United States
- R01 GM083873-08/GM/NIGMS NIH HHS/United States
LinkOut - more resources
Full Text Sources
Other Literature Sources