Identification of optimum sequencing depth especially for de novo genome assembly of small genomes using next generation sequencing data - PubMed (original) (raw)

Identification of optimum sequencing depth especially for de novo genome assembly of small genomes using next generation sequencing data

Aarti Desai et al. PLoS One. 2013.

Abstract

Next Generation Sequencing (NGS) is a disruptive technology that has found widespread acceptance in the life sciences research community. The high throughput and low cost of sequencing has encouraged researchers to undertake ambitious genomic projects, especially in de novo genome sequencing. Currently, NGS systems generate sequence data as short reads and de novo genome assembly using these short reads is computationally very intensive. Due to lower cost of sequencing and higher throughput, NGS systems now provide the ability to sequence genomes at high depth. However, currently no report is available highlighting the impact of high sequence depth on genome assembly using real data sets and multiple assembly algorithms. Recently, some studies have evaluated the impact of sequence coverage, error rate and average read length on genome assembly using multiple assembly algorithms, however, these evaluations were performed using simulated datasets. One limitation of using simulated datasets is that variables such as error rates, read length and coverage which are known to impact genome assembly are carefully controlled. Hence, this study was undertaken to identify the minimum depth of sequencing required for de novo assembly for different sized genomes using graph based assembly algorithms and real datasets. Illumina reads for E.coli (4.6 MB) S.kudriavzevii (11.18 MB) and C.elegans (100 MB) were assembled using SOAPdenovo, Velvet, ABySS, Meraculous and IDBA-UD. Our analysis shows that 50X is the optimum read depth for assembling these genomes using all assemblers except Meraculous which requires 100X read depth. Moreover, our analysis shows that de novo assembly from 50X read data requires only 6-40 GB RAM depending on the genome size and assembly algorithm used. We believe that this information can be extremely valuable for researchers in designing experiments and multiplexing which will enable optimum utilization of sequencing as well as analysis resources.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: All authors are affiliated to Persistent LABS and/or Persistent Systems, Pune. There are no patents, products in development or marketed products to declare. This does not alter the authors' adherence to all the PLOS ONE policies on sharing data and materials, as detailed online in the guide for authors.

Figures

Figure 1

Figure 1. N50 value for the genomes assembled by Velvet, SOAPdenovo, ABySS, Meraculous and IDBA-UD.

A) N50 for assembled E.coli genome: N50 is the length of the smallest contig which when added to a set of larger contigs yields at least 50% of the genome. The N50 values for IDBA-UD, Velvet and SOAPdenovo seemed to reach a plateau at 35X, ABySS at 50X depth of coverage. On the other hand, the N50 value of Meraculous generated assembly increased till 150X depth of coverage. B) N50 for assembled S.kudriavzevii genome: IDBA-UD and SOAPdenovo attained peak N50 value at 35X and 100X depth of coverage respectively, whereas the N50 value of Velvet, ABySS and Meraculous generated assembly increased till 150X depth of coverage. C) N50 for assembled C.elegans genome: SOAPdenovo, ABySS and IDBA-UD reached peak N50 value at 100X depth of coverage, whereas the N50 value of Velvet generated assembly increased approximately 1.5 fold until 150X with no change thereafter. Velvet generated assembly had the best N50 values of all the 4 assemblers.

Figure 2

Figure 2. Memory requirement for genome assembly.

Memory required to assemble E.coli (A), S.kudriavzevii (B) and C.elegans (C) genomes increased, although not proportionately, with increasing depth of sequencing.

Similar articles

Cited by

References

    1. Wade CM, Giulotto E, Sigurdsson S, Zoli M, Gnerre S, et al. (2009) Genome sequence, comparative analysis, and population genetics of the domestic horse. Science 326 (5954): 865–867. - PMC - PubMed
    1. Holt RA, Subramanian GM, Halpern A, Sutton GG, Charlab R, et al. (2002) The genome sequence of the malaria mosquito Anopheles gambiae. Science 298 (5591): 129–149. - PubMed
    1. Lander ES, Linton LM, Birren B, Nusbaum C, The International Human Genome Consortium, et al. (2001) Initial sequencing and analysis of the human genome. Nature, Vol. 409 (6822): 860–921. - PubMed
    1. Shendure J, Ji H (2008) Next-generation DNA sequencing. Nat Biotechnol. 26: 1135–1145. - PubMed
    1. Metzker ML (2010) Sequencing technologies – the next generation. Nat Rev Genet. 11: 31–46. - PubMed

MeSH terms

Grants and funding

The authors have no support or funding to report.

LinkOut - more resources