Use of simulated data sets to evaluate the fidelity of metagenomic processing methods (original) (raw)

Nature Methods volume 4, pages 495–500 (2007)Cite this article

Abstract

Metagenomics is a rapidly emerging field of research for studying microbial communities. To evaluate methods presently used to process metagenomic sequences, we constructed three simulated data sets of varying complexity by combining sequencing reads randomly selected from 113 isolate genomes. These data sets were designed to model real metagenomes in terms of complexity and phylogenetic composition. We assembled sampled reads using three commonly used genome assemblers (Phrap, Arachne and JAZZ), and predicted genes using two popular gene-finding pipelines (fgenesb and CRITICA/GLIMMER). The phylogenetic origins of the assembled contigs were predicted using one sequence similarity–based (blast hit distribution) and two sequence composition–based (PhyloPythia, oligonucleotide frequencies) binning methods. We explored the effects of the simulated community structure and method combinations on the fidelity of each processing step by comparison to the corresponding isolate genomes. The simulated data sets are available online to facilitate standardized benchmarking of tools for metagenomic analysis.

Please visit methagora to view and post comments on this article

This is a preview of subscription content, access via your institution

Access options

Subscribe to this journal

Receive 12 print issues and online access

$259.00 per year

only $21.58 per issue

Buy this article

USD 39.95

Prices may be subject to local taxes which are calculated during checkout

Additional access options:

Similar content being viewed by others

References

  1. Tyson, G.W. et al. Community structure and metabolism through reconstruction of microbial genomes from the environment. Nature 428, 37–43 (2004).
    Article CAS Google Scholar
  2. Venter, J.C. et al. Environmental genome shotgun sequencing of the Sargasso Sea. Science 304, 66–74 (2004).
    Article CAS Google Scholar
  3. Garcia Martin, H. et al. Metagenomic analysis of two enhanced biological phosphorus removal (EBPR) sludge communities. Nat. Biotechnol. 24, 1263–1269 (2006).
    Article Google Scholar
  4. Hallam, S.J. et al. Genomic analysis of the uncultivated marine crenarchaeote Cenarchaeum symbiosum . Proc. Natl. Acad. Sci. USA 103, 18296–18301 (2006).
    Article CAS Google Scholar
  5. Delcher, A.L., Harmon, D., Kasif, S., White, O. & Salzberg, S.L. Improved microbial gene identification with GLIMMER. Nucleic Acids Res. 27, 4636–4641 (1999).
    Article CAS Google Scholar
  6. Lukashin, A.V. & Borodovsky, M. GeneMark.hmm: new solutions for gene finding. Nucleic Acids Res. 26, 1107–1115 (1998).
    Article CAS Google Scholar
  7. Deschavanne, P.J., Giron, A., Vilain, J., Fagot, G. & Fertil, B. Genomic signature: characterization and classification of species assessed by chaos game representation of sequences. Mol. Biol. Evol. 16, 1391–1399 (1999).
    Article CAS Google Scholar
  8. Karlin, S. & Burge, C. Dinucleotide relative abundance extremes: a genomic signature. Trends Genet. 11, 283–290 (1995).
    Article CAS Google Scholar
  9. Teeling, H., Waldmann, J., Lombardot, T., Bauer, M. & Glockner, F.O. TETRA: a web-service and a stand-alone program for the analysis and comparison of tetranucleotide usage patterns in DNA sequences. BMC Bioinformatics 5, 163 (2004).
    Article Google Scholar
  10. McHardy, A.C., Martin, H.G., Tsirigos, A., Hugenholtz, P. & Rigoutsos, I. Accurate phylogenetic classification of variable-length DNA fragments. Nat. Methods 4, 63–72 (2006).
    Article Google Scholar
  11. Hugenholtz, P. Exploring prokaryotic diversity in the genomic era. Genome Biol. 3, 0003 (2002).
    Article Google Scholar
  12. Liolios, K., Tavernarakis, N., Hugenholtz, P. & Kyrpides, N.C. The genomes on line database (GOLD) v.2: a monitor of genome projects worldwide. Nucleic Acids Res. 34, D332–D334 (2006).
    Article CAS Google Scholar
  13. Markowitz, V.M. et al. The integrated microbial genomes (IMG) system. Nucleic Acids Res. 34, D344–D348 (2006).
    Article CAS Google Scholar
  14. Strous, M. et al. Deciphering the evolution and metabolism of an anammox bacterium from a community genome. Nature 440, 790–794 (2006).
    Article Google Scholar
  15. Woyke, T. et al. Symbiosis insights through metagenomic analysis of a microbial consortium. Nature 443, 950–955 (2006).
    Article CAS Google Scholar
  16. Tringe, S.G. et al. Comparative metagenomics of microbial communities. Science 308, 554–557 (2005).
    Article CAS Google Scholar
  17. Jaffe, D.B. et al. Whole-genome sequence assembly for mammalian genomes: Arachne 2. Genome Res. 13, 91–96 (2003).
    Article CAS Google Scholar
  18. Aparicio, S. et al. Whole-genome shotgun assembly and analysis of the genome of Fugu rubripes . Science 297, 1301–1310 (2002).
    Article CAS Google Scholar
  19. Chain, P. et al. Complete genome sequence of the ammonia-oxidizing bacterium and obligate chemolithoautotroph Nitrosomonas europaea . J. Bacteriol. 185, 2759–2773 (2003).
    Article CAS Google Scholar
  20. Altschul, S.F. et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402 (1997).
    Article CAS Google Scholar
  21. DeLong, E.F. et al. Community genomics among stratified microbial assemblages in the ocean's interior. Science 311, 496–503 (2006).
    Article CAS Google Scholar
  22. Tringe, S.G. & Rubin, E.M. Metagenomics: DNA sequencing of environmental samples. Nat. Rev. Genet. 6, 805–814 (2005).
    Article CAS Google Scholar
  23. Tatusov, R.L. et al. The COG database: an updated version includes eukaryotes. BMC Bioinformatics 4, 41 (2003).
    Article Google Scholar
  24. Markowitz, V.M. et al. An experimental metagenome data management and analysis system. Bioinformatics 22, e359–e367 (2006).
    Article CAS Google Scholar

Download references

Acknowledgements

We thank A. Lykidis and I. Anderson from the Genome Biology Program at DOE-JGI for their feedback and comments on this manuscript. This work was performed under the auspices of the US Department of Energy's Office of Science, Biological and Environmental Research Program, and the University of California, Lawrence Livermore National Laboratory under contract number W-7405-Eng-48, Lawrence Berkeley National Laboratory under contract number DE-AC02-05CH11231 and Los Alamos National Laboratory under contract number W-7405-ENG-36.

Author information

Authors and Affiliations

  1. Department of Energy Joint Genome Institute (DOE-JGI), 2800 Mitchell Drive, Walnut Creek, 94598, California, USA
    Konstantinos Mavromatis, Natalia Ivanova, Kerrie Barry, Harris Shapiro, Eugene Goltsman, Asaf Salamov, Frank Korzeniewski, Alla Lapidus, Igor Grigoriev, Paul Richardson, Philip Hugenholtz & Nikos C Kyrpides
  2. Bioinformatics and Pattern Discovery Group, IBM T.J. Watson Research Center, 1101 Kitchawan Rd., Yorktown Heights, 10598, New York, USA
    Alice C McHardy & Isidore Rigoutsos
  3. Oak Ridge National Laboratory, Oak Ridge, 37831, Tennessee, USA
    Miriam Land

Authors

  1. Konstantinos Mavromatis
  2. Natalia Ivanova
  3. Kerrie Barry
  4. Harris Shapiro
  5. Eugene Goltsman
  6. Alice C McHardy
  7. Isidore Rigoutsos
  8. Asaf Salamov
  9. Frank Korzeniewski
  10. Miriam Land
  11. Alla Lapidus
  12. Igor Grigoriev
  13. Paul Richardson
  14. Philip Hugenholtz
  15. Nikos C Kyrpides

Contributions

K.M. and N.I. performed the analysis, K.B., H.S. and E.G. performed assemblies with Phrap, JAZZ and Arachne respectively, A.C.M. performed binning with PhyloPythia, A.S. performed gene predictions with fgenesb and developed and performed binning with BLAST distr, F.K. developed and performed binning with _k_mer, M.L. performed gene prediction with the GLIMMER/CRITICA pipeline, A.L., I.G., P.R. and I.R. supported the project, P.H. and N.C.K. supported the project and contributed conceptually. K.M., P.H. and N.C.K. wrote the manuscript.

Corresponding author

Correspondence toKonstantinos Mavromatis.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Supplementary information

Rights and permissions

About this article

Cite this article

Mavromatis, K., Ivanova, N., Barry, K. et al. Use of simulated data sets to evaluate the fidelity of metagenomic processing methods.Nat Methods 4, 495–500 (2007). https://doi.org/10.1038/nmeth1043

Download citation

Associated content