Use of simulated data sets to evaluate the fidelity of metagenomic processing methods (original) (raw)

Article
Published: 29 April 2007
Natalia Ivanova 1,
Kerrie Barry 1,
Harris Shapiro 1,
Eugene Goltsman 1,
Alice C McHardy 2,
Isidore Rigoutsos 2,
Asaf Salamov 1,
Frank Korzeniewski 1,
Miriam Land 3,
Alla Lapidus 1,
Igor Grigoriev 1,
Paul Richardson 1,
Philip Hugenholtz 1 &
…
Nikos C Kyrpides 1

Nature Methods volume 4, pages 495–500 (2007)Cite this article

2732 Accesses
299 Citations
26 Altmetric
Metrics details

Abstract

Metagenomics is a rapidly emerging field of research for studying microbial communities. To evaluate methods presently used to process metagenomic sequences, we constructed three simulated data sets of varying complexity by combining sequencing reads randomly selected from 113 isolate genomes. These data sets were designed to model real metagenomes in terms of complexity and phylogenetic composition. We assembled sampled reads using three commonly used genome assemblers (Phrap, Arachne and JAZZ), and predicted genes using two popular gene-finding pipelines (fgenesb and CRITICA/GLIMMER). The phylogenetic origins of the assembled contigs were predicted using one sequence similarity–based (blast hit distribution) and two sequence composition–based (PhyloPythia, oligonucleotide frequencies) binning methods. We explored the effects of the simulated community structure and method combinations on the fidelity of each processing step by comparison to the corresponding isolate genomes. The simulated data sets are available online to facilitate standardized benchmarking of tools for metagenomic analysis.

Please visit methagora to view and post comments on this article

This is a preview of subscription content, access via your institution

Access options

Subscribe to this journal

Receive 12 print issues and online access

$259.00 per year

only $21.58 per issue

Buy this article

Purchase on SpringerLink
Instant access to the full article PDF.

USD 39.95

Prices may be subject to local taxes which are calculated during checkout

Additional access options:

References

Tyson, G.W. et al. Community structure and metabolism through reconstruction of microbial genomes from the environment. Nature 428, 37–43 (2004).
Article CAS Google Scholar
Venter, J.C. et al. Environmental genome shotgun sequencing of the Sargasso Sea. Science 304, 66–74 (2004).
Article CAS Google Scholar
Garcia Martin, H. et al. Metagenomic analysis of two enhanced biological phosphorus removal (EBPR) sludge communities. Nat. Biotechnol. 24, 1263–1269 (2006).
Article Google Scholar
Hallam, S.J. et al. Genomic analysis of the uncultivated marine crenarchaeote Cenarchaeum symbiosum . Proc. Natl. Acad. Sci. USA 103, 18296–18301 (2006).
Article CAS Google Scholar
Delcher, A.L., Harmon, D., Kasif, S., White, O. & Salzberg, S.L. Improved microbial gene identification with GLIMMER. Nucleic Acids Res. 27, 4636–4641 (1999).
Article CAS Google Scholar
Lukashin, A.V. & Borodovsky, M. GeneMark.hmm: new solutions for gene finding. Nucleic Acids Res. 26, 1107–1115 (1998).
Article CAS Google Scholar
Deschavanne, P.J., Giron, A., Vilain, J., Fagot, G. & Fertil, B. Genomic signature: characterization and classification of species assessed by chaos game representation of sequences. Mol. Biol. Evol. 16, 1391–1399 (1999).
Article CAS Google Scholar
Karlin, S. & Burge, C. Dinucleotide relative abundance extremes: a genomic signature. Trends Genet. 11, 283–290 (1995).
Article CAS Google Scholar
Teeling, H., Waldmann, J., Lombardot, T., Bauer, M. & Glockner, F.O. TETRA: a web-service and a stand-alone program for the analysis and comparison of tetranucleotide usage patterns in DNA sequences. BMC Bioinformatics 5, 163 (2004).
Article Google Scholar
McHardy, A.C., Martin, H.G., Tsirigos, A., Hugenholtz, P. & Rigoutsos, I. Accurate phylogenetic classification of variable-length DNA fragments. Nat. Methods 4, 63–72 (2006).
Article Google Scholar
Hugenholtz, P. Exploring prokaryotic diversity in the genomic era. Genome Biol. 3, 0003 (2002).
Article Google Scholar
Liolios, K., Tavernarakis, N., Hugenholtz, P. & Kyrpides, N.C. The genomes on line database (GOLD) v.2: a monitor of genome projects worldwide. Nucleic Acids Res. 34, D332–D334 (2006).
Article CAS Google Scholar
Markowitz, V.M. et al. The integrated microbial genomes (IMG) system. Nucleic Acids Res. 34, D344–D348 (2006).
Article CAS Google Scholar
Strous, M. et al. Deciphering the evolution and metabolism of an anammox bacterium from a community genome. Nature 440, 790–794 (2006).
Article Google Scholar
Woyke, T. et al. Symbiosis insights through metagenomic analysis of a microbial consortium. Nature 443, 950–955 (2006).
Article CAS Google Scholar
Tringe, S.G. et al. Comparative metagenomics of microbial communities. Science 308, 554–557 (2005).
Article CAS Google Scholar
Jaffe, D.B. et al. Whole-genome sequence assembly for mammalian genomes: Arachne 2. Genome Res. 13, 91–96 (2003).
Article CAS Google Scholar
Aparicio, S. et al. Whole-genome shotgun assembly and analysis of the genome of Fugu rubripes . Science 297, 1301–1310 (2002).
Article CAS Google Scholar
Chain, P. et al. Complete genome sequence of the ammonia-oxidizing bacterium and obligate chemolithoautotroph Nitrosomonas europaea . J. Bacteriol. 185, 2759–2773 (2003).
Article CAS Google Scholar
Altschul, S.F. et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402 (1997).
Article CAS Google Scholar
DeLong, E.F. et al. Community genomics among stratified microbial assemblages in the ocean's interior. Science 311, 496–503 (2006).
Article CAS Google Scholar
Tringe, S.G. & Rubin, E.M. Metagenomics: DNA sequencing of environmental samples. Nat. Rev. Genet. 6, 805–814 (2005).
Article CAS Google Scholar
Tatusov, R.L. et al. The COG database: an updated version includes eukaryotes. BMC Bioinformatics 4, 41 (2003).
Article Google Scholar
Markowitz, V.M. et al. An experimental metagenome data management and analysis system. Bioinformatics 22, e359–e367 (2006).
Article CAS Google Scholar

Download references

Acknowledgements

We thank A. Lykidis and I. Anderson from the Genome Biology Program at DOE-JGI for their feedback and comments on this manuscript. This work was performed under the auspices of the US Department of Energy's Office of Science, Biological and Environmental Research Program, and the University of California, Lawrence Livermore National Laboratory under contract number W-7405-Eng-48, Lawrence Berkeley National Laboratory under contract number DE-AC02-05CH11231 and Los Alamos National Laboratory under contract number W-7405-ENG-36.

Author information

Authors and Affiliations

Department of Energy Joint Genome Institute (DOE-JGI), 2800 Mitchell Drive, Walnut Creek, 94598, California, USA
Konstantinos Mavromatis, Natalia Ivanova, Kerrie Barry, Harris Shapiro, Eugene Goltsman, Asaf Salamov, Frank Korzeniewski, Alla Lapidus, Igor Grigoriev, Paul Richardson, Philip Hugenholtz & Nikos C Kyrpides
Bioinformatics and Pattern Discovery Group, IBM T.J. Watson Research Center, 1101 Kitchawan Rd., Yorktown Heights, 10598, New York, USA
Alice C McHardy & Isidore Rigoutsos
Oak Ridge National Laboratory, Oak Ridge, 37831, Tennessee, USA
Miriam Land

Authors

Konstantinos Mavromatis
Natalia Ivanova
Kerrie Barry
Harris Shapiro
Eugene Goltsman
Alice C McHardy
Isidore Rigoutsos
Asaf Salamov
Frank Korzeniewski
Miriam Land
Alla Lapidus
Igor Grigoriev
Paul Richardson
Philip Hugenholtz
Nikos C Kyrpides

Contributions

K.M. and N.I. performed the analysis, K.B., H.S. and E.G. performed assemblies with Phrap, JAZZ and Arachne respectively, A.C.M. performed binning with PhyloPythia, A.S. performed gene predictions with fgenesb and developed and performed binning with BLAST distr, F.K. developed and performed binning with _k_mer, M.L. performed gene prediction with the GLIMMER/CRITICA pipeline, A.L., I.G., P.R. and I.R. supported the project, P.H. and N.C.K. supported the project and contributed conceptually. K.M., P.H. and N.C.K. wrote the manuscript.

Corresponding author

Correspondence toKonstantinos Mavromatis.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Supplementary information

Rights and permissions

About this article

Cite this article

Mavromatis, K., Ivanova, N., Barry, K. et al. Use of simulated data sets to evaluate the fidelity of metagenomic processing methods.Nat Methods 4, 495–500 (2007). https://doi.org/10.1038/nmeth1043

Download citation

Received: 04 December 2006
Accepted: 21 March 2007
Published: 29 April 2007
Issue date: June 2007
DOI: https://doi.org/10.1038/nmeth1043