Use of simulated data sets to evaluate the fidelity of metagenomic processing methods (original) (raw)
- Article
- Published: 29 April 2007
- Natalia Ivanova1,
- Kerrie Barry1,
- Harris Shapiro1,
- Eugene Goltsman1,
- Alice C McHardy2,
- Isidore Rigoutsos2,
- Asaf Salamov1,
- Frank Korzeniewski1,
- Miriam Land3,
- Alla Lapidus1,
- Igor Grigoriev1,
- Paul Richardson1,
- Philip Hugenholtz1 &
- …
- Nikos C Kyrpides1
Nature Methods volume 4, pages 495–500 (2007)Cite this article
- 2732 Accesses
- 299 Citations
- 26 Altmetric
- Metrics details
Abstract
Metagenomics is a rapidly emerging field of research for studying microbial communities. To evaluate methods presently used to process metagenomic sequences, we constructed three simulated data sets of varying complexity by combining sequencing reads randomly selected from 113 isolate genomes. These data sets were designed to model real metagenomes in terms of complexity and phylogenetic composition. We assembled sampled reads using three commonly used genome assemblers (Phrap, Arachne and JAZZ), and predicted genes using two popular gene-finding pipelines (fgenesb and CRITICA/GLIMMER). The phylogenetic origins of the assembled contigs were predicted using one sequence similarity–based (blast hit distribution) and two sequence composition–based (PhyloPythia, oligonucleotide frequencies) binning methods. We explored the effects of the simulated community structure and method combinations on the fidelity of each processing step by comparison to the corresponding isolate genomes. The simulated data sets are available online to facilitate standardized benchmarking of tools for metagenomic analysis.
Please visit methagora to view and post comments on this article
This is a preview of subscription content, access via your institution
Access options
Subscribe to this journal
Receive 12 print issues and online access
$259.00 per year
only $21.58 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to the full article PDF.
USD 39.95
Prices may be subject to local taxes which are calculated during checkout
Additional access options:
Similar content being viewed by others
References
- Tyson, G.W. et al. Community structure and metabolism through reconstruction of microbial genomes from the environment. Nature 428, 37–43 (2004).
Article CAS Google Scholar - Venter, J.C. et al. Environmental genome shotgun sequencing of the Sargasso Sea. Science 304, 66–74 (2004).
Article CAS Google Scholar - Garcia Martin, H. et al. Metagenomic analysis of two enhanced biological phosphorus removal (EBPR) sludge communities. Nat. Biotechnol. 24, 1263–1269 (2006).
Article Google Scholar - Hallam, S.J. et al. Genomic analysis of the uncultivated marine crenarchaeote Cenarchaeum symbiosum . Proc. Natl. Acad. Sci. USA 103, 18296–18301 (2006).
Article CAS Google Scholar - Delcher, A.L., Harmon, D., Kasif, S., White, O. & Salzberg, S.L. Improved microbial gene identification with GLIMMER. Nucleic Acids Res. 27, 4636–4641 (1999).
Article CAS Google Scholar - Lukashin, A.V. & Borodovsky, M. GeneMark.hmm: new solutions for gene finding. Nucleic Acids Res. 26, 1107–1115 (1998).
Article CAS Google Scholar - Deschavanne, P.J., Giron, A., Vilain, J., Fagot, G. & Fertil, B. Genomic signature: characterization and classification of species assessed by chaos game representation of sequences. Mol. Biol. Evol. 16, 1391–1399 (1999).
Article CAS Google Scholar - Karlin, S. & Burge, C. Dinucleotide relative abundance extremes: a genomic signature. Trends Genet. 11, 283–290 (1995).
Article CAS Google Scholar - Teeling, H., Waldmann, J., Lombardot, T., Bauer, M. & Glockner, F.O. TETRA: a web-service and a stand-alone program for the analysis and comparison of tetranucleotide usage patterns in DNA sequences. BMC Bioinformatics 5, 163 (2004).
Article Google Scholar - McHardy, A.C., Martin, H.G., Tsirigos, A., Hugenholtz, P. & Rigoutsos, I. Accurate phylogenetic classification of variable-length DNA fragments. Nat. Methods 4, 63–72 (2006).
Article Google Scholar - Hugenholtz, P. Exploring prokaryotic diversity in the genomic era. Genome Biol. 3, 0003 (2002).
Article Google Scholar - Liolios, K., Tavernarakis, N., Hugenholtz, P. & Kyrpides, N.C. The genomes on line database (GOLD) v.2: a monitor of genome projects worldwide. Nucleic Acids Res. 34, D332–D334 (2006).
Article CAS Google Scholar - Markowitz, V.M. et al. The integrated microbial genomes (IMG) system. Nucleic Acids Res. 34, D344–D348 (2006).
Article CAS Google Scholar - Strous, M. et al. Deciphering the evolution and metabolism of an anammox bacterium from a community genome. Nature 440, 790–794 (2006).
Article Google Scholar - Woyke, T. et al. Symbiosis insights through metagenomic analysis of a microbial consortium. Nature 443, 950–955 (2006).
Article CAS Google Scholar - Tringe, S.G. et al. Comparative metagenomics of microbial communities. Science 308, 554–557 (2005).
Article CAS Google Scholar - Jaffe, D.B. et al. Whole-genome sequence assembly for mammalian genomes: Arachne 2. Genome Res. 13, 91–96 (2003).
Article CAS Google Scholar - Aparicio, S. et al. Whole-genome shotgun assembly and analysis of the genome of Fugu rubripes . Science 297, 1301–1310 (2002).
Article CAS Google Scholar - Chain, P. et al. Complete genome sequence of the ammonia-oxidizing bacterium and obligate chemolithoautotroph Nitrosomonas europaea . J. Bacteriol. 185, 2759–2773 (2003).
Article CAS Google Scholar - Altschul, S.F. et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402 (1997).
Article CAS Google Scholar - DeLong, E.F. et al. Community genomics among stratified microbial assemblages in the ocean's interior. Science 311, 496–503 (2006).
Article CAS Google Scholar - Tringe, S.G. & Rubin, E.M. Metagenomics: DNA sequencing of environmental samples. Nat. Rev. Genet. 6, 805–814 (2005).
Article CAS Google Scholar - Tatusov, R.L. et al. The COG database: an updated version includes eukaryotes. BMC Bioinformatics 4, 41 (2003).
Article Google Scholar - Markowitz, V.M. et al. An experimental metagenome data management and analysis system. Bioinformatics 22, e359–e367 (2006).
Article CAS Google Scholar
Acknowledgements
We thank A. Lykidis and I. Anderson from the Genome Biology Program at DOE-JGI for their feedback and comments on this manuscript. This work was performed under the auspices of the US Department of Energy's Office of Science, Biological and Environmental Research Program, and the University of California, Lawrence Livermore National Laboratory under contract number W-7405-Eng-48, Lawrence Berkeley National Laboratory under contract number DE-AC02-05CH11231 and Los Alamos National Laboratory under contract number W-7405-ENG-36.
Author information
Authors and Affiliations
- Department of Energy Joint Genome Institute (DOE-JGI), 2800 Mitchell Drive, Walnut Creek, 94598, California, USA
Konstantinos Mavromatis, Natalia Ivanova, Kerrie Barry, Harris Shapiro, Eugene Goltsman, Asaf Salamov, Frank Korzeniewski, Alla Lapidus, Igor Grigoriev, Paul Richardson, Philip Hugenholtz & Nikos C Kyrpides - Bioinformatics and Pattern Discovery Group, IBM T.J. Watson Research Center, 1101 Kitchawan Rd., Yorktown Heights, 10598, New York, USA
Alice C McHardy & Isidore Rigoutsos - Oak Ridge National Laboratory, Oak Ridge, 37831, Tennessee, USA
Miriam Land
Authors
- Konstantinos Mavromatis
- Natalia Ivanova
- Kerrie Barry
- Harris Shapiro
- Eugene Goltsman
- Alice C McHardy
- Isidore Rigoutsos
- Asaf Salamov
- Frank Korzeniewski
- Miriam Land
- Alla Lapidus
- Igor Grigoriev
- Paul Richardson
- Philip Hugenholtz
- Nikos C Kyrpides
Contributions
K.M. and N.I. performed the analysis, K.B., H.S. and E.G. performed assemblies with Phrap, JAZZ and Arachne respectively, A.C.M. performed binning with PhyloPythia, A.S. performed gene predictions with fgenesb and developed and performed binning with BLAST distr, F.K. developed and performed binning with _k_mer, M.L. performed gene prediction with the GLIMMER/CRITICA pipeline, A.L., I.G., P.R. and I.R. supported the project, P.H. and N.C.K. supported the project and contributed conceptually. K.M., P.H. and N.C.K. wrote the manuscript.
Corresponding author
Correspondence toKonstantinos Mavromatis.
Ethics declarations
Competing interests
The authors declare no competing financial interests.
Supplementary information
Rights and permissions
About this article
Cite this article
Mavromatis, K., Ivanova, N., Barry, K. et al. Use of simulated data sets to evaluate the fidelity of metagenomic processing methods.Nat Methods 4, 495–500 (2007). https://doi.org/10.1038/nmeth1043
- Received: 04 December 2006
- Accepted: 21 March 2007
- Published: 29 April 2007
- Issue date: June 2007
- DOI: https://doi.org/10.1038/nmeth1043