Unsupervised discovery of microbial population structure within metagenomes using nucleotide base composition - PubMed (original) (raw)

Unsupervised discovery of microbial population structure within metagenomes using nucleotide base composition

Isaam Saeed et al. Nucleic Acids Res. 2012 Mar.

Abstract

An approach to infer the unknown microbial population structure within a metagenome is to cluster nucleotide sequences based on common patterns in base composition, otherwise referred to as binning. When functional roles are assigned to the identified populations, a deeper understanding of microbial communities can be attained, more so than gene-centric approaches that explore overall functionality. In this study, we propose an unsupervised, model-based binning method with two clustering tiers, which uses a novel transformation of the oligonucleotide frequency-derived error gradient and GC content to generate coarse groups at the first tier of clustering; and tetranucleotide frequency to refine these groups at the secondary clustering tier. The proposed method has a demonstrated improvement over PhyloPythia, S-GSOM, TACOA and TaxSOM on all three benchmarks that were used for evaluation in this study. The proposed method is then applied to a pyrosequenced metagenomic library of mud volcano sediment sampled in southwestern Taiwan, with the inferred population structure validated against complementary sequencing of 16S ribosomal RNA marker genes. Finally, the proposed method was further validated against four publicly available metagenomes, including a highly complex Antarctic whale-fall bone sample, which was previously assumed to be too complex for binning prior to functional analysis.

PubMed Disclaimer

Figures

Figure 1.

Figure 1.

The motivation for the two-tiered clustering framework and the features used therein: (A) the PCA projection of the tetranucleotide frequency of random fragments of nine genomes results in poor discrimination between each genome type—shown here for the first two principal components for visualization and is also applicable when considering the first three principal components. (B) However, the nine genome types are found to form two coarse groups in the OFDEG and GC content space. (C and D) When the tetranucleotide frequency of fragments is computed with respect to each group, the discrimination between each genome type is more clearly evident.

Figure 2.

Figure 2.

Comparison of the proposed framework against the two next-best binning methods, PhyloPythia and TaxSOM, on the low complexity (simLC), medium complexity (simMC) and the medium–high complexity (sim-BG) benchmark data sets. The sim-BG benchmark, in particular, highlights the percentage improvement over PhyloPythia and TaxSOM at 78.41 and 17.55% in sensitivity, respectively; and 0.13 and 9.47% in specificity, respectively.

References

    1. Krause L, Diaz NN, Goesmann A, Kelly S, Nattkemper TW, Rohwer F, Edwards RA, Stoye J. Phylogenetic classification of short environmental DNA fragments. Nucleic Acids Res. 2008;36:2230–2239. - PMC - PubMed
    1. Woese CR, Fox GE. Phylogenetic structure of the prokaryotic domain: the primary kingdoms. Proc. Natl Acad. Sci. USA. 1977;74:5088–5090. - PMC - PubMed
    1. Liu Z, DeSantis TZ, Andersen GL, Knight R. Accurate taxonomy assignments from 16S rRNA sequences produced by highly parallel pyrosequencers. Nucleic Acids Res. 2008;36:e120. - PMC - PubMed
    1. Blow N. Exploring unseen communities. Nature. 2008;453:687–690. - PubMed
    1. Abe T, Sugawara H, Kinouchi M, Kanaya S, Ikemura T. Novel phylogenetic studies of genomic sequence fragments derived from uncultured microbe mixtures in environmental and clinical samples. DNA Res. 2005;12:281–290. - PubMed

Publication types

MeSH terms

Substances

LinkOut - more resources