Microbial community resemblance methods differ in their ability to detect biologically relevant patterns - PubMed (original) (raw)

Microbial community resemblance methods differ in their ability to detect biologically relevant patterns

Justin Kuczynski et al. Nat Methods. 2010 Oct.

Abstract

High-throughput sequencing methods enable characterization of microbial communities in a wide range of environments on an unprecedented scale. However, insight into microbial community composition is limited by our ability to detect patterns in this flood of sequences. Here we compare the performance of 51 analysis techniques using real and simulated bacterial 16S rRNA pyrosequencing datasets containing either clustered samples or samples arrayed across environmental gradients. We found that many diversity patterns were evident with severely undersampled communities and that methods varied widely in their ability to detect gradients and clusters. Chi-squared distances and Pearson correlation distances performed especially well for detecting gradients, whereas Gower and Canberra distances performed especially well for detecting clusters. These results also provide a basis for understanding tradeoffs between number of samples and depth of coverage, tradeoffs that are important to consider when designing studies to characterize microbial communities.

PubMed Disclaimer

Figures

Figure 1

Figure 1

Schematic of simulations and analysis of data. (a) 6 stages for the analysis of a simulated environmental gradient (b) Clustered samples. A hypothetical sample is formed at the root of a hierarchy which defines the relatedness of samples both inter- and intra-cluster (d1 and d2; stage 1). The species abundances at the root node (stage 2) are perturbed by an amount proportional to d1, and the results are renormalized to form the species abundances at each cluster (stage 3). The cluster nodes are then perturbed by d2 to produce species abundances at each sample (stage 4). Sample data is generated and analyzed similar to (a), and the analysis methods are then evaluated based on their ability to reveal the underlying cluster structure of the samples (stages 5–8).

Figure 2

Figure 2

Comparison of different gradient methods on the soil dataset, a simulated gradient dataset with or without noise. Axes represent the first two principal coordinates maximizing the variance in the data, obtained via PCoA (the percentage of the total variance explained by each axis is shown in parentheses). Each data point is a microbial community sample, colored according to either a real gradient (soil pH) or a simulated gradient (arbitrary units). For simulated data, sequencing depth was 1,000 sequences per sample, and species rank-abundance distributions were fit from empirical data.

Figure 3

Figure 3

Choice of analysis method reveals or obscures clusters. Keyboard data, simulated data resembling the keyboard data (distinct clusters), and simulated data representing less prominent sample clusters (subtle clusters) were analyzed by the indicated techniques All simulated data shown in this figure had 90 samples divided into 3 clusters, with 1,000 sequences per sample. Axes are labeled as in Figure 2.

Figure 4

Figure 4

Deep sequencing is superfluous when clusters are prominent, but critical when clusters are subtle. Data representing either prominent or subtle clusters was generated (see methods) with varying sequencing depths. (a–c) Jaccard distance followed by PCoA was applied to prominent cluster data with 10, 1,000, or 100,000 sequences per sample. No substantial improvement in the effectiveness of the method was found above 1,000 sequences per sample. (d–f) Gower distance followed by PCoA was applied to the same data (g–i) Gower distance applied to more subtle clusters.. (j–l) Morisita-Horn distance followed by PCoA applied to the subtle clusters. Although substantially more of the variance is explained by this method, the clusters are not easily interpretable: this situation persists even with 10 million sequences per sample (data not shown).

Figure 5

Figure 5

Tradeoff between number of samples and number of sequences per sample with prominent and subtle gradients and clusters. Panels show (a) subtle clusters, (b) prominent clusters, (c) subtle gradients, and (d) prominent gradients, with a survey budget of 500,000 sequences allocated to varying numbers of samples, and thus an inversely varying number of sequences per sample. Insets show examples of data at specific sampling depths. The inset panels show examples of the gradients and clusters at 5, 100, and 2,000 samples, corresponding to 100,000 5,000 and 250 sequences per sample respectively (arranged right to left in each panel). All comparisons use the Pearson distance + PCoA ordination method. Note that the fraction of the variance explained by the PCoA decreases as the number of samples increases, even when the patterns are clearer with more samples. Error bars represent ± s.e.m. of 12 simulations.

Similar articles

Cited by

References

    1. Turnbaugh PJ, et al. The human microbiome project. Nature. 2007;449:804–810. - PMC - PubMed
    1. Rappe MS, Giovannoni SJ. The uncultured microbial majority. Annu Rev Microbiol. 2003;57:369–394. - PubMed
    1. Margulies M, et al. Genome sequencing in microfabricated high-density picolitre reactors. Nature. 2005;437:376–380. - PMC - PubMed
    1. Hamady M, Walker JJ, Harris JK, Gold NJ, Knight R. Error-correcting barcoded primers for pyrosequencing hundreds of samples in multiplex. Nat Methods. 2008;5:235–237. - PMC - PubMed
    1. Turnbaugh PJ, et al. A core gut microbiome in obese and lean twins. Nature. 2009;457:480–484. - PMC - PubMed

Publication types

MeSH terms

Grants and funding

LinkOut - more resources