Dirichlet multinomial mixtures: generative models for microbial metagenomics - PubMed (original) (raw)

Dirichlet multinomial mixtures: generative models for microbial metagenomics

Ian Holmes et al. PLoS One. 2012.

Abstract

We introduce Dirichlet multinomial mixtures (DMM) for the probabilistic modelling of microbial metagenomics data. This data can be represented as a frequency matrix giving the number of times each taxa is observed in each sample. The samples have different size, and the matrix is sparse, as communities are diverse and skewed to rare taxa. Most methods used previously to classify or cluster samples have ignored these features. We describe each community by a vector of taxa probabilities. These vectors are generated from one of a finite number of Dirichlet mixture components each with different hyperparameters. Observed samples are generated through multinomial sampling. The mixture components cluster communities into distinct 'metacommunities', and, hence, determine envirotypes or enterotypes, groups of communities with a similar composition. The model can also deduce the impact of a treatment and be used for classification. We wrote software for the fitting of DMM models using the 'evidence framework' (http://code.google.com/p/microbedmm/). This includes the Laplace approximation of the model evidence. We applied the DMM model to human gut microbe genera frequencies from Obese and Lean twins. From the model evidence four clusters fit this data best. Two clusters were dominated by Bacteroides and were homogenous; two had a more variable community composition. We could not find a significant impact of body mass on community structure. However, Obese twins were more likely to derive from the high variance clusters. We propose that obesity is not associated with a distinct microbiota but increases the chance that an individual derives from a disturbed enterotype. This is an example of the 'Anna Karenina principle (AKP)' applied to microbial communities: disturbed states having many more configurations than undisturbed. We verify this by showing that in a study of inflammatory bowel disease (IBD) phenotypes, ileal Crohn's disease (ICD) is associated with a more variable community.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: KH is directly funded through a Unilever research grant to develop bioinformatics tools. All tools developed under this grant are being released open source. This does not alter the authors' adherence to all the PLoS ONE policies on sharing data and materials.

Figures

Figure 1

Figure 1. Model fit for mixture of Dirichlets prior to Twins dataset.

Evaluates model fit for increasing number of Dirichlet mixture components formula image using the Laplace approximation to the negative log model evidence.

Figure 2

Figure 2. NMDS plot of Twins dataset with hierarchical cluster labellings.

Samples arising from each of the four components are shown in red, green, blue and magenta, respectively. The black crosses indicate the Dirichlet means of each component.

Figure 3

Figure 3. Heat map of the Twins data and hierarchical clustering.

Heat map showing the Twins data with samples grouped according to the cluster most likely to have generated them. Only 30 out of 131 genera are shown, those with the greatest variability across clusters, see Table 1. To the right of each cluster the mean of the Dirichlet component for that mixture is shown. The data is square root transformed and therefore to convert the scale to relative abundance, values must be squared.

Figure 4

Figure 4. NMDS plot of Twins dataset with class labels.

Samples from Lean (formula image) individuals are shown in magenta and Obese (formula image) in Cyan. Overweight are grey. The black crosses indicate the Dirichlet means of each component of the three components for the Obese class, the black asterisk the single component for the Lean class. We also show the posterior mean of the entire Obese class as a black circle.

Figure 5

Figure 5. Receiver operating characteristic (ROC) curves for the Twins Dirichlet multinomial and random forests classifiers.

Gives true positive percentage on the y-axis i.e. Obese individuals correctly identified vs false positive percentage i.e Lean individuals flagged as Obese.

Figure 6

Figure 6. NMDS plot of IBD dataset with class labels.

Samples from Healthy individuals (black), and three IBD phenotypes, (red) colonic Crohn's disease (CCD), (green) ileal Crohn's disease (ICD), and (blue) ulcerative colitis (UC) are shown. The Dirichlet means of single component fits to each type are shown by the corresponding coloured cross.

Figure 7

Figure 7. Heat map of the IBD data divided by phenotype together with phenotype means.

Heat map showing the IBD data with samples grouped according to the IBD pheonotype. The means of the four single component Dirichlet models, fitted to healthy (formula image), colonic Crohn's disease (CCD - formula image), ileal Crohn's disease (ICD - formula image), and ulcerative colitis (UC - formula image) phenotypes are also shown. Only 25 out of 95 genera are shown, those with the greatest variability across phenotypes, see Table 3. The data is square root transformed and therefore to convert the scale to relative abundance, values must be squared.

References

    1. Streit W, Schmitz R. Metagenomics - the key to the uncultured microbes. Curr Opin Microbiol. 2004;7:492–498. - PubMed
    1. Dorigo U, Volatier L, Humbert JF. Molecular approaches to the assessment of biodiversity in aquatic microbial communities. Water Res. 2005;39:2207–2218. - PubMed
    1. Margulies M, Egholm M, Altman W, Attiya S, Bader J, et al. Genome sequencing in microfabricated high-density picolitre reactors. Nature. 2005;437:376–380. - PMC - PubMed
    1. Caporaso JG, Lauber CL, Walters WA, Berg-Lyons D, Lozupone CA, et al. Global patterns of 16S rRNA diversity at a depth of millions of sequences per sample. Proc Natl Acad Sci USA. 2011 e-pub ahead of print doi: 10.1073/pnas.1000080107. - PMC - PubMed
    1. Wang Q, Garrity GM, Tiedje JM, Cole JR. Naive Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy. Appl Environ Microbiol. 2007;73:5261–5267. - PMC - PubMed

Publication types

MeSH terms

LinkOut - more resources