Supervised classification of microbiota mitigates mislabeling errors - PubMed (original) (raw)

Supervised classification of microbiota mitigates mislabeling errors

Dan Knights et al. ISME J. 2011 Apr.

No abstract available

PubMed Disclaimer

Figures

Figure 1

Figure 1

Resequenced 454 16S rRNA genes from infant time series experiment. These data are 60 fecal samples obtained over 2.5 years from a single individual. (a) Principal coordinates analysis of unweighted UniFrac distances derived from sequences from the initial sequencing run. (b) Corrected data. (c) Taxonomic discrepancies between the initial run (a) and the corrected run (b). Sample points are colored according to collection time where dark blue points represent time points that were collected early during the experiment, whereas the light gray time points represent later samples. Note that time points from days 19, 55 and 85 are misplaced in panel a (too dark for their position), and after resequencing, they cluster with other dark blue samples (early time points).

Figure 2

Figure 2

(a, b) Metadata error correction using random forests for the forensic identification task (a) and the general body habitat classification task (b). The horizontal axes show the proportion of labels that has been intentionally perturbed, and the vertical axes show the proportion of error in the prediction of the random forest classifier when trained on the full dataset with the perturbed labels. Each point represents the average error for 10 random perturbations of the metadata, with standard error bars. The solid black line simply shows the amount of error in the metadata, and is a useful reference for the other curves. The ‘Classifier's reported error' reflects how well the model ‘thinks' it is doing based on the partially incorrect metadata, whereas the ‘Classifier's true error‘ reflects a ‘god's-eye view' of how well the model is actually doing based on the true metadata. If the model does a good job of learning the differences between categories, it will often discover the true category for a mislabeled sample, although it will still report such a classification as an error. Hence the true error is generally lower than the reported error. (c, d) Principal coordinates analysis plots of the UniFrac distances between samples in the Fierer et al. (2010) dataset; the first two axes (shown) explain 18.0 and 6.3% of the total variation. Panel c Shows the data with 40 randomly chosen intentionally confused labels circled in red, and d shows the labels predicted by the random forest classifier (trained with 2000 trees and otherwise default settings using the confused labels). This classifier recovered all of the true class labels for those samples, while introducing only two new incorrect labels. Confused labels that were corrected by the model are indicated with a black square; remaining errors are indicated with a red circle.

Similar articles

Cited by

References

    1. Breiman L. Random forests. Machine Learning. 2001;45:5–32.
    1. Caporaso JG, Kuczynski J, Stombaugh J, Bittinger K, Bushman FD, Costello EK, et al. QIIME allows analysis of high-throughput community sequencing data. Nat Methods. 2010;7:335–336. - PMC - PubMed
    1. Costello EK, Lauber CL, Hamady M, Fierer N, Gordon JI, Knight R. Bacterial community variation in human body habitats across space and time. Science. 2009;326:1694–1697. - PMC - PubMed
    1. Fierer N, Lauber CL, Zhou N, McDonald D, Costello EK, Knight R. Forensic identification using skin bacterial communities. Proc Natl Acad Sci USA. 2010;107:6477–6481. - PMC - PubMed
    1. Hugenholtz P, Tyson GW. Microbiology: metagenomics. Nature. 2008;455:481–483. - PubMed

Publication types

MeSH terms

Substances

LinkOut - more resources