MSClust: A Multi-Seeds based Clustering algorithm for microbiome profiling using 16S rRNA sequence - PubMed (original) (raw)

MSClust: A Multi-Seeds based Clustering algorithm for microbiome profiling using 16S rRNA sequence

Wei Chen et al. J Microbiol Methods. 2013 Sep.

Abstract

Recent developments of next generation sequencing technologies have led to rapid accumulation of 16S rRNA sequences for microbiome profiling. One key step in data processing is to cluster short sequences into operational taxonomic units (OTUs). Although many methods have been proposed for OTU inferences, a major challenge is the balance between inference accuracy and computational efficiency, where inference accuracy is often sacrificed to accommodate the need to analyze large numbers of sequences. Inspired by the hierarchical clustering method and a modified greedy network clustering algorithm, we propose a novel multi-seeds based heuristic clustering method, named MSClust, for OTU inference. MSClust first adaptively selects multi-seeds instead of one seed for each candidate cluster, and the reads are then processed using a greedy clustering strategy. Through many numerical examples, we demonstrate that MSClust enjoys less memory usage, and better biological accuracy compared to existing heuristic clustering methods while preserving efficiency and scalability.

Keywords: 16S rRNA reads; Clustering algorithms; Next-generation sequencing; Operational taxonomic unit (OTU); Seeds-selection.

© 2013 Elsevier B.V. All rights reserved.

PubMed Disclaimer

Figures

Figure 1

Figure 1

Flow chart of the MSClust algorithm

Figure 2

Figure 2

The pseudo-code for the Seeds Selection procedure.

Figure 3

Figure 3

Comparison of the Number of OTUs predicted by ESPRIT, CDHIT, Uclust, GramCluster, MSClust, DNA Clust and ESPRIT-Tree using the V2 benchmark at different sequences similarity thereshould ranging from 0.98 to 0.9

Figure 4

Figure 4

NMI and NID scores for different algorithoms based on the species ground-truth (a). NID scores across different algorithms at different threshoulds. (b). NMI scores across different algorithms at different threshoulds.

Figure 5

Figure 5

The effect of Parameter NumS (Number of Seeds) on clustering results. a). NID scores of MSClust are calculated at various similarity levels with different NumS. b) NMI scores of MSClust are compared for various NumS.

Figure 6

Figure 6

The effect of Parameter Tsize (size of Temp Cluster) on clustering results. a) NID scores of MSClust are calculated at various similarity levels with different Tsize. b) NMI scores of MSClust for various NumS with the same NumS=3.

Figure 7

Figure 7

phylogenetic tree for the 30 species

Figure 8

Figure 8

NID score and NMI score for the simulated data set. (a). NID scores for the compared algorithms at different threshoulds. (b). NMI scores for the compared algorithms at different thresholds.

Figure 9

Figure 9

NMI and NID Scores for different algorithms based on the benchmark dataset Stacked_60 with species ground-truth. (a) NID Scores across different algorithms at different threshoulds. (b) NMI Scores across different algorithms different threshoulds.

Figure 10

Figure 10

Similar articles

Cited by

References

    1. Sharpton TJ, Riesenfeld SJ, Kembel SW, Ladau J, O’Dwyer JP, Green JL, Eisen JA, Pollard KS. PhylOTU: A High-throughput procedure quantifies microbial community diversity and resolves novel taxa from metagenomic data. PLOS computational Biology. 2011;7(1):e1001061. - PMC - PubMed
    1. Schloss PD, Westcott SL, Ryabin T, Hall JR, Hartmann M, Hollister EB, Lesniewski RA, Oakley BB, Parks DH, Robinson CJ, et al. Introducing mothur: open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl Environ Microbial. 2009;75(23):7537–7541. - PMC - PubMed
    1. Schloss PD, Handelsman J. Introducing DOTUR, a computer program for defining operational taxonomic units and estimating species richness. Appl Environ Microbial. 2005;71(3):1501–1506. - PMC - PubMed
    1. Huse SM, Welch DM, Morrison HG, Sogin ML. Ironing out the wrinkles in the rare biosphere through improved OTU clustering. Environ Microbial. 2010;12(7):1889–1898. - PMC - PubMed
    1. Sun Y, Cai Y, Liu L, Yu F, Farrell ML, McKendree W, Farmerie W. ESPRIT: estimating species richness using large collections of 16S rRNA pyrosequences. Nucleic Acids Res. 2009;37(10):e76. - PMC - PubMed

Publication types

MeSH terms

Substances

LinkOut - more resources