Model-based clustering and data transformations for gene expression data - PubMed (original) (raw)
Model-based clustering and data transformations for gene expression data
K Y Yeung et al. Bioinformatics. 2001 Oct.
Abstract
Motivation: Clustering is a useful exploratory technique for the analysis of gene expression data. Many different heuristic clustering algorithms have been proposed in this context. Clustering algorithms based on probability models offer a principled alternative to heuristic algorithms. In particular, model-based clustering assumes that the data is generated by a finite mixture of underlying probability distributions such as multivariate normal distributions. The issues of selecting a 'good' clustering method and determining the 'correct' number of clusters are reduced to model selection problems in the probability framework. Gaussian mixture models have been shown to be a powerful tool for clustering in many applications.
Results: We benchmarked the performance of model-based clustering on several synthetic and real gene expression data sets for which external evaluation criteria were available. The model-based approach has superior performance on our synthetic data sets, consistently selecting the correct model and the number of clusters. On real expression data, the model-based approach produced clusters of quality comparable to a leading heuristic clustering algorithm, but with the key advantage of suggesting the number of clusters and an appropriate model. We also explored the validity of the Gaussian mixture assumption on different transformations of real data. We also assessed the degree to which these real gene expression data sets fit multivariate Gaussian distributions both before and after subjecting them to commonly used data transformations. Suitably chosen transformations seem to result in reasonable fits.
Availability: MCLUST is available at http://www.stat.washington.edu/fraley/mclust. The software for the diagonal model is under development.
Contact: kayee@cs.washington.edu.
Supplementary information: http://www.cs.washington.edu/homes/kayee/model.
Similar articles
- Computing the maximum similarity bi-clusters of gene expression data.
Liu X, Wang L. Liu X, et al. Bioinformatics. 2007 Jan 1;23(1):50-6. doi: 10.1093/bioinformatics/btl560. Epub 2006 Nov 7. Bioinformatics. 2007. PMID: 17090578 - Visualization and evaluation of clusters for exploratory analysis of gene expression data.
Kim JH, Kohane IS, Ohno-Machado L. Kim JH, et al. J Biomed Inform. 2002 Feb;35(1):25-36. doi: 10.1016/s1532-0464(02)00001-1. J Biomed Inform. 2002. PMID: 12415724 - Supervised cluster analysis for microarray data based on multivariate Gaussian mixture.
Qu Y, Xu S. Qu Y, et al. Bioinformatics. 2004 Aug 12;20(12):1905-13. doi: 10.1093/bioinformatics/bth177. Epub 2004 Mar 25. Bioinformatics. 2004. PMID: 15044244 - Bayesian mixture model based clustering of replicated microarray data.
Medvedovic M, Yeung KY, Bumgarner RE. Medvedovic M, et al. Bioinformatics. 2004 May 22;20(8):1222-32. doi: 10.1093/bioinformatics/bth068. Epub 2004 Feb 10. Bioinformatics. 2004. PMID: 14871871 - Navigating the Statistical Minefield of Model Selection and Clustering in Neuroscience.
Király B, Hangya B. Király B, et al. eNeuro. 2022 Jul 14;9(4):ENEURO.0066-22.2022. doi: 10.1523/ENEURO.0066-22.2022. Print 2022 Jul-Aug. eNeuro. 2022. PMID: 35835556 Free PMC article. Review.
Cited by
- Differential Impact of CD43 and CD28 on T-Cell Differentiation Depending on the Order of Engagement with the TCR.
Sandoval-Hernández MA, Fierro NA, Veytia-Bucheli JI, Alvarado-Velázquez DA, Alemán-Navarro E, Melchy-Pérez E, Auvynet C, Imaz-Rosshandler I, Carneiro J, Perez-Rueda E, Rosenstein Y. Sandoval-Hernández MA, et al. Int J Mol Sci. 2024 Mar 8;25(6):3135. doi: 10.3390/ijms25063135. Int J Mol Sci. 2024. PMID: 38542109 Free PMC article. - On the evaluation of outlier detection and one-class classification: a comparative study of algorithms, model selection, and ensembles.
Marques HO, Swersky L, Sander J, Campello RJGB, Zimek A. Marques HO, et al. Data Min Knowl Discov. 2023;37(4):1473-1517. doi: 10.1007/s10618-023-00931-x. Epub 2023 May 16. Data Min Knowl Discov. 2023. PMID: 37424877 Free PMC article. - A meta-analysis reveals the protein profile associated with malignant transformation of oral leukoplakia.
Normando AGC, Dos Santos ES, Sá JO, Busso-Lopes AF, De Rossi T, Patroni FMS, Granato DC, Guerra ENS, Santos-Silva AR, Lopes MA, Paes Leme AF. Normando AGC, et al. Front Oral Health. 2023 Feb 27;4:1088022. doi: 10.3389/froh.2023.1088022. eCollection 2023. Front Oral Health. 2023. PMID: 36923449 Free PMC article. Review. - Bayesian approaches to variable selection in mixture models with application to disease clustering.
Lu Z, Lou W. Lu Z, et al. J Appl Stat. 2021 Oct 28;50(2):387-407. doi: 10.1080/02664763.2021.1994529. eCollection 2023. J Appl Stat. 2021. PMID: 36698543 Free PMC article. - DECT-CLUST: Dual-Energy CT Image Clustering and Application to Head and Neck Squamous Cell Carcinoma Segmentation.
Chamroukhi F, Brivet S, Savadjiev P, Coates M, Forghani R. Chamroukhi F, et al. Diagnostics (Basel). 2022 Dec 6;12(12):3072. doi: 10.3390/diagnostics12123072. Diagnostics (Basel). 2022. PMID: 36553079 Free PMC article.
Publication types
MeSH terms
LinkOut - more resources
Full Text Sources
Other Literature Sources
Molecular Biology Databases