DiscoVars: A New Data Analysis Perspective -- Application in Variable Selection for Clustering (original) (raw)

Variable Selection in Model-Based Clustering: To Do or To Facilitate

Icml, 2010

Variable selection for cluster analysis is a difficult problem. The difficulty originates not only from the lack of class information but also the fact that high-dimensional data are often multifaceted and can be meaningfully clustered in multiple ways. In such a case the effort to find one subset of attributes that presumably gives the "best" clustering may be misguided. It makes more sense to facilitate variable selection by domain experts, that is, to systematically identify various facets of a data set (each being based on a subset of attributes), cluster the data along each one, and present the results to the domain experts for appraisal and selection. In this paper, we propose a generalization of the Gaussian mixture model, show its ability to cluster data along multiple facets, and demonstrate it is often more reasonable to facilitate variable selection than to perform it.

Assessing variable importance in clustering: a new method based on unsupervised binary decision trees

Computational Statistics, 2019

We consider different approaches for assessing variable importance in clustering. We focus on clustering using binary decision trees (CUBT), which is a non-parametric top-down hierarchical clustering method designed for both continuous and nominal data. We suggest a measure of variable importance for this method similar to the one used in Breiman's classification and regression trees. This score is useful to rank the variables in a dataset, to determine which variables are the most important or to detect the irrelevant ones. We analyze both stability and efficiency of this score on different data simulation models in the presence of noise, and compare it to other classical variable importance measures. Our experiments show that variable importance based on CUBT is much more efficient than other approaches in a large variety of situations.

Selection of Variables for Cluster Analysis and Classification Rules

Journal of the American Statistical Association, 2008

In this paper we introduce two procedures for variable selection in cluster analysis and classification rules. One is mainly oriented to detect the "noisy" non-informative variables, while the other deals also with multicolinearity. A forward-backward algorithm is also proposed to make feasible these procedures in large data sets. A small simulation is performed and some real data examples are analyzed.

Variable Selection for Model-Based Clustering

Journal of The American Statistical Association, 2006

We consider the problem of variable or feature selection for model-based clustering. We recast the problem of comparing two nested subsets of variables as a model comparison problem, and address it using approximate Bayes factors. We develop a greedy search algorithm for finding a local optimum in model space. The resulting method selects variables (or features), the number of clusters, and the clustering model simultaneously. We applied the method to several simulated and real examples, and found that removing irrelevant variables often improved performance. Compared to methods based on all the variables, our variable selection method consistently yielded more accurate estimates of the number of clusters, and lower classification error rates, as well as more parsimonious clustering models and easier visualization of results.

A Mathematical Programming Approach for Selection of Variables in Cluster Analysis

2015

Data clustering is a common technique for statistical data analysis; it is defined as a class of statistical techniques for classifying a set of observations into completely different groups. Cluster analysis seeks to minimize group variance and maximize between group variance. In this study we formulate a mathematical programming model that chooses the most important variables in cluster analysis. A nonlinear binary model is suggested to select the most important variables in clustering a set of data. The idea of the suggested model depends on clustering data by minimizing the distance between observations within groups. Indicator variables are used to select the most important variables in the cluster analysis.

Cluster Density Properties Define a Graph for Effective Pattern Feature Selection

IEEE Access, 2020

Feature selection is a challenging problem that occurs in the high-dimensional data analysis of many major applications. It addresses the curse of dimensionality by determining a small set of features to represent high-dimensional data without significant or noticeable loss of information. The purpose of this study is to develop and investigate a new unsupervised feature selection method which uses the k-influence space concept and subspace learning to map features onto a weighted graph and rank them by importance according to the PageRank graph centrality measure. The graph design in this method promotes feature relevance, downgrades redundancy, and is robust to outliers and cluster imbalances. In K-Means classification experiments using the ASU feature selection testing datasets, the method produces better accuracy and normalized mutual information results than state-of-the-art unsupervised feature selection algorithms. In a further evaluation, using a dataset of over 14,000 tweets, conventional classification of features selected by the method gave better sentiment analysis results than deep learning feature selection and classification.

The use of CART and multivariate regression trees for supervised and unsupervised feature selection

Chemometrics and Intelligent Laboratory Systems, 2005

Feature selection is a valuable technique in data analysis for information-preserving data reduction. This paper describes Classification and Regression Trees (CART) and Multivariate Regression Trees (MRT)-based approaches for both supervised and unsupervised feature selection. The well-known CART method allows to perform supervised feature selection by modeling one response variable (y) by some explanatory variables (x). The recently proposed CART extension, MRT can handle more than one response variable (y). This allows to perform a supervised feature selection in the presence of more than one response variable. For unsupervised feature selection, where no response variables are available, we propose Auto-Associative Multivariate Regression Trees (AAMRT) where the original variables (x) are not only used as explanatory variables (x), but also as response variables (y=x). Since (AA)MRT is grouping the objects into groups with similar response values by using explanatory variables, this means that the variables are found which are most responsible for the cluster structure in the data. We will demonstrate how these approaches can improve (the detection of) the cluster structure in data and how they can be used for knowledge discovery. D

Selection of Variables in Cluster Analysis: An Empirical Comparison of Eight Procedures

Psychometrika, 2008

Eight different variable selection techniques for model-based and non-model-based clustering are evaluated across a wide range of cluster structures. It is shown that several methods have difficulties when non-informative variables (i.e., random noise) are included in the model. Furthermore, the distribution of the random noise greatly impacts the performance of nearly all of the variable selection procedures. Overall, a variable

A Supervised Methodology to Measure the Variables Contribution to a Clustering

Lecture Notes in Computer Science, 2014

This article proposes a supervised approach to evaluate the contribution of explanatory variables to a clustering. The main idea is to learn to predict the instance membership to the clusters using each individual variable. All variables are then sorted with respect to their predictive power, which is measured using two evaluation criteria, i.e. accuracy (ACC) or Adjusted Rand Index (ARI). Once the relevant variables which contribute to the clustering discrimination have been determined, we filter out the redundant ones thanks to a supervised method. The aim of this work is to help end-users to easily understand a clustering of high-dimensional data. Experimental results show that our proposed method is competitive with existing methods from the literature.

Combining clustering of variables and feature selection using random forests

Communications in Statistics - Simulation and Computation

Standard approaches to tackle high-dimensional supervised classification problem often include variable selection and dimension reduction procedures. The novel methodology proposed in this paper combines clustering of variables and feature selection. More precisely, hierarchical clustering of variables procedure allows to build groups of correlated variables in order to reduce the redundancy of information and summarizes each group by a synthetic numerical variable. Originality is that the groups of variables (and the number of groups) are unknown a priori. Moreover the clustering approach used can deal with both numerical and categorical variables (i.e. mixed dataset). Among all the possible partitions resulting from dendrogram cuts, the most relevant synthetic variables (i.e. groups of variables) are selected with a variable selection procedure using random forests. Numerical performances of the proposed approach are compared with direct applications of random forests and variable selection using random forests on the original p variables. Improvements obtained with the proposed methodology are illustrated on two simulated mixed datasets (cases n > p and n < p, where n is the sample size) and on a real proteomic dataset. Via the selection of groups of variables (based on the synthetic variables), interpretability of the results becomes easier.