MCExplorer: Interactive Exploration of Multiple (Subspace) Clustering Solutions (original) (raw)
Related papers
Discovering Multiple Clustering Solutions: Grouping Objects in Different Views of the Data
2010 IEEE International Conference on Data Mining, 2010
cluster definition "Group similar objects in one group, separating dissimilar objects in different groups." Several instances focus on: different similarity functions, cluster characteristics, data types, . . . Most definitions provide only a single clustering solution For example, K -MEANS Aims at a single partitioning of the data Each object is assigned to exactly one cluster Aims at one clustering solution One set of K clusters forming the resulting groups of objects ⇒ In contrast, we focus on multiple clustering solutions...
Enhanced Visual Analysis for Cluster Tendency Assessment and Data Partitioning
IEEE Transactions on Knowledge and Data Engineering, 2010
Visual methods have been widely studied and used in data cluster analysis. Given a pairwise dissimilarity matrix D D of a set of n objects, visual methods such as the VAT algorithm generally represent D D as an n  n image IðD DÞ where the objects are reordered to reveal hidden cluster structure as dark blocks along the diagonal of the image. A major limitation of such methods is their inability to highlight cluster structure when D D contains highly complex clusters. This paper addresses this limitation by proposing a Spectral VAT algorithm, where D D is mapped to D D 0 in a graph embedding space and then reordered toD D 0 using the VAT algorithm. A strategy for automatic determination of the number of clusters in IðD D 0 Þ is then proposed, as well as a visual method for cluster formation from IðD D 0 Þ based on the difference between diagonal blocks and off-diagonal blocks. A sampling-based extended scheme is also proposed to enable visual cluster analysis for large data sets. Extensive experimental results on several synthetic and real-world data sets validate our algorithms. Index Terms-Clustering, VAT, cluster tendency, spectral embedding, out-of-sample extension. Ç 1 INTRODUCTION A general question in the data mining community is how to organize observed data into meaningful structures (or taxonomies). As a tool of exploratory data analysis [36], cluster analysis aims at grouping objects of a similar kind into their respective categories. Given a data set O comprising n objects fo 1 ; o 2 ;. .. ; o n g (e.g., fish, flowers, beers, etc.), (crisp) clustering partitions the data into c groups C 1 ; C 2 ;. .. ; C c , so that C i \ C j ¼ (; if i 6 ¼ j and C 1 [ C 2 [ Á Á Á [ C c ¼ O. There have been a large number of data clustering algorithms in the recent literature [24]. In general, clustering of unlabeled data poses three major problems: 1) assessing cluster tendency, i.e., how many clusters to seek or what is the value of c?, 2) partitioning the data into c groups, and 3) validating the c clusters discovered. Given "only" a pairwise dissimilarity matrix D D 2 R nÂn representing a data set of n objects (i.e., the original object data is not necessarily available), this paper addresses the first two problems, i.e., determining the number of clusters c prior to clustering and partitioning the data into c clusters. Most clustering algorithms require the number of clusters c as an input parameter, so the quality of the resulting clusters is largely dependent on the estimation of
ClustNails: Visual analysis of subspace clusters
Tsinghua Science and Technology, 2012
Subspace clustering addresses an important problem in clustering multi-dimensional data. In sparse multi-dimensional data, many dimensions are irrelevant and obscure the cluster boundaries. Subspace clustering helps by mining the clusters present in only locally relevant subsets of dimensions. However, understanding the result of subspace clustering by analysts is not trivial. In addition to the grouping information, relevant sets of dimensions and overlaps between groups, both in terms of dimensions and records, need to be analyzed. We introduce a visual subspace cluster analysis system called ClustNails. It integrates several novel visualization techniques with various user interaction facilities to support navigating and interpreting the result of subspace clustering. We demonstrate the effectiveness of the proposed system by applying it to the analysis of real world data and comparing it with existing visual subspace cluster analysis systems.
New exploratory clustering tool
Journal of Chemometrics, 2008
This paper describes a clustering method on three-way arrays making use of an exploratory visualization approach. The aim of this study is to cluster samples in the object mode of a three-way array, which is done using the scores (sample loadings) of a three-way factor model, for example, a Tucker3 or a PARAFAC model. Further, tools are developed to explore and identify reasons for particular clusters by visually mining the data using the clustering results as guidance. We introduce a three-way clustering tool and demonstrate our results on a metabolite profiling dataset. We explore how high performance liquid chromatography (HPLC) measurements of commercial extracts of St. John's wort (natural remedies for the treatment of mild to moderate depression) differ and which chemical compounds account for those differences. Using common distance measures, for example, Euclidean or Mahalanobis, on the scores of a three-way model, we verify that we can capture the underlying clustering structure in the data. Beside this, by making use of the visualization approach, we are able to identify the variables playing a significant role in the extracted cluster structure. The suggested approach generalizes straightforwardly to higher-order data and also to two-way data.
Integrating cluster formation and cluster evaluation in interactive visual analysis
2011
Abstract Cluster analysis is a popular method for data investigation where data items are structured into groups called clusters. This analysis involves two sequential steps, namely cluster formation and cluster evaluation. In this paper, we propose the tight integration of cluster formation and cluster evaluation in interactive visual analysis in order to overcome the challenges that relate to the black-box nature of clustering algorithms. We present our conceptual framework in the form of an interactive visual environment.
iVIBRATE: Interactive visualization-based framework for clustering large datasets
ACM Transactions on Information Systems, 2006
With continued advances in communication network technology and sensing technology, there is an astounding growth in the amount of data produced and made available through the cyberspace. Efficient and high-quality clustering of large datasets continues to be one of the most important problems in largescale data analysis. A commonly used methodology for cluster analysis on large datasets is the three-phase framework of "sampling/summarization − iterative cluster analysis − disk-labeling". There are three known problems with this framework, which demand effective solutions. The first problem is how to effectively define and validate irregularly shaped clusters, especially in large datasets. Automated algorithms and statistical methods are typically not effective in handling such particular clusters. The second problem is how to effectively label the entire data on disk (disk-labeling) without introducing additional errors, including the solutions for dealing with outliers, irregular clusters, and cluster boundary extension. The third problem is the lack of research about the issues for effectively integrating the three phases. In this paper, we describe iVIBRATE − an interactive-visualization based three-phase framework for clustering large datasets.
Subspace search and visualization to make sense of alternative clusterings in high-dimensional data
2012
In explorative data analysis, the data under consideration often resides in a high-dimensional (HD) data space. Currently many methods are available to analyze this type of data. So far, proposed automatic approaches include dimensionality reduction and cluster analysis, whereby visual-interactive methods aim to provide effective visual mappings to show, relate, and navigate HD data. Furthermore, almost all of these methods conduct the analysis from a singular perspective, meaning that they consider the data in either the original HD data space, or a reduced version thereof. Additionally, HD data spaces often consist of combined features that measure different properties, in which case the particular relationships between the various properties may not be clear to the analysts a priori since it can only be revealed if appropriate feature combinations (subspaces) of the data are taken into consideration. Considering just a single subspace is, however, often not sufficient since different subspaces may show complementary, conjointly, or contradicting relations between data items. Useful information may consequently remain embedded in sets of subspaces of a given HD input data space. Relying on the notion of subspaces, we propose a novel method for the visual analysis of HD data in which we employ an interestingness-guided subspace search algorithm to detect a candidate set of subspaces. Based on appropriately defined subspace similarity functions, we visualize the subspaces and provide navigation facilities to interactively explore large sets of subspaces. Our approach allows users to effectively compare and relate subspaces with respect to involved dimensions and clusters of objects. We apply our approach to synthetic and real data sets. We thereby demonstrate its support for understanding HD data from different perspectives, effectively yielding a more complete view on HD data.
SpecVAT: Enhanced Visual Cluster Analysis
2008 Eighth IEEE International Conference on Data Mining, 2008
Given a pairwise dissimilarity matrix D of a set of objects, visual methods such as the VAT algorithm (for visual analysis of cluster tendency) represent D as an image I(D) where the objects are reordered to highlight cluster structure as dark blocks along the diagonal of the image. A major limitation of such visual methods is their inability to highlight cluster structure in I(D) when D contains clusters with highly complex structure. In this paper, we address this limitation by proposing a Spectral VAT (SpecVAT) algorithm, where D is mapped to D in an embedding space by spectral decomposition of the Laplacian matrix, and then reordered toD using the VAT algorithm. We also propose a strategy to automatically determine the number of clusters in I(D), as well as a method for cluster formation from I(D) based on the difference between diagonal blocks and off-diagonal blocks. We demonstrate the effectiveness of our algorithms on several synthetic and real-world data sets that are not amenable to analysis via traditional VAT.
ClusterSculptor: A Visual Analytics Tool for High-Dimensional Data
2007 IEEE Symposium on Visual Analytics Science and Technology, 2007
Cluster analysis (CA) is a powerful strategy for the exploration of high-dimensional data in the absence of a-priori hypotheses or data classification models, and the results of CA can then be used to form such models. But even though formal models and classification rules may not exist in these data exploration scenarios, domain scientists and experts generally have a vast amount of non-compiled knowledge and intuition that they can bring to bear in this effort. In CA, there are various popular mechanisms to generate the clusters, however, the results from their nonsupervised deployment rarely fully agree with this expert knowledge and intuition. To this end, our paper describes a comprehensive and intuitive framework to aid scientists in the derivation of classification hierarchies in CA, using k-means as the overall clustering engine, but allowing them to tune its parameters interactively based on a non-distorted compact visual presentation of the inherent characteristics of the data in highdimensional space. These include cluster geometry, composition, spatial relations to neighbors, and others. In essence, we provide all the tools necessary for a high-dimensional activity we call cluster sculpting, and the evolving hierarchy can then be viewed in a space-efficient radial dendrogram. We demonstrate our system in the context of the mining and classification of a large collection of millions of data items of aerosol mass spectra, but our framework readily applies to any high-dimensional CA scenario.