Hierarchical Clustering Based on Mutual Information (original) (raw)


A new algorithm that can extract clusters in single-step based on a new information-theoretic notion is described. New method employs similarity-based sample entropy and probability descriptions to express scatter in a given dataset. Based on these quantities, a new information-theoretic association measure called mutual irrelevance metric is defined to model a (dis)-connectivity rule between samples. This metric is utilized for determining candidate cluster representative samples coined cluster indicators. Possible clusters are established based on an association quantity between samples and cluster indicators in a single iteration. Clustering capability of new approach is demonstrated for a non-convex dataset, which is hard to cluster by using most well known counterparts. It is also tested and compared to major algorithms for publicly available real datasets. Experimental results reveal that the proposed approach outperforms predecessors it is compared to.

In an age of increasingly large data sets, investigators in many different disciplines have turned to clustering as a tool for data analysis and exploration. Existing clustering methods, however, typically depend on several nontrivial assumptions about the structure of data. ...

A single-step information-theoretic algorithm that is able to identify possible clusters in dataset is presented. The proposed algorithm consists in representation of data scatter in terms of similarity-based data point entropy and probability descriptions. By using these quantities, an information-theoretic association metric called mutual ambiguity between data points is defined, which then is to be employed in determining particular data points called cluster identifiers. For forming individual clusters corresponding to cluster identifiers determined as such, a cluster relevance rule is defined. Since cluster identifiers and associative cluster member data points can be identified without recursive or iterative search, the algorithm is single-step. The algorithm is tested and justified with experiments by using synthetic and anonymous real datasets. Simulation results demonstrate that the proposed algorithm also exhibits more reliable performance in statistical sense compared to major algorithms.

Clustering ensemble is a new topic in machine learning. It can find a combined clustering with better quality from multiple partitions. But how to find the combined clustering is a difficult problem. In this paper, we extend the object function proposed by Strehl & Ghosh which is based on mutual information and we present a new algorithm similar to information