Graph Based Clustering with Constraints and Active Learning (original) (raw)

Active Learning for Semi-Supervised Clustering Framework for High Dimensional Data

isara solutions, 2019

In certain clustering tasks it is possible to obtain limited supervision in the form of pairwise constraints, i.e., pairs of instances labeled as belonging to same or different clusters. The resulting problem is known as semi-supervised clustering, an instance of semi-supervised learning stemming from a traditional unsupervised learning setting. Several algorithms exist for enhancing clustering quality by using supervision in the form of constraints [2]. These algorithms typically utilize the pairwise constraints to either modify the clustering objective function or to learn the clustering distortion measure. Semi-supervised clustering employs limited supervision in the form of labeled instances or pairwise instance constraints to aid unsupervised clustering and often significantly improves the clustering performance. Despite the vast amount of expert knowledge spent on this problem, most existing work is not designed for handling high-dimensional sparse data [4]. Semi-supervised clustering uses a small amount of supervised data to aid unsupervised learning. One typical approach specifies a limited number of must-link and cannot link constraints between pairs of examples. It presents a pairwise constrained clustering framework and a new method for actively selecting informative pairwise constraints to get improved clustering performance [6]. The clustering and active learning methods are both easily scalable to large datasets, and can handle very high dimensional data. Experimental and theoretical results confirm that this active querying of pairwise constraints significantly improves the accuracy of clustering when given a relatively small amount of supervision [5].

Active Selection Constraints for Semi-supervised Clustering Algorithms

International Journal of Information Technology and Computer Science

Semi.-supervised clustering algorithms aim to enhance the performance of clustering using the pairwise constraints. However, selecting these constraints randomly or improperly can minimize the performance of clustering in certain situations and with different applications. In this paper, we select the most informative constraints to improve semi-supervised clustering algorithms. We present an active selection of constraints, including active must.-link (AML) and active cannot.-link (ACL) constraints. Based on Radial-Bases Function, we compute lower-bound and upper-bound between data points to select the constraints that improve the performance. We test the proposed algorithm with the base-line methods and show that our proposed active pairwise constraints outperform other algorithms.

Active Learning of constraints using incremental approach in semi-supervised clustering

Semi-supervised clustering aims to improve clustering performance by considering user-provided side information in the form of pairwise constraints. We study the active learning problem of selecting must-link and cannot-link pairwise constraints for semi-supervised clustering. We consider active learning in an iterative framework; each iteration queries are selected based on the current clustering outcome and constraints available. We use the neighborhood framework where the pairwise points having the must-link belong to the same neigborhood and cannot-link pairwise points belong to the different neighborhood. If two points belong to the same neighborhood then they belong to the same cluster and viceversa. We will use the Glass Identification Data Set from the UCI machine learning repositories and investigate the improvement in clustering time using the Incremental Clustering.

Tackling Noise in Active Semi-supervised Clustering

2020

Constraint-based clustering leverages user-provided constraints to produce a clustering that matches the user’s expectation. In active constraint-based clustering, the algorithm selects the most informative constraints to query in order to produce good clusterings with as few constraints as possible. A major challenge in constraint-based clustering is handling noise: the majority of existing approaches assume that the provided constraints are correct, while that might not be the case. In this paper, we propose a method to identify and correct noisy constraints in active constraint-based clustering. Our approach reasons probabilistically about the correctness of the user’s answers and asks additional constraints to corroborate or correct the suspicious answers. We demonstrate the method’s effectiveness by incorporating it into COBRAS, a state-of-the-art method for active constraint-based clustering. Compared to COBRAS and other active-constraint-based clustering algorithms, the resul...

Semi-supervised clustering in graphs

2017

Nowadays, decision processes in various areas (marketing, biology, etc) require the processing of increasing amounts of more and more complex data. Because of this, there is a growing interest in machine learning techniques. Among these techniques, there is clustering. Clustering is the task of finding a partition of items, such that items in the same cluster are more similar than items in different clusters. This is a data-driven technique. Data come from different sources and take different forms. One challenge consists in designing a system capable of taking benefit of the different sources of data, even when they come in different forms. Among the different forms a piece of data can take, the description of an object can take the form of a feature vector: a list of attributes that takes a value. Objects can also be described by a graph which captures the relationships objects have with each others. In addition to this, some constraints can be known about the data. It can be know...

Semi-Supervised Clustering with Partial Background Information

Proceedings of the 2006 SIAM International Conference on Data Mining, 2006

Incorporating background knowledge into unsupervised clustering algorithms has been the subject of extensive research in recent years. Nevertheless, existing algorithms implicitly assume that the background information, typically specified in the form of labeled examples or pairwise constraints, has the same feature space as the unlabeled data to be clustered. In this paper, we are concerned with a new problem of incorporating partial background knowledge into clustering, where the labeled examples have moderate overlapping features with the unlabeled data. We formulate this as a constrained optimization problem, and propose two learning algorithms to solve the problem, based on hard and fuzzy clustering methods. An empirical study performed on a variety of real data sets shows that our proposed algorithms improve the quality of clustering results with limited labeled examples.

Active Learning for Semi-Supervised K-Means Clustering

2010

K-Means algorithm is one of the most used clustering algorithm for Knowledge Discovery in Data Mining. Seedbased K-Means is the integration of a small set of labeled data (called seeds) to the K-Means algorithm to improve its performances and overcome its sensitivity to initial centers, that are, most of the time, generated at random or the authors assume that the seeds are available for each cluster. This paper introduces a new efficient algorithm for active seeds selection which relies on a Min-Max approach that favors the coverage of the whole dataset. Experiments conducted on artificial and real datasets show that, using our active seeds selection algorithm, our algorithm can collect the seeds such that, for each data set, each cluster has at least one seed after a very small number of queries, and using the collected seeds, the number of convergence iteration of K-Means clustering will be reduced, which is crucial in many KDD applications.

An efficient method for active semi-supervised density based clustering

Semi-supervised clustering algorithms relies on side information, either labeled data (seeds) or pairwise constraints (must-link or cannot link) between data objects, to improve the quality of clustering. This paper proposes to extend an existing seed-based clustering algorithm with an active learning mechanism to collect pairwise constraints. My new semi-supervised algorithm can deal with both seeds and constraints. Experiment results on real data sets show the efficient of my algorithm when compared to the initial seed-based clustering algorithm.

An Efficient Learning of Constraints For Semi-Supervised Clustering using Neighbour Clustering Algorithm

Data mining is the process of finding the previously unknown and potentially interesting patterns and relation in database. Data mining is the step in the knowledge discovery in database process (KDD) .The structures that are the outcome of the data mining process must meet certain condition so that these can be considered as knowledge. These conditions are validity, understandability, utility, novelty, interestingness. Researcher identifies two fundamental goals of data mining: prediction and description. The proposed research work suggests the semi-supervised clustering problem where to know (with varying degree of certainty) that some sample pairs are (or are not) in the same class. A probabilistic model for semi-supervised clustering based on Shared Semi-supervised Neighbor clustering (SSNC) that provides a principled framework for incorporating supervision into prototype-based clustering. Semi-supervised clustering that combines the constraint-based and fitness-based approaches in a unified model. The proposed method first divides the Constraint-sensitive assignment of instances to clusters, where points are assigned to clusters so that the overall distortion of the points from the cluster centroids is minimized, while a minimum number of must-link and cannot-link constraints are violated. Experimental results across UCL Machine learning semi-supervised dataset results show that the proposed method has higher F-Measures than many existing Semi-Supervised Clustering methods.

Active seed selection for constrained clustering

Intelligent Data Analysis, 2017

Active learning for semi-supervised clustering allows algorithms to solicit a domain expert to provide side information as instances constraints, for example a set of labeled instances called seeds. The problem consists in selecting the queries to the expert that are likely to improve either the relevance or the quality of the proposed clustering. However, these active methods suffer from several limitations: (i) they are generally tailored for only one specific clustering paradigm or cluster shape and size, (ii) they may be counter-productive if the seeds are not selected in an appropriate manner and, (iii) they have to work efficiently with minimal expert supervision. In this paper, we propose a new active seed selection algorithm that relies on a k-nearest neighbors structure to locate dense potential clusters and efficiently query and propagate expert information. Our approach makes no hypothesis on the underlying data distribution and can be paired with any clustering algorithm. Comparative experiments conducted on real data sets show the efficiency of this new approach compared to existing ones.