Deep learning vs spectral clustering into an active clustering with pairwise constraints propagation (original) (raw)

Semi-supervised Spectral Clustering with automatic propagation of pairwise constraints

In our data driven world, clustering is of major importance to help end-users and decision makers understanding information structures. Supervised learning techniques rely on ground truth to perform the classification and are usually subject to overtraining issues. On the other hand, unsupervised clustering techniques study the structure of the data without disposing of any training data. Given the difficulty of the task, unsupervised learning tends to provide inferior results to supervised learning. To boost their performance, a compromise is to use learning only for some of the ambiguous classes. In this context, this paper studies the impact of pairwise constraints to unsupervised Spectral Clustering. We introduce a new generalization of constraint propagation which maximizes partitioning quality while reducing annotation costs. Experiments show the efficiency of the proposed scheme.

Semi-Supervised Clustering with Neural Networks

2020

Clustering using neural networks has recently demonstrated promising performance in machine learning and computer vision applications. However, the performance of current approaches is limited either by unsupervised learning or their dependence on large set of labeled data samples. In this paper, we propose ClusterNet that uses pairwise semantic constraints from very few labeled data samples (< 5% of total data) and exploits the abundant unlabeled data to drive the clustering approach. We define a new loss function that uses pairwise semantic similarity between objects combined with constrained k-means clustering to efficiently utilize both labeled and unlabeled data in the same framework. The proposed network uses convolution autoencoder to learn a latent representation that groups data into k specified clusters, while also learning the cluster centers simultaneously. We evaluate and compare the performance of ClusterNet on several datasets and state of the art deep clustering approaches.

Semi-supervised Clustering for High-dimensional and Sparse Features a Dissertation in Information Sciences and Technology

Clustering is one of the most common data mining tasks, used frequently for data organization and analysis in various application domains. Traditional machine learning approaches to clustering are fully automated and unsupervised where class labels are unknown a priori. In real application domains, however, some " weak " form of side information about the domain or data sets can be often available or derivable. In particular, information in the form of instance-level pairwise constraints is general and is relatively easy to derive. The problem with traditional clustering techniques is that they cannot benefit from side information even when available. I study the problem of semi-supervised clustering, which aims to partition a set of unlabeled data items into coherent groups given a collection of constraints. Because semi-supervised clustering promises higher quality with little extra human effort, it is of great interest both in theory and in practice. Semi-supervised clu...

Active Learning for Semi-Supervised Clustering Framework for High Dimensional Data

isara solutions, 2019

In certain clustering tasks it is possible to obtain limited supervision in the form of pairwise constraints, i.e., pairs of instances labeled as belonging to same or different clusters. The resulting problem is known as semi-supervised clustering, an instance of semi-supervised learning stemming from a traditional unsupervised learning setting. Several algorithms exist for enhancing clustering quality by using supervision in the form of constraints [2]. These algorithms typically utilize the pairwise constraints to either modify the clustering objective function or to learn the clustering distortion measure. Semi-supervised clustering employs limited supervision in the form of labeled instances or pairwise instance constraints to aid unsupervised clustering and often significantly improves the clustering performance. Despite the vast amount of expert knowledge spent on this problem, most existing work is not designed for handling high-dimensional sparse data [4]. Semi-supervised clustering uses a small amount of supervised data to aid unsupervised learning. One typical approach specifies a limited number of must-link and cannot link constraints between pairs of examples. It presents a pairwise constrained clustering framework and a new method for actively selecting informative pairwise constraints to get improved clustering performance [6]. The clustering and active learning methods are both easily scalable to large datasets, and can handle very high dimensional data. Experimental and theoretical results confirm that this active querying of pairwise constraints significantly improves the accuracy of clustering when given a relatively small amount of supervision [5].

Deep Clustering: A Comprehensive Survey

arXiv (Cornell University), 2022

Cluster analysis plays an indispensable role in machine learning and data mining. Learning a good data representation is crucial for clustering algorithms. Recently, deep clustering, which can learn clustering-friendly representations using deep neural networks, has been broadly applied in a wide range of clustering tasks. Existing surveys for deep clustering mainly focus on the single-view fields and the network architectures, ignoring the complex application scenarios of clustering. To address this issue, in this paper we provide a comprehensive survey for deep clustering in views of data sources. With different data sources and initial conditions, we systematically distinguish the clustering methods in terms of methodology, prior knowledge, and architecture. Concretely, deep clustering methods are introduced according to four categories, i.e., traditional single-view deep clustering, semi-supervised deep clustering, deep multi-view clustering, and deep transfer clustering. Finally, we discuss the open challenges and potential future opportunities in different fields of deep clustering.

A classification-based approach to semi-supervised clustering with pairwise constraints

Neural Networks, 2020

In this paper, we introduce a neural network framework for semi-supervised clustering (SSC) with pairwise (must-link or cannot-link) constraints. In contrast to existing approaches, we decompose SSC into two simpler classification tasks/stages: the first stage uses a pair of Siamese neural networks to label the unlabeled pairs of points as must-link or cannot-link; the second stage uses the fully pairwise-labeled dataset produced by the first stage in a supervised neural-network-based clustering method. The proposed approach, S 3 C 2 (Semi-Supervised Siamese Classifiers for Clustering), is motivated by the observation that binary classification (such as assigning pairwise relations) is usually easier than multi-class clustering with partial supervision. On the other hand, being classification-based, our method solves only well-defined classification problems, rather than less well specified clustering tasks. Extensive experiments on various datasets demonstrate the high performance of the proposed method.

ClusterNet : Semi-Supervised Clustering using Neural Networks

arXiv (Cornell University), 2018

Clustering using neural networks has recently demonstrated promising performance in machine learning and computer vision applications. However, the performance of current approaches is limited either by unsupervised learning or their dependence on large set of labeled data samples. In this paper, we propose ClusterNet that uses pairwise semantic constraints from very few labeled data samples (< 5% of total data) and exploits the abundant unlabeled data to drive the clustering approach. We define a new loss function that uses pairwise semantic similarity between objects combined with constrained k-means clustering to efficiently utilize both labeled and unlabeled data in the same framework. The proposed network uses convolution autoencoder to learn a latent representation that groups data into k specified clusters, while also learning the cluster centers simultaneously. We evaluate and compare the performance of ClusterNet on several datasets and state of the art deep clustering approaches.

Semi-supervised clustering with deep metric learning and graph embedding

World Wide Web, 2019

As a common technology in social network, clustering has attracted lots of research interest due to its high performance, and many clustering methods have been presented. The most of existing clustering methods are based on unsupervised learning. In fact, we usually can obtain some/few labeled samples in real applications. Recently, several semi-supervised clustering methods have been proposed, while there is still much space for improvement. In this paper, we aim to tackle two research questions in the process of semi-supervised clustering: (i) How to learn more discriminative feature representations to boost the process of the clustering; (ii) How to effectively make use of both the labeled and unlabeled data to enhance the performance of clustering. To address these two issues, we propose a novel semi-supervised clustering approach based on deep metric learning (SCDML) which leverages deep metric learning and semi-supervised learning effectively in a novel way. To make the extracted features of the contribution of data more representative and the label propagation network more suitable for real applications, we further improve our approach by adopting triplet loss in deep metric learning network and combining bedding with label propagation strategy to dynamically update the unlabeled to labeled data, which is named as semi-supervised clustering with deep metric learning and graph embedding (SCDMLGE). SCDMLGE enhances the robustness of metric learning network and promotes the accuracy of clustering. Substantial experimental results on Mnist, CIFAR-10, YaleB, and 20-Newsgroups benchmarks demonstrate the high effectiveness of our proposed approaches.

Scalable semi-supervised clustering by spectral kernel learning

Pattern Recognition Letters, 2014

Kernel learning is one of the most important and recent approaches to constrained clustering. Until now many kernel learning methods have been introduced for clustering when side information in the form of pairwise constraints is available. However, almost all of the existing methods either learn a whole kernel matrix or learn a limited number of parameters. Although the non-parametric methods that learn whole kernel matrix can provide capability of finding clusters of arbitrary structures, they are very computationally expensive and these methods are feasible only on small data sets. In this paper, we propose a kernel learning method that shows flexibility in the number of variables between the two extremes of freedom degree. The proposed method uses a spectral embedding to learn a square matrix whose number of rows is the number of dimensions in the embedded space. Therefore, the proposed method shows much higher scalability compared to other methods that learn a kernel matrix. Experimental results on synthetic and real-world data sets show that the performance of the proposed method is generally near to the learning a whole kernel matrix while its time cost is very low compared to these methods.

Joint Deep Clustering: Classification and Review

International Journal of Advanced Computer Science and Applications

Clustering is a fundamental problem in machine learning. To address this, a large number of algorithms have been developed. Some of these algorithms, such as K-means, handle the original data directly, while others, such as spectral clustering, apply linear transformation to the data. Still others, such as kernel-based algorithms, use nonlinear transformation. Since the performance of the clustering depends strongly on the quality of the data representation, representation learning approaches have been extensively researched. With the recent advances in deep learning, deep neural networks are being increasingly utilized to learn clustering-friendly representation. We provide here a review of existing algorithms that are being used to jointly optimize deep neural networks and clustering methods.