Fast Spectral Clustering via the Nyström Method (original) (raw)
Related papers
Sampling with Minimum Sum of Squared Similarities for Nystrom-Based Large Scale Spectral Clustering
The Nystrom sampling provides an efficient approach for large scale clustering problems, by generating a low-rank matrix approximation. However, existing sampling methods are limited by their accuracies and computing times. This paper proposes a scalable Nystrom-based clustering algorithm with a new sampling procedure, Minimum Sum of Squared Similarities (MSSS). Here we provide a theoretical analysis of the upper error bound of our algorithm, and demonstrate its performance in comparison to the leading spectral clustering methods that use Nystrom sampling.
Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence
In this paper we present a framework for spectral clustering based on the following simple scheme: sample a subset of the input points, compute the clusters for the sampled subset using weighted kernel k-means (Dhillon et al. 2004) and use the resulting centers to compute a clustering for the remaining data points. For the case where the points are sampled uniformly at random without replacement, we show that the number of samples required depends mainly on the number of clusters and the diameter of the set of points in the kernel space. Experiments show that the proposed framework outperforms the approaches based on the Nystrom approximation both in terms of accuracy and computation time.
LOW-RANK APPROXIMATION-BASED SPECTRAL CLUSTERING FOR LARGE DATASETS
R a c h a n a J a k k u l a & J y o t h i. P Abstract:-Spectral clustering is a well-known graph-theoretic approach of finding natural groupings in a given dataset. Nowadays, digital data are accumulated at a faster than ever speed in various fields, such as the Web, science, engineering, biomedicine, and real-world sensing. It is not uncommon for a dataset to contain tens of thousands of samples and/or features. Spectral clustering generally becomes infeasible for analysing these big data. spectral clustering has become one of the most popular modern clustering algorithms. It is simple to implement, can be solved efficiently by standard linear algebra software, and very often outperforms traditional clustering algorithms such as the k-means algorithm. A particular class of graph clustering algorithms is known as spectral clustering algorithms. These algorithms are mostly based on the Eigen-decomposition of Laplacian matrices of either weighted or unweighted graphs. This survey presents different graph clustering formulations, most of which based on graph cut and partitioning problems, and describes the main spectral clustering algorithms found in literature that solve these problems.In this paper, we propose a Low-rank Approximation-based Spectral (LAS) clustering for big data analytics. By integrating low-rank matrix approximations, i.e., the approximations to the affinity matrix and its subspace, as well as those for the Laplacian matrix and the Laplacian subspace, LAS gains great computational and spatial efficiency for processing big data. In addition, we propose various fast sampling strategies to efficiently select data samples. From a theoretical perspective, we mathematically prove the correctness of LAS, and provide the analysis of its approximation error, and computational complexity.
Large Scale Spectral Clustering Using Resistance Distance and Spielman-Teng Solvers
Lecture Notes in Computer Science, 2012
Spectral clustering is a novel clustering method which can detect complex shapes of data clusters. However, it requires the eigen decomposition of the graph Laplacian matrix, which is proportion to O(n 3) and thus is not suitable for large scale systems. Recently, many methods have been proposed to accelerate the computational time of spectral clustering. These approximate methods usually involve sampling techniques by which a lot information of the original data may be lost. In this work, we propose a fast and accurate spectral clustering approach using an approximate commute time embedding, which is similar to the spectral embedding. The method does not require using any sampling technique and computing any eigenvector at all. Instead it uses random projection and a linear time solver to find the approximate embedding. The experiments in several synthetic and real datasets show that the proposed approach has better clustering quality and is faster than the state-of-the-art approximate spectral clustering methods.
Clustered Nystrom Method for Large Scale Manifold Learning and Dimension Reduction
2010
Kernel (or similarity) matrix plays a key role in many machine learning algorithms such as kernel methods, manifold learning, and dimension reduction. However, the cost of storing and manipulating the complete kernel matrix makes it infeasible for large problems. The Nyström method is a popular sampling-based low-rank approximation scheme for reducing the computational burdens in handling large kernel matrices. In this paper, we analyze how the approximating quality of the Nyström method depends on the choice of landmark points, and in particular the encoding powers of the landmark points in summarizing the data. Our (non-probabilistic) error analysis justifies a "clustered Nyström method" that uses the k-means clustering centers as landmark points. Our algorithm can be applied to scale up a wide variety of algorithms that depend on the eigenvalue decomposition of kernel matrix (or its variant), such as kernel principal component analysis, Laplacian eigenmap, spectral clustering, as well as those involving kernel matrix inverse such as least-squares support vector machine and Gaussian process regression. Extensive experiments demonstrate the competitive performance of our algorithm in both accuracy and efficiency.
A tutorial on spectral clustering
In recent years, spectral clustering has become one of the most popular modern clustering algorithms. It is simple to implement, can be solved efficiently by standard linear algebra software, and very often outperforms traditional clustering algorithms such as the k-means algorithm. On the first glance spectral clustering appears slightly mysterious , and it is not obvious to see why it works at all and what it really does. The goal of this tutorial is to give some intuition on those questions. We describe different graph Laplacians and their basic properties, present the most common spectral clustering algorithms, and derive those algorithms from scratch by several different approaches. Advantages and disadvantages of the different spectral clustering algorithms are discussed.
Fast approximate spectral clustering
Proceedings of the 15th ACM SIGKDD …, 2009
Spectral clustering refers to a flexible class of clustering procedures that can produce high-quality clusterings on small data sets but which has limited applicability to large-scale problems due to its computational complexity of O(n 3 ), with n the number of data points. We extend the range of spectral clustering by developing a general framework for fast approximate spectral clustering in which a distortion-minimizing local transformation is first applied to the data. This framework is based on a theoretical analysis that provides a statistical characterization of the effect of local distortion on the mis-clustering rate. We develop two concrete instances of our general framework, one based on local k-means clustering (KASP) and one based on random projection trees (RASP). Extensive experiments show that these algorithms can achieve significant speedups with little degradation in clustering accuracy. Specifically, our algorithms outperform k-means by a large margin in terms of accuracy, and run several times faster than approximate spectral clustering based on the Nyström method, with comparable accuracy and significantly smaller memory footprint. Remarkably, our algorithms make it possible for a single machine to spectral cluster data sets with a million observations within several minutes.
Studies in big data, 2017
This chapter discusses clustering methods based on similarities between pairs of objects. Such a knowledge does not imply that the entire objects are embedded in a metric space. Instead, the local knowledge supports a graphical representation displaying relationships among the objects from a given data set. The problem of data clustering transforms then into the problem of graph partitioning, and this partitioning is acquired by analysing eigenvectors of the graph Laplacian, a basic tool used in spectral graph theory. We explain how various forms of graph Laplacian are used in various graph partitioning criteria, and how these translate into particular algorithms. There is a strong and fascinating relationship between graph Laplacian and random walk on a graph. Particularly, it allows to formulate a number of other clustering criteria, and to formulate another data clustering algorithms. We briefly review these problems. It should be noted that the eigenvectors deliver so-called spectral representation of data items. Unfortunately, this representation is fixed for a given data set, and adding or deleting some items destroys it. Thus we discuss recently invented methods of out of sample spectral clustering allowing to overcome this disadvantage. Although spectral methods are successful in extracting non-convex groups in data, the process of forming graph Laplacian is memory consuming and computing its eigenvectors is time consuming. Thus we discuss various local methods in which only relevant part of the graph are considered. Moreover, we mention a number of methods allowing fast and approximate computation of the eigenvectors.
Clusterability Analysis and Incremental Sampling for Nyström Extension Based Spectral Clustering
2011 IEEE 11th International Conference on Data Mining, 2011
To alleviate the memory and computational burdens of spectral clustering for large scale problems, some kind of low-rank matrix approximation is usually employed. Nyström method is an efficient technique to generate lowrank matrix approximation and its most important aspect is sampling. The matrix approximation errors of several sampling schemes have been theoretically analyzed for a number of learning tasks. However, the impact of matrix approximation error on the clustering performance of spectral clustering has not been studied. In this paper, we firstly analyze the performance of Nyström method in terms of clusterability, thus answer the impact of matrix approximation error on the clustering performance of spectral clustering. Our analysis immediately suggests an incremental sampling scheme for the Nyström method based spectral clustering. Experimental results show that the proposed incremental sampling scheme outperforms existing sampling schemes on various clustering tasks and image segmentation applications, and its efficiency is comparable with existing sampling schemes.