Approximate clustering in very large relational data (original) (raw)

Exact Algorithms and Lower Bounds for Stable Instances of Euclidean k-MEANS

Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, 2019

We investigate the complexity of solving stable or perturbation-resilient instances of k-means and kmedian clustering in fixed dimension Euclidean metrics (or more generally doubling metrics). The notion of stable or perturbation resilient instances was introduced by Bilu and Linial [2010] and Awasthi, Blum, and Sheffet [2012]. In our context, we say a k-means instance is α-stable if there is a unique optimum solution which remains unchanged if distances are (non-uniformly) stretched by a factor of at most α. Stable clustering instances have been studied to explain why heuristics such as Lloyd's algorithm perform well in practice. In this work we show that for any fixed ǫ > 0, (1 + ǫ)-stable instances of k-means in doubling metrics, which include fixed-dimensional Euclidean metrics, can be solved in polynomial time. More precisely, we show a natural multi-swap local-search algorithm in fact finds the optimum solution for (1 + ǫ)-stable instances of k-means and k-median in a polynomial number of iterations. We complement this result by showing that under a plausible PCP hypothesis this is essentially tight: that when the dimension d is part of the input, there is a fixed ǫ0 > 0 such there is not even a PTAS for (1 + ǫ0)-stable k-means in R d unless NP=RP. To do this, we consider a robust property of CSPs; call an instance stable if there is a unique optimum solution x * and for any other solution x ′ , the number of unsatisfied clauses is proportional to the Hamming distance between x * and x ′. Dinur, Goldreich, and Gur have already shown stable QSAT is hard to approximation for some constant Q [16], our hypothesis is simply that stable QSAT with bounded variable occurrence is also hard (there is in fact work in progress to prove this hypothesis). Given this hypothesis, we consider "stability-preserving" reductions to prove our hardness for stable k-means. Such reductions seem to be more fragile and intricate than standard L-reductions and may be of further use to demonstrate other stable optimization problems are hard to solve.

On Euclidean k-Means Clustering with α-Center Proximity

2019

k-means clustering is NP-hard in the worst case but previous work has shown efficient algorithms assuming the optimal k-means clusters are stable under additive or multiplicative perturbation of data. This has two caveats. First, we do not know how to efficiently verify this property of optimal solutions that are NP-hard to compute in the first place. Second, the stability assumptions required for polynomial time k-means algorithms are often unreasonable when compared to the ground-truth clusters in real-world data. A consequence of multiplicative perturbation resilience is center proximity, that is, every point is closer to the center of its own cluster than the center of any other cluster, by some multiplicative factor α > 1. We study the problem of minimizing the Euclidean k-means objective only over clusterings that satisfy α-center proximity. We give a simple algorithm to find the optimal α-center-proximal k-means clustering in running time exponential in k and 1/(α− 1) but ...

Linear-time approximation schemes for clustering problems in any dimensions

Journal of The ACM, 2010

We present a general approach for designing approximation algorithms for a fundamental class of geometric clustering problems in arbitrary dimensions. More specifically, our approach leads to simple randomized algorithms for the k-means, k-median and discrete k-means problems that yield (1 + ε) approximations with probability ≥ 1/2 and running times of O(2 (k/ε) O(1) dn). These are the first algorithms for these problems whose running times are linear in the size of the input (nd for n points in d dimensions) assuming k and ε are fixed. Our method is general enough to be applicable to clustering problems satisfying certain simple properties and is likely to have further applications. fixed) . Interestingly, the center in the optimal solution to the 1-mean problem is the same as the center of mass of the points. However, in the case of the 1-median problem, also known as the Fermat-Weber problem, no such closed form is known. We show that despite the lack of such a closed form, we can obtain an approximation to the optimal 1-median in O(1) time (independent of the number of points). There are many useful variations to these clustering problems, for example, in the discrete versions of these problems, the centers that we seek should belong to the input set of points.

On Euclidean k-Means Clustering with alpha-Center Proximity

2019

The k-means is a popular clustering objective that is NP-hard in the worst-case but often solved efficiently by simple heuristics in practice. The implicit assumption behind using the k-means (or many other objectives) is that an optimal solution would recover the underlying ground truth clustering. In most real-world datasets, the underlying ground-truth clustering is unambiguous and stable under small perturbations of data. As a consequence, the groundtruth clustering satisfies center proximity, that is, every point is closer to the center of its own cluster than the center of any other cluster, by some multiplicative factor α > 1. We study the problem of minimizing the Euclidean k-means objective only over clusterings that satisfy α-center proximity. We give a simple algorithm to find an exact optimal clustering for the above objective with running time exponential in k and 1/(α − 1) but linear in the number of points and the dimension. We define an analogous αcenter proximity condition for outliers, and give similar algorithmic guarantees for k-means with outliers and α-center proximity. On the hardness side we show that for any α > 1, there exists an α α , (α > 1), and an ε 0 > 0 such that minimizing the k-means objective over clusterings that satisfy α-center proximity is NP-hard to approximate within a multiplicative (1 + ε 0) factor.

Approximation Algorithms for Clustering

Aglomerativní hierarchické shlukování je důležitý shlukovací algoritmus, který má mnoho praktických využití, na příklad pro segmentaci trhu. Jeho největší nevýhodou je jeho velkáčasová složitost O(n 3). Cílem této práce je popsat a zanalyzovat algoritmy aproximující aglomerativní hierarchické shlukování. Tyto algoritmy mají nižšíčasovou složitost a produkují srovnatelné výsledky s exaktními metodami. Experimenty ukázaly,že aproximační algoritmus LSHlink je signifikantně rychlejší na velkých datech než exaktní algoritmus MSTlinkage algoritmus.

Parameterized Approximation Algorithms for K-center Clustering and Variants

Proceedings of the AAAI Conference on Artificial Intelligence

k-center is one of the most popular clustering models. While it admits a simple 2-approximation in polynomial time in general metrics, the Euclidean version is NP-hard to approximate within a factor of 1.93, even in the plane, if one insists the dependence on k in the running time be polynomial. Without this restriction, a classic algorithm yields a 2^{O((klog k)/{epsilon})}dn-time (1+epsilon)-approximation for Euclidean k-center, where d is the dimension. In this work, we give a faster algorithm for small dimensions: roughly speaking an O^*(2^{O((1/epsilon)^{O(d)} k^{1-1/d} log k)})-time (1+epsilon)-approximation. In particular, the running time is roughly O^*(2^{O((1/epsilon)^{O(1)}sqrt{k}log k)}) in the plane. We complement our algorithmic result with a matching hardness lower bound. We also consider a well-studied generalization of k-center, called Non-uniform k-center (NUkC), where we allow different radii clusters. NUkC is NP-hard to approximate within any factor, even in the ...

Structural parameters, tight bounds, and approximation for (k,r)-center

Discrete Applied Mathematics, 2018

In (k, r)-Center we are given a (possibly edge-weighted) graph and are asked to select at most k vertices (centers), so that all other vertices are at distance at most r from a center. In this paper we provide a number of tight fine-grained bounds on the complexity of this problem with respect to various standard graph parameters. Specifically: • For any r ≥ 1, we show an algorithm that solves the problem in O * ((3r + 1) cw) time, where cw is the clique-width of the input graph, as well as a tight SETH lower bound matching this algorithm's performance. As a corollary, for r = 1, this closes the gap that previously existed on the complexity of Dominating Set parameterized by cw. • We strengthen previously known FPT lower bounds, by showing that (k, r)-Center is W[1]-hard parameterized by the input graph's vertex cover (if edge weights are allowed), or feedback vertex set, even if k is an additional parameter. Our reductions imply tight ETH-based lower bounds. Finally, we devise an algorithm parameterized by vertex cover for unweighted graphs. • We show that the complexity of the problem parameterized by tree-depth is 2 Θ(td 2) , by showing an algorithm of this complexity and a tight ETH-based lower bound. We complement these mostly negative results by providing FPT approximation schemes parameterized by clique-width or treewidth, which work efficiently independently of the values of k, r. In particular, we give algorithms which, for any > 0, run in time O * ((tw/) O(tw)), O * ((cw/) O(cw)) and return a (k, (1 +)r)-center if a (k, r)-center exists, thus circumventing the problem's W-hardness.

Approximation clustering: A mine of semidefinite programming problems

Topics in Semidefinite and Interior-Point Methods, 1998

Clustering is a discipline devoted to nding homogeneous groups of data entities. In contrast to conventional clustering which i n v olves data processing in terms of either entities or variables, approximation clustering is aimed at processing of the data matrices as they are. Currently, approximation clustering is a set of clustering models and methods based on approximate decomposition of the data table into scalar product matrices representing weighted subsets, partitions or hierarchies as the sought clustering structures. Some of the problems involved are of semide nite programming, the others seem quite similar.