Faster Algorithms for the Constrained k-means Problem (original) (raw)
2017, Theory of Computing Systems
The classical center based clustering problems such as k-means/median/center assume that the optimal clusters satisfy the locality property that the points in the same cluster are close to each other. A number of clustering problems arise in machine learning where the optimal clusters do not follow such a locality property. For instance, consider the r-gather clustering problem where there is an additional constraint that each of the clusters should have at least r points or the capacitated clustering problem where there is an upper bound on the cluster sizes. Consider a variant of the k-means problem that may be regarded as a general version of such problems. Here, the optimal clusters O1, ..., O k are an arbitrary partition of the dataset and the goal is to output k-centers c1, ..., c k such that the objective function k i=1 x∈O i ||x − ci|| 2 is minimized. It is not difficult to argue that any algorithm (without knowing the optimal clusters) that outputs a single set of k centers, will not behave well as far as optimizing the above objective function is concerned. However, this does not rule out the existence of algorithms that output a list of such k centers such that at least one of these k centers behaves well. Given an error parameter ε > 0, let ℓ denote the size of the smallest list of k-centers such that at least one of the k-centers gives a (1 + ε) approximation w.r.t. the objective function above. In this paper, we show an upper bound on ℓ by giving a randomized algorithm that outputs a list of 2Õ (k/ε) k-centers 1. We also give a closely matching lower bound of 2Ω (k/ √ ε). Moreover, our algorithm runs in time O nd • 2Õ (k/ε). This is a significant improvement over the previous result of Ding and Xu [DX15] who gave an algorithm with running time O nd • (log n) k • 2 poly(k/ε) and output a list of size O (log n) k • 2 poly(k/ε). Our techniques generalize for the k-median problem and for many other settings where non-Euclidean distance measures are involved. and a set C, where each element of C is a partitioning of X into k disjoint subsets (or clusters). Since the set C may be exponentially large, we will assume that it is specified in a succinct manner by an efficient algorithm which decides membership in this set. A solution needs to output an element O = {O 1 ,. .. , O k } of C, and a set of k centers, c 1 ,. .. , c k , one for each cluster in O. The goal is to minimize k i=1 x∈O i ||x − c i || 2. It is easy to check that the center c i must be the mean of the corresponding cluster O i. Note that the k-means problem is a special case of this problem where the set C contains all possible ways of partitioning X into k subsets. The constrained k-median problem can be defined similarly. We will make the natural assumption (which is made by Ding and Xu as well) that it suffices to find a set of k centers. In other words, there is an (efficient) algorithm A C , which given a set of k centers c 1 ,. .. , c k , outputs the clustering {O 1 ,. .. , O k } ∈ C such that k i=1 x∈O i ||c i − x|| 2 is minimized. Such an algorithm is called a partition algorithm by Ding and Xu [DX15] 2. For the case of the k-means problem, this algorithm will just give the Voronoi partition with respect to c 1 ,. .. , c k , whereas in the case of the r-gather k-means clustering problem, the algorithm A C will be given by a suitable min-cost flow computation (see section 4.1 in [DX15]). Ding and Xu [DX15] considered several natural problems arising in diverse areas, e.g. machine learning, which can be stated in this framework. These included the so-called r-gather k-means, r-capacity k-means and l-diversity k-means problems. Their approach for solving such problems was to output a list of candidate sets of centers (of size k) such that at least one of these were close to the optimal centers. We formalize this approach and show that if k is a constant, then one can obtain a PTAS for the constrained k-means (and the constrained k-median) problems whose running time is linear plus a constant number of calls to A C. We define the list k-means problem. Given a set of points X and parameters k and ε, we want to output a list L of sets of k points (or centers). The list L should have the following property: for any partitioning O = {O 1 ,. .. , O k } of X into k clusters, there exists a set c 1 ,. .. , c k in the list L such that (up-to reordering of these centers)