Towards optimal lower bounds for k-median and k-means coresets (original) (raw)
Proceedings of the 54th Annual ACM SIGACT Symposium on Theory of Computing
The (,)-clustering problem consists of finding a set of points called centers, such that the sum of distances raised to the power of of every data point to its closest center is minimized. Among the most commonly encountered special cases are-median problem (= 1) and-means problem (= 2). The-median and-means problems are at the heart of modern data analysis and massive data applications have given raise to the notion of coreset: a small (weighted) subset of the input point set preserving the cost of any solution to the problem up to a multiplicative (1 ±) factor, hence reducing from large to small scale the input to the problem. While there has been an intensive effort to understand what is the best coreset size possible for both problems in various metric spaces, there is still a significant gap between the state-of-the-art upper and lower bounds. In this paper, we make progress on both upper and lower bounds, obtaining tight bounds for several cases, namely: • In finite point general metrics, any coreset must consist of Ω(log / 2) points. This improves on the Ω(log /) lower bound of Braverman, Jiang, Krauthgamer, and Wu [ICML'19] and matches the upper bounds proposed for-median by Feldman and Langberg [STOC'11] and-means by Cohen-Addad, Saulpic and Schwiegelshohn [STOC'21] up to polylog factors. • For doubling metrics with doubling constant , any coreset must consist of Ω(/ 2) points. This matches the-median andmeans upper bounds by Cohen-Addad, Saulpic, and Schwiegelshohn [STOC'21] up to polylog factors. • In-dimensional Euclidean space, any coreset for (,) clustering requires Ω(/ 2) points. This improves on the Ω(/ √) lower bound of Baker, Braverman, Huang, Jiang, Krauthgamer, and Wu [ICML'20] for-median and complements the Ω(min(, 2 /20)) lower bound of Huang and Vishnoi [STOC'20]. We complement our lower bound for-dimensional Euclidean space with the construction of a coreset of size˜(/ 2 •min(− ,)). This improves over the˜(2 −4) upper bound for general power of Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.