A Triangle Inequality for Cosine Similarity (original) (raw)
Abstract
Similarity search is a fundamental problem for many data analysis techniques. Many efficient search techniques rely on the triangle inequality of metrics, which allows pruning parts of the search space based on transitive bounds on distances. Recently, cosine similarity has become a popular alternative choice to the standard Euclidean metric, in particular in the context of textual data and neural network embeddings. Unfortunately, cosine similarity is not metric and does not satisfy the standard triangle inequality. Instead, many search techniques for cosine rely on approximation techniques such as locality sensitive hashing. In this paper, we derive a triangle inequality for cosine similarity that is suitable for efficient similarity search with many standard search structures (such as the VP-tree, Cover-tree, and M-tree); show that this bound is tight and discuss fast approximations for it. We hope that this spurs new research on accelerating exact similarity search for cosine similarity, and possible other similarity measures beyond the existing work for distance metrics.
Part of the work on this paper has been supported by Deutsche Forschungsgemeinschaft (DFG), project number 124020371, within the Collaborative Research Center SFB 876 “Providing Information by Resource-Constrained Analysis”, project A2.
Similar content being viewed by others
References
- Beygelzimer, A., Kakade, S.M., Langford, J.: Cover trees for nearest neighbor. In: International Conference on Machine Learning, ICML, pp. 97–104 (2006). https://doi.org/10.1145/1143844.1143857
- Bozkaya, T., Özsoyoglu, Z.M.: Indexing large metric spaces for similarity search queries. ACM Trans. Database Syst. 24(3), 361–404 (1999). https://doi.org/10.1145/328939.328959
Article Google Scholar - Brin, S.: Near neighbor search in large metric spaces. In: Dayal, U., Gray, P.M.D., Nishio, S. (eds.) International Conference on Very Large Data Bases, VLDB, pp. 574–584. Morgan Kaufmann (1995)
Google Scholar - Chávez, E., Ludueña, V., Reyes, N., Roggero, P.: Faster proximity searching with the distal SAT. In: International Conference on Similarity Search and Applications, SISAP, pp. 58–69 (2014). https://doi.org/10.1007/978-3-319-11988-5_6
- Ciaccia, P., Patella, M., Zezula, P.: M-tree: an efficient access method for similarity search in metric spaces. In: International Conference on Very Large Data Bases, VLDB, pp. 426–435 (1997)
Google Scholar - Hetland, M.L., Skopal, T., Lokoc, J., Beecks, C.: Ptolemaic access methods: challenging the reign of the metric space model. Inf. Syst. 38(7), 989–1006 (2013). https://doi.org/10.1016/j.is.2012.05.011
Article Google Scholar - Jagadish, H.V., Ooi, B.C., Tan, K., Yu, C., Zhang, R.: iDistance: an adaptive B\({}^{\text{+ }}\)-tree based indexing method for nearest neighbor search. ACM Trans. Database Syst. 30(2), 364–397 (2005). https://doi.org/10.1145/1071610.1071612
Article Google Scholar - Kriegel, H.-P., Schubert, E., Zimek, A.: The (black) art of runtime evaluation: are we comparing algorithms or implementations? Knowl. Inf. Syst. 52(2), 341–378 (2016). https://doi.org/10.1007/s10115-016-1004-2
Article Google Scholar - Lang, A., Schubert, E.: BETULA: numerically stable CF-trees for BIRCH clustering. In: International Conference on Similarity Search and Applications, SISAP, pp. 281–296 (2020). https://doi.org/10.1007/978-3-030-60936-8_22
- Lokoc, J., Hetland, M.L., Skopal, T., Beecks, C.: Ptolemaic indexing of the signature quadratic form distance. In: International Conference on Similarity Search and Applications, pp. 9–16 (2011). https://doi.org/10.1145/1995412.1995417
- Micó, L., Oncina, J., Vidal, E.: A new version of the nearest-neighbour approximating and eliminating search algorithm (AESA) with linear preprocessing time and memory requirements. Pattern Recognit. Lett. 15(1), 9–17 (1994). https://doi.org/10.1016/0167-8655(94)90095-7
Article Google Scholar - Nanopoulos, A., Radovanovic, M., Ivanovic, M.: How does high dimensionality affect collaborative filtering? In: ACM Conference on Recommender Systems, RecSys, pp. 293–296 (2009). https://doi.org/10.1145/1639714.1639771
- Navarro, G.: Searching in metric spaces by spatial approximation. VLDB J. 11(1), 28–46 (2002). https://doi.org/10.1007/s007780200060
Article Google Scholar - Novak, D., Batko, M., Zezula, P.: Metric index: an efficient and scalable solution for precise and approximate similarity search. Inf. Syst. 36(4), 721–733 (2011). https://doi.org/10.1016/j.is.2010.10.002
Article Google Scholar - Omohundro, S.M.: Five balltree construction algorithms. Technical report. TR-89-063, International Computer Science Institute (ICSI) (1989)
Google Scholar - Radovanovic, M., Nanopoulos, A., Ivanovic, M.: Nearest neighbors in high-dimensional data: the emergence and influence of hubs. In: International Conference on Machine Learning, ICML, pp. 865–872 (2009). https://doi.org/10.1145/1553374.1553485
- Ruiz, G., Santoyo, F., Chávez, E., Figueroa, K., Tellez, E.S.: Extreme pivots for faster metric indexes. In: International Conference on Similarity Search and Applications, SISAP, pp. 115–126 (2013). https://doi.org/10.1007/978-3-642-41062-8_12
- Schubert, E., Gertz, M.: Numerically stable parallel computation of (co-)variance. In: International Conference on Scientific and Statistical Database Management, SSDBM, pp. 10:1–10:12 (2018). https://doi.org/10.1145/3221269.3223036
- Schubert, E., Zimek, A.: ELKI: a large open-source library for data analysis - ELKI release 0.7.5, Heidelberg. CoRR abs/1902.03616 (2019). http://arxiv.org/abs/1902.03616
- Uhlmann, J.K.: Satisfying general proximity/similarity queries with metric trees. Inf. Process. Lett. 40(4), 175–179 (1991). https://doi.org/10.1016/0020-0190(91)90074-R
Article MATH Google Scholar - Yianilos, P.N.: Data structures and algorithms for nearest neighbor search in general metric spaces. In: ACM/SIGACT-SIAM Symposium on Discrete Algorithms, SODA, pp. 311–321 (1993)
Google Scholar - Zimek, A., Schubert, E., Kriegel, H.: A survey on unsupervised outlier detection in high-dimensional numerical data. Stat. Anal. Data Min. 5(5), 363–387 (2012). https://doi.org/10.1002/sam.11161
Article MathSciNet MATH Google Scholar
Author information
Authors and Affiliations
- TU Dortmund University, Dortmund, Germany
Erich Schubert
Corresponding author
Correspondence toErich Schubert.
Editor information
Editors and Affiliations
- National University of San Luis, San Luis, Argentina
Nora Reyes - University of St Andrews, St Andrews, UK
Richard Connor - University of Vienna, Vienna, Austria
Nils Kriege - Kiel University, Kiel, Germany
Daniyal Kazempour - University of Bologna, Bologna, Italy
Ilaria Bartolini - TU Dortmund University, Dortmund, Germany
Erich Schubert - TU Dortmund University, Dortmund, Germany
Jian-Jia Chen
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Schubert, E. (2021). A Triangle Inequality for Cosine Similarity. In: Reyes, N., et al. Similarity Search and Applications. SISAP 2021. Lecture Notes in Computer Science(), vol 13058. Springer, Cham. https://doi.org/10.1007/978-3-030-89657-7\_3
Download citation
- .RIS
- .ENW
- .BIB
- DOI: https://doi.org/10.1007/978-3-030-89657-7\_3
- Published: 22 October 2021
- Publisher Name: Springer, Cham
- Print ISBN: 978-3-030-89656-0
- Online ISBN: 978-3-030-89657-7
- eBook Packages: Computer ScienceComputer Science (R0)Springer Nature Proceedings Computer Science