A Triangle Inequality for Cosine Similarity (original) (raw)

Abstract

Similarity search is a fundamental problem for many data analysis techniques. Many efficient search techniques rely on the triangle inequality of metrics, which allows pruning parts of the search space based on transitive bounds on distances. Recently, cosine similarity has become a popular alternative choice to the standard Euclidean metric, in particular in the context of textual data and neural network embeddings. Unfortunately, cosine similarity is not metric and does not satisfy the standard triangle inequality. Instead, many search techniques for cosine rely on approximation techniques such as locality sensitive hashing. In this paper, we derive a triangle inequality for cosine similarity that is suitable for efficient similarity search with many standard search structures (such as the VP-tree, Cover-tree, and M-tree); show that this bound is tight and discuss fast approximations for it. We hope that this spurs new research on accelerating exact similarity search for cosine similarity, and possible other similarity measures beyond the existing work for distance metrics.

Part of the work on this paper has been supported by Deutsche Forschungsgemeinschaft (DFG), project number 124020371, within the Collaborative Research Center SFB 876 “Providing Information by Resource-Constrained Analysis”, project A2.

Similar content being viewed by others

References

  1. Beygelzimer, A., Kakade, S.M., Langford, J.: Cover trees for nearest neighbor. In: International Conference on Machine Learning, ICML, pp. 97–104 (2006). https://doi.org/10.1145/1143844.1143857
  2. Bozkaya, T., Özsoyoglu, Z.M.: Indexing large metric spaces for similarity search queries. ACM Trans. Database Syst. 24(3), 361–404 (1999). https://doi.org/10.1145/328939.328959
    Article Google Scholar
  3. Brin, S.: Near neighbor search in large metric spaces. In: Dayal, U., Gray, P.M.D., Nishio, S. (eds.) International Conference on Very Large Data Bases, VLDB, pp. 574–584. Morgan Kaufmann (1995)
    Google Scholar
  4. Chávez, E., Ludueña, V., Reyes, N., Roggero, P.: Faster proximity searching with the distal SAT. In: International Conference on Similarity Search and Applications, SISAP, pp. 58–69 (2014). https://doi.org/10.1007/978-3-319-11988-5_6
  5. Ciaccia, P., Patella, M., Zezula, P.: M-tree: an efficient access method for similarity search in metric spaces. In: International Conference on Very Large Data Bases, VLDB, pp. 426–435 (1997)
    Google Scholar
  6. Hetland, M.L., Skopal, T., Lokoc, J., Beecks, C.: Ptolemaic access methods: challenging the reign of the metric space model. Inf. Syst. 38(7), 989–1006 (2013). https://doi.org/10.1016/j.is.2012.05.011
    Article Google Scholar
  7. Jagadish, H.V., Ooi, B.C., Tan, K., Yu, C., Zhang, R.: iDistance: an adaptive B\({}^{\text{+ }}\)-tree based indexing method for nearest neighbor search. ACM Trans. Database Syst. 30(2), 364–397 (2005). https://doi.org/10.1145/1071610.1071612
    Article Google Scholar
  8. Kriegel, H.-P., Schubert, E., Zimek, A.: The (black) art of runtime evaluation: are we comparing algorithms or implementations? Knowl. Inf. Syst. 52(2), 341–378 (2016). https://doi.org/10.1007/s10115-016-1004-2
    Article Google Scholar
  9. Lang, A., Schubert, E.: BETULA: numerically stable CF-trees for BIRCH clustering. In: International Conference on Similarity Search and Applications, SISAP, pp. 281–296 (2020). https://doi.org/10.1007/978-3-030-60936-8_22
  10. Lokoc, J., Hetland, M.L., Skopal, T., Beecks, C.: Ptolemaic indexing of the signature quadratic form distance. In: International Conference on Similarity Search and Applications, pp. 9–16 (2011). https://doi.org/10.1145/1995412.1995417
  11. Micó, L., Oncina, J., Vidal, E.: A new version of the nearest-neighbour approximating and eliminating search algorithm (AESA) with linear preprocessing time and memory requirements. Pattern Recognit. Lett. 15(1), 9–17 (1994). https://doi.org/10.1016/0167-8655(94)90095-7
    Article Google Scholar
  12. Nanopoulos, A., Radovanovic, M., Ivanovic, M.: How does high dimensionality affect collaborative filtering? In: ACM Conference on Recommender Systems, RecSys, pp. 293–296 (2009). https://doi.org/10.1145/1639714.1639771
  13. Navarro, G.: Searching in metric spaces by spatial approximation. VLDB J. 11(1), 28–46 (2002). https://doi.org/10.1007/s007780200060
    Article Google Scholar
  14. Novak, D., Batko, M., Zezula, P.: Metric index: an efficient and scalable solution for precise and approximate similarity search. Inf. Syst. 36(4), 721–733 (2011). https://doi.org/10.1016/j.is.2010.10.002
    Article Google Scholar
  15. Omohundro, S.M.: Five balltree construction algorithms. Technical report. TR-89-063, International Computer Science Institute (ICSI) (1989)
    Google Scholar
  16. Radovanovic, M., Nanopoulos, A., Ivanovic, M.: Nearest neighbors in high-dimensional data: the emergence and influence of hubs. In: International Conference on Machine Learning, ICML, pp. 865–872 (2009). https://doi.org/10.1145/1553374.1553485
  17. Ruiz, G., Santoyo, F., Chávez, E., Figueroa, K., Tellez, E.S.: Extreme pivots for faster metric indexes. In: International Conference on Similarity Search and Applications, SISAP, pp. 115–126 (2013). https://doi.org/10.1007/978-3-642-41062-8_12
  18. Schubert, E., Gertz, M.: Numerically stable parallel computation of (co-)variance. In: International Conference on Scientific and Statistical Database Management, SSDBM, pp. 10:1–10:12 (2018). https://doi.org/10.1145/3221269.3223036
  19. Schubert, E., Zimek, A.: ELKI: a large open-source library for data analysis - ELKI release 0.7.5, Heidelberg. CoRR abs/1902.03616 (2019). http://arxiv.org/abs/1902.03616
  20. Uhlmann, J.K.: Satisfying general proximity/similarity queries with metric trees. Inf. Process. Lett. 40(4), 175–179 (1991). https://doi.org/10.1016/0020-0190(91)90074-R
    Article MATH Google Scholar
  21. Yianilos, P.N.: Data structures and algorithms for nearest neighbor search in general metric spaces. In: ACM/SIGACT-SIAM Symposium on Discrete Algorithms, SODA, pp. 311–321 (1993)
    Google Scholar
  22. Zimek, A., Schubert, E., Kriegel, H.: A survey on unsupervised outlier detection in high-dimensional numerical data. Stat. Anal. Data Min. 5(5), 363–387 (2012). https://doi.org/10.1002/sam.11161
    Article MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

  1. TU Dortmund University, Dortmund, Germany
    Erich Schubert

Corresponding author

Correspondence toErich Schubert.

Editor information

Editors and Affiliations

  1. National University of San Luis, San Luis, Argentina
    Nora Reyes
  2. University of St Andrews, St Andrews, UK
    Richard Connor
  3. University of Vienna, Vienna, Austria
    Nils Kriege
  4. Kiel University, Kiel, Germany
    Daniyal Kazempour
  5. University of Bologna, Bologna, Italy
    Ilaria Bartolini
  6. TU Dortmund University, Dortmund, Germany
    Erich Schubert
  7. TU Dortmund University, Dortmund, Germany
    Jian-Jia Chen

Rights and permissions

© 2021 Springer Nature Switzerland AG

About this paper

Cite this paper

Schubert, E. (2021). A Triangle Inequality for Cosine Similarity. In: Reyes, N., et al. Similarity Search and Applications. SISAP 2021. Lecture Notes in Computer Science(), vol 13058. Springer, Cham. https://doi.org/10.1007/978-3-030-89657-7\_3

Download citation

Keywords

Publish with us