Locality-Sensitive Hashing for Chi2 Distance (original) (raw)

Fast locality-sensitive hashing

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '11, 2011

Locality-sensitive hashing (LSH) is a basic primitive in several large-scale data processing applications, including nearest-neighbor search, de-duplication, clustering, etc. In this paper we propose a new and simple method to speed up the widely-used Euclidean realization of LSH. At the heart of our method is a fast way to estimate the Euclidean distance between two d-dimensional vectors; this is achieved by the use of randomized Hadamard transforms in a non-linear setting. This decreases the running time of a (k, L)parameterized LSH from O(dkL) to O(d log d + kL). Our experiments show that using the new LSH in nearest-neighbor applications can improve their running times by significant amounts. To the best of our knowledge, this is the first running time improvement to LSH that is both provable and practical.

Beyond Locality-Sensitive Hashing

Proceedings of the Twenty-Fifth Annual ACM-SIAM Symposium on Discrete Algorithms, 2013

We present a new data structure for the c-approximate near neighbor problem (ANN) in the Euclidean space. For n points in R d , our algorithm achieves Oc(n ρ + d log n) query time and Oc(n 1+ρ + d log n) space, where ρ ≤ 7/(8c 2) + O(1/c 3) + oc(1). This is the first improvement over the result by Andoni and Indyk (FOCS 2006) and the first data structure that bypasses a locality-sensitive hashing lower bound proved by O'Donnell, Wu and Zhou (ICS 2011). By a standard reduction we obtain a data structure for the Hamming space and ℓ1 norm with ρ ≤ 7/(8c) + O(1/c 3/2) + oc(1), which is the first improvement over the result of Indyk and Motwani (STOC 1998).

Optimal Load Factor for Approximate Nearest Neighbor Search under Exact Euclidean Locality Sensitive Hashing

International Journal of Computer Applications, 2013

Locality Sensitive Hashing (LSH) is an index-based data structure that allows spatial item retrieval over a large dataset. The performance measure, ρ, has significant effect on the computational complexity and memory space requirement to create and store items in this data structure respectively. The minimization of ρ at a specific approximation factor c, is dependent on the load factor, α. Over the years, = 4has been used by researchers. In this paper, we demonstratethat the choice of = 4does not guarantee low computational complexity and low memory space of the data structure under the LSH scheme. To guarantee low computational complexity and low memory space, we propose = 5. Experiments on the Defense Meteorological Satellite Program imagery datasethave shown that = 5saves more than 75%on memory space; cuts the computational complexity by more than 70%andanswers query two times faster on the average compared to that of = 4.

Query adaptative locality sensitive hashing

2008

It is well known that high-dimensional nearest-neighbor retrieval is very expensive. Many signal processing methods suffer from this computing cost. Dramatic performance gains can be obtained by using approximate search, such as the popular Locality-Sensitive Hashing. This paper improves LSH by performing an on-line selection of the most appropriate hash functions from a pool of functions. An additional improvement originates from the use of E8 lattices for geometric hashing instead of one-dimensional random projections. A performance study based on state-of-the-art high-dimensional descriptors computed on real images shows that our improvements to LSH greatly reduce the search complexity for a given level of accuracy.

SC-LSH: An Efficient Indexing Method for Approximate Similarity Search in High Dimensional Space

2014

Locality Sensitive Hashing (LSH) is one of the most promising techniques for solving nearest neighbour search problem in high dimensional space. Euclidean LSH is the most popular variation of LSH that has been successfully applied in many multimedia applications. However, the Euclidean LSH presents limitations that affect structure and query performances. The main limitation of the Euclidean LSH is the large memory consumption. In order to achieve a good accuracy, a large number of hash tables is required. In this paper, we propose a new hashing algorithm to overcome the storage space problem and improve query time, while keeping a good accuracy as similar to that achieved by the original Euclidean LSH. The Experimental results on a real large-scale dataset show that the proposed approach achieves good performances and consumes less memory than the Euclidean LSH. Keywords—Approximate Nearest Neighbor Search, Content based image retrieval (CBIR), Curse of dimensionality, Locality sen...

DRAWBACKS AND PROPOSED SOLUTIONS FOR REAL-TIME PROCESSING ON EXISTING STATE-OF-THE-ART LOCALITY SENSITIVE HASHING TECHNIQUES

Nearest-neighbor query processing is a fundamental operation for many image retrieval applications. Often, images are stored and represented by high-dimensional vectors that are generated by feature-extraction algorithms. Since tree-based index structures are shown to be ineffective for high dimensional processing due to the well-known "Curse of Dimensionality", approximate nearest neighbor techniques are used for faster query processing. Locality Sensitive Hashing (LSH) is a very popular and efficient approximate nearest neighbor technique that is known for its sublinear query processing complexity and theoretical guarantees. Nowadays, with the emergence of technology, several diverse application domains require real-time high-dimensional data storing and processing capacity. Existing LSH techniques are not suitable to handle real-time data and queries. In this paper, we discuss the challenges and drawbacks of existing LSH techniques for processing real-time high-dimensional image data. Additionally, through experimental analysis, we propose improvements for existing state-of-the-art LSH techniques for efficient processing of high-dimensional image data.

A Survey: Over Various Hashing Techniques Which Provide Nearest Neighbor Search in Large Scale Data

— Hashing is most popular technique which provides an efficient and accurate way to nearest neighbor search in large scale data. In large scale image retrieval data is represents in the form of semantic similarity presented in labeled pair of images. Thus unsupervised techniques are efficient to provide solution for these problems, supervised hashing technique is required to provide desired solution. In this paper a survey over these techniques is presented. A Multi-view alignment based hashing technique is presented which uses regularized kernel nonnegative matrix factorization (RKNMF) to enhance the performance of the nearest neighbor search, A composite hashing for multiple information search is presented. There are some other techniques are also presented, which presents an overview over the hashing techniques used for large scale image search.

Fast image similarity search by distributed locality sensitive hashing

Pattern Recognition Letters, 2019

Approximate Nearest Neighbor (ANN) search approaches that use possible neighbors instead of exact neighbors are widely investigated by researchers in recent years. ANN approaches are usually applied in a centralized manner. However, in real world applications data is usually stored in a distributed manner. This situation led to the need for implementing ANN methods in a distributed way. In this study, our goal is to perform fast and accurate search on large size image datasets by using distributed environments. For this purpose, we propose an approach called as Randomized Distributed Hashing (RDH) which uses Locality Sensitive Hashing (LSH) in a distributed scheme. In this approach, we have randomly distributed data to different nodes on a cluster. After the distribution of data, in each node we have used same randomized hash function set for indexing the local data. Then at the query stage, the query sample is locally searched in different nodes. By exploiting from parallelism, the query time performance is significantly increased. We have a speed up of 8 for the query performance in the distributed scheme with 10 nodes. The level of Mean Average Precision (MAP) scores are quite high which are comparable to other methods. We have also investigated the usage of different and selected randomized hash functions in different nodes rather than using same indexing. We create selected hash functions according to their data division property before indexing. Since LSH is data independent method, we have obtained similar results with using same hash functions. We compared our experimental results with state-of-the-art methods given in a recent study. The proposed distributed scheme is promising for searching images in large datasets with multiple nodes.

Lattice-based Locality Sensitive Hashing is Optimal

ArXiv, 2018

Locality sensitive hashing (LSH) was introduced by Indyk and Motwani (STOC `98) to give the first sublinear time algorithm for the c-approximate nearest neighbor (ANN) problem using only polynomial space. At a high level, an LSH family hashes "nearby" points to the same bucket and "far away" points to different buckets. The quality of measure of an LSH family is its LSH exponent, which helps determine both query time and space usage. In a seminal work, Andoni and Indyk (FOCS `06) constructed an LSH family based on random ball partitioning of space that achieves an LSH exponent of 1/c^2 for the l_2 norm, which was later shown to be optimal by Motwani, Naor and Panigrahy (SIDMA `07) and O'Donnell, Wu and Zhou (TOCT `14). Although optimal in the LSH exponent, the ball partitioning approach is computationally expensive. So, in the same work, Andoni and Indyk proposed a simpler and more practical hashing scheme based on Euclidean lattices and provided computationa...

QUERY-ADAPTATIVE LOCALITY SENSITIVE HASHING HerveJ

It is well known that high-dimensional nearest-neighbor retrieval is very expensive. Many signal processing methods suffer from this computing cost. Dramatic performance gains can be obtained by using approximate search, such as the popular Locality-Sensitive Hashing. This paper improves LSH by performing an on-line selec- tion of the most appropriate hash functions from a pool of functions. An additional improvement originates from the use of E8 lattices for geometric hashing instead of one-dimensional random projections. A performance study based on state-of-the-art high-dimensional de- scriptors computed on real images shows that our improvements to LSH greatly reduce the search complexity for a given level of accu- racy.