sadhan sood | Texas A&M University (original) (raw)

sadhan sood

Uploads

Papers by sadhan sood

This paper offers a novel look at using a dimensionalityreduction technique called simhash [8] to... more This paper offers a novel look at using a dimensionalityreduction technique called simhash [8] to detect similar document pairs in large-scale collections. We show that this algorithm produces interesting intermediate data, which is normally discarded, that can be used to predict which of the bits in the final hash are more susceptible to being flipped in similar documents. This paves the way for a probabilistic search technique in the Hamming space of simhashes that can be significantly faster and more space-efficient than the existing simhash approaches. We show that with 95 % recall compared to deterministic search of prior work [16], our method exhibits 4-14 times faster lookup and requires 2-10 times less RAM on our collection of 70M web pages.

Probabilistic Simhash Matching. (August 2011) Sadhan Sood, B.Tech., Cochin University of Science ... more Probabilistic Simhash Matching. (August 2011) Sadhan Sood, B.Tech., Cochin University of Science and Technology-Kochi Chair of Advisory Committee: Dr. Dmitri Loguinov Finding near-duplicate documents is an interesting problem but the existing methods are not suitable for large scale datasets and memory constrained systems. In this work, we developed approaches that tackle the problem of finding near-duplicates while improving query performance and using less memory. We then carried out an evaluation of our method on a dataset of 70M web documents, and showed that our method works really well. The results indicated that our method could achieve a reduction in space by a factor of 5 while improving the query time by a factor of 4 with a recall of 0.95 for finding all near-duplicates when the dataset is in memory. With the same recall and same reduction in space, we could achieve an improvement in query-time by a factor of 4.5 while finding first the near-duplicate for an in memory dat...

Chair of Advisory Committee: Dr. Dmitri Loguinov Finding near-duplicate documents is an interesti... more Chair of Advisory Committee: Dr. Dmitri Loguinov Finding near-duplicate documents is an interesting problem but the existing methods are not suitable for large scale datasets and memory constrained systems. In this work, we developed approaches that tackle the problem of finding near-duplicates while improving query performance and using less memory. We then carried out an evaluation of our method on a dataset of 70M web documents, and showed that our method works really well. The results indicated that our method could achieve a reduction in space by a factor of 5 while improving the query time by a factor of 4 with a recall of 0.95 for finding all near-duplicates when the dataset is in memory.

Proceedings of the 20th ACM international …, Jan 1, 2011

This paper offers a novel look at using a dimensionalityreduction technique called simhash [8] to... more This paper offers a novel look at using a dimensionalityreduction technique called simhash [8] to detect similar document pairs in large-scale collections. We show that this algorithm produces interesting intermediate data, which is normally discarded, that can be used to predict which of the bits in the final hash are more susceptible to being flipped in similar documents. This paves the way for a probabilistic search technique in the Hamming space of simhashes that can be significantly faster and more space-efficient than the existing simhash approaches. We show that with 95% recall compared to deterministic search of prior work [16], our method exhibits 4-14 times faster lookup and requires 2-10 times less RAM on our collection of 70M web pages.

Proceedings of the 20th ACM international …, Jan 1, 2011

This paper offers a novel look at using a dimensionalityreduction technique called simhash [8] to... more This paper offers a novel look at using a dimensionalityreduction technique called simhash [8] to detect similar document pairs in large-scale collections. We show that this algorithm produces interesting intermediate data, which is normally discarded, that can be used to predict which of the bits in the final hash are more susceptible to being flipped in similar documents. This paves the way for a probabilistic search technique in the Hamming space of simhashes that can be significantly faster and more space-efficient than the existing simhash approaches. We show that with 95% recall compared to deterministic search of prior work [16], our method exhibits 4-14 times faster lookup and requires 2-10 times less RAM on our collection of 70M web pages.

sadhan sood | Texas A&M University (original) (raw)

Uploads

Papers by sadhan sood

Log In