Hashing Representations (original) (raw)
Hashing Representations for Machine Learning
I've discovered that there are many 'hashing tricks' in machine learning. Some of these are like the count-min sketch in that they rely on an explicit bloom filterstyle datastructure. The ones here take a radical conceptual step: they only use one hashing function. This turns out to work out remarkably well when learning, because the learning algorithm can learn to deal with collisions. In several applications, it allows radical compressions of the size of the learned model. The closest similar prior work we've found is in the CRM114 spam filter which generates several features for each n-gram of words.
This hashing trick is implemented in Vowpal Wabbit as the core representation, as well as Torch and Stream, part of Elefant. It is particularly useful for online learning algorithms consuming large amounts of data.
Qinfeng Shi, James Petterson, Gideon Dror, John Langford, Alex Smola, and SVN Vishwanathan, Hash Kernels for Structured Data, AISTAT 2009 and JMLR 2009.
This paper defines a 'hash kernel', shows that it can be handy for several problems, and makes an observation that hashing can work well asymptotically due to internal feature redundancy.
Kilian Weinberger, Anirban Dasgupta, John Langford, Alex Smola, Josh Attenberg, Feature Hashing for Large Scale Multitask Learning, ICML 2009 Theorem 3 in this paper was wrong as originally published, but fixed now.
Here, we provide a large scale application of the hashing trick to mass personalized spam filtering, define an unbiased hash, and prove that the unbiased hash is length preserving when the magnitudes of individual elements are not large. This implies that hashing can be thought of as a sparsity preserving projection.
See also Small Statistical Models by Random Feature Mixing by Kuzman Ganchev andMark Dredze which shows this approach can work for NLP applications.
And Extremely Fast Text Feature Extraction for Classification and Indexing by George Forman and Evan Kirshenbaum which shows that a very fast combined parser/hasher works well on a variety of problems.