dask_ml.feature_extraction.text.HashingVectorizer — dask-ml 2025.1.1 documentation (original) (raw)

`dask_ml.feature_extraction.text`.HashingVectorizer¶

class dask_ml.feature_extraction.text.HashingVectorizer(*, input='content', encoding='utf-8', decode_error='strict', strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern='(?u)\\b\\w\\w+\\b', ngram_range=(1, 1), analyzer='word', n_features=1048576, binary=False, norm='l2', alternate_sign=True, dtype=<class 'numpy.float64'>)¶

Convert a collection of text documents to a matrix of token occurrences.

It turns a collection of text documents into a scipy.sparse matrix holding token occurrence counts (or binary occurrence information), possibly normalized as token frequencies if norm=’l1’ or projected on the euclidean unit sphere if norm=’l2’.

This text vectorizer implementation uses the hashing trick to find the token string name to feature integer index mapping.

This strategy has several advantages:

it is very low memory scalable to large datasets as there is no need to store a vocabulary dictionary in memory.
it is fast to pickle and un-pickle as it holds no state besides the constructor parameters.
it can be used in a streaming (partial fit) or parallel pipeline as there is no state computed during fit.

There are also a couple of cons (vs using a CountVectorizer with an in-memory vocabulary):

there is no way to compute the inverse transform (from feature indices to string feature names) which can be a problem when trying to introspect which features are most important to a model.
there can be collisions: distinct tokens can be mapped to the same feature index. However in practice this is rarely an issue if n_features is large enough (e.g. 2 ** 18 for text classification problems).
no IDF weighting as this would render the transformer stateful.

The hash function employed is the signed 32-bit version of Murmurhash3.

For an efficiency comparison of the different feature extractors, seeFeatureHasher and DictVectorizer Comparison.

For an example of document clustering and comparison withTfidfVectorizer, seeClustering text documents using k-means.

See also

CountVectorizer

Convert a collection of text documents to a matrix of token counts.

TfidfVectorizer

Convert a collection of raw documents to a matrix of TF-IDF features.

Notes

This estimator is stateless and does not need to be fitted. However, we recommend to call fit_transform() instead oftransform(), as parameter validation is only performed infit().

Examples

from sklearn.feature_extraction.text import HashingVectorizer corpus = [ ... 'This is the first document.', ... 'This document is the second document.', ... 'And this is the third one.', ... 'Is this the first document?', ... ] vectorizer = HashingVectorizer(n_features=2**4) X = vectorizer.fit_transform(corpus) print(X.shape) (4, 16)

Methods

build_analyzer()	Return a callable to process input data.
build_preprocessor()	Return a function to preprocess the text before tokenization.
build_tokenizer()	Return a function that splits a string into a sequence of tokens.
decode(doc)	Decode the input into a string of unicode symbols.
fit(X[, y])	Only validates estimator's parameters.
fit_transform(X[, y])	Transform a sequence of documents to a document-term matrix.
get_metadata_routing()	Get metadata routing of this object.
get_params([deep])	Get parameters for this estimator.
get_stop_words()	Build or fetch the effective stop words list.
partial_fit(X[, y])	Only validates estimator's parameters.
set_output(*[, transform])	Set output container.
set_params(**params)	Set the parameters of this estimator.
set_transform_request(*[, raw_X])	Request metadata passed to the transform method.
transform(raw_X)	Transform a sequence of documents to a document-term matrix.

__init__(*, input='content', encoding='utf-8', decode_error='strict', strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern='(?u)\\b\\w\\w+\\b', ngram_range=(1, 1), analyzer='word', n_features=1048576, binary=False, norm='l2', alternate_sign=True, dtype=<class 'numpy.float64'>)¶

dask_ml.feature_extraction.text.HashingVectorizer — dask-ml 2025.1.1 documentation (original) (raw)

dask_ml.feature_extraction.text.HashingVectorizer¶

`dask_ml.feature_extraction.text`.HashingVectorizer¶