dask_ml.feature_extraction.text.CountVectorizer — dask-ml 2025.1.1 documentation (original) (raw)
When a vocabulary isn’t provided, fit_transform
requires two passes over the dataset: one to learn the vocabulary and a second to transform the data. Consider persisting the data if it fits in (distributed) memory prior to calling fit
or transform
when not providing a vocabulary
.
Additionally, this implementation benefits from having an active dask.distributed.Client
, even on a single machine. When a client is present, the learned vocabulary
is persisted in distributed memory, which saves some recomputation and redundant communication.
The Dask-ML implementation currently requires that raw_documents
is a dask.bag.Bag of documents (lists of strings).
from dask_ml.feature_extraction.text import CountVectorizer import dask.bag as db from distributed import Client client = Client() corpus = [ ... 'This is the first document.', ... 'This document is the second document.', ... 'And this is the third one.', ... 'Is this the first document?', ... ] corpus = db.from_sequence(corpus, npartitions=2) vectorizer = CountVectorizer() X = vectorizer.fit_transform(corpus) dask.array<concatenate, shape=(nan, 9), dtype=int64, chunksize=(nan, 9), ... chunktype=scipy.csr_matrix> X.compute().toarray() array([[0, 1, 1, 1, 0, 0, 1, 0, 1], [0, 2, 0, 1, 0, 1, 1, 0, 1], [1, 0, 0, 1, 1, 0, 1, 1, 1], [0, 1, 1, 1, 0, 0, 1, 0, 1]]) vectorizer.get_feature_names() ['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']