dask_ml.feature_extraction.text.CountVectorizer — dask-ml 2025.1.1 documentation (original) (raw)

When a vocabulary isn’t provided, fit_transform requires two passes over the dataset: one to learn the vocabulary and a second to transform the data. Consider persisting the data if it fits in (distributed) memory prior to calling fit or transformwhen not providing a vocabulary.

Additionally, this implementation benefits from having an active dask.distributed.Client, even on a single machine. When a client is present, the learned vocabulary is persisted in distributed memory, which saves some recomputation and redundant communication.

The Dask-ML implementation currently requires that raw_documentsis a dask.bag.Bag of documents (lists of strings).

from dask_ml.feature_extraction.text import CountVectorizer import dask.bag as db from distributed import Client client = Client() corpus = [ ... 'This is the first document.', ... 'This document is the second document.', ... 'And this is the third one.', ... 'Is this the first document?', ... ] corpus = db.from_sequence(corpus, npartitions=2) vectorizer = CountVectorizer() X = vectorizer.fit_transform(corpus) dask.array<concatenate, shape=(nan, 9), dtype=int64, chunksize=(nan, 9), ... chunktype=scipy.csr_matrix> X.compute().toarray() array([[0, 1, 1, 1, 0, 0, 1, 0, 1], [0, 2, 0, 1, 0, 1, 1, 0, 1], [1, 0, 0, 1, 1, 0, 1, 1, 1], [0, 1, 1, 1, 0, 0, 1, 0, 1]]) vectorizer.get_feature_names() ['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']