dask_ml.feature_extraction.text.CountVectorizer

class dask_ml.feature_extraction.text.CountVectorizer(*, input='content', encoding='utf-8', decode_error='strict', strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern='(?u)\\b\\w\\w+\\b', ngram_range=(1, 1), analyzer='word', max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class 'numpy.int64'>)

Convert a collection of text documents to a matrix of token counts

Notes

When a vocabulary isn’t provided, fit_transform requires two passes over the dataset: one to learn the vocabulary and a second to transform the data. Consider persisting the data if it fits in (distributed) memory prior to calling fit or transform when not providing a vocabulary.

Additionally, this implementation benefits from having an active dask.distributed.Client, even on a single machine. When a client is present, the learned vocabulary is persisted in distributed memory, which saves some recomputation and redundant communication.

Examples

The Dask-ML implementation currently requires that raw_documents is a dask.bag.Bag of documents (lists of strings).

>>> from dask_ml.feature_extraction.text import CountVectorizer
>>> import dask.bag as db
>>> from distributed import Client
>>> client = Client()
>>> corpus = [
...     'This is the first document.',
...     'This document is the second document.',
...     'And this is the third one.',
...     'Is this the first document?',
... ]
>>> corpus = db.from_sequence(corpus, npartitions=2)
>>> vectorizer = CountVectorizer()
>>> X = vectorizer.fit_transform(corpus)
dask.array<concatenate, shape=(nan, 9), dtype=int64, chunksize=(nan, 9), ...
           chunktype=scipy.csr_matrix>
>>> X.compute().toarray()
array([[0, 1, 1, 1, 0, 0, 1, 0, 1],
       [0, 2, 0, 1, 0, 1, 1, 0, 1],
       [1, 0, 0, 1, 1, 0, 1, 1, 1],
       [0, 1, 1, 1, 0, 0, 1, 0, 1]])
>>> vectorizer.get_feature_names()
['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']

Methods

build_analyzer()

Return a callable to process input data.

build_preprocessor()

Return a function to preprocess the text before tokenization.

build_tokenizer()

Return a function that splits a string into a sequence of tokens.

decode(doc)

Decode the input into a string of unicode symbols.

fit(raw_documents[, y])

Learn a vocabulary dictionary of all tokens in the raw documents.

fit_transform(raw_documents[, y])

Learn the vocabulary dictionary and return document-term matrix.

get_feature_names()

DEPRECATED: get_feature_names is deprecated in 1.0 and will be removed in 1.2.

get_feature_names_out([input_features])

Get output feature names for transformation.

get_params([deep])

Get parameters for this estimator.

get_stop_words()

Build or fetch the effective stop words list.

inverse_transform(X)

Return terms per document with nonzero entries in X.

set_params(**params)

Set the parameters of this estimator.

transform(raw_documents)

Transform documents to document-term matrix.

__init__(*, input='content', encoding='utf-8', decode_error='strict', strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern='(?u)\\b\\w\\w+\\b', ngram_range=(1, 1), analyzer='word', max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class 'numpy.int64'>)