dask_ml.feature_extraction.text.CountVectorizer

class dask_ml.feature_extraction.text.CountVectorizer(*, input='content', encoding='utf-8', decode_error='strict', strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern='(?u)\b\w\w+\b', ngram_range=(1, 1), analyzer='word', max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class 'numpy.int64'>)

Convert a collection of text documents to a matrix of token counts

Notes

When a vocabulary isn’t provided, fit_transform requires two passes over the dataset: one to learn the vocabulary and a second to transform the data. Consider persisting the data if it fits in (distributed) memory prior to calling fit or transform when not providing a vocabulary.

Additionally, this implementation benefits from having an active dask.distributed.Client, even on a single machine. When a client is present, the learned vocabulary is persisted in distributed memory, which saves some recompuation and redundant communication.

Examples

The Dask-ML implementation currently requires that raw_documents is a dask.bag.Bag of documents (lists of strings).

>>> from dask_ml.feature_extraction.text import CountVectorizer
>>> import dask.bag as db
>>> from distributed import Client
>>> client = Client()
>>> corpus = [
...     'This is the first document.',
...     'This document is the second document.',
...     'And this is the third one.',
...     'Is this the first document?',
... ]
>>> corpus = db.from_sequence(corpus, npartitions=2)
>>> vectorizer = CountVectorizer()
>>> X = vectorizer.fit_transform(corpus)
dask.array<concatenate, shape=(nan, 9), dtype=int64, chunksize=(nan, 9), ...
           chunktype=scipy.csr_matrix>
>>> X.compute().toarray()
array([[0, 1, 1, 1, 0, 0, 1, 0, 1],
       [0, 2, 0, 1, 0, 1, 1, 0, 1],
       [1, 0, 0, 1, 1, 0, 1, 1, 1],
       [0, 1, 1, 1, 0, 0, 1, 0, 1]])
>>> vectorizer.get_feature_names()
['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']

Methods

build_analyzer() Return a callable that handles preprocessing, tokenization and n-grams generation.
build_preprocessor() Return a function to preprocess the text before tokenization.
build_tokenizer() Return a function that splits a string into a sequence of tokens.
decode(doc) Decode the input into a string of unicode symbols.
fit(raw_documents[, y]) Learn a vocabulary dictionary of all tokens in the raw documents.
fit_transform(raw_documents[, y]) Learn the vocabulary dictionary and return document-term matrix.
get_feature_names() Array mapping from feature integer indices to feature name.
get_params([deep]) Get parameters for this estimator.
get_stop_words() Build or fetch the effective stop words list.
inverse_transform(X) Return terms per document with nonzero entries in X.
set_params(**params) Set the parameters of this estimator.
transform(raw_documents) Transform documents to document-term matrix.
__init__(*, input='content', encoding='utf-8', decode_error='strict', strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern='(?u)\\b\\w\\w+\\b', ngram_range=(1, 1), analyzer='word', max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class 'numpy.int64'>)

Initialize self. See help(type(self)) for accurate signature.

build_analyzer()

Return a callable that handles preprocessing, tokenization and n-grams generation.

Returns:
analyzer: callable

A function to handle preprocessing, tokenization and n-grams generation.

build_preprocessor()

Return a function to preprocess the text before tokenization.

Returns:
preprocessor: callable

A function to preprocess the text before tokenization.

build_tokenizer()

Return a function that splits a string into a sequence of tokens.

Returns:
tokenizer: callable

A function to split a string into a sequence of tokens.

decode(doc)

Decode the input into a string of unicode symbols.

The decoding strategy depends on the vectorizer parameters.

Parameters:
doc : str

The string to decode.

Returns:
doc: str

A string of unicode symbols.

fit(raw_documents, y=None)

Learn a vocabulary dictionary of all tokens in the raw documents.

Parameters:
raw_documents : iterable

An iterable which yields either str, unicode or file objects.

Returns:
self
fit_transform(raw_documents, y=None)

Learn the vocabulary dictionary and return document-term matrix.

This is equivalent to fit followed by transform, but more efficiently implemented.

Parameters:
raw_documents : iterable

An iterable which yields either str, unicode or file objects.

Returns:
X : array of shape (n_samples, n_features)

Document-term matrix.

get_feature_names()

Array mapping from feature integer indices to feature name.

Returns:
feature_names : list

A list of feature names.

get_params(deep=True)

Get parameters for this estimator.

Parameters:
deep : bool, default=True

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:
params : mapping of string to any

Parameter names mapped to their values.

get_stop_words()

Build or fetch the effective stop words list.

Returns:
stop_words: list or None

A list of stop words.

inverse_transform(X)

Return terms per document with nonzero entries in X.

Parameters:
X : {array-like, sparse matrix} of shape (n_samples, n_features)

Document-term matrix.

Returns:
X_inv : list of arrays of shape (n_samples,)

List of arrays of terms.

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:
**params : dict

Estimator parameters.

Returns:
self : object

Estimator instance.

transform(raw_documents)

Transform documents to document-term matrix.

Extract token counts out of raw text documents using the vocabulary fitted with fit or the one provided to the constructor.

Parameters:
raw_documents : iterable

An iterable which yields either str, unicode or file objects.

Returns:
X : sparse matrix of shape (n_samples, n_features)

Document-term matrix.