dask_ml.feature_extraction.text.CountVectorizer

`dask_ml.feature_extraction.text`.CountVectorizer¶

class dask_ml.feature_extraction.text.CountVectorizer(*, input='content', encoding='utf-8', decode_error='strict', strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern='(?u)\\b\\w\\w+\\b', ngram_range=(1, 1), analyzer='word', max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class 'numpy.int64'>)¶

Convert a collection of text documents to a matrix of token counts

See also

sklearn.feature_extraction.text.CountVectorizer

Notes

When a vocabulary isn’t provided, fit_transform requires two passes over the dataset: one to learn the vocabulary and a second to transform the data. Consider persisting the data if it fits in (distributed) memory prior to calling fit or transform when not providing a vocabulary.

Additionally, this implementation benefits from having an active dask.distributed.Client, even on a single machine. When a client is present, the learned vocabulary is persisted in distributed memory, which saves some recomputation and redundant communication.

Examples

The Dask-ML implementation currently requires that raw_documents is a dask.bag.Bag of documents (lists of strings).

>>> from dask_ml.feature_extraction.text import CountVectorizer
>>> import dask.bag as db
>>> from distributed import Client
>>> client = Client()
>>> corpus = [
...     'This is the first document.',
...     'This document is the second document.',
...     'And this is the third one.',
...     'Is this the first document?',
... ]
>>> corpus = db.from_sequence(corpus, npartitions=2)
>>> vectorizer = CountVectorizer()
>>> X = vectorizer.fit_transform(corpus)
dask.array<concatenate, shape=(nan, 9), dtype=int64, chunksize=(nan, 9), ...
           chunktype=scipy.csr_matrix>
>>> X.compute().toarray()
array([[0, 1, 1, 1, 0, 0, 1, 0, 1],
       [0, 2, 0, 1, 0, 1, 1, 0, 1],
       [1, 0, 0, 1, 1, 0, 1, 1, 1],
       [0, 1, 1, 1, 0, 0, 1, 0, 1]])
>>> vectorizer.get_feature_names()
['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']

Methods

`build_analyzer`()	Return a callable to process input data.
`build_preprocessor`()	Return a function to preprocess the text before tokenization.
`build_tokenizer`()	Return a function that splits a string into a sequence of tokens.
`decode`(doc)	Decode the input into a string of unicode symbols.
`fit`(raw_documents[, y])	Learn a vocabulary dictionary of all tokens in the raw documents.
`fit_transform`(raw_documents[, y])	Learn the vocabulary dictionary and return document-term matrix.
`get_feature_names_out`([input_features])	Get output feature names for transformation.
`get_metadata_routing`()	Get metadata routing of this object.
`get_params`([deep])	Get parameters for this estimator.
`get_stop_words`()	Build or fetch the effective stop words list.
`inverse_transform`(X)	Return terms per document with nonzero entries in X.
`set_params`(**params)	Set the parameters of this estimator.
`transform`(raw_documents)	Transform documents to document-term matrix.

__init__(*, input='content', encoding='utf-8', decode_error='strict', strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern='(?u)\\b\\w\\w+\\b', ngram_range=(1, 1), analyzer='word', max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class 'numpy.int64'>)¶

dask_ml.preprocessing.BlockTransformer

dask_ml.feature_extraction.text.HashingVectorizer

dask_ml.feature_extraction.text.CountVectorizer

dask_ml.feature_extraction.text.CountVectorizer¶

`dask_ml.feature_extraction.text`.CountVectorizer¶