dask_ml.feature_extraction.text.CountVectorizer
dask_ml.feature_extraction.text
.CountVectorizer¶
- class dask_ml.feature_extraction.text.CountVectorizer(*, input='content', encoding='utf-8', decode_error='strict', strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern='(?u)\\b\\w\\w+\\b', ngram_range=(1, 1), analyzer='word', max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class 'numpy.int64'>)¶
Convert a collection of text documents to a matrix of token counts
Notes
When a vocabulary isn’t provided,
fit_transform
requires two passes over the dataset: one to learn the vocabulary and a second to transform the data. Consider persisting the data if it fits in (distributed) memory prior to callingfit
ortransform
when not providing avocabulary
.Additionally, this implementation benefits from having an active
dask.distributed.Client
, even on a single machine. When a client is present, the learnedvocabulary
is persisted in distributed memory, which saves some recomputation and redundant communication.Examples
The Dask-ML implementation currently requires that
raw_documents
is adask.bag.Bag
of documents (lists of strings).>>> from dask_ml.feature_extraction.text import CountVectorizer >>> import dask.bag as db >>> from distributed import Client >>> client = Client() >>> corpus = [ ... 'This is the first document.', ... 'This document is the second document.', ... 'And this is the third one.', ... 'Is this the first document?', ... ] >>> corpus = db.from_sequence(corpus, npartitions=2) >>> vectorizer = CountVectorizer() >>> X = vectorizer.fit_transform(corpus) dask.array<concatenate, shape=(nan, 9), dtype=int64, chunksize=(nan, 9), ... chunktype=scipy.csr_matrix> >>> X.compute().toarray() array([[0, 1, 1, 1, 0, 0, 1, 0, 1], [0, 2, 0, 1, 0, 1, 1, 0, 1], [1, 0, 0, 1, 1, 0, 1, 1, 1], [0, 1, 1, 1, 0, 0, 1, 0, 1]]) >>> vectorizer.get_feature_names() ['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
Methods
build_analyzer
()Return a callable to process input data.
build_preprocessor
()Return a function to preprocess the text before tokenization.
build_tokenizer
()Return a function that splits a string into a sequence of tokens.
decode
(doc)Decode the input into a string of unicode symbols.
fit
(raw_documents[, y])Learn a vocabulary dictionary of all tokens in the raw documents.
fit_transform
(raw_documents[, y])Learn the vocabulary dictionary and return document-term matrix.
get_feature_names_out
([input_features])Get output feature names for transformation.
get_metadata_routing
()Get metadata routing of this object.
get_params
([deep])Get parameters for this estimator.
get_stop_words
()Build or fetch the effective stop words list.
inverse_transform
(X)Return terms per document with nonzero entries in X.
set_fit_request
(*[, raw_documents])Request metadata passed to the
fit
method.set_params
(**params)Set the parameters of this estimator.
set_transform_request
(*[, raw_documents])Request metadata passed to the
transform
method.transform
(raw_documents)Transform documents to document-term matrix.
- __init__(*, input='content', encoding='utf-8', decode_error='strict', strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern='(?u)\\b\\w\\w+\\b', ngram_range=(1, 1), analyzer='word', max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class 'numpy.int64'>)¶