dask_ml.feature_extraction.text.CountVectorizer
dask_ml.feature_extraction.text.CountVectorizer¶
- class dask_ml.feature_extraction.text.CountVectorizer(*, input='content', encoding='utf-8', decode_error='strict', strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern='(?u)\\b\\w\\w+\\b', ngram_range=(1, 1), analyzer='word', max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class 'numpy.int64'>)¶
- Convert a collection of text documents to a matrix of token counts - Notes - When a vocabulary isn’t provided, - fit_transformrequires two passes over the dataset: one to learn the vocabulary and a second to transform the data. Consider persisting the data if it fits in (distributed) memory prior to calling- fitor- transformwhen not providing a- vocabulary.- Additionally, this implementation benefits from having an active - dask.distributed.Client, even on a single machine. When a client is present, the learned- vocabularyis persisted in distributed memory, which saves some recomputation and redundant communication.- Examples - The Dask-ML implementation currently requires that - raw_documentsis a- dask.bag.Bagof documents (lists of strings).- >>> from dask_ml.feature_extraction.text import CountVectorizer >>> import dask.bag as db >>> from distributed import Client >>> client = Client() >>> corpus = [ ... 'This is the first document.', ... 'This document is the second document.', ... 'And this is the third one.', ... 'Is this the first document?', ... ] >>> corpus = db.from_sequence(corpus, npartitions=2) >>> vectorizer = CountVectorizer() >>> X = vectorizer.fit_transform(corpus) dask.array<concatenate, shape=(nan, 9), dtype=int64, chunksize=(nan, 9), ... chunktype=scipy.csr_matrix> >>> X.compute().toarray() array([[0, 1, 1, 1, 0, 0, 1, 0, 1], [0, 2, 0, 1, 0, 1, 1, 0, 1], [1, 0, 0, 1, 1, 0, 1, 1, 1], [0, 1, 1, 1, 0, 0, 1, 0, 1]]) >>> vectorizer.get_feature_names() ['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this'] - Methods - build_analyzer()- Return a callable to process input data. - build_preprocessor()- Return a function to preprocess the text before tokenization. - build_tokenizer()- Return a function that splits a string into a sequence of tokens. - decode(doc)- Decode the input into a string of unicode symbols. - fit(raw_documents[, y])- Learn a vocabulary dictionary of all tokens in the raw documents. - fit_transform(raw_documents[, y])- Learn the vocabulary dictionary and return document-term matrix. - get_feature_names_out([input_features])- Get output feature names for transformation. - get_metadata_routing()- Get metadata routing of this object. - get_params([deep])- Get parameters for this estimator. - get_stop_words()- Build or fetch the effective stop words list. - inverse_transform(X)- Return terms per document with nonzero entries in X. - set_params(**params)- Set the parameters of this estimator. - transform(raw_documents)- Transform documents to document-term matrix. - __init__(*, input='content', encoding='utf-8', decode_error='strict', strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern='(?u)\\b\\w\\w+\\b', ngram_range=(1, 1), analyzer='word', max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class 'numpy.int64'>)¶