dask_ml.wrappers.Incremental
dask_ml.wrappers
.Incremental¶
- class dask_ml.wrappers.Incremental(estimator=None, scoring=None, shuffle_blocks=True, random_state=None, assume_equal_chunks=True, predict_meta=None, predict_proba_meta=None, transform_meta=None)¶
Metaestimator for feeding Dask Arrays to an estimator blockwise.
This wrapper provides a bridge between Dask objects and estimators implementing the
partial_fit
API. These incremental learners can train on batches of data. This fits well with Dask’s blocked data structures.Note
This meta-estimator is not appropriate for hyperparameter optimization on larger-than-memory datasets. For that, see
IncrementalSearchCV
orHyperbandSearchCV
.See the list of incremental learners in the scikit-learn documentation for a list of estimators that implement the
partial_fit
API. Note that Incremental is not limited to just these classes, it will work on any estimator implementingpartial_fit
, including those defined outside of scikit-learn itself.Calling
Incremental.fit()
with a Dask Array will pass each block of the Dask array or arrays toestimator.partial_fit
sequentially.Like
ParallelPostFit
, the methods available after fitting (e.g.Incremental.predict()
, etc.) are all parallel and delayed.The
estimator_
attribute is a clone of estimator that was actually used during the call tofit
. All attributes learned during training are available onIncremental
directly.- Parameters
- estimatorEstimator
Any object supporting the scikit-learn
partial_fit
API.- scoringstring or callable, optional
A single string (see The scoring parameter: defining model evaluation rules) or a callable (see Defining your scoring strategy from metric functions) to evaluate the predictions on the test set.
For evaluating multiple metrics, either give a list of (unique) strings or a dict with names as keys and callables as values.
NOTE that when using custom scorers, each scorer should return a single value. Metric functions returning a list/array of values can be wrapped into multiple scorers that return one value each.
See Specifying multiple metrics for evaluation for an example.
Warning
If None, the estimator’s default scorer (if available) is used. Most scikit-learn estimators will convert large Dask arrays to a single NumPy array, which may exhaust the memory of your worker. You probably want to always specify scoring.
- random_stateint or numpy.random.RandomState, optional
Random object that determines how to shuffle blocks.
- shuffle_blocksbool, default True
Determines whether to call
partial_fit
on a randomly selected chunk of the Dask arrays (default), or to fit in sequential order. This does not control shuffle between blocks or shuffling each block.- predict_meta: pd.Series, pd.DataFrame, np.array deafult: None(infer)
An empty
pd.Series
,pd.DataFrame
,np.array
that matches the output type of the estimatorspredict
call. This meta is necessary for for some estimators to work withdask.dataframe
anddask.array
- predict_proba_meta: pd.Series, pd.DataFrame, np.array deafult: None(infer)
An empty
pd.Series
,pd.DataFrame
,np.array
that matches the output type of the estimatorspredict_proba
call. This meta is necessary for for some estimators to work withdask.dataframe
anddask.array
- transform_meta: pd.Series, pd.DataFrame, np.array deafult: None(infer)
An empty
pd.Series
,pd.DataFrame
,np.array
that matches the output type of the estimatorstransform
call. This meta is necessary for for some estimators to work withdask.dataframe
anddask.array
- Attributes
- estimator_Estimator
A clone of estimator that was actually fit during the
.fit
call.
Examples
>>> from dask_ml.wrappers import Incremental >>> from dask_ml.datasets import make_classification >>> import sklearn.linear_model >>> X, y = make_classification(chunks=25) >>> est = sklearn.linear_model.SGDClassifier() >>> clf = Incremental(est, scoring='accuracy') >>> clf.fit(X, y, classes=[0, 1])
When used inside a grid search, prefix the underlying estimator’s parameter names with
estimator__
.>>> from sklearn.model_selection import GridSearchCV >>> param_grid = {"estimator__alpha": [0.1, 1.0, 10.0]} >>> gs = GridSearchCV(clf, param_grid) >>> gs.fit(X, y, classes=[0, 1])
Methods
fit
(X[, y])Fit the underlying estimator.
get_metadata_routing
()Get metadata routing of this object.
get_params
([deep])Get parameters for this estimator.
partial_fit
(X[, y])Fit the underlying estimator.
predict
(X)Predict for X.
predict_log_proba
(X)Log of probability estimates.
predict_proba
(X)Probability estimates.
score
(X, y[, compute])Returns the score on the given data.
set_params
(**params)Set the parameters of this estimator.
set_score_request
(*[, compute])Request metadata passed to the
score
method.transform
(X)Transform block or partition-wise for dask inputs.
- __init__(estimator=None, scoring=None, shuffle_blocks=True, random_state=None, assume_equal_chunks=True, predict_meta=None, predict_proba_meta=None, transform_meta=None)¶