dask_ml.wrappers.Incremental

class dask_ml.wrappers.Incremental(estimator=None, scoring=None, shuffle_blocks=True, random_state=None)

Metaestimator for feeding Dask Arrays to an estimator blockwise.

This wrapper provides a bridge between Dask objects and estimators implementing the partial_fit API. These incremental learners can train on batches of data. This fits well with Dask’s blocked data structures.

Note

This meta-estimator is not appropriate for hyperparameter optimization on larger-than-memory datasets. For that, see :class:dask_ml.model_selection.IncrementalSearch`.

See the list of incremental learners in the scikit-learn documentation for a list of estimators that implement the partial_fit API. Note that Incremental is not limited to just these classes, it will work on any estimator implementing partial_fit, including those defined outside of scikit-learn itself.

Calling Incremental.fit() with a Dask Array will pass each block of the Dask array or arrays to estimator.partial_fit sequentially.

Like ParallelPostFit, the methods available after fitting (e.g. Incremental.predict(), etc.) are all parallel and delayed.

The estimator_ attribute is a clone of estimator that was actually used during the call to fit. All attributes learned during training are available on Incremental directly.

Parameters:
estimator : Estimator

Any object supporting the scikit-learn parital_fit API.

scoring : string or callable, optional

A single string (see The scoring parameter: defining model evaluation rules) or a callable (see Defining your scoring strategy from metric functions) to evaluate the predictions on the test set.

For evaluating multiple metrics, either give a list of (unique) strings or a dict with names as keys and callables as values.

NOTE that when using custom scorers, each scorer should return a single value. Metric functions returning a list/array of values can be wrapped into multiple scorers that return one value each.

See Specifying multiple metrics for evaluation for an example.

Warning

If None, the estimator’s default scorer (if available) is used. Most scikit-learn estimators will convert large Dask arrays to a single NumPy array, which may exhaust the memory of your worker. You probably want to always specify scoring.

random_state : int or numpy.random.RandomState, optional

Random object that determines how to shuffle blocks.

shuffle_blocks : bool, default True

Determines whether to call partial_fit on a randomly selected chunk of the Dask arrays (default), or to fit in sequential order. This does not control shuffle between blocks or shuffling each block.

Attributes:
estimator_ : Estimator

A clone of estimator that was actually fit during the .fit call.

See also

ParallelPostFit, dask_ml.model_selection.IncrementalSearch

Examples

>>> from dask_ml.wrappers import Incremental
>>> from dask_ml.datasets import make_classification
>>> import sklearn.linear_model
>>> X, y = make_classification(chunks=25)
>>> est = sklearn.linear_model.SGDClassifier()
>>> clf = Incremental(est, scoring='accuracy')
>>> clf.fit(X, y, classes=[0, 1])

When used inside a grid search, prefix the underlying estimator’s parameter names with estimator__.

>>> from sklearn.model_selection import GridSearchCV
>>> param_grid = {"estimator__alpha": [0.1, 1.0, 10.0]}
>>> gs = GridSearchCV(clf, param_grid)
>>> gs.fit(X, y, classes=[0, 1])

Methods

fit(X[, y]) Fit the underlying estimator.
get_params([deep]) Get parameters for this estimator.
partial_fit(X[, y]) Fit the underlying estimator.
predict(X) Predict for X.
predict_proba(X) Predict for X.
score(X, y[, compute]) Returns the score on the given data.
set_params(**params) Set the parameters of this estimator.
transform(X) Transform block or partition-wise for dask inputs.
__init__(estimator=None, scoring=None, shuffle_blocks=True, random_state=None)

Initialize self. See help(type(self)) for accurate signature.

fit(X, y=None, **fit_kwargs)

Fit the underlying estimator.

Parameters:
X, y : array-like
**kwargs

Additional fit-kwargs for the underlying estimator.

Returns:
self : object
get_params(deep=True)

Get parameters for this estimator.

Parameters:
deep : boolean, optional

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:
params : mapping of string to any

Parameter names mapped to their values.

partial_fit(X, y=None, **fit_kwargs)

Fit the underlying estimator.

If this estimator has not been previously fit, this is identical to Incremental.fit(). If it has been previously fit, self.estimator_ is used as the starting point.

Parameters:
X, y : array-like
**kwargs

Additional fit-kwargs for the underlying estimator.

Returns:
self : object
predict(X)

Predict for X.

For dask inputs, a dask array or dataframe is returned. For other inputs (NumPy array, pandas dataframe, scipy sparse matrix), the regular return value is returned.

Parameters:
X : array-like
Returns:
y : array-like
predict_proba(X)

Predict for X.

For dask inputs, a dask array or dataframe is returned. For other inputs (NumPy array, pandas dataframe, scipy sparse matrix), the regular return value is returned.

If the underlying estimator does not have a predict_proba method, then an AttributeError is raised.

Parameters:
X : array or dataframe
Returns:
y : array-like
score(X, y, compute=True)

Returns the score on the given data.

Parameters:
X : array-like, shape = [n_samples, n_features]

Input data, where n_samples is the number of samples and n_features is the number of features.

y : array-like, shape = [n_samples] or [n_samples, n_output], optional

Target relative to X for classification or regression; None for unsupervised learning.

Returns:
score : float

return self.estimator.score(X, y)

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Returns:
self
transform(X)

Transform block or partition-wise for dask inputs.

For dask inputs, a dask array or dataframe is returned. For other inputs (NumPy array, pandas dataframe, scipy sparse matrix), the regular return value is returned.

If the underlying estimator does not have a transform method, then an AttributeError is raised.

Parameters:
X : array-like
Returns:
transformed : array-like