dask_ml.wrappers.Incremental

class dask_ml.wrappers.Incremental(estimator=None, scoring=None, shuffle_blocks=True, random_state=None, assume_equal_chunks=True, predict_meta=None, predict_proba_meta=None, transform_meta=None)

Metaestimator for feeding Dask Arrays to an estimator blockwise.

This wrapper provides a bridge between Dask objects and estimators implementing the partial_fit API. These incremental learners can train on batches of data. This fits well with Dask’s blocked data structures.

Note

This meta-estimator is not appropriate for hyperparameter optimization on larger-than-memory datasets. For that, see IncrementalSearchCV or HyperbandSearchCV.

See the list of incremental learners in the scikit-learn documentation for a list of estimators that implement the partial_fit API. Note that Incremental is not limited to just these classes, it will work on any estimator implementing partial_fit, including those defined outside of scikit-learn itself.

Calling Incremental.fit() with a Dask Array will pass each block of the Dask array or arrays to estimator.partial_fit sequentially.

Like ParallelPostFit, the methods available after fitting (e.g. Incremental.predict(), etc.) are all parallel and delayed.

The estimator_ attribute is a clone of estimator that was actually used during the call to fit. All attributes learned during training are available on Incremental directly.

Parameters
estimatorEstimator

Any object supporting the scikit-learn partial_fit API.

scoringstring or callable, optional

A single string (see The scoring parameter: defining model evaluation rules) or a callable (see Defining your scoring strategy from metric functions) to evaluate the predictions on the test set.

For evaluating multiple metrics, either give a list of (unique) strings or a dict with names as keys and callables as values.

NOTE that when using custom scorers, each scorer should return a single value. Metric functions returning a list/array of values can be wrapped into multiple scorers that return one value each.

See Specifying multiple metrics for evaluation for an example.

Warning

If None, the estimator’s default scorer (if available) is used. Most scikit-learn estimators will convert large Dask arrays to a single NumPy array, which may exhaust the memory of your worker. You probably want to always specify scoring.

random_stateint or numpy.random.RandomState, optional

Random object that determines how to shuffle blocks.

shuffle_blocksbool, default True

Determines whether to call partial_fit on a randomly selected chunk of the Dask arrays (default), or to fit in sequential order. This does not control shuffle between blocks or shuffling each block.

predict_meta: pd.Series, pd.DataFrame, np.array deafult: None(infer)

An empty pd.Series, pd.DataFrame, np.array that matches the output type of the estimators predict call. This meta is necessary for for some estimators to work with dask.dataframe and dask.array

predict_proba_meta: pd.Series, pd.DataFrame, np.array deafult: None(infer)

An empty pd.Series, pd.DataFrame, np.array that matches the output type of the estimators predict_proba call. This meta is necessary for for some estimators to work with dask.dataframe and dask.array

transform_meta: pd.Series, pd.DataFrame, np.array deafult: None(infer)

An empty pd.Series, pd.DataFrame, np.array that matches the output type of the estimators transform call. This meta is necessary for for some estimators to work with dask.dataframe and dask.array

Attributes
estimator_Estimator

A clone of estimator that was actually fit during the .fit call.

Examples

>>> from dask_ml.wrappers import Incremental
>>> from dask_ml.datasets import make_classification
>>> import sklearn.linear_model
>>> X, y = make_classification(chunks=25)
>>> est = sklearn.linear_model.SGDClassifier()
>>> clf = Incremental(est, scoring='accuracy')
>>> clf.fit(X, y, classes=[0, 1])

When used inside a grid search, prefix the underlying estimator’s parameter names with estimator__.

>>> from sklearn.model_selection import GridSearchCV
>>> param_grid = {"estimator__alpha": [0.1, 1.0, 10.0]}
>>> gs = GridSearchCV(clf, param_grid)
>>> gs.fit(X, y, classes=[0, 1])

Methods

fit(X[, y])

Fit the underlying estimator.

get_params([deep])

Get parameters for this estimator.

partial_fit(X[, y])

Fit the underlying estimator.

predict(X)

Predict for X.

predict_log_proba(X)

Log of probability estimates.

predict_proba(X)

Probability estimates.

score(X, y[, compute])

Returns the score on the given data.

set_params(**params)

Set the parameters of this estimator.

transform(X)

Transform block or partition-wise for dask inputs.

__init__(estimator=None, scoring=None, shuffle_blocks=True, random_state=None, assume_equal_chunks=True, predict_meta=None, predict_proba_meta=None, transform_meta=None)