dask_ml.wrappers.ParallelPostFit

class dask_ml.wrappers.ParallelPostFit(estimator=None, scoring=None)

Meta-estimator for parallel predict and transform.

Parameters:
estimator : Estimator

The underlying estimator that is fit.

scoring : string or callable, optional

A single string (see The scoring parameter: defining model evaluation rules) or a callable (see Defining your scoring strategy from metric functions) to evaluate the predictions on the test set.

For evaluating multiple metrics, either give a list of (unique) strings or a dict with names as keys and callables as values.

NOTE that when using custom scorers, each scorer should return a single value. Metric functions returning a list/array of values can be wrapped into multiple scorers that return one value each.

See Specifying multiple metrics for evaluation for an example.

Warning

If None, the estimator’s default scorer (if available) is used. Most scikit-learn estimators will convert large Dask arrays to a single NumPy array, which may exhaust the memory of your worker. You probably want to always specify scoring.

See also

Incremental, dask_ml.model_selection.IncrementalSearch

Notes

Warning

This class is not appropriate for parallel or distributed training on large datasets. For that, see Incremental, which provides distributed (but sequential) training. If you’re doing distributed hyperparameter optimization on larger-than-memory datasets, see dask_ml.model_selection.IncrementalSearch.

This estimator does not parallelize the training step. This simply calls the underlying estimators’s fit method called and copies over the learned attributes to self afterwards.

It is helpful for situations where your training dataset is relatively small (fits on a single machine) but you need to predict or transform a much larger dataset. predict, predict_proba and transform will be done in parallel (potentially distributed if you’ve connected to a dask.distributed.Client).

Note that many scikit-learn estimators already predict and transform in parallel. This meta-estimator may still be useful in those cases when your dataset is larger than memory, as the distributed scheduler will ensure the data isn’t all read into memory at once.

Examples

>>> from sklearn.ensemble import GradientBoostingClassifier
>>> import sklearn.datasets
>>> import dask_ml.datasets

Make a small 1,000 sample 2 training dataset and fit normally.

>>> X, y = sklearn.datasets.make_classification(n_samples=1000,
...                                             random_state=0)
>>> clf = ParallelPostFit(estimator=GradientBoostingClassifier(),
...                       scoring='accuracy')
>>> clf.fit(X, y)
ParallelPostFit(estimator=GradientBoostingClassifier(...))
>>> clf.classes_
array([0, 1])

Transform and predict return dask outputs for dask inputs.

>>> X_big, y_big = dask_ml.datasets.make_classification(n_samples=100000,
                                                        random_state=0)
>>> clf.predict(X)
dask.array<predict, shape=(10000,), dtype=int64, chunksize=(1000,)>

Which can be computed in parallel.

>>> clf.predict_proba(X).compute()
array([[0.99141094, 0.00858906],
       [0.93178389, 0.06821611],
       [0.99129105, 0.00870895],
       ...,
       [0.97996652, 0.02003348],
       [0.98087444, 0.01912556],
       [0.99407016, 0.00592984]])

Methods

fit(X[, y]) Fit the underlying estimator.
get_params([deep]) Get parameters for this estimator.
predict(X) Predict for X.
predict_proba(X) Predict for X.
score(X, y[, compute]) Returns the score on the given data.
set_params(**params) Set the parameters of this estimator.
transform(X) Transform block or partition-wise for dask inputs.
partial_fit  
__init__(estimator=None, scoring=None)

Initialize self. See help(type(self)) for accurate signature.

fit(X, y=None, **kwargs)

Fit the underlying estimator.

Parameters:
X, y : array-like
**kwargs

Additional fit-kwargs for the underlying estimator.

Returns:
self : object
get_params(deep=True)

Get parameters for this estimator.

Parameters:
deep : boolean, optional

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:
params : mapping of string to any

Parameter names mapped to their values.

predict(X)

Predict for X.

For dask inputs, a dask array or dataframe is returned. For other inputs (NumPy array, pandas dataframe, scipy sparse matrix), the regular return value is returned.

Parameters:
X : array-like
Returns:
y : array-like
predict_proba(X)

Predict for X.

For dask inputs, a dask array or dataframe is returned. For other inputs (NumPy array, pandas dataframe, scipy sparse matrix), the regular return value is returned.

If the underlying estimator does not have a predict_proba method, then an AttributeError is raised.

Parameters:
X : array or dataframe
Returns:
y : array-like
score(X, y, compute=True)

Returns the score on the given data.

Parameters:
X : array-like, shape = [n_samples, n_features]

Input data, where n_samples is the number of samples and n_features is the number of features.

y : array-like, shape = [n_samples] or [n_samples, n_output], optional

Target relative to X for classification or regression; None for unsupervised learning.

Returns:
score : float

return self.estimator.score(X, y)

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Returns:
self
transform(X)

Transform block or partition-wise for dask inputs.

For dask inputs, a dask array or dataframe is returned. For other inputs (NumPy array, pandas dataframe, scipy sparse matrix), the regular return value is returned.

If the underlying estimator does not have a transform method, then an AttributeError is raised.

Parameters:
X : array-like
Returns:
transformed : array-like