dask_ml.wrappers.ParallelPostFit

dask_ml.wrappers.ParallelPostFit

class dask_ml.wrappers.ParallelPostFit(estimator=None, scoring=None, predict_meta=None, predict_proba_meta=None, transform_meta=None)

Meta-estimator for parallel predict and transform.

Parameters
estimatorEstimator

The underlying estimator that is fit.

scoringstring or callable, optional

A single string (see The scoring parameter: defining model evaluation rules) or a callable (see Defining your scoring strategy from metric functions) to evaluate the predictions on the test set.

For evaluating multiple metrics, either give a list of (unique) strings or a dict with names as keys and callables as values.

NOTE that when using custom scorers, each scorer should return a single value. Metric functions returning a list/array of values can be wrapped into multiple scorers that return one value each.

See Specifying multiple metrics for evaluation for an example.

Warning

If None, the estimator’s default scorer (if available) is used. Most scikit-learn estimators will convert large Dask arrays to a single NumPy array, which may exhaust the memory of your worker. You probably want to always specify scoring.

predict_meta: pd.Series, pd.DataFrame, np.array deafult: None(infer)

An empty pd.Series, pd.DataFrame, np.array that matches the output type of the estimators predict call. This meta is necessary for for some estimators to work with dask.dataframe and dask.array

predict_proba_meta: pd.Series, pd.DataFrame, np.array deafult: None(infer)

An empty pd.Series, pd.DataFrame, np.array that matches the output type of the estimators predict_proba call. This meta is necessary for for some estimators to work with dask.dataframe and dask.array

transform_meta: pd.Series, pd.DataFrame, np.array deafult: None(infer)

An empty pd.Series, pd.DataFrame, np.array that matches the output type of the estimators transform call. This meta is necessary for for some estimators to work with dask.dataframe and dask.array

See also

Incremental
dask_ml.model_selection.IncrementalSearch

Notes

Warning

This class is not appropriate for parallel or distributed training on large datasets. For that, see Incremental, which provides distributed (but sequential) training. If you’re doing distributed hyperparameter optimization on larger-than-memory datasets, see dask_ml.model_selection.IncrementalSearch.

This estimator does not parallelize the training step. This simply calls the underlying estimators’s fit method called and copies over the learned attributes to self afterwards.

It is helpful for situations where your training dataset is relatively small (fits on a single machine) but you need to predict or transform a much larger dataset. predict, predict_proba and transform will be done in parallel (potentially distributed if you’ve connected to a dask.distributed.Client).

Note that many scikit-learn estimators already predict and transform in parallel. This meta-estimator may still be useful in those cases when your dataset is larger than memory, as the distributed scheduler will ensure the data isn’t all read into memory at once.

Examples

>>> from sklearn.ensemble import GradientBoostingClassifier
>>> import sklearn.datasets
>>> import dask_ml.datasets

Make a small 1,000 sample 2 training dataset and fit normally.

>>> X, y = sklearn.datasets.make_classification(n_samples=1000,
...                                             random_state=0)
>>> clf = ParallelPostFit(estimator=GradientBoostingClassifier(),
...                       scoring='accuracy')
>>> clf.fit(X, y)
ParallelPostFit(estimator=GradientBoostingClassifier(...))
>>> clf.classes_
array([0, 1])

Transform and predict return dask outputs for dask inputs.

>>> X_big, y_big = dask_ml.datasets.make_classification(n_samples=100000,
                                                        random_state=0)
>>> clf.predict(X)
dask.array<predict, shape=(10000,), dtype=int64, chunksize=(1000,)>

Which can be computed in parallel.

>>> clf.predict_proba(X).compute()
array([[0.99141094, 0.00858906],
       [0.93178389, 0.06821611],
       [0.99129105, 0.00870895],
       ...,
       [0.97996652, 0.02003348],
       [0.98087444, 0.01912556],
       [0.99407016, 0.00592984]])

Methods

fit(X[, y])

Fit the underlying estimator.

get_metadata_routing()

Get metadata routing of this object.

get_params([deep])

Get parameters for this estimator.

predict(X)

Predict for X.

predict_log_proba(X)

Log of probability estimates.

predict_proba(X)

Probability estimates.

score(X, y[, compute])

Returns the score on the given data.

set_params(**params)

Set the parameters of this estimator.

set_score_request(*[, compute])

Request metadata passed to the score method.

transform(X)

Transform block or partition-wise for dask inputs.

partial_fit

__init__(estimator=None, scoring=None, predict_meta=None, predict_proba_meta=None, transform_meta=None)