dask_ml.wrappers.ParallelPostFit
dask_ml.wrappers
.ParallelPostFit¶
- class dask_ml.wrappers.ParallelPostFit(estimator=None, scoring=None, predict_meta=None, predict_proba_meta=None, transform_meta=None)¶
Meta-estimator for parallel predict and transform.
- Parameters
- estimatorEstimator
The underlying estimator that is fit.
- scoringstring or callable, optional
A single string (see The scoring parameter: defining model evaluation rules) or a callable (see Defining your scoring strategy from metric functions) to evaluate the predictions on the test set.
For evaluating multiple metrics, either give a list of (unique) strings or a dict with names as keys and callables as values.
NOTE that when using custom scorers, each scorer should return a single value. Metric functions returning a list/array of values can be wrapped into multiple scorers that return one value each.
See Specifying multiple metrics for evaluation for an example.
Warning
If None, the estimator’s default scorer (if available) is used. Most scikit-learn estimators will convert large Dask arrays to a single NumPy array, which may exhaust the memory of your worker. You probably want to always specify scoring.
- predict_meta: pd.Series, pd.DataFrame, np.array deafult: None(infer)
An empty
pd.Series
,pd.DataFrame
,np.array
that matches the output type of the estimatorspredict
call. This meta is necessary for for some estimators to work withdask.dataframe
anddask.array
- predict_proba_meta: pd.Series, pd.DataFrame, np.array deafult: None(infer)
An empty
pd.Series
,pd.DataFrame
,np.array
that matches the output type of the estimatorspredict_proba
call. This meta is necessary for for some estimators to work withdask.dataframe
anddask.array
- transform_meta: pd.Series, pd.DataFrame, np.array deafult: None(infer)
An empty
pd.Series
,pd.DataFrame
,np.array
that matches the output type of the estimatorstransform
call. This meta is necessary for for some estimators to work withdask.dataframe
anddask.array
See also
Incremental
dask_ml.model_selection.IncrementalSearch
Notes
Warning
This class is not appropriate for parallel or distributed training on large datasets. For that, see
Incremental
, which provides distributed (but sequential) training. If you’re doing distributed hyperparameter optimization on larger-than-memory datasets, seedask_ml.model_selection.IncrementalSearch
.This estimator does not parallelize the training step. This simply calls the underlying estimators’s
fit
method called and copies over the learned attributes toself
afterwards.It is helpful for situations where your training dataset is relatively small (fits on a single machine) but you need to predict or transform a much larger dataset.
predict
,predict_proba
andtransform
will be done in parallel (potentially distributed if you’ve connected to adask.distributed.Client
).Note that many scikit-learn estimators already predict and transform in parallel. This meta-estimator may still be useful in those cases when your dataset is larger than memory, as the distributed scheduler will ensure the data isn’t all read into memory at once.
Examples
>>> from sklearn.ensemble import GradientBoostingClassifier >>> import sklearn.datasets >>> import dask_ml.datasets
Make a small 1,000 sample 2 training dataset and fit normally.
>>> X, y = sklearn.datasets.make_classification(n_samples=1000, ... random_state=0) >>> clf = ParallelPostFit(estimator=GradientBoostingClassifier(), ... scoring='accuracy') >>> clf.fit(X, y) ParallelPostFit(estimator=GradientBoostingClassifier(...))
>>> clf.classes_ array([0, 1])
Transform and predict return dask outputs for dask inputs.
>>> X_big, y_big = dask_ml.datasets.make_classification(n_samples=100000, random_state=0)
>>> clf.predict(X) dask.array<predict, shape=(10000,), dtype=int64, chunksize=(1000,)>
Which can be computed in parallel.
>>> clf.predict_proba(X).compute() array([[0.99141094, 0.00858906], [0.93178389, 0.06821611], [0.99129105, 0.00870895], ..., [0.97996652, 0.02003348], [0.98087444, 0.01912556], [0.99407016, 0.00592984]])
Methods
fit
(X[, y])Fit the underlying estimator.
get_metadata_routing
()Get metadata routing of this object.
get_params
([deep])Get parameters for this estimator.
predict
(X)Predict for X.
predict_log_proba
(X)Log of probability estimates.
predict_proba
(X)Probability estimates.
score
(X, y[, compute])Returns the score on the given data.
set_params
(**params)Set the parameters of this estimator.
set_score_request
(*[, compute])Request metadata passed to the
score
method.transform
(X)Transform block or partition-wise for dask inputs.
partial_fit
- __init__(estimator=None, scoring=None, predict_meta=None, predict_proba_meta=None, transform_meta=None)¶