dask_ml.wrappers.ParallelPostFit
dask_ml.wrappers.ParallelPostFit¶
- class dask_ml.wrappers.ParallelPostFit(estimator=None, scoring=None, predict_meta=None, predict_proba_meta=None, transform_meta=None)¶
 Meta-estimator for parallel predict and transform.
- Parameters
 - estimatorEstimator
 The underlying estimator that is fit.
- scoringstring or callable, optional
 A single string (see The scoring parameter: defining model evaluation rules) or a callable (see scoring) to evaluate the predictions on the test set.
For evaluating multiple metrics, either give a list of (unique) strings or a dict with names as keys and callables as values.
NOTE that when using custom scorers, each scorer should return a single value. Metric functions returning a list/array of values can be wrapped into multiple scorers that return one value each.
See Specifying multiple metrics for evaluation for an example.
Warning
If None, the estimator’s default scorer (if available) is used. Most scikit-learn estimators will convert large Dask arrays to a single NumPy array, which may exhaust the memory of your worker. You probably want to always specify scoring.
- predict_meta: pd.Series, pd.DataFrame, np.array deafult: None(infer)
 An empty
pd.Series,pd.DataFrame,np.arraythat matches the output type of the estimatorspredictcall. This meta is necessary for for some estimators to work withdask.dataframeanddask.array- predict_proba_meta: pd.Series, pd.DataFrame, np.array deafult: None(infer)
 An empty
pd.Series,pd.DataFrame,np.arraythat matches the output type of the estimatorspredict_probacall. This meta is necessary for for some estimators to work withdask.dataframeanddask.array- transform_meta: pd.Series, pd.DataFrame, np.array deafult: None(infer)
 An empty
pd.Series,pd.DataFrame,np.arraythat matches the output type of the estimatorstransformcall. This meta is necessary for for some estimators to work withdask.dataframeanddask.array
See also
Incrementaldask_ml.model_selection.IncrementalSearch
Notes
Warning
This class is not appropriate for parallel or distributed training on large datasets. For that, see
Incremental, which provides distributed (but sequential) training. If you’re doing distributed hyperparameter optimization on larger-than-memory datasets, seedask_ml.model_selection.IncrementalSearch.This estimator does not parallelize the training step. This simply calls the underlying estimators’s
fitmethod called and copies over the learned attributes toselfafterwards.It is helpful for situations where your training dataset is relatively small (fits on a single machine) but you need to predict or transform a much larger dataset.
predict,predict_probaandtransformwill be done in parallel (potentially distributed if you’ve connected to adask.distributed.Client).Note that many scikit-learn estimators already predict and transform in parallel. This meta-estimator may still be useful in those cases when your dataset is larger than memory, as the distributed scheduler will ensure the data isn’t all read into memory at once.
Examples
>>> from sklearn.ensemble import GradientBoostingClassifier >>> import sklearn.datasets >>> import dask_ml.datasets
Make a small 1,000 sample 2 training dataset and fit normally.
>>> X, y = sklearn.datasets.make_classification(n_samples=1000, ... random_state=0) >>> clf = ParallelPostFit(estimator=GradientBoostingClassifier(), ... scoring='accuracy') >>> clf.fit(X, y) ParallelPostFit(estimator=GradientBoostingClassifier(...))
>>> clf.classes_ array([0, 1])
Transform and predict return dask outputs for dask inputs.
>>> X_big, y_big = dask_ml.datasets.make_classification(n_samples=100000, random_state=0)
>>> clf.predict(X) dask.array<predict, shape=(10000,), dtype=int64, chunksize=(1000,)>
Which can be computed in parallel.
>>> clf.predict_proba(X).compute() array([[0.99141094, 0.00858906], [0.93178389, 0.06821611], [0.99129105, 0.00870895], ..., [0.97996652, 0.02003348], [0.98087444, 0.01912556], [0.99407016, 0.00592984]])
Methods
fit(X[, y])Fit the underlying estimator.
get_metadata_routing()Get metadata routing of this object.
get_params([deep])Get parameters for this estimator.
predict(X)Predict for X.
predict_log_proba(X)Log of probability estimates.
predict_proba(X)Probability estimates.
score(X, y[, compute])Returns the score on the given data.
set_params(**params)Set the parameters of this estimator.
set_score_request(*[, compute])Configure whether metadata should be requested to be passed to the
scoremethod.transform(X)Transform block or partition-wise for dask inputs.
partial_fit
- __init__(estimator=None, scoring=None, predict_meta=None, predict_proba_meta=None, transform_meta=None)¶