Parallel Meta-estimators
Contents
Parallel Meta-estimators¶
dask-ml provides some meta-estimators that parallelize and scaling out
certain tasks that may not be parallelized within scikit-learn itself.
For example, ParallelPostFit
will parallelize the
predict
, predict_proba
and transform
methods, enabling them
to work on large (possibly larger-than memory) datasets.
Parallel Prediction and Transformation¶
wrappers.ParallelPostFit
is a meta-estimator for parallelizing
post-fit tasks like prediction and transformation. It can wrap any
scikit-learn estimator to provide parallel predict
, predict_proba
, and
transform
methods.
Warning
ParallelPostFit
does not parallelize the training step. The underlying
estimator’s .fit
method is called normally.
Since just the predict
, predict_proba
, and transform
methods are
wrapped, wrappers.ParallelPostFit
is most useful in situations where
your training dataset is relatively small (fits in a single machine’s memory),
and prediction or transformation must be done on a much larger dataset (perhaps
larger than a single machine’s memory).
In [1]: from sklearn.ensemble import GradientBoostingClassifier
In [2]: import sklearn.datasets
In [3]: import dask_ml.datasets
In [4]: from dask_ml.wrappers import ParallelPostFit
In this example, we’ll make a small 1,000 sample training dataset
In [5]: X, y = sklearn.datasets.make_classification(n_samples=1000,
...: random_state=0)
...:
Training is identical to just calling estimator.fit(X, y)
. Aside from
copying over learned attributes, that’s all that ParallelPostFit
does.
In [6]: clf = ParallelPostFit(estimator=GradientBoostingClassifier())
In [7]: clf.fit(X, y)
Out[7]: ParallelPostFit(estimator=GradientBoostingClassifier())
This class is useful for predicting for or transforming large datasets.
We’ll make a larger dask array X_big
with 10,000 samples per block.
In [8]: X_big, _ = dask_ml.datasets.make_classification(n_samples=100000,
...: chunks=10000,
...: random_state=0)
...:
In [9]: clf.predict(X_big)
Out[9]: dask.array<_predict, shape=(100000,), dtype=int64, chunksize=(10000,), chunktype=numpy.ndarray>
This returned a dask.array
. Like any dask array, the actual compute
will
cause the scheduler to compute tasks in parallel. If you’ve connected to a
dask.distributed.Client
, the computation will be parallelized across your
cluster of machines.
In [10]: clf.predict_proba(X_big).compute()[:10]
Out[10]:
array([[0.24519988, 0.75480012],
[0.00464304, 0.99535696],
[0.00902734, 0.99097266],
[0.01299259, 0.98700741],
[0.95094461, 0.04905539],
[0.75588097, 0.24411903],
[0.93088993, 0.06911007],
[0.0203811 , 0.9796189 ],
[0.021179 , 0.978821 ],
[0.96243287, 0.03756713]])
See parallelizing prediction for an example of how this scales for a support vector classifier.
Comparison to other Estimators in dask-ml¶
dask-ml
re-implements some estimators from scikit-learn, for example
dask_ml.cluster.KMeans
, or dask_ml.preprocessing.QuantileTransformer
. This raises
the question, should I use the reimplemented dask-ml versions, or should I wrap
scikit-learn version in a meta-estimator? It varies from estimator to estimator,
and depends on your tolerance for approximate solutions and the size of your
training data. In general, if your training data is small, you should be fine
wrapping the scikit-learn version with a dask-ml
meta-estimator.