Parallel Meta-estimators

dask-ml provides some meta-estimators that parallelize and scaling out certain tasks that may not be parallelized within scikit-learn itself. For example, ParallelPostFit will parallelize the predict, predict_proba and transform methods, enabling them to work on large (possibly larger-than memory) datasets.

Parallel Prediction and Transformation

wrappers.ParallelPostFit is a meta-estimator for parallelizing post-fit tasks like prediction and transformation. It can wrap any scikit-learn estimator to provide parallel predict, predict_proba, and transform methods.

Warning

ParallelPostFit does not parallelize the training step. The underlying estimator’s .fit method is called normally.

Since just the predict, predict_proba, and transform methods are wrapped, wrappers.ParallelPostFit is most useful in situations where your training dataset is relatively small (fits in a single machine’s memory), and prediction or transformation must be done on a much larger dataset (perhaps larger than a single machine’s memory).

In [1]: from sklearn.ensemble import GradientBoostingClassifier

In [2]: import sklearn.datasets

In [3]: import dask_ml.datasets

In [4]: from dask_ml.wrappers import ParallelPostFit

In this example, we’ll make a small 1,000 sample training dataset

In [5]: X, y = sklearn.datasets.make_classification(n_samples=1000,
   ...:                                             random_state=0)
   ...: 

Training is identical to just calling estimator.fit(X, y). Aside from copying over learned attributes, that’s all that ParallelPostFit does.

In [6]: clf = ParallelPostFit(estimator=GradientBoostingClassifier())

In [7]: clf.fit(X, y)
Out[7]: ParallelPostFit(estimator=GradientBoostingClassifier())

This class is useful for predicting for or transforming large datasets. We’ll make a larger dask array X_big with 10,000 samples per block.

In [8]: X_big, _ = dask_ml.datasets.make_classification(n_samples=100000,
   ...:                                                 chunks=10000,
   ...:                                                 random_state=0)
   ...: 

In [9]: clf.predict(X_big)
Out[9]: dask.array<_predict, shape=(100000,), dtype=int64, chunksize=(10000,), chunktype=numpy.ndarray>

This returned a dask.array. Like any dask array, the actual compute will cause the scheduler to compute tasks in parallel. If you’ve connected to a dask.distributed.Client, the computation will be parallelized across your cluster of machines.

In [10]: clf.predict_proba(X_big).compute()[:10]
Out[10]: 
array([[0.1893683 , 0.8106317 ],
       [0.00464304, 0.99535696],
       [0.00902734, 0.99097266],
       [0.01299259, 0.98700741],
       [0.92958233, 0.07041767],
       [0.8115279 , 0.1884721 ],
       [0.8941099 , 0.1058901 ],
       [0.0203811 , 0.9796189 ],
       [0.021179  , 0.978821  ],
       [0.97098636, 0.02901364]])

See parallelizing prediction for an example of how this scales for a support vector classifier.

Comparison to other Estimators in dask-ml

dask-ml re-implements some estimators from scikit-learn, for example dask_ml.cluster.KMeans, or dask_ml.preprocessing.QuantileTransformer. This raises the question, should I use the reimplemented dask-ml versions, or should I wrap scikit-learn version in a meta-estimator? It varies from estimator to estimator, and depends on your tolerance for approximate solutions and the size of your training data. In general, if your training data is small, you should be fine wrapping the scikit-learn version with a dask-ml meta-estimator.