.. _parallel-meta-estimators: .. currentmodule:: dask_ml Parallel Meta-estimators ======================== dask-ml provides some meta-estimators that parallelize and scaling out certain tasks that may not be parallelized within scikit-learn itself. For example, :class:`~wrappers.ParallelPostFit` will parallelize the ``predict``, ``predict_proba`` and ``transform`` methods, enabling them to work on large (possibly larger-than memory) datasets. Parallel Prediction and Transformation '''''''''''''''''''''''''''''''''''''' :class:`wrappers.ParallelPostFit` is a meta-estimator for parallelizing post-fit tasks like prediction and transformation. It can wrap any scikit-learn estimator to provide parallel ``predict``, ``predict_proba``, and ``transform`` methods. .. warning:: ``ParallelPostFit`` does *not* parallelize the training step. The underlying estimator's ``.fit`` method is called normally. Since just the ``predict``, ``predict_proba``, and ``transform`` methods are wrapped, :class:`wrappers.ParallelPostFit` is most useful in situations where your training dataset is relatively small (fits in a single machine's memory), and prediction or transformation must be done on a much larger dataset (perhaps larger than a single machine's memory). .. ipython:: python from sklearn.ensemble import GradientBoostingClassifier import sklearn.datasets import dask_ml.datasets from dask_ml.wrappers import ParallelPostFit In this example, we'll make a small 1,000 sample training dataset .. ipython:: python X, y = sklearn.datasets.make_classification(n_samples=1000, random_state=0) Training is identical to just calling ``estimator.fit(X, y)``. Aside from copying over learned attributes, that's all that ``ParallelPostFit`` does. .. ipython:: python clf = ParallelPostFit(estimator=GradientBoostingClassifier()) clf.fit(X, y) This class is useful for predicting for or transforming large datasets. We'll make a larger dask array ``X_big`` with 10,000 samples per block. .. ipython:: python :okwarning: X_big, _ = dask_ml.datasets.make_classification(n_samples=100000, chunks=10000, random_state=0) clf.predict(X_big) This returned a ``dask.array``. Like any dask array, the actual ``compute`` will cause the scheduler to compute tasks in parallel. If you've connected to a ``dask.distributed.Client``, the computation will be parallelized across your cluster of machines. .. ipython:: python :okexcept: clf.predict_proba(X_big).compute()[:10] See `parallelizing prediction`_ for an example of how this scales for a support vector classifier. Comparison to other Estimators in dask-ml ''''''''''''''''''''''''''''''''''''''''' ``dask-ml`` re-implements some estimators from scikit-learn, for example :class:`dask_ml.cluster.KMeans`, or :class:`dask_ml.preprocessing.QuantileTransformer`. This raises the question, should I use the reimplemented dask-ml versions, or should I wrap scikit-learn version in a meta-estimator? It varies from estimator to estimator, and depends on your tolerance for approximate solutions and the size of your training data. In general, if your training data is small, you should be fine wrapping the scikit-learn version with a ``dask-ml`` meta-estimator. .. _learning curve: http://scikit-learn.org/stable/modules/learning_curve.html .. _parallelizing prediction: http://dask-ml-benchmarks.readthedocs.io/en/latest/auto_examples/plot_parallel_post_fit_scaling.html