Pipelines and Composite Estimators

Pipelines and Composite Estimators

Dask-ML estimators follow the scikit-learn API. This means Dask-ML estimators like dask_ml.decomposition.PCA can be placed inside a regular sklearn.pipeline.Pipeline.

See http://scikit-learn.org/dev/modules/compose.html for more on using pipelines in general.

In [1]: from sklearn.pipeline import Pipeline  # regular scikit-learn pipeline

In [2]: from dask_ml.cluster import KMeans

In [3]: from dask_ml.decomposition import PCA

In [4]: estimators = [('reduce_dim', PCA()), ('cluster', KMeans())]

In [5]: pipe = Pipeline(estimators)

In [6]: pipe
Out[6]: Pipeline(steps=[('reduce_dim', PCA()), ('cluster', KMeans())])

The pipeline pipe can now be used with Dask arrays.

ColumnTransformer for Heterogeneous Data

dask_ml.compose.ColumnTransformer is a clone of the scikit-learn version that works well with Dask objects.

See http://scikit-learn.org/dev/modules/compose.html#columntransformer-for-heterogeneous-data for an introduction to ColumnTransformer.