Pipelines and Composite Estimators

Dask-ML estimators follow the scikit-learn API. This means Dask-ML estimators like dask_ml.decomposition.PCA can be placed inside a regular sklearn.pipeline.Pipeline.

See http://scikit-learn.org/dev/modules/compose.html for more on using pipelines in general.

In [1]: from sklearn.pipeline import Pipeline  # regular scikit-learn pipeline

In [2]: from dask_ml.cluster import KMeans

In [3]: from dask_ml.decomposition import PCA

In [4]: estimators = [('reduce_dim', PCA()), ('cluster', KMeans())]

In [5]: pipe = Pipeline(estimators)

In [6]: pipe
Out[6]: 
Pipeline(memory=None,
     steps=[('reduce_dim', PCA(copy=True, iterated_power=0, n_components=None, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)), ('cluster', KMeans(algorithm='full', copy_x=True, init='k-means||', init_max_iter=None,
    max_iter=300, n_clusters=8, n_jobs=1, oversampling_factor=2,
    precompute_distances='auto', random_state=None, tol=0.0001))])

The pipeline pipe can now be used with Dask arrays.

ColumnTransformer for Heterogeneous Data

dask_ml.compose.ColumnTransformer is a clone of the scikit-learn version that works well with Dask objects.

See http://scikit-learn.org/dev/modules/compose.html#columntransformer-for-heterogeneous-data for an introduction to ColumnTransformer.