Dask-ML provides scalable machine learning in Python using Dask alongside popular machine learning libraries like Scikit-Learn.

You can try Dask-ML on a small cloud instance by clicking the following button:

import dask.dataframe as dd
data = df[['age', 'income', 'married']]
labels = df['outcome']

lr = LogisticRegression()
lr.fit(data, labels)


What does this offer?¶

The overarching goal of Dask-ML is to enable scalable machine learning. See the navigation pane to the left for details on specific features.

How does this work?¶

Modern machine learning algorithms employ a wide variety of techniques. Scaling these requires a similarly wide variety of different approaches. Generally solutions fall into the following three categories:

Parallelize Scikit-Learn Directly¶

Scikit-Learn already provides parallel computing on a single machine with Joblib. Dask extends this parallelism to many machines in a cluster. This works well for modest data sizes but large computations, such as random forests, hyper-parameter optimization, and more.

from dask.distributed import Client
import joblib

client = Client()  # Connect to a Dask Cluster

# Your normal scikit-learn code here


Note that this is an active collaboration with the Scikit-Learn development team. This functionality is progressing quickly but is in a state of rapid change.

Reimplement Scalable Algorithms with Dask Array¶

Some machine learning algorithms are easy to write down as Numpy algorithms. In these cases we can replace Numpy arrays with Dask arrays to achieve scalable algorithms easily. This is employed for linear models, pre-processing, and clustering.

from dask_ml.preprocessing import Categorizer, DummyEncoder

lr = LogisticRegression()
lr.fit(data, labels)


Partner with other distributed libraries¶

Other machine learning libraries like XGBoost and TensorFlow already have distributed solutions that work quite well. Dask-ML makes no attempt to re-implement these systems. Instead, Dask-ML makes it easy to use normal Dask workflows to prepare and set up data, then it deploys XGBoost or Tensorflow alongside Dask, and hands the data over.

from dask_ml.xgboost import XGBRegressor

est = XGBRegressor(...)
est.fit(train, train_labels)