.. _joblib: Scikit-Learn & Joblib ===================== Many Scikit-Learn algorithms are written for parallel execution using `Joblib `__, which natively provides thread-based and process-based parallelism. Joblib is what backs the ``n_jobs=`` parameter in normal use of Scikit-Learn. Dask can scale these Joblib-backed algorithms out to a cluster of machines by providing an alternative Joblib backend. The following video demonstrates how to use Dask to parallelize a grid search across a cluster. .. raw:: html To use the Dask backend to Joblib you have to create a Client, and wrap your code with ``joblib.parallel_backend('dask')``. .. code-block:: python from dask.distributed import Client import joblib client = Client(processes=False) # create local cluster # client = Client("scheduler-address:8786") # or connect to remote cluster with joblib.parallel_backend('dask'): # Your scikit-learn code As an example you might distribute a randomized cross validated parameter search as follows: .. code-block:: python import numpy as np from dask.distributed import Client import joblib from sklearn.datasets import load_digits from sklearn.model_selection import RandomizedSearchCV from sklearn.svm import SVC client = Client(processes=False) # create local cluster digits = load_digits() param_space = { 'C': np.logspace(-6, 6, 13), 'gamma': np.logspace(-8, 8, 17), 'tol': np.logspace(-4, -1, 4), 'class_weight': [None, 'balanced'], } model = SVC(kernel='rbf') search = RandomizedSearchCV(model, param_space, cv=3, n_iter=50, verbose=10) with joblib.parallel_backend('dask'): search.fit(digits.data, digits.target) Note that the Dask joblib backend is useful for scaling out CPU-bound workloads; workloads with datasets that fit in RAM, but have many individual operations that can be done in parallel. To scale out to RAM-bound workloads (larger-than-memory datasets) use one of the following alternatives: * :ref:`parallel-meta-estimators` * :ref:`hyperparameter.incremental` * or one of the estimators from the :ref:`api`