Cross Validation

Cross Validation¶

See the scikit-learn cross validation documentation for a fuller discussion of cross validation. This document only describes the extensions made to support Dask arrays.

The simplest way to split one or more Dask arrays is with dask_ml.model_selection.train_test_split():

In [1]: import dask.array as da

In [2]: from dask_ml.datasets import make_regression

In [3]: from dask_ml.model_selection import train_test_split

In [4]: X, y = make_regression(n_samples=125, n_features=4, random_state=0, chunks=50)

In [5]: X
Out[5]: dask.array<normal, shape=(125, 4), dtype=float64, chunksize=(50, 4), chunktype=numpy.ndarray>

The interface for splitting Dask arrays is the same as scikit-learn’s version.

In [6]: X_train, X_test, y_train, y_test = train_test_split(X, y)

In [7]: X_train  # A dask Array
Out[7]: dask.array<concatenate, shape=(112, 4), dtype=float64, chunksize=(45, 4), chunktype=numpy.ndarray>

In [8]: X_train.compute()[:3]
Out[8]: 
array([[ 0.46698325,  0.14795648,  0.00401895, -0.25118474],
       [ 0.83154369,  0.17831678, -0.11855249,  0.11225947],
       [ 1.4136866 , -0.46491099, -0.21560393,  0.38860571]])

While it’s possible to pass dask arrays to sklearn.model_selection.train_test_split(), we recommend using the Dask version for performance reasons: the Dask version is faster for two reasons:

First, the Dask version shuffles blockwise. In a distributed setting, shuffling between blocks may require sending large amounts of data between machines, which can be slow. However, if there’s a strong pattern in your data, you’ll want to perform a full shuffle.

Second, the Dask version avoids allocating large intermediate NumPy arrays storing the indexes for slicing. For very large datasets, creating and transmitting np.arange(n_samples) can be expensive.

Preprocessing

Hyper Parameter Search