Cross Validation
================

See the `scikit-learn cross validation documentation`_ for a fuller discussion of cross validation.
This document only describes the extensions made to support Dask arrays.

The simplest way to split one or more Dask arrays is with :func:`dask_ml.model_selection.train_test_split`:

.. ipython:: python

   import dask.array as da
   from dask_ml.datasets import make_regression
   from dask_ml.model_selection import train_test_split

   X, y = make_regression(n_samples=125, n_features=4, random_state=0, chunks=50)
   X

The interface for splitting Dask arrays is the same as scikit-learn's version.

.. ipython:: python

   X_train, X_test, y_train, y_test = train_test_split(X, y)
   X_train  # A dask Array

   X_train.compute()[:3]


While it's possible to pass dask arrays to :func:`sklearn.model_selection.train_test_split`, we recommend
using the Dask version for performance reasons: the Dask version is faster
for two reasons:

First, **the Dask version shuffles blockwise**.
In a distributed setting, shuffling *between* blocks may require sending large amounts of data between machines, which can be slow.
However, if there's a strong pattern in your data, you'll want to perform a full shuffle.

Second, the Dask version avoids allocating large intermediate NumPy arrays storing the indexes for slicing.
For very large datasets, creating and transmitting ``np.arange(n_samples)`` can be expensive.

.. _scikit-learn cross validation documentation: http:/scikit-learn.org/stable/modules/cross_validation.html