See the scikit-learn cross validation documentation for a fuller discussion of cross validation. This document only describes the extensions made to support Dask arrays.
The simplest way to split one or more Dask arrays is with
In : import dask.array as da In : from dask_ml.datasets import make_regression In : from dask_ml.model_selection import train_test_split In : X, y = make_regression(n_samples=125, n_features=4, random_state=0, chunks=50) In : X Out: dask.array<normal, shape=(125, 4), dtype=float64, chunksize=(50, 4), chunktype=numpy.ndarray>
The interface for splitting Dask arrays is the same as scikit-learn’s version.
In : X_train, X_test, y_train, y_test = train_test_split(X, y) In : X_train # A dask Array Out: dask.array<concatenate, shape=(112, 4), dtype=float64, chunksize=(45, 4), chunktype=numpy.ndarray> In : X_train.compute()[:3] Out: array([[ 1.4746071 , 0.99089734, 0.19177484, -0.25725069], [ 1.4136866 , -0.46491099, -0.21560393, 0.38860571], [ 1.32715466, -0.6589019 , 0.88253167, 0.53379402]])
While it’s possible to pass dask arrays to
sklearn.model_selection.train_test_split(), we recommend
using the Dask version for performance reasons. There are two major difference that let make the Dask version
First, the Dask version shuffles blockwise. In a distributed setting, shuffling between blocks may require sending large amounts of data between machines, which can be slow. However, if there’s a strong pattern in your data, you’ll want to perform a full shuffle.
Second, the Dask version avoids allocating large intermediate NumPy arrays storing the indexes for slicing.
For very large datasets, creating and transmitting
np.arange(n_samples) can be expensive.