dask_ml.datasets.make_blobs

dask_ml.datasets.make_blobs(n_samples=100, n_features=2, centers=None, cluster_std=1.0, center_box=(- 10.0, 10.0), shuffle=True, random_state=None, chunks=None)

Generate isotropic Gaussian blobs for clustering.

This can be used to generate very large Dask arrays on a cluster of machines. When using Dask in distributed mode, the client machine only needs to allocate a single block’s worth of data.

Parameters
n_samplesint or array-like, optional (default=100)

If int, it is the total number of points equally divided among clusters. If array-like, each element of the sequence indicates the number of samples per cluster.

n_featuresint, optional (default=2)

The number of features for each sample.

centersint or array of shape [n_centers, n_features], optional

(default=None) The number of centers to generate, or the fixed center locations. If n_samples is an int and centers is None, 3 centers are generated. If n_samples is array-like, centers must be either None or an array of length equal to the length of n_samples.

cluster_stdfloat or sequence of floats, optional (default=1.0)

The standard deviation of the clusters.

center_boxpair of floats (min, max), optional (default=(-10.0, 10.0))

The bounding box for each cluster center when centers are generated at random.

shuffleboolean, optional (default=True)

Shuffle the samples.

random_stateint, RandomState instance or None (default)

Determines random number generation for dataset creation. Pass an int for reproducible output across multiple function calls. See Glossary.

chunksint, tuple

How to chunk the array. Must be one of the following forms: - A blocksize like 1000. - A blockshape like (1000, 1000). - Explicit sizes of all blocks along all dimensions like

((1000, 1000, 500), (400, 400)).

Returns
Xarray of shape [n_samples, n_features]

The generated samples.

yarray of shape [n_samples]

The integer labels for cluster membership of each sample.

See also

make_classification

a more intricate variant

Examples

>>> from dask_ml.datasets import make_blobs
>>> X, y = make_blobs(n_samples=100000, chunks=10000)
>>> X
dask.array<..., shape=(100000, 2), dtype=float64, chunksize=(10000, 2)>
>>> y
dask.array<concatenate, shape=(100000,), dtype=int64, chunksize=(10000,)>