dask_ml.cluster.KMeans

class dask_ml.cluster.KMeans(n_clusters=8, init='k-means||', oversampling_factor=2, max_iter=300, tol=0.0001, precompute_distances='auto', random_state=None, copy_x=True, n_jobs=1, algorithm='full', init_max_iter=None)

Scalable KMeans for clustering

Parameters
n_clustersint, default 8

Number of clusters to end up with

init{‘k-means||’, ‘k-means++’ or ndarray}

Method for center initialization, defualts to ‘k-means||’.

‘k-means||’ : selects the the gg

‘k-means++’ : selects the initial cluster centers in a smart way to speed up convergence. Uses scikit-learn’s implementation.

Warning

If using 'k-means++', the entire dataset will be read into memory at once.

An array of shape (n_clusters, n_features) can be used to give an explicit starting point

oversampling_factorint, default 2

Oversampling factor for use in the k-means|| algorithm.

max_iterint

Maximum number EM iterations to attempt.

init_max_iterint

Number of iterations for init step.

tolfloat

Relative tolerance with regards to inertia to declare convergence

algorithm‘full’

The algorithm to use for the EM step. Only “full” (LLoyd’s algorithm) is allowed.

random_stateint, RandomState instance or None, optional, default: None

If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

Attributes
cluster_centers_np.ndarray [n_clusters, n_features]

A NumPy array with the cluster centers

labels_da.array [n_samples,]

A dask array with the index position in cluster_centers_ this sample belongs to.

intertia_float

Sum of distances of samples to their closest cluster center.

n_iter_int

Number of EM steps to reach convergence

Notes

This class implements a parallel and distributed version of k-Means.

Initialization with k-means||

The default initializer for KMeans is k-means||, compared to k-means++ from scikit-learn. This is the algorithm described in Scalable K-Means++ (2012).

k-means|| is designed to work well in a distributed environment. It’s a variant of k-means++ that’s designed to work in parallel (k-means++ is inherently sequential). Currently, the k-means|| implementation here is slower than scikit-learn’s k-means++ if your entire dataset fits in memory on a single machine. If that’s the case, consider using init='k-means++'.

Parallel Lloyd’s Algorithm

LLoyd’s Algorithm (the default Expectation Maximization algorithm used in scikit-learn) is naturally parallelizable. In naive benchmarks, the implementation here achieves 2-3x speedups over scikit-learn.

Both the initialization step and the EM steps make multiple passes over the data. If possible, persist your dask collections in (distributed) memory before running .fit.

References

Methods

fit_transform(self, X[, y])

Fit to data, then transform it.

get_params(self[, deep])

Get parameters for this estimator.

predict(self, X)

Predict the closest cluster each sample in X belongs to.

set_params(self, \*\*params)

Set the parameters of this estimator.

fit

transform

__init__(self, n_clusters=8, init='k-means||', oversampling_factor=2, max_iter=300, tol=0.0001, precompute_distances='auto', random_state=None, copy_x=True, n_jobs=1, algorithm='full', init_max_iter=None)

Initialize self. See help(type(self)) for accurate signature.

fit_transform(self, X, y=None, **fit_params)

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters
Xnumpy array of shape [n_samples, n_features]

Training set.

ynumpy array of shape [n_samples]

Target values.

Returns
X_newnumpy array of shape [n_samples, n_features_new]

Transformed array.

get_params(self, deep=True)

Get parameters for this estimator.

Parameters
deepboolean, optional

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns
paramsmapping of string to any

Parameter names mapped to their values.

predict(self, X)

Predict the closest cluster each sample in X belongs to. In the vector quantization literature, cluster_centers_ is called the code book and each value returned by predict is the index of the closest code in the code book.

Parameters
Xarray-like, shape = [n_samples, n_features]

New data to predict.

Returns
labelsarray, shape [n_samples,]

Index of the cluster each sample belongs to.

set_params(self, **params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Returns
self