dask_ml.cluster.KMeans

class dask_ml.cluster.KMeans(n_clusters=8, init='k-means||', oversampling_factor=2, max_iter=300, tol=0.0001, precompute_distances='auto', random_state=None, copy_x=True, n_jobs=1, algorithm='full', init_max_iter=None)

Scalable KMeans for clustering

Parameters:
n_clusters : int, default 8

Number of clusters to end up with

init : {‘k-means||’, ‘k-means++’ or ndarray}

Method for center initialization, defualts to ‘k-means||’.

‘k-means||’ : selects the the gg

‘k-means++’ : selects the initial cluster centers in a smart way to speed up convergence. Uses scikit-learn’s implementation.

Warning

If using 'k-means++', the entire dataset will be read into memory at once.

An array of shape (n_clusters, n_features) can be used to give an explicit starting point

oversampling_factor : int, default 2

Oversampling factor for use in the k-means|| algorithm.

max_iter : int

Maximum number EM iterations to attempt.

init_max_iter : int

Number of iterations for init step.

tol : float

Relative tolerance with regards to inertia to declare convergence

algorithm : ‘full’

The algorithm to use for the EM step. Only “full” (LLoyd’s algorithm) is allowed.

random_state : int, RandomState instance or None, optional, default: None

If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

Attributes:
cluster_centers_ : np.ndarray [n_clusters, n_features]

A NumPy array with the cluster centers

labels_ : da.array [n_samples,]

A dask array with the index position in cluster_centers_ this sample belongs to.

intertia_ : float

Sum of distances of samples to their closest cluster center.

n_iter_ : int

Number of EM steps to reach convergence

See also

PartialMiniBatchKMeans, sklearn.cluster.MiniBatchKMeans, sklearn.cluster.KMeans

Notes

This class implements a parallel and distributed version of k-Means.

Initialization with k-means||

The default initializer for KMeans is k-means||, compared to k-means++ from scikit-learn. This is the algorithm described in Scalable K-Means++ (2012).

k-means|| is designed to work well in a distributed environment. It’s a variant of k-means++ that’s designed to work in parallel (k-means++ is inherently sequential). Currently, the k-means|| implementation here is slower than scikit-learn’s k-means++ if your entire dataset fits in memory on a single machine. If that’s the case, consider using init='k-means++'.

Parallel Lloyd’s Algorithm

LLoyd’s Algorithm (the default Expectation Maximization algorithm used in scikit-learn) is naturally parallelizable. In naive benchmarks, the implementation here achieves 2-3x speedups over scikit-learn.

Both the initialization step and the EM steps make multiple passes over the data. If possible, persist your dask collections in (distributed) memory before running .fit.

References

Methods

fit_transform(X[, y]) Fit to data, then transform it.
get_params([deep]) Get parameters for this estimator.
predict(X) Predict the closest cluster each sample in X belongs to.
set_params(**params) Set the parameters of this estimator.
fit  
transform  
__init__(n_clusters=8, init='k-means||', oversampling_factor=2, max_iter=300, tol=0.0001, precompute_distances='auto', random_state=None, copy_x=True, n_jobs=1, algorithm='full', init_max_iter=None)

Initialize self. See help(type(self)) for accurate signature.

fit_transform(X, y=None, **fit_params)

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:
X : numpy array of shape [n_samples, n_features]

Training set.

y : numpy array of shape [n_samples]

Target values.

Returns:
X_new : numpy array of shape [n_samples, n_features_new]

Transformed array.

get_params(deep=True)

Get parameters for this estimator.

Parameters:
deep : boolean, optional

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:
params : mapping of string to any

Parameter names mapped to their values.

predict(X)

Predict the closest cluster each sample in X belongs to. In the vector quantization literature, cluster_centers_ is called the code book and each value returned by predict is the index of the closest code in the code book.

Parameters:
X : array-like, shape = [n_samples, n_features]

New data to predict.

Returns:
labels : array, shape [n_samples,]

Index of the cluster each sample belongs to.

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Returns:
self