dask_ml.cluster.KMeans

class dask_ml.cluster.KMeans(n_clusters=8, init='k-means||', oversampling_factor=2, max_iter=300, tol=0.0001, precompute_distances='auto', random_state=None, copy_x=True, n_jobs=1, algorithm='full', init_max_iter=None)

Scalable KMeans for clustering

Parameters
n_clustersint, default 8

Number of clusters to end up with

init{‘k-means||’, ‘k-means++’ or ndarray}

Method for center initialization, defaults to ‘k-means||’.

‘k-means||’ : selects the the gg

‘k-means++’ : selects the initial cluster centers in a smart way to speed up convergence. Uses scikit-learn’s implementation.

Warning

If using 'k-means++', the entire dataset will be read into memory at once.

An array of shape (n_clusters, n_features) can be used to give an explicit starting point

oversampling_factorint, default 2

Oversampling factor for use in the k-means|| algorithm.

max_iterint

Maximum number EM iterations to attempt.

init_max_iterint

Number of iterations for init step.

tolfloat

Relative tolerance with regards to inertia to declare convergence

algorithm‘full’

The algorithm to use for the EM step. Only “full” (LLoyd’s algorithm) is allowed.

random_stateint, RandomState instance or None, optional, default: None

If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

Attributes
cluster_centers_np.ndarray [n_clusters, n_features]

A NumPy array with the cluster centers

labels_da.array [n_samples,]

A dask array with the index position in cluster_centers_ this sample belongs to.

inertia_float

Sum of distances of samples to their closest cluster center.

n_iter_int

Number of EM steps to reach convergence

Notes

This class implements a parallel and distributed version of k-Means.

Initialization with k-means||

The default initializer for KMeans is k-means||, compared to k-means++ from scikit-learn. This is the algorithm described in Scalable K-Means++ (2012).

k-means|| is designed to work well in a distributed environment. It’s a variant of k-means++ that’s designed to work in parallel (k-means++ is inherently sequential). Currently, the k-means|| implementation here is slower than scikit-learn’s k-means++ if your entire dataset fits in memory on a single machine. If that’s the case, consider using init='k-means++'.

Parallel Lloyd’s Algorithm

LLoyd’s Algorithm (the default Expectation Maximization algorithm used in scikit-learn) is naturally parallelizable. In naive benchmarks, the implementation here achieves 2-3x speedups over scikit-learn.

Both the initialization step and the EM steps make multiple passes over the data. If possible, persist your dask collections in (distributed) memory before running .fit.

References

Methods

fit_transform(X[, y])

Fit to data, then transform it.

get_params([deep])

Get parameters for this estimator.

predict(X)

Predict the closest cluster each sample in X belongs to.

set_params(**params)

Set the parameters of this estimator.

fit

transform

__init__(n_clusters=8, init='k-means||', oversampling_factor=2, max_iter=300, tol=0.0001, precompute_distances='auto', random_state=None, copy_x=True, n_jobs=1, algorithm='full', init_max_iter=None)