Clustering

Clustering

KMeans([n_clusters, init, ...])

Scalable KMeans for clustering

SpectralClustering([n_clusters, ...])

Apply parallel Spectral Clustering

The dask_ml.cluster module implements several algorithms for clustering unlabeled data.

Spectral Clustering

Spectral Clustering finds a low-dimensional embedding on the affinity matrix between samples. The embedded dataset is then clustered, typically with KMeans.

Typically, spectral clustering algorithms do not scale well. Computing the n_samples×n_samples affinity matrix becomes prohibitively expensive when the number of samples is large. Several algorithms have been proposed to work around this limitation.

In dask-ml, we use the Nyström method to approximate the large affinity matrix. This involves sampling n_components rows from the entire training set. The exact affinity is computed for this subset ( n_components×n_components ), and between this small subset and the rest of the data ( n_components×(n_samplesn_components) ). We avoid the direct computation of the rest of the affinity matrix.

Let S be our n×n affinity matrix. We can rewrite that as

Sd=[ABBTC]

Where A is the n×n affinity matrix of the n_components that we sampled, and B is the n×(nn_components) affinity matrix between the sample and the rest of the dataset. Instead of computing C directly, we approximate it with BTA1B.

See the spectral clustering benchmark for an example showing how dask_ml.cluster.SpectralClustering scales in the number of samples.