# Clustering¶

`KMeans` ([n_clusters, init, …]) |
Scalable KMeans for clustering |

`PartialMiniBatchKMeans` (**kwargs) |
Mini-Batch K-Means clustering |

`SpectralClustering` ([n_clusters, …]) |
Apply parallel Spectral Clustering |

The `dask_ml.cluster`

module implements several algorithms for clustering unlabeled data.

## Spectral Clustering¶

Spectral Clustering finds a low-dimensional embedding on the affinity matrix between samples. The embedded dataset is then clustered, typically with KMeans.

Typically, spectral clustering algorithms do not scale well. Computing the \(n\_samples \times n\_samples\) affinity matrix becomes prohibitively expensive when the number of samples is large. Several algorithms have been proposed to work around this limitation.

In `dask-ml`

, we use the Nyström method to approximate the large affinity
matrix. This involves sampling `n_components`

rows from the entire training
set. The exact affinity is computed for this subset
( \(n\_components \times n\_components\) ), and between this small subset and
the rest of the data ( \(n\_components \times (n\_samples - n\_components)\) ).
We avoid the direct computation of the rest of the affinity matrix.

Let \(S\) be our \(n \times n\) affinity matrix. We can rewrite that as

Where \(A\) is the \(n \times n\) affinity matrix of the \(n\_components\) that we sampled, and \(B\) is the \(n \times (n - n\_components)\) affinity matrix between the sample and the rest of the dataset. Instead of computing \(C\) directly, we approximate it with \(B^T A^{-1} B\).

See the spectral clustering benchmark for an example showing how
`dask_ml.cluster.SpectralClustering`

scales in the number of samples.