Clustering
Contents
Clustering¶
|
Scalable KMeans for clustering |
|
Apply parallel Spectral Clustering |
The dask_ml.cluster
module implements several algorithms for clustering unlabeled data.
Spectral Clustering¶
Spectral Clustering finds a low-dimensional embedding on the affinity matrix between samples. The embedded dataset is then clustered, typically with KMeans.
Typically, spectral clustering algorithms do not scale well. Computing the \(n\_samples \times n\_samples\) affinity matrix becomes prohibitively expensive when the number of samples is large. Several algorithms have been proposed to work around this limitation.
In dask-ml
, we use the Nyström method to approximate the large affinity
matrix. This involves sampling n_components
rows from the entire training
set. The exact affinity is computed for this subset
( \(n\_components \times n\_components\) ), and between this small subset and
the rest of the data ( \(n\_components \times (n\_samples - n\_components)\) ).
We avoid the direct computation of the rest of the affinity matrix.
Let \(S\) be our \(n \times n\) affinity matrix. We can rewrite that as
Where \(A\) is the \(n \times n\) affinity matrix of the \(n\_components\) that we sampled, and \(B\) is the \(n \times (n - n\_components)\) affinity matrix between the sample and the rest of the dataset. Instead of computing \(C\) directly, we approximate it with \(B^T A^{-1} B\).
See the spectral clustering benchmark for an example showing how
dask_ml.cluster.SpectralClustering
scales in the number of samples.