dask_ml.cluster.SpectralClustering
dask_ml.cluster
.SpectralClustering¶
- class dask_ml.cluster.SpectralClustering(n_clusters=8, eigen_solver=None, random_state=None, n_init='auto', gamma=1.0, affinity='rbf', n_neighbors=10, eigen_tol=0.0, assign_labels='kmeans', degree=3, coef0=1, kernel_params=None, n_jobs=1, n_components=100, persist_embedding=False, kmeans_params=None)¶
Apply parallel Spectral Clustering
This implementation avoids the expensive computation of the N x N affinity matrix. Instead, the Nyström Method is used as an approximation.
- Parameters
- n_clustersinteger, optional
The dimension of the projection subspace.
- eigen_solverNone
ignored
- random_stateint, RandomState instance or None, optional, default: None
A pseudo random number generator used for the initialization of the lobpcg eigen vectors decomposition when eigen_solver == ‘amg’ and by the K-Means initialization. If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.
- n_initint, optional, default: 10
ignored
- gammafloat, default=1.0
Kernel coefficient for rbf, poly, sigmoid, laplacian and chi2 kernels. Ignored for
affinity='nearest_neighbors'
.- affinitystring, array-like or callable, default ‘rbf’
It may be ‘precomputed’ or one of the kernels supported by metrics.pairwise.PAIRWISE_KERNEL_FUNCTIONS.
Only kernels that produce similarity scores (non-negative values that increase with similarity) should be used. This property is not checked by the clustering algorithm.
Callables should expect arguments similar to sklearn.metrics.pairwise_kernels: a required
X
, an optionalY
, andgamma
,degree
,coef0
, and any keywords passed inkernel_params
.- n_neighborsinteger
Number of neighbors to use when constructing the affinity matrix using the nearest neighbors method. Ignored for
affinity='rbf'
.- eigen_tolfloat, optional, default: 0.0
Stopping criterion for eigendecomposition of the Laplacian matrix when using arpack eigen_solver.
- assign_labels‘kmeans’ or Estimator, default: ‘kmeans’
The strategy to use to assign labels in the embedding space. By default creates an instance of
dask_ml.cluster.KMeans
and sets n_clusters to 2. For further control over the hyperparameters of the final label assignment, pass an instance of aKMeans
estimator (either scikit-learn or dask-ml).- degreefloat, default=3
Degree of the polynomial kernel. Ignored by other kernels.
- coef0float, default=1
Zero coefficient for polynomial and sigmoid kernels. Ignored by other kernels.
- kernel_paramsdictionary of string to any, optional
Parameters (keyword arguments) and values for kernel passed as callable object. Ignored by other kernels.
- n_jobsint, optional (default = 1)
The number of parallel jobs to run. If
-1
, then the number of jobs is set to the number of CPU cores.- n_componentsint, default 100
Number of rows from
X
to use for the Nyström approximation. Largern_components
will improve the accuracy of the approximation, at the cost of a longer training time.- persist_embeddingbool
Whether to persist the intermediate n_samples x n_components array used for clustering.
- kmeans_paramsdictionary of string to any, optional
Keyword arguments for the KMeans clustering used for the final clustering.
- Attributes
- assign_labels_Estimator
The instance of the KMeans estimator used to assign labels
- labels_dask.array.Array, size (n_samples,)
The cluster labels assigned
- eigenvalues_numpy.ndarray
The eigenvalues from the SVD of the sampled points
Notes
Using
persist_embedding=True
can be an important optimization to avoid some redundant computations. This persists the array being fed to the clustering algorithm in (distributed) memory. The array is shapen_samples x n_components
.References
Parallel Spectral Clustering in Distributed Systems, 2010 Chen, Song, Bai, Lin, and Chang IEEE Transactions on Pattern Analysis and Machine Intelligence http://ieeexplore.ieee.org/document/5444877/
Spectral Grouping Using the Nystrom Method (2004) Fowlkes, Belongie, Chung, Malik IEEE Transactions on Pattern Analysis and Machine Intelligence https://people.cs.umass.edu/~mahadeva/cs791bb/reading/fowlkes-nystrom.pdf
Methods
fit_predict
(X[, y])Perform clustering on X and returns cluster labels.
get_metadata_routing
()Get metadata routing of this object.
get_params
([deep])Get parameters for this estimator.
set_params
(**params)Set the parameters of this estimator.
fit
- __init__(n_clusters=8, eigen_solver=None, random_state=None, n_init='auto', gamma=1.0, affinity='rbf', n_neighbors=10, eigen_tol=0.0, assign_labels='kmeans', degree=3, coef0=1, kernel_params=None, n_jobs=1, n_components=100, persist_embedding=False, kmeans_params=None)¶