dask_ml.cluster.SpectralClustering

dask_ml.cluster.SpectralClustering

class dask_ml.cluster.SpectralClustering(n_clusters=8, eigen_solver=None, random_state=None, n_init='auto', gamma=1.0, affinity='rbf', n_neighbors=10, eigen_tol=0.0, assign_labels='kmeans', degree=3, coef0=1, kernel_params=None, n_jobs=1, n_components=100, persist_embedding=False, kmeans_params=None)

Apply parallel Spectral Clustering

This implementation avoids the expensive computation of the N x N affinity matrix. Instead, the Nyström Method is used as an approximation.

Parameters
n_clustersinteger, optional

The dimension of the projection subspace.

eigen_solverNone

ignored

random_stateint, RandomState instance or None, optional, default: None

A pseudo random number generator used for the initialization of the lobpcg eigen vectors decomposition when eigen_solver == ‘amg’ and by the K-Means initialization. If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

n_initint, optional, default: 10

ignored

gammafloat, default=1.0

Kernel coefficient for rbf, poly, sigmoid, laplacian and chi2 kernels. Ignored for affinity='nearest_neighbors'.

affinitystring, array-like or callable, default ‘rbf’

It may be ‘precomputed’ or one of the kernels supported by metrics.pairwise.PAIRWISE_KERNEL_FUNCTIONS.

Only kernels that produce similarity scores (non-negative values that increase with similarity) should be used. This property is not checked by the clustering algorithm.

Callables should expect arguments similar to sklearn.metrics.pairwise_kernels: a required X, an optional Y, and gamma, degree, coef0, and any keywords passed in kernel_params.

n_neighborsinteger

Number of neighbors to use when constructing the affinity matrix using the nearest neighbors method. Ignored for affinity='rbf'.

eigen_tolfloat, optional, default: 0.0

Stopping criterion for eigendecomposition of the Laplacian matrix when using arpack eigen_solver.

assign_labels‘kmeans’ or Estimator, default: ‘kmeans’

The strategy to use to assign labels in the embedding space. By default creates an instance of dask_ml.cluster.KMeans and sets n_clusters to 2. For further control over the hyperparameters of the final label assignment, pass an instance of a KMeans estimator (either scikit-learn or dask-ml).

degreefloat, default=3

Degree of the polynomial kernel. Ignored by other kernels.

coef0float, default=1

Zero coefficient for polynomial and sigmoid kernels. Ignored by other kernels.

kernel_paramsdictionary of string to any, optional

Parameters (keyword arguments) and values for kernel passed as callable object. Ignored by other kernels.

n_jobsint, optional (default = 1)

The number of parallel jobs to run. If -1, then the number of jobs is set to the number of CPU cores.

n_componentsint, default 100

Number of rows from X to use for the Nyström approximation. Larger n_components will improve the accuracy of the approximation, at the cost of a longer training time.

persist_embeddingbool

Whether to persist the intermediate n_samples x n_components array used for clustering.

kmeans_paramsdictionary of string to any, optional

Keyword arguments for the KMeans clustering used for the final clustering.

Attributes
assign_labels_Estimator

The instance of the KMeans estimator used to assign labels

labels_dask.array.Array, size (n_samples,)

The cluster labels assigned

eigenvalues_numpy.ndarray

The eigenvalues from the SVD of the sampled points

Notes

Using persist_embedding=True can be an important optimization to avoid some redundant computations. This persists the array being fed to the clustering algorithm in (distributed) memory. The array is shape n_samples x n_components.

References

Methods

fit_predict(X[, y])

Perform clustering on X and returns cluster labels.

get_metadata_routing()

Get metadata routing of this object.

get_params([deep])

Get parameters for this estimator.

set_params(**params)

Set the parameters of this estimator.

fit

__init__(n_clusters=8, eigen_solver=None, random_state=None, n_init='auto', gamma=1.0, affinity='rbf', n_neighbors=10, eigen_tol=0.0, assign_labels='kmeans', degree=3, coef0=1, kernel_params=None, n_jobs=1, n_components=100, persist_embedding=False, kmeans_params=None)