dask_ml.cluster.SpectralClustering

class dask_ml.cluster.SpectralClustering(n_clusters=8, eigen_solver=None, random_state=None, n_init=10, gamma=1.0, affinity='rbf', n_neighbors=10, eigen_tol=0.0, assign_labels='kmeans', degree=3, coef0=1, kernel_params=None, n_jobs=1, n_components=100, persist_embedding=False, kmeans_params=None)

Apply parallel Spectral Clustering

This implementation avoids the expensive computation of the N x N affinity matrix. Instead, the Nyström Method is used as an approximation.

Parameters
n_clustersinteger, optional

The dimension of the projection subspace.

eigen_solverNone

ignored

random_stateint, RandomState instance or None, optional, default: None

A pseudo random number generator used for the initialization of the lobpcg eigen vectors decomposition when eigen_solver == ‘amg’ and by the K-Means initialization. If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

n_initint, optional, default: 10

ignored

gammafloat, default=1.0

Kernel coefficient for rbf, poly, sigmoid, laplacian and chi2 kernels. Ignored for affinity='nearest_neighbors'.

affinitystring, array-like or callable, default ‘rbf’

If a string, this may be one of ‘nearest_neighbors’, ‘precomputed’, ‘rbf’ or one of the kernels supported by sklearn.metrics.pairwise_kernels.

Only kernels that produce similarity scores (non-negative values that increase with similarity) should be used. This property is not checked by the clustering algorithm.

Callables should expect arguments similar to sklearn.metrics.pairwise_kernels: a required X, an optional Y, and gamma, degree, coef0, and any keywords passed in kernel_params.

n_neighborsinteger

Number of neighbors to use when constructing the affinity matrix using the nearest neighbors method. Ignored for affinity='rbf'.

eigen_tolfloat, optional, default: 0.0

Stopping criterion for eigendecomposition of the Laplacian matrix when using arpack eigen_solver.

assign_labels‘kmeans’ or Estimator, default: ‘kmeans’

The strategy to use to assign labels in the embedding space. By default creates an instance of dask_ml.cluster.KMeans and sets n_clusters to 2. For further control over the hyperparameters of the final label assignment, pass an instance of a KMeans estimator (either scikit-learn or dask-ml).

degreefloat, default=3

Degree of the polynomial kernel. Ignored by other kernels.

coef0float, default=1

Zero coefficient for polynomial and sigmoid kernels. Ignored by other kernels.

kernel_paramsdictionary of string to any, optional

Parameters (keyword arguments) and values for kernel passed as callable object. Ignored by other kernels.

n_jobsint, optional (default = 1)

The number of parallel jobs to run. If -1, then the number of jobs is set to the number of CPU cores.

n_componentsint, default 100

Number of rows from X to use for the Nyström approximation. Larger n_components will improve the accuracy of the approximation, at the cost of a longer training time.

persist_embeddingbool

Whether to persist the intermediate n_samples x n_components array used for clustering.

kmeans_paramsdictionary of string to any, optional

Keyword arguments for the KMeans clustering used for the final clustering.

Attributes
assign_labels_Estimator

The instance of the KMeans estimator used to assign labels

labels_dask.array.Array, size (n_samples,)

The cluster labels assigned

eigenvalues_numpy.ndarray

The eigenvalues from the SVD of the sampled points

Notes

Using persist_embedding=True can be an important optimization to avoid some redundant computations. This persists the array being fed to the clustering algorithm in (distributed) memory. The array is shape n_samples x n_components.

References

Methods

fit_predict(self, X[, y])

Performs clustering on X and returns cluster labels.

get_params(self[, deep])

Get parameters for this estimator.

set_params(self, \*\*params)

Set the parameters of this estimator.

fit

__init__(self, n_clusters=8, eigen_solver=None, random_state=None, n_init=10, gamma=1.0, affinity='rbf', n_neighbors=10, eigen_tol=0.0, assign_labels='kmeans', degree=3, coef0=1, kernel_params=None, n_jobs=1, n_components=100, persist_embedding=False, kmeans_params=None)

Initialize self. See help(type(self)) for accurate signature.

fit_predict(self, X, y=None)

Performs clustering on X and returns cluster labels.

Parameters
Xndarray, shape (n_samples, n_features)

Input data.

yIgnored

not used, present for API consistency by convention.

Returns
labelsndarray, shape (n_samples,)

cluster labels

get_params(self, deep=True)

Get parameters for this estimator.

Parameters
deepboolean, optional

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns
paramsmapping of string to any

Parameter names mapped to their values.

set_params(self, **params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Returns
self