dask_ml.cluster.SpectralClustering

class dask_ml.cluster.SpectralClustering(n_clusters=8, eigen_solver=None, random_state=None, n_init=10, gamma=1.0, affinity='rbf', n_neighbors=10, eigen_tol=0.0, assign_labels='kmeans', degree=3, coef0=1, kernel_params=None, n_jobs=1, n_components=100, persist_embedding=False, kmeans_params=None)

Apply parallel Spectral Clustering

This implementation avoids the expensive computation of the N x N affinity matrix. Instead, the Nyström Method is used as an approximation.

Parameters:
n_clusters : integer, optional

The dimension of the projection subspace.

eigen_solver : None

ignored

random_state : int, RandomState instance or None, optional, default: None

A pseudo random number generator used for the initialization of the lobpcg eigen vectors decomposition when eigen_solver == ‘amg’ and by the K-Means initialization. If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

n_init : int, optional, default: 10

ignored

gamma : float, default=1.0

Kernel coefficient for rbf, poly, sigmoid, laplacian and chi2 kernels. Ignored for affinity='nearest_neighbors'.

affinity : string, array-like or callable, default ‘rbf’

If a string, this may be one of ‘nearest_neighbors’, ‘precomputed’, ‘rbf’ or one of the kernels supported by sklearn.metrics.pairwise_kernels.

Only kernels that produce similarity scores (non-negative values that increase with similarity) should be used. This property is not checked by the clustering algorithm.

Callables should expect arguments similar to sklearn.metrics.pairwise_kernels: a required X, an optional Y, and gamma, degree, coef0, and any keywords passed in kernel_params.

n_neighbors : integer

Number of neighbors to use when constructing the affinity matrix using the nearest neighbors method. Ignored for affinity='rbf'.

eigen_tol : float, optional, default: 0.0

Stopping criterion for eigendecomposition of the Laplacian matrix when using arpack eigen_solver.

assign_labels : ‘kmeans’ or Estimator, default: ‘kmeans’

The strategy to use to assign labels in the embedding space. By default creates an instance of dask_ml.cluster.KMeans and sets n_clusters to 2. For further control over the hyperparameters of the final label assignment, pass an instance of a KMeans estimator (either scikit-learn or dask-ml).

degree : float, default=3

Degree of the polynomial kernel. Ignored by other kernels.

coef0 : float, default=1

Zero coefficient for polynomial and sigmoid kernels. Ignored by other kernels.

kernel_params : dictionary of string to any, optional

Parameters (keyword arguments) and values for kernel passed as callable object. Ignored by other kernels.

n_jobs : int, optional (default = 1)

The number of parallel jobs to run. If -1, then the number of jobs is set to the number of CPU cores.

n_components : int, default 100

Number of rows from X to use for the Nyström approximation. Larger n_components will improve the accuracy of the approximation, at the cost of a longer training time.

persist_embedding : bool

Whether to persist the intermediate n_samples x n_components array used for clustering.

kmeans_params : dictionary of string to any, optional

Keyword arguments for the KMeans clustering used for the final clustering.

Attributes:
assign_labels_ : Estimator

The instance of the KMeans estimator used to assign labels

labels_ : dask.array.Array, size (n_samples,)

The cluster labels assigned

eigenvalues_ : numpy.ndarray

The eigenvalues from the SVD of the sampled points

Notes

Using persist_embedding=True can be an important optimization to avoid some redundant computations. This persists the array being fed to the clustering algorithm in (distributed) memory. The array is shape n_samples x n_components.

References

Methods

fit_predict(X[, y]) Performs clustering on X and returns cluster labels.
get_params([deep]) Get parameters for this estimator.
set_params(**params) Set the parameters of this estimator.
fit  
__init__(n_clusters=8, eigen_solver=None, random_state=None, n_init=10, gamma=1.0, affinity='rbf', n_neighbors=10, eigen_tol=0.0, assign_labels='kmeans', degree=3, coef0=1, kernel_params=None, n_jobs=1, n_components=100, persist_embedding=False, kmeans_params=None)

Initialize self. See help(type(self)) for accurate signature.

fit_predict(X, y=None)

Performs clustering on X and returns cluster labels.

Parameters:
X : ndarray, shape (n_samples, n_features)

Input data.

y : Ignored

not used, present for API consistency by convention.

Returns:
labels : ndarray, shape (n_samples,)

cluster labels

get_params(deep=True)

Get parameters for this estimator.

Parameters:
deep : boolean, optional

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:
params : mapping of string to any

Parameter names mapped to their values.

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Returns:
self