dask_ml.cluster.KMeans
dask_ml.cluster
.KMeans¶
- class dask_ml.cluster.KMeans(n_clusters=8, init='k-means||', oversampling_factor=2, max_iter=300, tol=0.0001, precompute_distances='auto', random_state=None, copy_x=True, n_jobs=1, algorithm='full', init_max_iter=None, n_init='auto')¶
Scalable KMeans for clustering
- Parameters
- n_clustersint, default 8
Number of clusters to end up with
- init{‘k-means||’, ‘k-means++’ or ndarray}
Method for center initialization, defaults to ‘k-means||’.
‘k-means||’ : selects the the gg
‘k-means++’ : selects the initial cluster centers in a smart way to speed up convergence. Uses scikit-learn’s implementation.
Warning
If using
'k-means++'
, the entire dataset will be read into memory at once.An array of shape (n_clusters, n_features) can be used to give an explicit starting point
- oversampling_factorint, default 2
Oversampling factor for use in the
k-means||
algorithm.- max_iterint
Maximum number EM iterations to attempt.
- init_max_iterint
Number of iterations for init step.
- tolfloat
Relative tolerance with regards to inertia to declare convergence
- algorithm‘full’
The algorithm to use for the EM step. Only “full” (LLoyd’s algorithm) is allowed.
- random_stateint, RandomState instance or None, optional, default: None
If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.
- n_init‘auto’ or int, default=10
Number of times the k-means algorithm will be run with different centroid seeds. The final results will be the best output of n_init consecutive runs in terms of inertia. When n_init=’auto’, the number of runs will be 10 if using init=’random’, and 1 if using init=’kmeans++’. .. versionadded:: 1.2
Added ‘auto’ option for n_init.
Changed in version 1.4: Default value for n_init will change from 10 to ‘auto’ in version 1.4.
- Attributes
- cluster_centers_np.ndarray [n_clusters, n_features]
A NumPy array with the cluster centers
- labels_da.array [n_samples,]
A dask array with the index position in
cluster_centers_
this sample belongs to.- inertia_float
Sum of distances of samples to their closest cluster center.
- n_iter_int
Number of EM steps to reach convergence
Notes
This class implements a parallel and distributed version of k-Means.
Initialization with k-means||
The default initializer for
KMeans
isk-means||
, compared tok-means++
from scikit-learn. This is the algorithm described in Scalable K-Means++ (2012).k-means||
is designed to work well in a distributed environment. It’s a variant of k-means++ that’s designed to work in parallel (k-means++ is inherently sequential). Currently, thek-means||
implementation here is slower than scikit-learn’sk-means++
if your entire dataset fits in memory on a single machine. If that’s the case, consider usinginit='k-means++'
.Parallel Lloyd’s Algorithm
LLoyd’s Algorithm (the default Expectation Maximization algorithm used in scikit-learn) is naturally parallelizable. In naive benchmarks, the implementation here achieves 2-3x speedups over scikit-learn.
Both the initialization step and the EM steps make multiple passes over the data. If possible, persist your dask collections in (distributed) memory before running
.fit
.References
Scalable K-Means++, 2012 Bahman Bahmani, Benjamin Moseley, Andrea Vattani, Ravi Kumar, Sergei Vassilvitskii https://arxiv.org/abs/1203.6402
Methods
fit_transform
(X[, y])Fit to data, then transform it.
get_metadata_routing
()Get metadata routing of this object.
get_params
([deep])Get parameters for this estimator.
predict
(X)Predict the closest cluster each sample in X belongs to.
set_output
(*[, transform])Set output container.
set_params
(**params)Set the parameters of this estimator.
fit
transform
- __init__(n_clusters=8, init='k-means||', oversampling_factor=2, max_iter=300, tol=0.0001, precompute_distances='auto', random_state=None, copy_x=True, n_jobs=1, algorithm='full', init_max_iter=None, n_init='auto')¶