dask_ml.cluster
.KMeans¶
-
class
dask_ml.cluster.
KMeans
(n_clusters=8, init='k-means||', oversampling_factor=2, max_iter=300, tol=0.0001, precompute_distances='auto', random_state=None, copy_x=True, n_jobs=1, algorithm='full', init_max_iter=None)¶ Scalable KMeans for clustering
Parameters: - n_clusters : int, default 8
Number of clusters to end up with
- init : {‘k-means||’, ‘k-means++’ or ndarray}
Method for center initialization, defaults to ‘k-means||’.
‘k-means||’ : selects the the gg
‘k-means++’ : selects the initial cluster centers in a smart way to speed up convergence. Uses scikit-learn’s implementation.
Warning
If using
'k-means++'
, the entire dataset will be read into memory at once.An array of shape (n_clusters, n_features) can be used to give an explicit starting point
- oversampling_factor : int, default 2
Oversampling factor for use in the
k-means||
algorithm.- max_iter : int
Maximum number EM iterations to attempt.
- init_max_iter : int
Number of iterations for init step.
- tol : float
Relative tolerance with regards to inertia to declare convergence
- algorithm : ‘full’
The algorithm to use for the EM step. Only “full” (LLoyd’s algorithm) is allowed.
- random_state : int, RandomState instance or None, optional, default: None
If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.
Attributes: - cluster_centers_ : np.ndarray [n_clusters, n_features]
A NumPy array with the cluster centers
- labels_ : da.array [n_samples,]
A dask array with the index position in
cluster_centers_
this sample belongs to.- intertia_ : float
Sum of distances of samples to their closest cluster center.
- n_iter_ : int
Number of EM steps to reach convergence
Notes
This class implements a parallel and distributed version of k-Means.
Initialization with k-means||
The default initializer for
KMeans
isk-means||
, compared tok-means++
from scikit-learn. This is the algorithm described in Scalable K-Means++ (2012).k-means||
is designed to work well in a distributed environment. It’s a variant of k-means++ that’s designed to work in parallel (k-means++ is inherently sequential). Currently, thek-means||
implementation here is slower than scikit-learn’sk-means++
if your entire dataset fits in memory on a single machine. If that’s the case, consider usinginit='k-means++'
.Parallel Lloyd’s Algorithm
LLoyd’s Algorithm (the default Expectation Maximization algorithm used in scikit-learn) is naturally parallelizable. In naive benchmarks, the implementation here achieves 2-3x speedups over scikit-learn.
Both the initialization step and the EM steps make multiple passes over the data. If possible, persist your dask collections in (distributed) memory before running
.fit
.References
- Scalable K-Means++, 2012 Bahman Bahmani, Benjamin Moseley, Andrea Vattani, Ravi Kumar, Sergei Vassilvitskii https://arxiv.org/abs/1203.6402
Methods
fit_transform
(X[, y])Fit to data, then transform it. get_params
([deep])Get parameters for this estimator. predict
(X)Predict the closest cluster each sample in X belongs to. set_params
(**params)Set the parameters of this estimator. fit transform -
__init__
(n_clusters=8, init='k-means||', oversampling_factor=2, max_iter=300, tol=0.0001, precompute_distances='auto', random_state=None, copy_x=True, n_jobs=1, algorithm='full', init_max_iter=None)¶ Initialize self. See help(type(self)) for accurate signature.
-
fit_transform
(X, y=None, **fit_params)¶ Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
Parameters: - X : {array-like, sparse matrix, dataframe} of shape (n_samples, n_features)
- y : ndarray of shape (n_samples,), default=None
Target values.
- **fit_params : dict
Additional fit parameters.
Returns: - X_new : ndarray array of shape (n_samples, n_features_new)
Transformed array.
-
get_params
(deep=True)¶ Get parameters for this estimator.
Parameters: - deep : bool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns: - params : mapping of string to any
Parameter names mapped to their values.
-
predict
(X)¶ Predict the closest cluster each sample in X belongs to. In the vector quantization literature, cluster_centers_ is called the code book and each value returned by predict is the index of the closest code in the code book.
Parameters: - X : array-like, shape = [n_samples, n_features]
New data to predict.
Returns: - labels : array, shape [n_samples,]
Index of the cluster each sample belongs to.
-
set_params
(**params)¶ Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form
<component>__<parameter>
so that it’s possible to update each component of a nested object.Parameters: - **params : dict
Estimator parameters.
Returns: - self : object
Estimator instance.