dask_ml.preprocessing
.LabelEncoder¶
-
class
dask_ml.preprocessing.
LabelEncoder
(use_categorical=True)¶ Encode labels with value between 0 and n_classes-1.
Note
This differs from the scikit-learn version for Categorical data. When passed a categorical y, this implementation will use the categorical information for the label encoding and transformation. You will receive different answers when
- Your categories are not monotonically increasing
- You have unobserved categories
Specify
use_categorical=False
to recover the scikit-learn behavior.Parameters: - use_categorical : bool, default True
Whether to use the categorical dtype information when y is a dask or pandas Series with a categorical dtype.
Attributes: - classes_ : array of shape (n_class,)
Holds the label for each class.
- dtype_ : Optional CategoricalDtype
For Categorical y, the dtype is stored here.
Examples
LabelEncoder can be used to normalize labels.
>>> from dask_ml import preprocessing >>> le = preprocessing.LabelEncoder() >>> le.fit([1, 2, 2, 6]) LabelEncoder() >>> le.classes_ array([1, 2, 6]) >>> le.transform([1, 1, 2, 6]) #doctest: +ELLIPSIS array([0, 0, 1, 2]...) >>> le.inverse_transform([0, 0, 1, 2]) array([1, 1, 2, 6])
It can also be used to transform non-numerical labels (as long as they are hashable and comparable) to numerical labels.
>>> le = preprocessing.LabelEncoder() >>> le.fit(["paris", "paris", "tokyo", "amsterdam"]) LabelEncoder() >>> list(le.classes_) ['amsterdam', 'paris', 'tokyo'] >>> le.transform(["tokyo", "tokyo", "paris"]) #doctest: +ELLIPSIS array([2, 2, 1]...) >>> list(le.inverse_transform([2, 2, 1])) ['tokyo', 'tokyo', 'paris']
When using Dask, we strongly recommend using a Categorical dask Series if possible. This avoids a (potentially expensive) scan of the values and enables a faster transform algorithm.
>>> import dask.dataframe as dd >>> import pandas as pd >>> data = dd.from_pandas(pd.Series(['a', 'a', 'b'], dtype='category'), ... npartitions=2) >>> le.fit_transform(data) dask.array<values, shape=(nan,), dtype=int8, chunksize=(nan,)> >>> le.fit_transform(data).compute() array([0, 0, 1], dtype=int8)
Methods
fit
(self, y)Fit label encoder fit_transform
(self, y)Fit label encoder and return encoded labels get_params
(self[, deep])Get parameters for this estimator. inverse_transform
(self, y)Transform labels back to original encoding. set_params
(self, \*\*params)Set the parameters of this estimator. transform
(self, y)Transform labels to normalized encoding. -
__init__
(self, use_categorical=True)¶ Initialize self. See help(type(self)) for accurate signature.
-
fit
(self, y)¶ Fit label encoder
Parameters: - y : array-like of shape (n_samples,)
Target values.
Returns: - self : returns an instance of self.
-
fit_transform
(self, y)¶ Fit label encoder and return encoded labels
Parameters: - y : array-like of shape [n_samples]
Target values.
Returns: - y : array-like of shape [n_samples]
-
get_params
(self, deep=True)¶ Get parameters for this estimator.
Parameters: - deep : boolean, optional
If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns: - params : mapping of string to any
Parameter names mapped to their values.
-
inverse_transform
(self, y)¶ Transform labels back to original encoding.
Parameters: - y : numpy array of shape [n_samples]
Target values.
Returns: - y : numpy array of shape [n_samples]
-
set_params
(self, **params)¶ Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form
<component>__<parameter>
so that it’s possible to update each component of a nested object.Returns: - self
-
transform
(self, y)¶ Transform labels to normalized encoding.
Parameters: - y : array-like of shape [n_samples]
Target values.
Returns: - y : array-like of shape [n_samples]