dask_ml.preprocessing.LabelEncoder

class dask_ml.preprocessing.LabelEncoder(use_categorical=True)

Encode labels with value between 0 and n_classes-1.

Note

This differs from the scikit-learn version for Categorical data. When passed a categorical y, this implementation will use the categorical information for the label encoding and transformation. You will receive different answers when

  1. Your categories are not monotonically increasing
  2. You have unobserved categories

Specify use_categorical=False to recover the scikit-learn behavior.

Parameters:
use_categorical : bool, default True

Whether to use the categorical dtype information when y is a dask or pandas Series with a categorical dtype.

Attributes:
classes_ : array of shape (n_class,)

Holds the label for each class.

dtype_ : Optional CategoricalDtype

For Categorical y, the dtype is stored here.

Examples

LabelEncoder can be used to normalize labels.

>>> from dask_ml import preprocessing
>>> le = preprocessing.LabelEncoder()
>>> le.fit([1, 2, 2, 6])
LabelEncoder()
>>> le.classes_
array([1, 2, 6])
>>> le.transform([1, 1, 2, 6]) #doctest: +ELLIPSIS
array([0, 0, 1, 2]...)
>>> le.inverse_transform([0, 0, 1, 2])
array([1, 1, 2, 6])

It can also be used to transform non-numerical labels (as long as they are hashable and comparable) to numerical labels.

>>> le = preprocessing.LabelEncoder()
>>> le.fit(["paris", "paris", "tokyo", "amsterdam"])
LabelEncoder()
>>> list(le.classes_)
['amsterdam', 'paris', 'tokyo']
>>> le.transform(["tokyo", "tokyo", "paris"]) #doctest: +ELLIPSIS
array([2, 2, 1]...)
>>> list(le.inverse_transform([2, 2, 1]))
['tokyo', 'tokyo', 'paris']

When using Dask, we strongly recommend using a Categorical dask Series if possible. This avoids a (potentially expensive) scan of the values and enables a faster transform algorithm.

>>> import dask.dataframe as dd
>>> import pandas as pd
>>> data = dd.from_pandas(pd.Series(['a', 'a', 'b'], dtype='category'),
...                       npartitions=2)
>>> le.fit_transform(data)
dask.array<values, shape=(nan,), dtype=int8, chunksize=(nan,)>
>>> le.fit_transform(data).compute()
array([0, 0, 1], dtype=int8)

Methods

fit(y) Fit label encoder
fit_transform(y) Fit label encoder and return encoded labels
get_params([deep]) Get parameters for this estimator.
inverse_transform(y) Transform labels back to original encoding.
set_params(**params) Set the parameters of this estimator.
transform(y) Transform labels to normalized encoding.
__init__(use_categorical=True)

Initialize self. See help(type(self)) for accurate signature.

fit(y)

Fit label encoder

Parameters:
y : array-like of shape (n_samples,)

Target values.

Returns:
self : returns an instance of self.
fit_transform(y)

Fit label encoder and return encoded labels

Parameters:
y : array-like of shape [n_samples]

Target values.

Returns:
y : array-like of shape [n_samples]
get_params(deep=True)

Get parameters for this estimator.

Parameters:
deep : boolean, optional

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:
params : mapping of string to any

Parameter names mapped to their values.

inverse_transform(y)

Transform labels back to original encoding.

Parameters:
y : numpy array of shape [n_samples]

Target values.

Returns:
y : numpy array of shape [n_samples]
set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Returns:
self
transform(y)

Transform labels to normalized encoding.

Parameters:
y : array-like of shape [n_samples]

Target values.

Returns:
y : array-like of shape [n_samples]