dask_ml.preprocessing.LabelEncoder

dask_ml.preprocessing.LabelEncoder

class dask_ml.preprocessing.LabelEncoder(use_categorical: bool = True)

Encode labels with value between 0 and n_classes-1.

Note

This differs from the scikit-learn version for Categorical data. When passed a categorical y, this implementation will use the categorical information for the label encoding and transformation. You will receive different answers when

  1. Your categories are not monotonically increasing

  2. You have unobserved categories

Specify use_categorical=False to recover the scikit-learn behavior.

Parameters
use_categoricalbool, default True

Whether to use the categorical dtype information when y is a dask or pandas Series with a categorical dtype.

Attributes
classes_array of shape (n_class,)

Holds the label for each class.

dtype_Optional CategoricalDtype

For Categorical y, the dtype is stored here.

Examples

LabelEncoder can be used to normalize labels.

>>> from dask_ml import preprocessing
>>> le = preprocessing.LabelEncoder()
>>> le.fit([1, 2, 2, 6])
LabelEncoder()
>>> le.classes_
array([1, 2, 6])
>>> le.transform([1, 1, 2, 6]) 
array([0, 0, 1, 2]...)
>>> le.inverse_transform([0, 0, 1, 2])
array([1, 1, 2, 6])

It can also be used to transform non-numerical labels (as long as they are hashable and comparable) to numerical labels.

>>> le = preprocessing.LabelEncoder()
>>> le.fit(["paris", "paris", "tokyo", "amsterdam"])
LabelEncoder()
>>> list(le.classes_)
['amsterdam', 'paris', 'tokyo']
>>> le.transform(["tokyo", "tokyo", "paris"]) 
array([2, 2, 1]...)
>>> list(le.inverse_transform([2, 2, 1]))
['tokyo', 'tokyo', 'paris']

When using Dask, we strongly recommend using a Categorical dask Series if possible. This avoids a (potentially expensive) scan of the values and enables a faster transform algorithm.

>>> import dask.dataframe as dd
>>> import pandas as pd
>>> data = dd.from_pandas(pd.Series(['a', 'a', 'b'], dtype='category'),
...                       npartitions=2)
>>> le.fit_transform(data)
dask.array<values, shape=(nan,), dtype=int8, chunksize=(nan,)>
>>> le.fit_transform(data).compute()
array([0, 0, 1], dtype=int8)

Methods

fit(y)

Fit label encoder.

fit_transform(y)

Fit label encoder and return encoded labels.

get_params([deep])

Get parameters for this estimator.

inverse_transform(y)

Transform labels back to original encoding.

set_params(**params)

Set the parameters of this estimator.

transform(y)

Transform labels to normalized encoding.

__init__(use_categorical: bool = True)