API Reference

This page lists all of the estimators and top-level functions in dask_ml. Unless otherwise noted, the estimators implemented in dask-ml are appropriate for parallel and distributed training.

dask_ml.model_selection: Model Selection

Utilities for hyperparameter optimization.

These estimators will operate in parallel. Their scalability depends on the underlying estimators being used.

Dask-ML has a few cross validation utilities.

model_selection.train_test_split(*arrays, …) Split arrays into random train and test matricies.

model_selection.train_test_split() is a simple helper that uses model_selection.ShuffleSplit internally.

model_selection.ShuffleSplit([n_splits, …]) Random permutation cross-validator.
model_selection.KFold([n_splits, shuffle, …]) K-Folds cross-validator

Dask-ML provides drop-in replacements for grid and randomized search. These are appropriate for datasets where the CV splits fit in memory.

model_selection.GridSearchCV(estimator, …) Exhaustive search over specified parameter values for an estimator.
model_selection.RandomizedSearchCV(…[, …]) Randomized search on hyper parameters.

For hyperparameter optimization on larger-than-memory datasets, Dask-ML provides the follwoing:

model_selection.IncrementalSearchCV(…[, …]) Incrementally search for hyper-parameters on models that support partial_fit

dask_ml.linear_model: Generalized Linear Models

The dask_ml.linear_model module implements linear models for classification and regression.

linear_model.LinearRegression([penalty, …]) Esimator for linear regression.
linear_model.LogisticRegression([penalty, …]) Esimator for logistic regression.
linear_model.PoissonRegression([penalty, …]) Esimator for poisson regression.

dask_ml.wrappers: Meta-Estimators

dask-ml provides some meta-estimators that help use regular estimators that follow the scikit-learn API. These meta-estimators make the underlying estimator work well with Dask Arrays or DataFrames.

wrappers.ParallelPostFit([estimator, scoring]) Meta-estimator for parallel predict and transform.
wrappers.Incremental([estimator, scoring, …]) Metaestimator for feeding Dask Arrays to an estimator blockwise.

dask_ml.cluster: Clustering

Unsupervised Clustering Algorithms

cluster.KMeans([n_clusters, init, …]) Scalable KMeans for clustering
cluster.SpectralClustering([n_clusters, …]) Apply parallel Spectral Clustering

dask_ml.decomposition: Matrix Decomposition

decomposition.PCA([n_components, copy, …]) Principal component analysis (PCA)
decomposition.TruncatedSVD([n_components, …])

Methods

dask_ml.preprocessing: Preprocessing Data

Utilties for Preprocessing data.

class dask_ml.preprocessing.StandardScaler(copy=True, with_mean=True, with_std=True)

Standardize features by removing the mean and scaling to unit variance

Centering and scaling happen independently on each feature by computing the relevant statistics on the samples in the training set. Mean and standard deviation are then stored to be used on later data using the transform method.

Standardization of a dataset is a common requirement for many machine learning estimators: they might behave badly if the individual features do not more or less look like standard normally distributed data (e.g. Gaussian with 0 mean and unit variance).

For instance many elements used in the objective function of a learning algorithm (such as the RBF kernel of Support Vector Machines or the L1 and L2 regularizers of linear models) assume that all features are centered around 0 and have variance in the same order. If a feature has a variance that is orders of magnitude larger that others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected.

This scaler can also be applied to sparse CSR or CSC matrices by passing with_mean=False to avoid breaking the sparsity structure of the data.

Read more in the User Guide.

Parameters:
copy : boolean, optional, default True

If False, try to avoid a copy and do inplace scaling instead. This is not guaranteed to always work inplace; e.g. if the data is not a NumPy array or scipy.sparse CSR matrix, a copy may still be returned.

with_mean : boolean, True by default

If True, center the data before scaling. This does not work (and will raise an exception) when attempted on sparse matrices, because centering them entails building a dense matrix which in common use cases is likely to be too large to fit in memory.

with_std : boolean, True by default

If True, scale the data to unit variance (or equivalently, unit standard deviation).

Attributes:
scale_ : ndarray or None, shape (n_features,)

Per feature relative scaling of the data. Equal to None when with_std=False.

New in version 0.17: scale_

mean_ : ndarray or None, shape (n_features,)

The mean value for each feature in the training set. Equal to None when with_mean=False.

var_ : ndarray or None, shape (n_features,)

The variance for each feature in the training set. Used to compute scale_. Equal to None when with_std=False.

n_samples_seen_ : int or array, shape (n_features,)

The number of samples processed by the estimator for each feature. If there are not missing samples, the n_samples_seen will be an integer, otherwise it will be an array. Will be reset on new calls to fit, but increments across partial_fit calls.

See also

scale
Equivalent function without the estimator API.
sklearn.decomposition.PCA
Further removes the linear correlation across features with ‘whiten=True’.

Notes

NaNs are treated as missing values: disregarded in fit, and maintained in transform.

For a comparison of the different scalers, transformers, and normalizers, see examples/preprocessing/plot_all_scaling.py.

Examples

>>> from sklearn.preprocessing import StandardScaler
>>> data = [[0, 0], [0, 0], [1, 1], [1, 1]]
>>> scaler = StandardScaler()
>>> print(scaler.fit(data))
StandardScaler(copy=True, with_mean=True, with_std=True)
>>> print(scaler.mean_)
[0.5 0.5]
>>> print(scaler.transform(data))
[[-1. -1.]
 [-1. -1.]
 [ 1.  1.]
 [ 1.  1.]]
>>> print(scaler.transform([[2, 2]]))
[[3. 3.]]

Methods

fit(X[, y]) Compute the mean and std to be used for later scaling.
fit_transform(X[, y]) Fit to data, then transform it.
get_params([deep]) Get parameters for this estimator.
inverse_transform(X[, copy]) Scale back the data to the original representation
partial_fit(X[, y]) Online computation of mean and std on X for later scaling.
set_params(**params) Set the parameters of this estimator.
transform(X[, y, copy]) Perform standardization by centering and scaling
fit(X, y=None)

Compute the mean and std to be used for later scaling.

Parameters:
X : {array-like, sparse matrix}, shape [n_samples, n_features]

The data used to compute the mean and standard deviation used for later scaling along the features axis.

y

Ignored

fit_transform(X, y=None, **fit_params)

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:
X : numpy array of shape [n_samples, n_features]

Training set.

y : numpy array of shape [n_samples]

Target values.

Returns:
X_new : numpy array of shape [n_samples, n_features_new]

Transformed array.

get_params(deep=True)

Get parameters for this estimator.

Parameters:
deep : boolean, optional

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:
params : mapping of string to any

Parameter names mapped to their values.

inverse_transform(X, copy=None)

Scale back the data to the original representation

Parameters:
X : array-like, shape [n_samples, n_features]

The data used to scale along the features axis.

copy : bool, optional (default: None)

Copy the input X or not.

Returns:
X_tr : array-like, shape [n_samples, n_features]

Transformed array.

partial_fit(X, y=None)

Online computation of mean and std on X for later scaling. All of X is processed as a single batch. This is intended for cases when fit is not feasible due to very large number of n_samples or because X is read from a continuous stream.

The algorithm for incremental mean and std is given in Equation 1.5a,b in Chan, Tony F., Gene H. Golub, and Randall J. LeVeque. “Algorithms for computing the sample variance: Analysis and recommendations.” The American Statistician 37.3 (1983): 242-247:

Parameters:
X : {array-like, sparse matrix}, shape [n_samples, n_features]

The data used to compute the mean and standard deviation used for later scaling along the features axis.

y

Ignored

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Returns:
self
transform(X, y=None, copy=None)

Perform standardization by centering and scaling

Parameters:
X : array-like, shape [n_samples, n_features]

The data used to scale along the features axis.

y : (ignored)

Deprecated since version 0.19: This parameter will be removed in 0.21.

copy : bool, optional (default: None)

Copy the input X or not.

class dask_ml.preprocessing.MinMaxScaler(feature_range=(0, 1), copy=True)

Transforms features by scaling each feature to a given range.

This estimator scales and translates each feature individually such that it is in the given range on the training set, i.e. between zero and one.

The transformation is given by:

X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))
X_scaled = X_std * (max - min) + min

where min, max = feature_range.

This transformation is often used as an alternative to zero mean, unit variance scaling.

Read more in the User Guide.

Parameters:
feature_range : tuple (min, max), default=(0, 1)

Desired range of transformed data.

copy : boolean, optional, default True

Set to False to perform inplace row normalization and avoid a copy (if the input is already a numpy array).

Attributes:
min_ : ndarray, shape (n_features,)

Per feature adjustment for minimum.

scale_ : ndarray, shape (n_features,)

Per feature relative scaling of the data.

New in version 0.17: scale_ attribute.

data_min_ : ndarray, shape (n_features,)

Per feature minimum seen in the data

New in version 0.17: data_min_

data_max_ : ndarray, shape (n_features,)

Per feature maximum seen in the data

New in version 0.17: data_max_

data_range_ : ndarray, shape (n_features,)

Per feature range (data_max_ - data_min_) seen in the data

New in version 0.17: data_range_

See also

minmax_scale
Equivalent function without the estimator API.

Notes

NaNs are treated as missing values: disregarded in fit, and maintained in transform.

For a comparison of the different scalers, transformers, and normalizers, see examples/preprocessing/plot_all_scaling.py.

Examples

>>> from sklearn.preprocessing import MinMaxScaler
>>>
>>> data = [[-1, 2], [-0.5, 6], [0, 10], [1, 18]]
>>> scaler = MinMaxScaler()
>>> print(scaler.fit(data))
MinMaxScaler(copy=True, feature_range=(0, 1))
>>> print(scaler.data_max_)
[ 1. 18.]
>>> print(scaler.transform(data))
[[0.   0.  ]
 [0.25 0.25]
 [0.5  0.5 ]
 [1.   1.  ]]
>>> print(scaler.transform([[2, 2]]))
[[1.5 0. ]]

Methods

fit(X[, y]) Compute the minimum and maximum to be used for later scaling.
fit_transform(X[, y]) Fit to data, then transform it.
get_params([deep]) Get parameters for this estimator.
inverse_transform(X[, y, copy]) Undo the scaling of X according to feature_range.
partial_fit(X[, y]) Online computation of min and max on X for later scaling.
set_params(**params) Set the parameters of this estimator.
transform(X[, y, copy]) Scaling features of X according to feature_range.
fit(X, y=None)

Compute the minimum and maximum to be used for later scaling.

Parameters:
X : array-like, shape [n_samples, n_features]

The data used to compute the per-feature minimum and maximum used for later scaling along the features axis.

fit_transform(X, y=None, **fit_params)

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:
X : numpy array of shape [n_samples, n_features]

Training set.

y : numpy array of shape [n_samples]

Target values.

Returns:
X_new : numpy array of shape [n_samples, n_features_new]

Transformed array.

get_params(deep=True)

Get parameters for this estimator.

Parameters:
deep : boolean, optional

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:
params : mapping of string to any

Parameter names mapped to their values.

inverse_transform(X, y=None, copy=None)

Undo the scaling of X according to feature_range.

Parameters:
X : array-like, shape [n_samples, n_features]

Input data that will be transformed. It cannot be sparse.

partial_fit(X, y=None)

Online computation of min and max on X for later scaling. All of X is processed as a single batch. This is intended for cases when fit is not feasible due to very large number of n_samples or because X is read from a continuous stream.

Parameters:
X : array-like, shape [n_samples, n_features]

The data used to compute the mean and standard deviation used for later scaling along the features axis.

y

Ignored

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Returns:
self
transform(X, y=None, copy=None)

Scaling features of X according to feature_range.

Parameters:
X : array-like, shape [n_samples, n_features]

Input data that will be transformed.

class dask_ml.preprocessing.RobustScaler(with_centering=True, with_scaling=True, quantile_range=(25.0, 75.0), copy=True)

Scale features using statistics that are robust to outliers.

This Scaler removes the median and scales the data according to the quantile range (defaults to IQR: Interquartile Range). The IQR is the range between the 1st quartile (25th quantile) and the 3rd quartile (75th quantile).

Centering and scaling happen independently on each feature by computing the relevant statistics on the samples in the training set. Median and interquartile range are then stored to be used on later data using the transform method.

Standardization of a dataset is a common requirement for many machine learning estimators. Typically this is done by removing the mean and scaling to unit variance. However, outliers can often influence the sample mean / variance in a negative way. In such cases, the median and the interquartile range often give better results.

New in version 0.17.

Read more in the User Guide.

Parameters:
with_centering : boolean, True by default

If True, center the data before scaling. This will cause transform to raise an exception when attempted on sparse matrices, because centering them entails building a dense matrix which in common use cases is likely to be too large to fit in memory.

with_scaling : boolean, True by default

If True, scale the data to interquartile range.

quantile_range : tuple (q_min, q_max), 0.0 < q_min < q_max < 100.0

Default: (25.0, 75.0) = (1st quantile, 3rd quantile) = IQR Quantile range used to calculate scale_.

New in version 0.18.

copy : boolean, optional, default is True

If False, try to avoid a copy and do inplace scaling instead. This is not guaranteed to always work inplace; e.g. if the data is not a NumPy array or scipy.sparse CSR matrix, a copy may still be returned.

Attributes:
center_ : array of floats

The median value for each feature in the training set.

scale_ : array of floats

The (scaled) interquartile range for each feature in the training set.

New in version 0.17: scale_ attribute.

See also

robust_scale
Equivalent function without the estimator API.
sklearn.decomposition.PCA
Further removes the linear correlation across features with ‘whiten=True’.

Notes

For a comparison of the different scalers, transformers, and normalizers, see examples/preprocessing/plot_all_scaling.py.

https://en.wikipedia.org/wiki/Median https://en.wikipedia.org/wiki/Interquartile_range

Examples

>>> from sklearn.preprocessing import RobustScaler
>>> X = [[ 1., -2.,  2.],
...      [ -2.,  1.,  3.],
...      [ 4.,  1., -2.]]
>>> transformer = RobustScaler().fit(X)
>>> transformer
RobustScaler(copy=True, quantile_range=(25.0, 75.0), with_centering=True,
       with_scaling=True)
>>> transformer.transform(X)
array([[ 0. , -2. ,  0. ],
       [-1. ,  0. ,  0.4],
       [ 1. ,  0. , -1.6]])

Methods

fit(X[, y]) Compute the median and quantiles to be used for scaling.
fit_transform(X[, y]) Fit to data, then transform it.
get_params([deep]) Get parameters for this estimator.
inverse_transform(X) Scale back the data to the original representation
set_params(**params) Set the parameters of this estimator.
transform(X) Center and scale the data.
fit(X, y=None)

Compute the median and quantiles to be used for scaling.

Parameters:
X : array-like, shape [n_samples, n_features]

The data used to compute the median and quantiles used for later scaling along the features axis.

fit_transform(X, y=None, **fit_params)

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:
X : numpy array of shape [n_samples, n_features]

Training set.

y : numpy array of shape [n_samples]

Target values.

Returns:
X_new : numpy array of shape [n_samples, n_features_new]

Transformed array.

get_params(deep=True)

Get parameters for this estimator.

Parameters:
deep : boolean, optional

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:
params : mapping of string to any

Parameter names mapped to their values.

inverse_transform(X)

Scale back the data to the original representation

Parameters:
X : array-like

The data used to scale along the specified axis.

This implementation was copied and modified from Scikit-Learn.
See License information here:
https://github.com/scikit-learn/scikit-learn/blob/master/README.rst
set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Returns:
self
transform(X)

Center and scale the data.

Can be called on sparse input, provided that RobustScaler has been fitted to dense input and with_centering=False.

Parameters:
X : {array-like, sparse matrix}

The data used to scale along the specified axis.

This implementation was copied and modified from Scikit-Learn.
See License information here:
https://github.com/scikit-learn/scikit-learn/blob/master/README.rst
class dask_ml.preprocessing.QuantileTransformer(n_quantiles=1000, output_distribution='uniform', ignore_implicit_zeros=False, subsample=100000, random_state=None, copy=True)

Transforms features using quantile information.

This implementation differs from the scikit-learn implementation by using approximate quantiles. The scikit-learn docstring follows.

This method transforms the features to follow a uniform or a normal distribution. Therefore, for a given feature, this transformation tends to spread out the most frequent values. It also reduces the impact of (marginal) outliers: this is therefore a robust preprocessing scheme.

The transformation is applied on each feature independently. The cumulative density function of a feature is used to project the original values. Features values of new/unseen data that fall below or above the fitted range will be mapped to the bounds of the output distribution. Note that this transform is non-linear. It may distort linear correlations between variables measured at the same scale but renders variables measured at different scales more directly comparable.

Read more in the User Guide.

Parameters:
n_quantiles : int, optional (default=1000)

Number of quantiles to be computed. It corresponds to the number of landmarks used to discretize the cumulative density function.

output_distribution : str, optional (default=’uniform’)

Marginal distribution for the transformed data. The choices are ‘uniform’ (default) or ‘normal’.

ignore_implicit_zeros : bool, optional (default=False)

Only applies to sparse matrices. If True, the sparse entries of the matrix are discarded to compute the quantile statistics. If False, these entries are treated as zeros.

subsample : int, optional (default=1e5)

Maximum number of samples used to estimate the quantiles for computational efficiency. Note that the subsampling procedure may differ for value-identical sparse and dense matrices.

random_state : int, RandomState instance or None, optional (default=None)

If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random. Note that this is used by subsampling and smoothing noise.

copy : boolean, optional, (default=True)

Set to False to perform inplace transformation and avoid a copy (if the input is already a numpy array).

Attributes:
quantiles_ : ndarray, shape (n_quantiles, n_features)

The values corresponding the quantiles of reference.

references_ : ndarray, shape(n_quantiles, )

Quantiles of references.

See also

quantile_transform
Equivalent function without the estimator API.
PowerTransformer
Perform mapping to a normal distribution using a power transform.
StandardScaler
Perform standardization that is faster, but less robust to outliers.
RobustScaler
Perform robust standardization that removes the influence of outliers but does not put outliers and inliers on the same scale.

Notes

NaNs are treated as missing values: disregarded in fit, and maintained in transform.

For a comparison of the different scalers, transformers, and normalizers, see examples/preprocessing/plot_all_scaling.py.

Examples

>>> import numpy as np
>>> from sklearn.preprocessing import QuantileTransformer
>>> rng = np.random.RandomState(0)
>>> X = np.sort(rng.normal(loc=0.5, scale=0.25, size=(25, 1)), axis=0)
>>> qt = QuantileTransformer(n_quantiles=10, random_state=0)
>>> qt.fit_transform(X) # doctest: +ELLIPSIS
array([...])

Methods

fit(X[, y]) Compute the quantiles used for transforming.
fit_transform(X[, y]) Fit to data, then transform it.
get_params([deep]) Get parameters for this estimator.
inverse_transform(X) Back-projection to the original space.
set_params(**params) Set the parameters of this estimator.
transform(X) Feature-wise transformation of the data.
fit(X, y=None)

Compute the quantiles used for transforming.

Parameters:
X : ndarray or sparse matrix, shape (n_samples, n_features)

The data used to scale along the features axis. If a sparse matrix is provided, it will be converted into a sparse csc_matrix. Additionally, the sparse matrix needs to be nonnegative if ignore_implicit_zeros is False.

Returns:
self : object
fit_transform(X, y=None, **fit_params)

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:
X : numpy array of shape [n_samples, n_features]

Training set.

y : numpy array of shape [n_samples]

Target values.

Returns:
X_new : numpy array of shape [n_samples, n_features_new]

Transformed array.

get_params(deep=True)

Get parameters for this estimator.

Parameters:
deep : boolean, optional

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:
params : mapping of string to any

Parameter names mapped to their values.

inverse_transform(X)

Back-projection to the original space.

Parameters:
X : ndarray or sparse matrix, shape (n_samples, n_features)

The data used to scale along the features axis. If a sparse matrix is provided, it will be converted into a sparse csc_matrix. Additionally, the sparse matrix needs to be nonnegative if ignore_implicit_zeros is False.

Returns:
Xt : ndarray or sparse matrix, shape (n_samples, n_features)

The projected data.

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Returns:
self
transform(X)

Feature-wise transformation of the data.

Parameters:
X : ndarray or sparse matrix, shape (n_samples, n_features)

The data used to scale along the features axis. If a sparse matrix is provided, it will be converted into a sparse csc_matrix. Additionally, the sparse matrix needs to be nonnegative if ignore_implicit_zeros is False.

Returns:
Xt : ndarray or sparse matrix, shape (n_samples, n_features)

The projected data.

class dask_ml.preprocessing.Categorizer(categories=None, columns=None)

Transform columns of a DataFrame to categorical dtype.

This is a useful pre-processing step for dummy, one-hot, or categorical encoding.

Parameters:
categories : mapping, optional

A dictionary mapping column name to instances of pandas.api.types.CategoricalDtype. Alternatively, a mapping of column name to (categories, ordered) tuples.

columns : sequence, optional

A sequence of column names to limit the categorization to. This argument is ignored when categories is specified.

Attributes:
columns_ : pandas.Index

The columns that were categorized. Useful when categories is None, and we detect the categorical and object columns

categories_ : dict

A dictionary mapping column names to dtypes. For pandas>=0.21.0, the values are instances of pandas.api.types.CategoricalDtype. For older pandas, the values are tuples of (categories, ordered).

Notes

This transformer only applies to dask.DataFrame and pandas.DataFrame. By default, all object-type columns are converted to categoricals. The set of categories will be the values present in the column and the categoricals will be unordered. Pass dtypes to control this behavior.

All other columns are included in the transformed output untouched.

For dask.DataFrame, any unknown categoricals will become known.

Examples

>>> df = pd.DataFrame({"A": [1, 2, 3], "B": ['a', 'a', 'b']})
>>> ce = Categorizer()
>>> ce.fit_transform(df).dtypes
A       int64
B    category
dtype: object
>>> ce.categories_
{'B': CategoricalDtype(categories=['a', 'b'], ordered=False)}

Using CategoricalDtypes for specifying the categories:

>>> from pandas.api.types import CategoricalDtype
>>> ce = Categorizer(categories={"B": CategoricalDtype(['a', 'b', 'c'])})
>>> ce.fit_transform(df).B.dtype
CategoricalDtype(categories=['a', 'b', 'c'], ordered=False)

Methods

fit(X[, y]) Find the categorical columns.
fit_transform(X[, y]) Fit to data, then transform it.
get_params([deep]) Get parameters for this estimator.
set_params(**params) Set the parameters of this estimator.
transform(X[, y]) Transform the columns in X according to self.categories_.
fit(X, y=None)

Find the categorical columns.

Parameters:
X : pandas.DataFrame or dask.DataFrame
y : ignored
Returns:
self
fit_transform(X, y=None, **fit_params)

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:
X : numpy array of shape [n_samples, n_features]

Training set.

y : numpy array of shape [n_samples]

Target values.

Returns:
X_new : numpy array of shape [n_samples, n_features_new]

Transformed array.

get_params(deep=True)

Get parameters for this estimator.

Parameters:
deep : boolean, optional

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:
params : mapping of string to any

Parameter names mapped to their values.

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Returns:
self
transform(X, y=None)

Transform the columns in X according to self.categories_.

Parameters:
X : pandas.DataFrame or dask.DataFrame
y : ignored
Returns:
X_trn : pandas.DataFrame or dask.DataFrame

Same type as the input. The columns in self.categories_ will be converted to categorical dtype.

class dask_ml.preprocessing.DummyEncoder(columns=None, drop_first=False)

Dummy (one-hot) encode categorical columns.

Parameters:
columns : sequence, optional

The columns to dummy encode. Must be categorical dtype. Dummy encodes all categorical dtype columns by default.

drop_first : bool, default False

Whether to drop the first category in each column.

Attributes:
columns_ : Index

The columns in the training data before dummy encoding

transformed_columns_ : Index

The columns in the training data after dummy encoding

categorical_columns_ : Index

The categorical columns in the training data

noncategorical_columns_ : Index

The rest of the columns in the training data

categorical_blocks_ : dict

Mapping from column names to slice objects. The slices represent the positions in the transformed array that the categorical column ends up at

dtypes_ : dict

Dictionary mapping column name to either

  • instances of CategoricalDtype (pandas >= 0.21.0)
  • tuples of (categories, ordered)

Notes

This transformer only applies to dask and pandas DataFrames. For dask DataFrames, all of your categoricals should be known.

The inverse transformation can be used on a dataframe or array.

Examples

>>> data = pd.DataFrame({"A": [1, 2, 3, 4],
...                      "B": pd.Categorical(['a', 'a', 'a', 'b'])})
>>> de = DummyEncoder()
>>> trn = de.fit_transform(data)
>>> trn
A  B_a  B_b
0  1    1    0
1  2    1    0
2  3    1    0
3  4    0    1
>>> de.columns_
Index(['A', 'B'], dtype='object')
>>> de.non_categorical_columns_
Index(['A'], dtype='object')
>>> de.categorical_columns_
Index(['B'], dtype='object')
>>> de.dtypes_
{'B': CategoricalDtype(categories=['a', 'b'], ordered=False)}
>>> de.categorical_blocks_
{'B': slice(1, 3, None)}
>>> de.fit_transform(dd.from_pandas(data, 2))
Dask DataFrame Structure:
                A    B_a    B_b
npartitions=2
0              int64  uint8  uint8
2                ...    ...    ...
3                ...    ...    ...
Dask Name: get_dummies, 4 tasks

Methods

fit(X[, y]) Determine the categorical columns to be dummy encoded.
fit_transform(X[, y]) Fit to data, then transform it.
get_params([deep]) Get parameters for this estimator.
inverse_transform(X) Inverse dummy-encode the columns in X
set_params(**params) Set the parameters of this estimator.
transform(X[, y]) Dummy encode the categorical columns in X
fit(X, y=None)

Determine the categorical columns to be dummy encoded.

Parameters:
X : pandas.DataFrame or dask.dataframe.DataFrame
y : ignored
Returns:
self
fit_transform(X, y=None, **fit_params)

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:
X : numpy array of shape [n_samples, n_features]

Training set.

y : numpy array of shape [n_samples]

Target values.

Returns:
X_new : numpy array of shape [n_samples, n_features_new]

Transformed array.

get_params(deep=True)

Get parameters for this estimator.

Parameters:
deep : boolean, optional

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:
params : mapping of string to any

Parameter names mapped to their values.

inverse_transform(X)

Inverse dummy-encode the columns in X

Parameters:
X : array or dataframe

Either the NumPy, dask, or pandas version

Returns:
data : DataFrame

Dask array or dataframe will return a Dask DataFrame. Numpy array or pandas dataframe will return a pandas DataFrame

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Returns:
self
transform(X, y=None)

Dummy encode the categorical columns in X

Parameters:
X : pd.DataFrame or dd.DataFrame
y : ignored
Returns:
transformed : pd.DataFrame or dd.DataFrame

Same type as the input

class dask_ml.preprocessing.OrdinalEncoder(columns=None)

Ordinal (integer) encode categorical columns.

Parameters:
columns : sequence, optional

The columns to encode. Must be categorical dtype. Encodes all categorical dtype columns by default.

Attributes:
columns_ : Index

The columns in the training data before/after encoding

categorical_columns_ : Index

The categorical columns in the training data

noncategorical_columns_ : Index

The rest of the columns in the training data

dtypes_ : dict

Dictionary mapping column name to either

  • instances of CategoricalDtype (pandas >= 0.21.0)
  • tuples of (categories, ordered)

Notes

This transformer only applies to dask and pandas DataFrames. For dask DataFrames, all of your categoricals should be known.

The inverse transformation can be used on a dataframe or array.

Examples

>>> data = pd.DataFrame({"A": [1, 2, 3, 4],
...                      "B": pd.Categorical(['a', 'a', 'a', 'b'])})
>>> enc = OrdinalEncoder()
>>> trn = enc.fit_transform(data)
>>> trn
   A  B
0  1  0
1  2  0
2  3  0
3  4  1
>>> enc.columns_
Index(['A', 'B'], dtype='object')
>>> enc.non_categorical_columns_
Index(['A'], dtype='object')
>>> enc.categorical_columns_
Index(['B'], dtype='object')
>>> enc.dtypes_
{'B': CategoricalDtype(categories=['a', 'b'], ordered=False)}
>>> enc.fit_transform(dd.from_pandas(data, 2))
Dask DataFrame Structure:
                   A     B
npartitions=2
0              int64  int8
2                ...   ...
3                ...   ...
Dask Name: assign, 8 tasks

Methods

fit(X[, y]) Determine the categorical columns to be encoded.
fit_transform(X[, y]) Fit to data, then transform it.
get_params([deep]) Get parameters for this estimator.
inverse_transform(X) Inverse ordinal-encode the columns in X
set_params(**params) Set the parameters of this estimator.
transform(X[, y]) Ordinal encode the categorical columns in X
fit(X, y=None)

Determine the categorical columns to be encoded.

Parameters:
X : pandas.DataFrame or dask.dataframe.DataFrame
y : ignored
Returns:
self
fit_transform(X, y=None, **fit_params)

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:
X : numpy array of shape [n_samples, n_features]

Training set.

y : numpy array of shape [n_samples]

Target values.

Returns:
X_new : numpy array of shape [n_samples, n_features_new]

Transformed array.

get_params(deep=True)

Get parameters for this estimator.

Parameters:
deep : boolean, optional

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:
params : mapping of string to any

Parameter names mapped to their values.

inverse_transform(X)

Inverse ordinal-encode the columns in X

Parameters:
X : array or dataframe

Either the NumPy, dask, or pandas version

Returns:
data : DataFrame

Dask array or dataframe will return a Dask DataFrame. Numpy array or pandas dataframe will return a pandas DataFrame

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Returns:
self
transform(X, y=None)

Ordinal encode the categorical columns in X

Parameters:
X : pd.DataFrame or dd.DataFrame
y : ignored
Returns:
transformed : pd.DataFrame or dd.DataFrame

Same type as the input

class dask_ml.preprocessing.LabelEncoder(use_categorical=True)

Encode labels with value between 0 and n_classes-1.

Note

This differs from the scikit-learn version for Categorical data. When passed a categorical y, this implementation will use the categorical information for the label encoding and transformation. You will receive different answers when

  1. Your categories are not monotonically increasing
  2. You have unobserved categories

Specify use_categorical=False to recover the scikit-learn behavior.

Parameters:
use_categorical : bool, default True

Whether to use the categorical dtype information when y is a dask or pandas Series with a categorical dtype.

Attributes:
classes_ : array of shape (n_class,)

Holds the label for each class.

dtype_ : Optional CategoricalDtype

For Categorical y, the dtype is stored here.

Examples

LabelEncoder can be used to normalize labels.

>>> from dask_ml import preprocessing
>>> le = preprocessing.LabelEncoder()
>>> le.fit([1, 2, 2, 6])
LabelEncoder()
>>> le.classes_
array([1, 2, 6])
>>> le.transform([1, 1, 2, 6]) #doctest: +ELLIPSIS
array([0, 0, 1, 2]...)
>>> le.inverse_transform([0, 0, 1, 2])
array([1, 1, 2, 6])

It can also be used to transform non-numerical labels (as long as they are hashable and comparable) to numerical labels.

>>> le = preprocessing.LabelEncoder()
>>> le.fit(["paris", "paris", "tokyo", "amsterdam"])
LabelEncoder()
>>> list(le.classes_)
['amsterdam', 'paris', 'tokyo']
>>> le.transform(["tokyo", "tokyo", "paris"]) #doctest: +ELLIPSIS
array([2, 2, 1]...)
>>> list(le.inverse_transform([2, 2, 1]))
['tokyo', 'tokyo', 'paris']

When using Dask, we strongly recommend using a Categorical dask Series if possible. This avoids a (potentially expensive) scan of the values and enables a faster transform algorithm.

>>> import dask.dataframe as dd
>>> import pandas as pd
>>> data = dd.from_pandas(pd.Series(['a', 'a', 'b'], dtype='category'),
...                       npartitions=2)
>>> le.fit_transform(data)
dask.array<values, shape=(nan,), dtype=int8, chunksize=(nan,)>
>>> le.fit_transform(data).compute()
array([0, 0, 1], dtype=int8)

Methods

fit(y) Fit label encoder
fit_transform(y) Fit label encoder and return encoded labels
get_params([deep]) Get parameters for this estimator.
inverse_transform(y) Transform labels back to original encoding.
set_params(**params) Set the parameters of this estimator.
transform(y) Transform labels to normalized encoding.
fit(y)

Fit label encoder

Parameters:
y : array-like of shape (n_samples,)

Target values.

Returns:
self : returns an instance of self.
fit_transform(y)

Fit label encoder and return encoded labels

Parameters:
y : array-like of shape [n_samples]

Target values.

Returns:
y : array-like of shape [n_samples]
get_params(deep=True)

Get parameters for this estimator.

Parameters:
deep : boolean, optional

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:
params : mapping of string to any

Parameter names mapped to their values.

inverse_transform(y)

Transform labels back to original encoding.

Parameters:
y : numpy array of shape [n_samples]

Target values.

Returns:
y : numpy array of shape [n_samples]
set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Returns:
self
transform(y)

Transform labels to normalized encoding.

Parameters:
y : array-like of shape [n_samples]

Target values.

Returns:
y : array-like of shape [n_samples]
class dask_ml.preprocessing.OneHotEncoder(n_values=None, categorical_features=None, categories='auto', sparse=True, dtype=<class 'numpy.float64'>, handle_unknown='error')

Encode categorical integer features as a one-hot numeric array.

New in version 0.8.0.

Note

This requires scikit-learn 0.20.0 or newer.

The input to this transformer should be an array-like of integers, strings, or categoricals, denoting the values taken on by categorical (discrete) features. The features are encoded using a one-hot (aka ‘one-of-K’ or ‘dummy’) encoding scheme. This creates a binary column for each category and returns a sparse matrix or dense array.

By default, the encoder derives the categories based on

  1. For arrays, the unique values in each feature
  2. For DataFrames, the CategoricalDtype information for each feature

Alternatively, for arrays, you can also specify the categories manually.

This encoding is needed for feeding categorical data to many scikit-learn estimators, notably linear models and SVMs with the standard kernels.

Note: a one-hot encoding of y labels should use a LabelBinarizer instead.

Read more in the User Guide.

Parameters:
categories : ‘auto’ or a list of lists/arrays of values.

Categories (unique values) per feature:

  • ‘auto’ : Determine categories automatically from the training data.
  • list : categories[i] holds the categories expected in the ith column. The passed categories should not mix strings and numeric values within a single feature, and should be sorted in case of numeric values.

The used categories can be found in the categories_ attribute.

sparse : boolean, default=True

Will return sparse matrix if set True else will return an array.

dtype : number type, default=np.float

Desired dtype of output.

handle_unknown : ‘error’

Whether to raise an error or ignore if an unknown categorical feature is present during transform (default is to raise). The option to ignore unknown categories is not currently implemented.

Attributes:
categories_ : list of arrays

The categories of each feature determined during fitting (in order of the features in X and corresponding with the output of transform).

dtypes_ : list of dtypes

For DataFrame input, the CategoricalDtype information associated with each feature. For arrays, this is a list of Nones.

Notes

There are a few differences from scikit-learn.

Examples

Given a dataset with two features, we let the encoder find the unique values per feature and transform the data to a binary one-hot encoding.

>>> from dask_ml.preprocessing import OneHotEncoder
>>> import numpy as np
>>> import dask.array as da
>>> enc = OneHotEncoder()
>>> X = da.from_array(np.array([['A'], ['B'], ['A'], ['C']]), chunks=2)
>>> enc.fit(X)
... # doctest: +ELLIPSIS
OneHotEncoder(categorical_features=None, categories=None,
       dtype=<... 'numpy.float64'>, handle_unknown='error',
       n_values=None, sparse=True)
>>> enc.categories_
[array(['A', 'B', 'C'], dtype='<U1')]
>>> enc.transform(X)
dask.array<concatenate, shape=(4, 3), dtype=float64, chunksize=(2, 3)>

Methods

fit(X[, y]) Fit OneHotEncoder to X.
fit_transform(X[, y]) Fit OneHotEncoder to X, then transform X.
get_feature_names([input_features]) Return feature names for output features.
get_params([deep]) Get parameters for this estimator.
inverse_transform(X) Convert the back data to the original representation.
set_params(**params) Set the parameters of this estimator.
transform(X) Transform X using one-hot encoding.
active_features_

DEPRECATED: The active_features_ attribute was deprecated in version 0.20 and will be removed 0.22.

feature_indices_

DEPRECATED: The feature_indices_ attribute was deprecated in version 0.20 and will be removed 0.22.

fit(X, y=None)

Fit OneHotEncoder to X.

Parameters:
X : array-like, shape [n_samples, n_features]

The data to determine the categories of each feature.

Returns:
self
fit_transform(X, y=None)

Fit OneHotEncoder to X, then transform X.

Equivalent to fit(X).transform(X) but more convenient.

Parameters:
X : array-like, shape [n_samples, n_features]

The data to encode.

Returns:
X_out : sparse matrix if sparse=True else a 2-d array

Transformed input.

get_feature_names(input_features=None)

Return feature names for output features.

Parameters:
input_features : list of string, length n_features, optional

String names for input features if available. By default, “x0”, “x1”, … “xn_features” is used.

Returns:
output_feature_names : array of string, length n_output_features
get_params(deep=True)

Get parameters for this estimator.

Parameters:
deep : boolean, optional

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:
params : mapping of string to any

Parameter names mapped to their values.

inverse_transform(X)

Convert the back data to the original representation.

In case unknown categories are encountered (all zero’s in the one-hot encoding), None is used to represent this category.

Parameters:
X : array-like or sparse matrix, shape [n_samples, n_encoded_features]

The transformed data.

Returns:
X_tr : array-like, shape [n_samples, n_features]

Inverse transformed array.

n_values_

DEPRECATED: The n_values_ attribute was deprecated in version 0.20 and will be removed 0.22.

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Returns:
self
transform(X)

Transform X using one-hot encoding.

Parameters:
X : array-like, shape [n_samples, n_features]

The data to encode.

Returns:
X_out : sparse matrix if sparse=True else a 2-d array

Transformed input.

class dask_ml.preprocessing.PolynomialFeatures(degree=2, interaction_only=False, include_bias=True, preserve_dataframe=False)

Generate polynomial and interaction features.

Generate a new feature matrix consisting of all polynomial combinations of the features with degree less than or equal to the specified degree. For example, if an input sample is two dimensional and of the form [a, b], the degree-2 polynomial features are [1, a, b, a^2, ab, b^2].

Parameters:
degree : integer

The degree of the polynomial features. Default = 2.

interaction_only : boolean, default = False

If true, only interaction features are produced: features that are products of at most degree distinct input features (so not x[1] ** 2, x[0] * x[2] ** 3, etc.).

include_bias : boolean

If True (default), then include a bias column, the feature in which all polynomial powers are zero (i.e. a column of ones - acts as an intercept term in a linear model).

preserve_dataframe : boolean

If True, preserve pandas and dask dataframes after transforming. Using False (default) returns numpy or dask arrays and mimics sklearn’s default behaviour

Attributes:
powers_ : array, shape (n_output_features, n_input_features)

powers_[i, j] is the exponent of the jth input in the ith output.

n_input_features_ : int

The total number of input features.

n_output_features_ : int

The total number of polynomial output features. The number of output features is computed by iterating over all suitably sized combinations of input features.

Notes

Be aware that the number of features in the output array scales polynomially in the number of features of the input array, and exponentially in the degree. High degrees can cause overfitting.

See examples/linear_model/plot_polynomial_interpolation.py

Examples

>>> X = np.arange(6).reshape(3, 2)
>>> X
array([[0, 1],
       [2, 3],
       [4, 5]])
>>> poly = PolynomialFeatures(2)
>>> poly.fit_transform(X)
array([[ 1.,  0.,  1.,  0.,  0.,  1.],
       [ 1.,  2.,  3.,  4.,  6.,  9.],
       [ 1.,  4.,  5., 16., 20., 25.]])
>>> poly = PolynomialFeatures(interaction_only=True)
>>> poly.fit_transform(X)
array([[ 1.,  0.,  1.,  0.],
       [ 1.,  2.,  3.,  6.],
       [ 1.,  4.,  5., 20.]])

Methods

fit(X[, y]) Compute number of output features.
fit_transform(X[, y]) Fit to data, then transform it.
get_feature_names([input_features]) Return feature names for output features
get_params([deep]) Get parameters for this estimator.
set_params(**params) Set the parameters of this estimator.
transform(X[, y]) Transform data to polynomial features
fit(X, y=None)

Compute number of output features.

Parameters:
X : array-like, shape (n_samples, n_features)

The data.

Returns:
self : instance
fit_transform(X, y=None, **fit_params)

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:
X : numpy array of shape [n_samples, n_features]

Training set.

y : numpy array of shape [n_samples]

Target values.

Returns:
X_new : numpy array of shape [n_samples, n_features_new]

Transformed array.

get_feature_names(input_features=None)

Return feature names for output features

Parameters:
input_features : list of string, length n_features, optional

String names for input features if available. By default, “x0”, “x1”, … “xn_features” is used.

Returns:
output_feature_names : list of string, length n_output_features
get_params(deep=True)

Get parameters for this estimator.

Parameters:
deep : boolean, optional

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:
params : mapping of string to any

Parameter names mapped to their values.

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Returns:
self
transform(X, y=None)

Transform data to polynomial features

Parameters:
X : array-like or sparse matrix, shape [n_samples, n_features]

The data to transform, row by row. Sparse input should preferably be in CSC format.

Returns:
XP : np.ndarray or CSC sparse matrix, shape [n_samples, NP]

The matrix of features, where NP is the number of polynomial features generated from the combination of inputs.

preprocessing.StandardScaler([copy, …]) Standardize features by removing the mean and scaling to unit variance
preprocessing.RobustScaler([with_centering, …]) Scale features using statistics that are robust to outliers.
preprocessing.MinMaxScaler([feature_range, copy]) Transforms features by scaling each feature to a given range.
preprocessing.QuantileTransformer([…]) Transforms features using quantile information.
preprocessing.StandardScaler([copy, …]) Standardize features by removing the mean and scaling to unit variance
preprocessing.Categorizer([categories, columns]) Transform columns of a DataFrame to categorical dtype.
preprocessing.DummyEncoder([columns, drop_first]) Dummy (one-hot) encode categorical columns.
preprocessing.OrdinalEncoder([columns]) Ordinal (integer) encode categorical columns.
preprocessing.LabelEncoder([use_categorical]) Encode labels with value between 0 and n_classes-1.
preprocessing.PolynomialFeatures([degree, …]) Generate polynomial and interaction features.

dask_ml.compose: Composite Estimators

Meta-estimators for building composite models with transformers.

Meta-estimators for composing models with multiple transformers.

These estimators are useful for working with heterogenous tabular data.

class dask_ml.compose.ColumnTransformer(transformers, remainder='drop', sparse_threshold=0.3, n_jobs=1, transformer_weights=None, preserve_dataframe=True)

Applies transformers to columns of an array or pandas DataFrame.

EXPERIMENTAL: some behaviors may change between releases without deprecation.

This estimator allows different columns or column subsets of the input to be transformed separately and the results combined into a single feature space. This is useful for heterogeneous or columnar data, to combine several feature extraction mechanisms or transformations into a single transformer.

Read more in the User Guide.

New in version 0.9.0.

Note

This requires scikit-learn 0.20.0 or newer.

Parameters:
transformers : list of tuples

List of (name, transformer, column(s)) tuples specifying the transformer objects to be applied to subsets of the data.

name : string

Like in Pipeline and FeatureUnion, this allows the transformer and its parameters to be set using set_params and searched in grid search.

transformer : estimator or {‘passthrough’, ‘drop’}

Estimator must support fit and transform. Special-cased strings ‘drop’ and ‘passthrough’ are accepted as well, to indicate to drop the columns or to pass them through untransformed, respectively.

column(s) : string or int, array-like of string or int, slice, boolean mask array or callable

Indexes the data on its second axis. Integers are interpreted as positional columns, while strings can reference DataFrame columns by name. A scalar string or int should be used where transformer expects X to be a 1d array-like (vector), otherwise a 2d array will be passed to the transformer. A callable is passed the input data X and can return any of the above.

remainder : {‘drop’, ‘passthrough’} or estimator, default ‘drop’

By default, only the specified columns in transformers are transformed and combined in the output, and the non-specified columns are dropped. (default of 'drop'). By specifying remainder='passthrough', all remaining columns that were not specified in transformers will be automatically passed through. This subset of columns is concatenated with the output of the transformers. By setting remainder to be an estimator, the remaining non-specified columns will use the remainder estimator. The estimator must support fit and transform.

sparse_threshold : float, default = 0.3

If the transformed output consists of a mix of sparse and dense data, it will be stacked as a sparse matrix if the density is lower than this value. Use sparse_threshold=0 to always return dense. When the transformed output consists of all sparse or all dense data, the stacked result will be sparse or dense, respectively, and this keyword will be ignored.

n_jobs : int or None, optional (default=None)

Number of jobs to run in parallel. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors. See Glossary for more details.

transformer_weights : dict, optional

Multiplicative weights for features per transformer. The output of the transformer is multiplied by these weights. Keys are transformer names, values the weights.

preserve_dataframe : bool, (default=True)

Whether to preserve preserve pandas DataFrames when concatenating the results.

Warning

The default behavior of keeping DataFrames differs from scikit-learn’s current behavior. Set preserve_dataframe=False if you need to ensure that the output matches scikit-learn’s ColumnTransformer.

Attributes:
transformers_ : list

The collection of fitted transformers as tuples of (name, fitted_transformer, column). fitted_transformer can be an estimator, ‘drop’, or ‘passthrough’. If there are remaining columns, the final element is a tuple of the form: (‘remainder’, transformer, remaining_columns) corresponding to the remainder parameter. If there are remaining columns, then len(transformers_)==len(transformers)+1, otherwise len(transformers_)==len(transformers).

named_transformers_ : Bunch object, a dictionary with attribute access

Access the fitted transformer by name.

sparse_output_ : boolean

Boolean flag indicating wether the output of transform is a sparse matrix or a dense numpy array, which depends on the output of the individual transformers and the sparse_threshold keyword.

See also

dask_ml.compose.make_column_transformer
convenience function for combining the outputs of multiple transformer objects applied to column subsets of the original feature space.

Notes

The order of the columns in the transformed feature matrix follows the order of how the columns are specified in the transformers list. Columns of the original feature matrix that are not specified are dropped from the resulting transformed feature matrix, unless specified in the passthrough keyword. Those columns specified with passthrough are added at the right to the output of the transformers.

Examples

>>> from dask_ml.compose import ColumnTransformer
>>> from sklearn.preprocessing import Normalizer
>>> ct = ColumnTransformer(
...     [("norm1", Normalizer(norm='l1'), [0, 1]),
...      ("norm2", Normalizer(norm='l1'), slice(2, 4))])
>>> X = np.array([[0., 1., 2., 2.],
...               [1., 1., 0., 1.]])
>>> # Normalizer scales each row of X to unit norm. A separate scaling
>>> # is applied for the two first and two last elements of each
>>> # row independently.
>>> ct.fit_transform(X)    # doctest: +NORMALIZE_WHITESPACE
array([[0. , 1. , 0.5, 0.5],
       [0.5, 0.5, 0. , 1. ]])

Methods

fit(X[, y]) Fit all transformers using X.
fit_transform(X[, y]) Fit all transformers, transform the data and concatenate results.
get_feature_names() Get feature names from all transformers.
get_params([deep]) Get parameters for this estimator.
set_params(**kwargs) Set the parameters of this estimator.
transform(X) Transform X separately by each transformer, concatenate results.
fit(X, y=None)

Fit all transformers using X.

Parameters:
X : array-like or DataFrame of shape [n_samples, n_features]

Input data, of which specified subsets are used to fit the transformers.

y : array-like, shape (n_samples, …), optional

Targets for supervised learning.

Returns:
self : ColumnTransformer

This estimator

fit_transform(X, y=None)

Fit all transformers, transform the data and concatenate results.

Parameters:
X : array-like or DataFrame of shape [n_samples, n_features]

Input data, of which specified subsets are used to fit the transformers.

y : array-like, shape (n_samples, …), optional

Targets for supervised learning.

Returns:
X_t : array-like or sparse matrix, shape (n_samples, sum_n_components)

hstack of results of transformers. sum_n_components is the sum of n_components (output dimension) over transformers. If any result is a sparse matrix, everything will be converted to sparse matrices.

get_feature_names()

Get feature names from all transformers.

Returns:
feature_names : list of strings

Names of the features produced by transform.

get_params(deep=True)

Get parameters for this estimator.

Parameters:
deep : boolean, optional

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:
params : mapping of string to any

Parameter names mapped to their values.

named_transformers_

Access the fitted transformer by name.

Read-only attribute to access any transformer by given name. Keys are transformer names and values are the fitted transformer objects.

set_params(**kwargs)

Set the parameters of this estimator.

Valid parameter keys can be listed with get_params().

Returns:
self
transform(X)

Transform X separately by each transformer, concatenate results.

Parameters:
X : array-like or DataFrame of shape [n_samples, n_features]

The data to be transformed by subset.

Returns:
X_t : array-like or sparse matrix, shape (n_samples, sum_n_components)

hstack of results of transformers. sum_n_components is the sum of n_components (output dimension) over transformers. If any result is a sparse matrix, everything will be converted to sparse matrices.

dask_ml.compose.make_column_transformer(*transformers, **kwargs)

Construct a ColumnTransformer from the given transformers.

This is a shorthand for the ColumnTransformer constructor; it does not require, and does not permit, naming the transformers. Instead, they will be given names automatically based on their types. It also does not allow weighting.

Parameters:
*transformers : tuples of column selections and transformers
remainder : {‘drop’, ‘passthrough’} or estimator, default ‘drop’

By default, only the specified columns in transformers are transformed and combined in the output, and the non-specified columns are dropped. (default of 'drop'). By specifying remainder='passthrough', all remaining columns that were not specified in transformers will be automatically passed through. This subset of columns is concatenated with the output of the transformers. By setting remainder to be an estimator, the remaining non-specified columns will use the remainder estimator. The estimator must support fit and transform.

sparse_threshold : float, default = 0.3

If the transformed output consists of a mix of sparse and dense data, it will be stacked as a sparse matrix if the density is lower than this value. Use sparse_threshold=0 to always return dense. When the transformed output consists of all sparse or all dense data, the stacked result will be sparse or dense, respectively, and this keyword will be ignored.

n_jobs : int or None, optional (default=None)

Number of jobs to run in parallel. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors. See Glossary for more details.

Returns:
ct : ColumnTransformer

See also

sklearn.compose.ColumnTransformer
Class that allows combining the outputs of multiple transformer objects used on column subsets of the data into a single feature space.

Examples

>>> from sklearn.preprocessing import StandardScaler, OneHotEncoder
>>> from sklearn.compose import make_column_transformer
>>> make_column_transformer(
...     (['numerical_column'], StandardScaler()),
...     (['categorical_column'], OneHotEncoder()))
...     # doctest: +NORMALIZE_WHITESPACE +ELLIPSIS
ColumnTransformer(n_jobs=None, remainder='drop', sparse_threshold=0.3,
         transformer_weights=None,
         transformers=[('standardscaler',
                        StandardScaler(...),
                        ['numerical_column']),
                       ('onehotencoder',
                        OneHotEncoder(...),
                        ['categorical_column'])])
compose.ColumnTransformer(transformers[, …]) Applies transformers to columns of an array or pandas DataFrame.
compose.make_column_transformer(…) Construct a ColumnTransformer from the given transformers.

dask_ml.impute: Imputing Missing Data

class dask_ml.impute.SimpleImputer(missing_values=nan, strategy='mean', fill_value=None, verbose=0, copy=True)

Methods

fit(X[, y]) Fit the imputer on X.
fit_transform(X[, y]) Fit to data, then transform it.
get_params([deep]) Get parameters for this estimator.
set_params(**params) Set the parameters of this estimator.
transform(X) Impute all missing values in X.
fit(X, y=None)

Fit the imputer on X.

Parameters:
X : {array-like, sparse matrix}, shape (n_samples, n_features)

Input data, where n_samples is the number of samples and n_features is the number of features.

Returns:
self : SimpleImputer
fit_transform(X, y=None, **fit_params)

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:
X : numpy array of shape [n_samples, n_features]

Training set.

y : numpy array of shape [n_samples]

Target values.

Returns:
X_new : numpy array of shape [n_samples, n_features_new]

Transformed array.

get_params(deep=True)

Get parameters for this estimator.

Parameters:
deep : boolean, optional

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:
params : mapping of string to any

Parameter names mapped to their values.

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Returns:
self
transform(X)

Impute all missing values in X.

Parameters:
X : {array-like, sparse matrix}, shape (n_samples, n_features)

The input data to complete.

impute.SimpleImputer([missing_values, …])

Methods

dask_ml.metrics: Metrics

Score functions, performance metrics, and pairwise distance computations.

Regression Metrics

metrics.mean_absolute_error(y_true, y_pred) Mean squared error regression loss
metrics.mean_squared_error(y_true, y_pred[, …]) Mean squared error regression loss
metrics.r2_score(y_true, y_pred[, …]) R^2 (coefficient of determination) regression score function.

Classification Metrics

metrics.accuracy_score(y_true, y_pred[, …]) Accuracy classification score.
metrics.log_loss(y_true, y_pred[, eps, …]) Log loss, aka logistic loss or cross-entropy loss.

dask_ml.tensorflow: Tensorflow

Interoperate with a Tensorflow cluster.

start_tensorflow(client, **kwargs) Start Tensorflow on Dask Cluster

dask_ml.xgboost: XGBoost

Train an XGBoost model on dask arrays or dataframes.

This may be used for training an XGBoost model on a cluster. XGBoost will be setup in distributed mode alongside your existing dask.distributed cluster.

XGBClassifier([max_depth, learning_rate, …])
Attributes:
XGBRegressor([max_depth, learning_rate, …])
Attributes:
train(client, params, data, labels[, …]) Train an XGBoost model on a Dask Cluster
predict(client, model, data) Distributed prediction with XGBoost