dask_ml.compose.ColumnTransformer
dask_ml.compose
.ColumnTransformer¶
- class dask_ml.compose.ColumnTransformer(transformers, remainder='drop', sparse_threshold=0.3, n_jobs=1, transformer_weights=None, preserve_dataframe=True)¶
Applies transformers to columns of an array or pandas DataFrame.
EXPERIMENTAL: some behaviors may change between releases without deprecation.
This estimator allows different columns or column subsets of the input to be transformed separately and the results combined into a single feature space. This is useful for heterogeneous or columnar data, to combine several feature extraction mechanisms or transformations into a single transformer.
Read more in the User Guide.
New in version 0.9.0.
Note
This requires scikit-learn 0.20.0 or newer.
- Parameters
- transformerslist of tuples
List of (name, transformer, column(s)) tuples specifying the transformer objects to be applied to subsets of the data.
- namestring
Like in Pipeline and FeatureUnion, this allows the transformer and its parameters to be set using
set_params
and searched in grid search.- transformerestimator or {‘passthrough’, ‘drop’}
Estimator must support fit and transform. Special-cased strings ‘drop’ and ‘passthrough’ are accepted as well, to indicate to drop the columns or to pass them through untransformed, respectively.
- column(s)string or int, array-like of string or int, slice, boolean mask array or callable
Indexes the data on its second axis. Integers are interpreted as positional columns, while strings can reference DataFrame columns by name. A scalar string or int should be used where
transformer
expects X to be a 1d array-like (vector), otherwise a 2d array will be passed to the transformer. A callable is passed the input data X and can return any of the above.
- remainder{‘drop’, ‘passthrough’} or estimator, default ‘drop’
By default, only the specified columns in transformers are transformed and combined in the output, and the non-specified columns are dropped. (default of
'drop'
). By specifyingremainder='passthrough'
, all remaining columns that were not specified in transformers will be automatically passed through. This subset of columns is concatenated with the output of the transformers. By settingremainder
to be an estimator, the remaining non-specified columns will use theremainder
estimator. The estimator must support fit and transform.- sparse_thresholdfloat, default = 0.3
If the transformed output consists of a mix of sparse and dense data, it will be stacked as a sparse matrix if the density is lower than this value. Use
sparse_threshold=0
to always return dense. When the transformed output consists of all sparse or all dense data, the stacked result will be sparse or dense, respectively, and this keyword will be ignored.- n_jobsint or None, optional (default=None)
Number of jobs to run in parallel.
None
means 1 unless in ajoblib.parallel_backend
context.-1
means using all processors. See Glossary for more details.- transformer_weightsdict, optional
Multiplicative weights for features per transformer. The output of the transformer is multiplied by these weights. Keys are transformer names, values the weights.
- preserve_dataframebool, (default=True)
Whether to preserve preserve pandas DataFrames when concatenating the results.
Warning
The default behavior of keeping DataFrames differs from scikit-learn’s current behavior. Set
preserve_dataframe=False
if you need to ensure that the output matches scikit-learn’s ColumnTransformer.
- Attributes
- transformers_list
The collection of fitted transformers as tuples of (name, fitted_transformer, column). fitted_transformer can be an estimator, ‘drop’, or ‘passthrough’. If there are remaining columns, the final element is a tuple of the form: (‘remainder’, transformer, remaining_columns) corresponding to the
remainder
parameter. If there are remaining columns, thenlen(transformers_)==len(transformers)+1
, otherwiselen(transformers_)==len(transformers)
.named_transformers_
Bunch object, a dictionary with attribute accessAccess the fitted transformer by name.
- sparse_output_boolean
Boolean flag indicating whether the output of
transform
is a sparse matrix or a dense numpy array, which depends on the output of the individual transformers and the sparse_threshold keyword.
See also
dask_ml.compose.make_column_transformer
convenience function for combining the outputs of multiple transformer objects applied to column subsets of the original feature space.
Notes
The order of the columns in the transformed feature matrix follows the order of how the columns are specified in the transformers list. Columns of the original feature matrix that are not specified are dropped from the resulting transformed feature matrix, unless specified in the passthrough keyword. Those columns specified with passthrough are added at the right to the output of the transformers.
Examples
>>> from dask_ml.compose import ColumnTransformer >>> from sklearn.preprocessing import Normalizer >>> ct = ColumnTransformer( ... [("norm1", Normalizer(norm='l1'), [0, 1]), ... ("norm2", Normalizer(norm='l1'), slice(2, 4))]) >>> X = np.array([[0., 1., 2., 2.], ... [1., 1., 0., 1.]]) >>> # Normalizer scales each row of X to unit norm. A separate scaling >>> # is applied for the two first and two last elements of each >>> # row independently. >>> ct.fit_transform(X) array([[0. , 1. , 0.5, 0.5], [0.5, 0.5, 0. , 1. ]])
Methods
fit
(X[, y])Fit all transformers using X.
fit_transform
(X[, y])Fit all transformers, transform the data and concatenate results.
get_feature_names_out
([input_features])Get output feature names for transformation.
get_metadata_routing
()Get metadata routing of this object.
get_params
([deep])Get parameters for this estimator.
set_output
(*[, transform])Set the output container when "transform" and "fit_transform" are called.
set_params
(**kwargs)Set the parameters of this estimator.
transform
(X, **params)Transform X separately by each transformer, concatenate results.
- __init__(transformers, remainder='drop', sparse_threshold=0.3, n_jobs=1, transformer_weights=None, preserve_dataframe=True)¶