dask_ml.wrappers
.Incremental¶
-
class
dask_ml.wrappers.
Incremental
(estimator=None, scoring=None, shuffle_blocks=True, random_state=None, assume_equal_chunks=True)¶ Metaestimator for feeding Dask Arrays to an estimator blockwise.
This wrapper provides a bridge between Dask objects and estimators implementing the
partial_fit
API. These incremental learners can train on batches of data. This fits well with Dask’s blocked data structures.Note
This meta-estimator is not appropriate for hyperparameter optimization on larger-than-memory datasets. For that, see
IncrementalSearchCV
orHyperbandSearchCV
.See the list of incremental learners in the scikit-learn documentation for a list of estimators that implement the
partial_fit
API. Note that Incremental is not limited to just these classes, it will work on any estimator implementingpartial_fit
, including those defined outside of scikit-learn itself.Calling
Incremental.fit()
with a Dask Array will pass each block of the Dask array or arrays toestimator.partial_fit
sequentially.Like
ParallelPostFit
, the methods available after fitting (e.g.Incremental.predict()
, etc.) are all parallel and delayed.The
estimator_
attribute is a clone of estimator that was actually used during the call tofit
. All attributes learned during training are available onIncremental
directly.Parameters: - estimator : Estimator
Any object supporting the scikit-learn
parital_fit
API.- scoring : string or callable, optional
A single string (see The scoring parameter: defining model evaluation rules) or a callable (see Defining your scoring strategy from metric functions) to evaluate the predictions on the test set.
For evaluating multiple metrics, either give a list of (unique) strings or a dict with names as keys and callables as values.
NOTE that when using custom scorers, each scorer should return a single value. Metric functions returning a list/array of values can be wrapped into multiple scorers that return one value each.
See Specifying multiple metrics for evaluation for an example.
Warning
If None, the estimator’s default scorer (if available) is used. Most scikit-learn estimators will convert large Dask arrays to a single NumPy array, which may exhaust the memory of your worker. You probably want to always specify scoring.
- random_state : int or numpy.random.RandomState, optional
Random object that determines how to shuffle blocks.
- shuffle_blocks : bool, default True
Determines whether to call
partial_fit
on a randomly selected chunk of the Dask arrays (default), or to fit in sequential order. This does not control shuffle between blocks or shuffling each block.
Attributes: - estimator_ : Estimator
A clone of estimator that was actually fit during the
.fit
call.
Examples
>>> from dask_ml.wrappers import Incremental >>> from dask_ml.datasets import make_classification >>> import sklearn.linear_model >>> X, y = make_classification(chunks=25) >>> est = sklearn.linear_model.SGDClassifier() >>> clf = Incremental(est, scoring='accuracy') >>> clf.fit(X, y, classes=[0, 1])
When used inside a grid search, prefix the underlying estimator’s parameter names with
estimator__
.>>> from sklearn.model_selection import GridSearchCV >>> param_grid = {"estimator__alpha": [0.1, 1.0, 10.0]} >>> gs = GridSearchCV(clf, param_grid) >>> gs.fit(X, y, classes=[0, 1])
Methods
fit
(X[, y])Fit the underlying estimator. get_params
([deep])Get parameters for this estimator. partial_fit
(X[, y])Fit the underlying estimator. predict
(X)Predict for X. predict_log_proba
(X)Log of proability estimates. predict_proba
(X)Probability estimates. score
(X, y[, compute])Returns the score on the given data. set_params
(**params)Set the parameters of this estimator. transform
(X)Transform block or partition-wise for dask inputs. -
__init__
(estimator=None, scoring=None, shuffle_blocks=True, random_state=None, assume_equal_chunks=True)¶ Initialize self. See help(type(self)) for accurate signature.
-
fit
(X, y=None, **fit_kwargs)¶ Fit the underlying estimator.
Parameters: - X, y : array-like
- **kwargs
Additional fit-kwargs for the underlying estimator.
Returns: - self : object
-
get_params
(deep=True)¶ Get parameters for this estimator.
Parameters: - deep : bool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns: - params : mapping of string to any
Parameter names mapped to their values.
-
partial_fit
(X, y=None, **fit_kwargs)¶ Fit the underlying estimator.
If this estimator has not been previously fit, this is identical to
Incremental.fit()
. If it has been previously fit,self.estimator_
is used as the starting point.Parameters: - X, y : array-like
- **kwargs
Additional fit-kwargs for the underlying estimator.
Returns: - self : object
-
predict
(X)¶ Predict for X.
For dask inputs, a dask array or dataframe is returned. For other inputs (NumPy array, pandas dataframe, scipy sparse matrix), the regular return value is returned.
Parameters: - X : array-like
Returns: - y : array-like
-
predict_log_proba
(X)¶ Log of proability estimates.
For dask inputs, a dask array or dataframe is returned. For other inputs (NumPy array, pandas dataframe, scipy sparse matrix), the regular return value is returned.
If the underlying estimator does not have a
predict_proba
method, then anAttributeError
is raised.Parameters: - X : array or dataframe
Returns: - y : array-like
-
predict_proba
(X)¶ Probability estimates.
For dask inputs, a dask array or dataframe is returned. For other inputs (NumPy array, pandas dataframe, scipy sparse matrix), the regular return value is returned.
If the underlying estimator does not have a
predict_proba
method, then anAttributeError
is raised.Parameters: - X : array or dataframe
Returns: - y : array-like
-
score
(X, y, compute=True)¶ Returns the score on the given data.
Parameters: - X : array-like, shape = [n_samples, n_features]
Input data, where n_samples is the number of samples and n_features is the number of features.
- y : array-like, shape = [n_samples] or [n_samples, n_output], optional
Target relative to X for classification or regression; None for unsupervised learning.
Returns: - score : float
return self.estimator.score(X, y)
-
set_params
(**params)¶ Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form
<component>__<parameter>
so that it’s possible to update each component of a nested object.Parameters: - **params : dict
Estimator parameters.
Returns: - self : object
Estimator instance.
-
transform
(X)¶ Transform block or partition-wise for dask inputs.
For dask inputs, a dask array or dataframe is returned. For other inputs (NumPy array, pandas dataframe, scipy sparse matrix), the regular return value is returned.
If the underlying estimator does not have a
transform
method, then anAttributeError
is raised.Parameters: - X : array-like
Returns: - transformed : array-like