dask_ml.model_selection.IncrementalSearchCV
dask_ml.model_selection.IncrementalSearchCV¶
- class dask_ml.model_selection.IncrementalSearchCV(estimator, parameters, n_initial_parameters=10, decay_rate=<object object>, test_size=None, patience=False, tol=0.001, fits_per_score=1, max_iter=100, random_state=None, scoring=None, verbose=False, prefix='', scores_per_fit=None, predict_meta=None, predict_proba_meta=None, transform_meta=None)¶
Incrementally search for hyper-parameters on models that support partial_fit
This incremental hyper-parameter optimization class starts training the model on many hyper-parameters on a small amount of data, and then only continues training those models that seem to be performing well.
See the User Guide for more.
- Parameters
- estimatorestimator object.
A object of that type is instantiated for each initial hyperparameter combination. This is assumed to implement the scikit-learn estimator interface. Either estimator needs to provide a score` function, or
scoringmust be passed. The estimator must implementpartial_fit,set_params, and work well withclone.- parametersdict
Dictionary with parameters names (string) as keys and distributions or lists of parameters to try. Distributions must provide a
rvsmethod for sampling (such as those from scipy.stats.distributions). If a list is given, it is sampled uniformly.- n_initial_parametersint, default=10
Number of parameter settings that are sampled. This trades off runtime vs quality of the solution.
Alternatively, you can set this to
"grid"to do a full grid search.- decay_ratefloat, default 1.0
How quickly to decrease the number partial future fit calls.
Deprecated since version v1.4.0: This implementation of an adaptive algorithm that uses
decay_ratehas moved toInverseDecaySearchCV.- patienceint, default False
If specified, training stops when the score does not increase by
tolafterpatiencecalls topartial_fit. Off by default.- fits_per_scoreint, optional, default=1
If
patienceis used the maximum number ofpartial_fitcalls betweenscorecalls.- scores_per_fitint, default 1
If
patienceis used the maximum number ofpartial_fitcalls betweenscorecalls.Deprecated since version v1.4.0: Renamed to
fits_per_score.- tolfloat, default 0.001
The required level of improvement to consider stopping training on that model. The most recent score must be at at most
tolbetter than the all of the previouspatiencescores for that model. Increasingtolwill tend to reduce training time, at the cost of worse models.- max_iterint, default 100
Maximum number of partial fit calls per model.
- test_sizefloat
Fraction of the dataset to hold out for computing test scores. Defaults to the size of a single partition of the input training set
Note
The training dataset should fit in memory on a single machine. Adjust the
test_sizeparameter as necessary to achieve this.- random_stateint, RandomState instance or None, optional, default: None
If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.
- scoringstring, callable, list/tuple, dict or None, default: None
A single string (see The scoring parameter: defining model evaluation rules) or a callable (see scoring) to evaluate the predictions on the test set.
For evaluating multiple metrics, either give a list of (unique) strings or a dict with names as keys and callables as values.
NOTE that when using custom scorers, each scorer should return a single value. Metric functions returning a list/array of values can be wrapped into multiple scorers that return one value each.
See Specifying multiple metrics for evaluation for an example.
If None, the estimator’s default scorer (if available) is used.
- verbosebool, float, int, optional, default: False
If False (default), don’t print logs (or pipe them to stdout). However, standard logging will still be used.
If True, print logs and use standard logging.
If float, print/log approximately
verbosefraction of the time.- prefixstr, optional, default=””
While logging, add
prefixto each message.- predict_meta: pd.Series, pd.DataFrame, np.array deafult: None(infer)
An empty
pd.Series,pd.DataFrame,np.arraythat matches the output type of the estimatorspredictcall. This meta is necessary for for some estimators to work withdask.dataframeanddask.array- predict_proba_meta: pd.Series, pd.DataFrame, np.array deafult: None(infer)
An empty
pd.Series,pd.DataFrame,np.arraythat matches the output type of the estimatorspredict_probacall. This meta is necessary for for some estimators to work withdask.dataframeanddask.array- transform_meta: pd.Series, pd.DataFrame, np.array deafult: None(infer)
An empty
pd.Series,pd.DataFrame,np.arraythat matches the output type of the estimatorstransformcall. This meta is necessary for for some estimators to work withdask.dataframeanddask.array
- Attributes
- cv_results_dict of np.ndarrays
This dictionary has keys
mean_partial_fit_timemean_score_timestd_partial_fit_timestd_score_timetest_scorerank_test_scoremodel_idpartial_fit_callsparamsparam_{key}, wherekeyis every key inparams.
The values in the
test_scorekey correspond to the last score a model received on the hold out dataset. The keymodel_idcorresponds withhistory_. This dictionary can be imported into Pandas.- model_history_dict of lists of dict
A dictionary of each models history. This is a reorganization of
history_: the same information is present but organized per model.This data has the structure
{model_id: hist}wherehistis a subset ofhistory_andmodel_idare model identifiers.- history_list of dicts
Information about each model after each
partial_fitcall. Each dict the keyspartial_fit_timescore_timescoremodel_idparamspartial_fit_callselapsed_wall_time
The key
model_idcorresponds to themodel_idincv_results_. This list of dicts can be imported into Pandas.- best_estimator_BaseEstimator
The model with the highest validation score among all the models retained by the “inverse decay” algorithm.
- best_score_float
Score achieved by
best_estimator_on the validation set after the final call topartial_fit.- best_index_int
Index indicating which estimator in
cv_results_corresponds to the highest score.- best_params_dict
Dictionary of best parameters found on the hold-out data.
- scorer_
The function used to score models, which has a call signature of
scorer_(estimator, X, y).- n_splits_int
Number of cross validation splits.
- multimetric_bool
Whether this cross validation search uses multiple metrics.
Examples
Connect to the client and create the data
>>> from dask.distributed import Client >>> client = Client() >>> import numpy as np >>> from dask_ml.datasets import make_classification >>> X, y = make_classification(n_samples=5000000, n_features=20, ... chunks=100000, random_state=0)
Our underlying estimator is an SGDClassifier. We specify a few parameters common to each clone of the estimator.
>>> from sklearn.linear_model import SGDClassifier >>> model = SGDClassifier(tol=1e-3, penalty='elasticnet', random_state=0)
The distribution of parameters we’ll sample from.
>>> params = {'alpha': np.logspace(-2, 1, num=1000), ... 'l1_ratio': np.linspace(0, 1, num=1000), ... 'average': [True, False]}
>>> search = IncrementalSearchCV(model, params, random_state=0) >>> search.fit(X, y, classes=[0, 1]) IncrementalSearchCV(...)
Alternatively you can provide keywords to start with more hyper-parameters, but stop those that don’t seem to improve with more data.
>>> search = IncrementalSearchCV(model, params, random_state=0, ... n_initial_parameters=1000, ... patience=20, max_iter=100)
Often, additional training leads to little or no gain in scores at the end of training. In these cases, stopping training is beneficial because there’s no gain from more training and less computation is required. Two parameters control detecting “little or no gain”:
patienceandtol. Training continues if at least one score is more thantolabove the other scores in the most recentpatiencecalls tomodel.partial_fit.For example, setting
tol=0andpatience=2means training will stop after two consecutive calls tomodel.partial_fitwithout improvement, or whenmax_itertotal calls tomodel.partial_fitare reached.Methods
decision_function(X)fit(X[, y])Find the best parameters for a particular model.
get_metadata_routing()Get metadata routing of this object.
get_params([deep])Get parameters for this estimator.
inverse_transform(Xt)predict(X)Predict for X.
predict_log_proba(X)Log of probability estimates.
predict_proba(X)Probability estimates.
score(X[, y])Returns the score on the given data.
set_params(**params)Set the parameters of this estimator.
set_score_request(*[, compute])Configure whether metadata should be requested to be passed to the
scoremethod.transform(X)Transform block or partition-wise for dask inputs.
partial_fit
- __init__(estimator, parameters, n_initial_parameters=10, decay_rate=<object object>, test_size=None, patience=False, tol=0.001, fits_per_score=1, max_iter=100, random_state=None, scoring=None, verbose=False, prefix='', scores_per_fit=None, predict_meta=None, predict_proba_meta=None, transform_meta=None)¶