dask_ml.model_selection.SuccessiveHalvingSearchCV

class dask_ml.model_selection.SuccessiveHalvingSearchCV(estimator, parameters, n_initial_parameters=10, n_initial_iter=None, max_iter=None, aggressiveness=3, test_size=None, patience=False, tol=0.001, random_state=None, scoring=None)

Perform the successive halving algorithm [R424ea1a907b1-1].

This algorithm trains estimators for a certain number partial_fit calls to partial_fit, then kills the worst performing half. It trains the surviving estimators for twice as long, and repeats this until one estimator survives.

The value of \(1/2\) above is used for a clear explanation. This class defaults to killing the worst performing 1 - 1 // aggressiveness fraction of models, and trains estimators for aggressiveness times longer, and waits until the number of models left is less than aggressiveness.

Parameters:
estimator : estimator object.

A object of that type is instantiated for each initial hyperparameter combination. This is assumed to implement the scikit-learn estimator interface. Either estimator needs to provide a score function, or scoring must be passed. The estimator must implement partial_fit, set_params, and work well with clone.

parameters : dict

Dictionary with parameters names (string) as keys and distributions or lists of parameters to try. Distributions must provide a rvs method for sampling (such as those from scipy.stats.distributions). If a list is given, it is sampled uniformly.

aggressiveness : float, default=3

How aggressive to be in culling off the different estimators. Higher values imply higher confidence in scoring (or that the hyperparameters influence the estimator.score more than the data).

n_initial_parameters : int, default=10

Number of parameter settings that are sampled. This trades off runtime vs quality of the solution.

n_initial_iter : int

Number of times to call partial fit initially before scoring. Estimators are trained for n_initial_iter calls to partial_fit initially. Higher values of n_initial_iter train the estimators longer before making a decision. Metadata on the number of calls to partial_fit is in metadata (and metadata_).

max_iter : int, default None

Maximum number of partial fit calls per model. If None, will allow SuccessiveHalvingSearchCV to run until (about) one model survives. If specified, models will stop being trained when max_iter calls to partial_fit are reached.

test_size : float

Fraction of the dataset to hold out for computing test scores. Defaults to the size of a single partition of the input training set

Note

The training dataset should fit in memory on a single machine. Adjust the test_size parameter as necessary to achieve this.

patience : int, default False

If specified, training stops when the score does not increase by tol after patience calls to partial_fit. Off by default.

tol : float, default 0.001

The required level of improvement to consider stopping training on that model. The most recent score must be at at most tol better than the all of the previous patience scores for that model. Increasing tol will tend to reduce training time, at the cost of worse models.

scoring : string, callable, None. default: None

A single string (see The scoring parameter: defining model evaluation rules) or a callable (see Defining your scoring strategy from metric functions) to evaluate the predictions on the test set.

If None, the estimator’s default scorer (if available) is used.

random_state : int, RandomState instance or None, optional, default: None

If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

Attributes:
cv_results_ : dict of np.ndarrays

This dictionary has keys

  • mean_partial_fit_time
  • mean_score_time
  • std_partial_fit_time
  • std_score_time
  • test_score
  • rank_test_score
  • model_id
  • partial_fit_calls
  • params
  • param_{key}, where key is every key in params.

The values in the test_score key correspond to the last score a model received on the hold out dataset. The key model_id corresponds with history_. This dictionary can be imported into Pandas.

metadata and metadata_ : dict[key, int]

Dictionary describing the computation. metadata describes the computation that will be performed, and metadata_ describes the computation that has been performed. Both dictionaries have keys

  • n_models: the number of models for this run of successive halving
  • max_iter: the maximum number of times partial_fit is called. At least one model will have this many partial_fit calls.
  • partial_fit_calls: the total number of partial_fit calls. All models together will receive this many partial_fit calls.

When patience is specified, the reduced computation will be reflected in metadata_ but not metadata.

model_history_ : dict of lists of dict

A dictionary of each models history. This is a reorganization of history_: the same information is present but organized per model.

This data has the structure {model_id: hist} where hist is a subset of history_ and model_id are model identifiers.

history_ : list of dicts

Information about each model after each partial_fit call. Each dict the keys

  • partial_fit_time
  • score_time
  • score
  • model_id
  • params
  • partial_fit_calls

The key model_id corresponds to the model_id in cv_results_. This list of dicts can be imported into Pandas.

best_estimator_ : BaseEstimator

The model with the highest validation score among all the models retained by the “inverse decay” algorithm.

best_score_ : float

Score achieved by best_estimator_ on the vaidation set after the final call to partial_fit.

best_index_ : int

Index indicating which estimator in cv_results_ corresponds to the highest score.

best_params_ : dict

Dictionary of best parameters found on the hold-out data.

scorer_ :

The function used to score models, which has a call signature of scorer_(estimator, X, y).

n_splits_ : int

Number of cross validation splits.

multimetric_ : bool

Whether this cross validation search uses multiple metrics.

References

[R424ea1a907b1-1]“Non-stochastic best arm identification and hyperparameter optimization” by Jamieson, Kevin and Talwalkar, Ameet. 2016. https://arxiv.org/abs/1502.07943

Methods

decision_function(self, X)
fit(self, X[, y]) Find the best parameters for a particular model.
get_params(self[, deep]) Get parameters for this estimator.
inverse_transform(self, Xt)
predict(self, X) Predict for X.
predict_log_proba(self, X) Log of proability estimates.
predict_proba(self, X) Probability estimates.
score(self, X[, y]) Returns the score on the given data.
set_params(self, \*\*params) Set the parameters of this estimator.
transform(self, X)
partial_fit  
__init__(self, estimator, parameters, n_initial_parameters=10, n_initial_iter=None, max_iter=None, aggressiveness=3, test_size=None, patience=False, tol=0.001, random_state=None, scoring=None)

Initialize self. See help(type(self)) for accurate signature.

fit(self, X, y=None, **fit_params)

Find the best parameters for a particular model.

Parameters:
X, y : array-like
**fit_params

Additional partial fit keyword arguments for the estimator.

get_params(self, deep=True)

Get parameters for this estimator.

Parameters:
deep : boolean, optional

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:
params : mapping of string to any

Parameter names mapped to their values.

predict(self, X)

Predict for X.

For dask inputs, a dask array or dataframe is returned. For other inputs (NumPy array, pandas dataframe, scipy sparse matrix), the regular return value is returned.

Parameters:
X : array-like
Returns:
y : array-like
predict_log_proba(self, X)

Log of proability estimates.

For dask inputs, a dask array or dataframe is returned. For other inputs (NumPy array, pandas dataframe, scipy sparse matrix), the regular return value is returned.

If the underlying estimator does not have a predict_proba method, then an AttributeError is raised.

Parameters:
X : array or dataframe
Returns:
y : array-like
predict_proba(self, X)

Probability estimates.

For dask inputs, a dask array or dataframe is returned. For other inputs (NumPy array, pandas dataframe, scipy sparse matrix), the regular return value is returned.

If the underlying estimator does not have a predict_proba method, then an AttributeError is raised.

Parameters:
X : array or dataframe
Returns:
y : array-like
score(self, X, y=None)

Returns the score on the given data.

Parameters:
X : array-like, shape = [n_samples, n_features]

Input data, where n_samples is the number of samples and n_features is the number of features.

y : array-like, shape = [n_samples] or [n_samples, n_output], optional

Target relative to X for classification or regression; None for unsupervised learning.

Returns:
score : float

return self.estimator.score(X, y)

set_params(self, **params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Returns:
self