dask_ml.model_selection.HyperbandSearchCV

dask_ml.model_selection.HyperbandSearchCV

class dask_ml.model_selection.HyperbandSearchCV(estimator, parameters, max_iter=81, aggressiveness=3, patience=False, tol=0.001, test_size=None, random_state=None, scoring=None, verbose=False, prefix='', predict_meta=None, predict_proba_meta=None, transform_meta=None)

Find the best parameters for a particular model with an adaptive cross-validation algorithm.

Hyperband will find close to the best possible parameters with the given computational budget * by spending more time training high-performing estimators [1]. This means that Hyperband stops training estimators that perform poorly – at it’s core, Hyperband is an early stopping scheme for RandomizedSearchCV.

Hyperband does not require a trade-off between “evaluate many parameters for a short time” and “train a few parameters for a long time” like RandomizedSearchCV.

Hyperband requires one input which requires knowing how long to train the best performing estimator via max_iter. The other implicit input (the Dask array chuck size) requires a rough estimate of how many parameters to sample. Specification details are in Notes.

*

After \(N\) partial_fit calls the estimator Hyperband produces will be close to the best possible estimator that \(N\) partial_fit calls could ever produce with high probability (where “close” means “within log terms of the expected best possible score”).

Parameters
estimatorestimator object.

A object of that type is instantiated for each hyperparameter combination. This is assumed to implement the scikit-learn estimator interface. Either estimator needs to provide a score function, or scoring must be passed. The estimator must implement partial_fit, set_params, and work well with clone.

parametersdict

Dictionary with parameters names (string) as keys and distributions or lists of parameters to try. Distributions must provide a rvs method for sampling (such as those from scipy.stats.distributions). If a list is given, it is sampled uniformly.

max_iterint

The maximum number of partial_fit calls to any one model. This should be the number of partial_fit calls required for the model to converge. See Notes for details on setting this parameter.

aggressivenessint, default=3

How aggressive to be in culling off the different estimators. Higher values imply higher confidence in scoring (or that the hyperparameters influence the estimator.score more than the data). Theory suggests aggressiveness=3 is close to optimal. aggressiveness=4 has higher confidence that is likely suitable for initial exploration.

patienceint, default False

If specified, training stops when the score does not increase by tol after patience calls to partial_fit. Off by default. A patience value is automatically selected if patience=True to work well with the Hyperband model selection algorithm.

tolfloat, default 0.001

The required level of improvement to consider stopping training on that model when patience is specified. Increasing tol will tend to reduce training time at the cost of (potentially) worse estimators.

test_sizefloat

Fraction of the dataset to hold out for computing test/validation scores. Defaults to the size of a single partition of the input training set.

Note

The testing dataset should fit in memory on a single machine. Adjust the test_size parameter as necessary to achieve this.

random_stateint, RandomState instance or None, optional, default: None

If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

scoringstring, callable, list/tuple, dict or None, default: None

A single string (see The scoring parameter: defining model evaluation rules) or a callable (see Defining your scoring strategy from metric functions) to evaluate the predictions on the test set.

If None, the estimator’s default scorer (if available) is used.

verbosebool, float, int, optional, default: False

If False (default), don’t print logs (or pipe them to stdout). However, standard logging will still be used.

If True, print logs and use standard logging.

If float, print/log approximately verbose fraction of the time.

prefixstr, optional, default=””

While logging, add prefix to each message.

predict_meta: pd.Series, pd.DataFrame, np.array deafult: None(infer)

An empty pd.Series, pd.DataFrame, np.array that matches the output type of the estimators predict call. This meta is necessary for for some estimators to work with dask.dataframe and dask.array

predict_proba_meta: pd.Series, pd.DataFrame, np.array deafult: None(infer)

An empty pd.Series, pd.DataFrame, np.array that matches the output type of the estimators predict_proba call. This meta is necessary for for some estimators to work with dask.dataframe and dask.array

transform_meta: pd.Series, pd.DataFrame, np.array deafult: None(infer)

An empty pd.Series, pd.DataFrame, np.array that matches the output type of the estimators transform call. This meta is necessary for for some estimators to work with dask.dataframe and dask.array

Attributes
metadata and metadata_dict[str, Union(int, dict)]

These dictionaries describe the computation performed, either before computation happens with metadata or after computation happens with metadata_. These dictionaries both have keys

  • n_models, an int representing how many models will be/is created.

  • partial_fit_calls, an int representing how many times

    partial_fit will be/is called.

  • brackets, a list of the brackets that Hyperband runs. Each bracket has different values for training time importance and hyperparameter importance. In addition to n_models and partial_fit_calls, each element in this list has keys

    • bracket, an int the bracket ID. Each bracket corresponds to a different levels of training time importance. For bracket 0, training time is important. For the highest bracket, training time is not important and models are killed aggressively.

    • SuccessiveHalvingSearchCV params, a dictionary used to create the different brackets. It does not include the estimator or parameters parameters.

    • decisions, the number of partial_fit calls Hyperband makes before making decisions.

These dictionaries are the same if patience is not specified. If patience is specified, it’s possible that less training is performed, and metadata_ will reflect that (though metadata won’t).

cv_results_Dict[str, np.ndarray]

A dictionary that describes how well each model has performed. It contains information about every model regardless if it reached max_iter. It has keys

  • mean_partial_fit_time

  • mean_score_time

  • std_partial_fit_time

  • std_score_time

  • test_score

  • rank_test_score

  • model_id

  • partial_fit_calls

  • params

  • param_{key}, where {key} is every key in params.

  • bracket

The values in the test_score key correspond to the last score a model received on the hold out dataset. The key model_id corresponds with history_. This dictionary can be imported into a Pandas DataFrame.

In the model_id, the bracket ID prefix corresponds to the bracket in metadata. Bracket 0 doesn’t adapt to previous training at all; higher values correspond to more adaptation.

history_list of dicts

Information about each model after each partial_fit call. Each dict the keys

  • partial_fit_time

  • score_time

  • score

  • model_id

  • params

  • partial_fit_calls

  • elapsed_wall_time

The key model_id corresponds to the model_id in cv_results_. This list of dicts can be imported into Pandas.

model_history_dict of lists of dict

A dictionary of each models history. This is a reorganization of history_: the same information is present but organized per model.

This data has the structure {model_id: [h1, h2, h3, ...]} where h1, h2 and h3 are elements of history_ and model_id is the model ID as in cv_results_.

best_estimator_BaseEstimator

The model with the highest validation score as selected by the Hyperband model selection algorithm.

best_score_float

Score achieved by best_estimator_ on the validation set after the final call to partial_fit.

best_index_int

Index indicating which estimator in cv_results_ corresponds to the highest score.

best_params_dict

Dictionary of best parameters found on the hold-out data.

scorer_

The function used to score models, which has a call signature of scorer_(estimator, X, y).

Notes

To set max_iter and the chunk size for X and y, it is required to estimate

  • the number of examples at least one model will see (n_examples). If 10 passes through the data are needed for the longest trained model, n_examples = 10 * len(X).

  • how many hyper-parameter combinations to sample (n_params)

These can be rough guesses. To determine the chunk size and max_iter,

  1. Let the chunks size be chunk_size = n_examples / n_params

  2. Let max_iter = n_params

Then, every estimator sees no more than max_iter * chunk_size = n_examples examples. Hyperband will actually sample some more hyper-parameter combinations than n_examples (which is why rough guesses are adequate). For example, let’s say

  • about 200 or 300 hyper-parameters need to be tested to effectively search the possible hyper-parameters

  • models need more than 50 * len(X) examples but less than 100 * len(X) examples.

Let’s decide to provide 81 * len(X) examples and to sample 243 parameters. Then each chunk will be 1/3rd the dataset and max_iter=243.

If you use HyperbandSearchCV, please use the citation for [2]

@InProceedings{sievert2019better,
    author    = {Scott Sievert and Tom Augspurger and Matthew Rocklin},
    title     = {{B}etter and faster hyperparameter optimization with {D}ask},
    booktitle = {{P}roceedings of the 18th {P}ython in {S}cience {C}onference},
    pages     = {118 - 125},
    year      = {2019},
    editor    = {Chris Calloway and David Lippa and Dillon Niederhut and David Shupe},  # noqa
    doi       = {10.25080/Majora-7ddc1dd1-011}
  }

References

1

“Hyperband: A novel bandit-based approach to hyperparameter optimization”, 2016 by L. Li, K. Jamieson, G. DeSalvo, A. Rostamizadeh, and A. Talwalkar. https://arxiv.org/abs/1603.06560

2

“Better and faster hyperparameter optimization with Dask”, 2018 by S. Sievert, T. Augspurger, M. Rocklin. https://doi.org/10.25080/Majora-7ddc1dd1-011

Examples

>>> import numpy as np
>>> from dask_ml.model_selection import HyperbandSearchCV
>>> from dask_ml.datasets import make_classification
>>> from sklearn.linear_model import SGDClassifier
>>>
>>> X, y = make_classification(chunks=20)
>>> est = SGDClassifier(tol=1e-3)
>>> param_dist = {'alpha': np.logspace(-4, 0, num=1000),
>>>               'loss': ['hinge', 'log', 'modified_huber', 'squared_hinge'],
>>>               'average': [True, False]}
>>>
>>> search = HyperbandSearchCV(est, param_dist)
>>> search.fit(X, y, classes=np.unique(y))
>>> search.best_params_
{'loss': 'log', 'average': False, 'alpha': 0.0080502}

Methods

decision_function(X)

fit(X[, y])

Find the best parameters for a particular model.

get_metadata_routing()

Get metadata routing of this object.

get_params([deep])

Get parameters for this estimator.

inverse_transform(Xt)

predict(X)

Predict for X.

predict_log_proba(X)

Log of probability estimates.

predict_proba(X)

Probability estimates.

score(X[, y])

Returns the score on the given data.

set_params(**params)

Set the parameters of this estimator.

set_score_request(*[, compute])

Request metadata passed to the score method.

transform(X)

Transform block or partition-wise for dask inputs.

partial_fit

__init__(estimator, parameters, max_iter=81, aggressiveness=3, patience=False, tol=0.001, test_size=None, random_state=None, scoring=None, verbose=False, prefix='', predict_meta=None, predict_proba_meta=None, transform_meta=None)