Hyper Parameter Search
Contents
Hyper Parameter Search¶
Tools to perform hyperparameter optimization of ScikitLearn APIcompatible models using Dask, and to scale hyperparameter optimization to larger data and/or larger searches.
Hyperparameter searches are a required process in machine learning. Briefly, machine learning models require certain “hyperparameters”, model parameters that can be learned from the data. Finding these good values for these parameters is a “hyperparameter search” or an “hyperparameter optimization.” For more detail, see “Tuning the hyperparameters of an estimator.”
These searches can take an ample time (days or weeks), especially when good performance is desired and/or with massive datasets, which is common when preparing for production or a paper publication. The following section clarifies the issues that can occur:
“Scaling hyperparameter searches” mentions problems that often occur in hyperparameter optimization searches.
Tools that address these problems are expanded upon in these sections:
“DropIn Replacements for ScikitLearn” details classes that mirror the Scikitlearn estimators but work nicely with Dask objects and can offer better performance.
“Incremental Hyperparameter Optimization” details classes that work well with large datasets.
“Adaptive Hyperparameter Optimization” details classes that avoid extra computation and find highperforming hyperparameters more quickly.
Scaling hyperparameter searches¶
DaskML provides classes to avoid the two most common issues in hyperparameter optimization, when the hyperparameter search is…
“memory constrained”. This happens when the dataset size is too large to fit in memory. This typically happens when a model needs to be tuned for a largerthanmemory dataset after local development.
“compute constrained”. This happen when the computation takes too long even with data that can fit in memory. This typically happens when many hyperparameters need to be tuned or the model requires a specialized hardware (e.g., GPUs).
“Memory constrained” searches happen when the data doesn’t fit in the memory of a single machine:
>>> import pandas as pd
>>> import dask.dataframe as dd
>>>
>>> ## not memory constrained
>>> df = pd.read_csv("data/0.parquet")
>>> df.shape
(30000, 200) # => 23MB
>>>
>>> ## memory constrained
>>> # Read 1000 of the above dataframes (=> 22GB of data)
>>> ddf = dd.read_parquet("data/*.parquet")
“Compute constrained” is when the hyperparameter search takes too long even if the data fits in memory. There might a lot of hyperparameters to search, or the model may require specialized hardware like GPUs:
>>> import pandas as pd
>>> from scipy.stats import uniform, loguniform
>>> from sklearn.linear_model import SGDClasifier
>>>
>>> df = pd.read_parquet("data/0.parquet") # data to train on; 23MB as above
>>>
>>> model = SGDClasifier()
>>>
>>> # not compute constrained
>>> params = {"l1_ratio": uniform(0, 1)}
>>>
>>> # compute constrained
>>> params = {
... "l1_ratio": uniform(0, 1),
... "alpha": loguniform(1e5, 1e1),
... "penalty": ["l2", "l1", "elasticnet"],
... "learning_rate": ["invscaling", "adaptive"],
... "power_t": uniform(0, 1),
... "average": [True, False],
... }
>>>
These issues are independent and both can happen the same time. DaskML has tools to address all 4 combinations. Let’s look at each case.
Neither compute nor memory constrained¶
This case happens when there aren’t many hyperparameters to tune and the data fits in memory. This is common when the search doesn’t take too long to run.
Scikitlearn can handle this case:

Exhaustive search over specified parameter values for an estimator. 
Randomized search on hyper parameters. 
DaskML also has some drop in replacements for the Scikitlearn versions that works well with Dask collections (like Dask Arrays and Dask DataFrames):

Exhaustive search over specified parameter values for an estimator. 
Randomized search on hyper parameters. 
By default, these estimators will efficiently pass the entire dataset to
fit
if a Dask Array/DataFrame is passed. More detail is in
“Works Well With Dask Collections”.
These estimators above work especially well with models that have expensive preprocessing, which is common in natural language processing (NLP). More detail is in “Compute constrained, but not memory constrained” and “Avoid Repeated Work”.
Memory constrained, but not compute constrained¶
This case happens when the data doesn’t fit in memory but there aren’t many
hyperparameters to search over. The data doesn’t fit in memory, so it makes
sense to call partial_fit
on each chunk of a Dask Array/Dataframe. This
estimator does that:
Incrementally search for hyperparameters on models that support partial_fit 
More detail on IncrementalSearchCV
is in
“Incremental Hyperparameter Optimization”.
Dask’s implementation of GridSearchCV
and
RandomizedSearchCV
can to also call
partial_fit
on each chunk of a Dask array, as long as the model passed is
wrapped with Incremental
.
Compute constrained, but not memory constrained¶
This case happens when the data fits in the memory of one machine but when
there are a lot of hyperparameters to search, or the models require specialized
hardware like GPUs. The best class for this case is
HyperbandSearchCV
:
Find the best parameters for a particular model with an adaptive crossvalidation algorithm. 
Briefly, this estimator is easy to use, has strong mathematical motivation and performs remarkably well. For more detail, see “Hyperband parameters: ruleofthumb” and “Hyperband Performance”.
Two other adaptive hyperparameter optimization algorithms are implemented in these classes:
Perform the successive halving algorithm [R424ea1a907b11]. 

Incrementally search for hyperparameters on models that support partial_fit 
The input parameters for these classes are more difficult to configure.
All of these searches can reduce time to solution by (cleverly) deciding which
parameters to evaluate. That is, these searches adapt to history to decide
which parameters to continue evaluating. All of these estimators support
ignoring models models with decreasing score via the patience
and tol
parameters.
Another way to limit computation is to avoid repeated work during during the searches. This is especially useful with expensive preprocessing, which is common in natural language processing (NLP).
Randomized search on hyper parameters. 


Exhaustive search over specified parameter values for an estimator. 
Avoiding repeated work with this class relies on the model being an instance of
Scikitlearn’s Pipeline
. See
“Avoid Repeated Work” for more detail.
Compute and memory constrained¶
This case happens when the dataset is larger than memory and there are many parameters to search. In this case, it’s useful to have strong support for Dask Arrays/DataFrames and to decide which models to continue training.
Find the best parameters for a particular model with an adaptive crossvalidation algorithm. 

Perform the successive halving algorithm [R424ea1a907b11]. 

Incrementally search for hyperparameters on models that support partial_fit 
These classes work well with data that does not fit in memory. They also reduce the computation required as described in “Compute constrained, but not memory constrained.”
Now, let’s look at these classes indepth.
“DropIn Replacements for ScikitLearn” details
RandomizedSearchCV
andGridSearchCV
.“Incremental Hyperparameter Optimization” details
IncrementalSearchCV
and all it’s subclasses (one of which isHyperbandSearchCV
).“Adaptive Hyperparameter Optimization” details usage and performance of
HyperbandSearchCV
.
DropIn Replacements for ScikitLearn¶
DaskML implements dropin replacements for
GridSearchCV
and
RandomizedSearchCV
.

Exhaustive search over specified parameter values for an estimator. 
Randomized search on hyper parameters. 
The variants in DaskML implement many (but not all) of the same parameters, and should be a dropin replacement for the subset that they do implement. In that case, why use DaskML’s versions?
Flexible Backends: Hyperparameter optimization can be done in parallel using threads, processes, or distributed across a cluster.
Works well with Dask collections. Dask arrays, dataframes, and delayed can be passed to
fit
.Avoid repeated work. Candidate models with identical parameters and inputs will only be fit once. For compositemodels such as
Pipeline
this can be significantly more efficient as it can avoid expensive repeated computations.
Both Scikitlearn’s and DaskML’s model selection metaestimators can be used with Dask’s joblib backend.
Flexible Backends¶
DaskML can use any of the dask schedulers. By default the threaded scheduler is used, but this can easily be swapped out for the multiprocessing or distributed scheduler:
# Distribute gridsearch across a cluster
from dask.distributed import Client
scheduler_address = '127.0.0.1:8786'
client = Client(scheduler_address)
search.fit(digits.data, digits.target)
Works Well With Dask Collections¶
Dask collections such as dask.array
, dask.dataframe
and
dask.delayed
can be passed to fit
. This means you can use dask to do
your data loading and preprocessing as well, allowing for a clean workflow.
This also allows you to work with remote data on a cluster without ever having
to pull it locally to your computer:
import dask.dataframe as dd
# Load data from s3
df = dd.read_csv('s3://bucketname/mydata*.csv')
# Do some preprocessing steps
df['x2'] = df.x  df.x.mean()
# ...
# Pass to fit without ever leaving the cluster
search.fit(df[['x', 'x2']], df['y'])
This example will compute each CV split and store it on a single machine so
fit
can be called.
Avoid Repeated Work¶
When searching over composite models like sklearn.pipeline.Pipeline
or
sklearn.pipeline.FeatureUnion
, DaskML will avoid fitting the same
model + parameter + data combination more than once. For pipelines with
expensive early steps this can be faster, as repeated work is avoided.
For example, given the following 3stage pipeline and grid (modified from this Scikitlearn example).
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline
pipeline = Pipeline([('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', SGDClassifier())])
grid = {'vect__ngram_range': [(1, 1)],
'tfidf__norm': ['l1', 'l2'],
'clf__alpha': [1e3, 1e4, 1e5]}
the ScikitLearn gridsearch implementation looks something like (simplified):
scores = []
for ngram_range in parameters['vect__ngram_range']:
for norm in parameters['tfidf__norm']:
for alpha in parameters['clf__alpha']:
vect = CountVectorizer(ngram_range=ngram_range)
X2 = vect.fit_transform(X, y)
tfidf = TfidfTransformer(norm=norm)
X3 = tfidf.fit_transform(X2, y)
clf = SGDClassifier(alpha=alpha)
clf.fit(X3, y)
scores.append(clf.score(X3, y))
best = choose_best_parameters(scores, parameters)
As a directed acyclic graph, this might look like:
In contrast, the dask version looks more like:
scores = []
for ngram_range in parameters['vect__ngram_range']:
vect = CountVectorizer(ngram_range=ngram_range)
X2 = vect.fit_transform(X, y)
for norm in parameters['tfidf__norm']:
tfidf = TfidfTransformer(norm=norm)
X3 = tfidf.fit_transform(X2, y)
for alpha in parameters['clf__alpha']:
clf = SGDClassifier(alpha=alpha)
clf.fit(X3, y)
scores.append(clf.score(X3, y))
best = choose_best_parameters(scores, parameters)
With a corresponding directed acyclic graph:
Looking closely, you can see that the ScikitLearn version ends up fitting earlier steps in the pipeline multiple times with the same parameters and data. Due to the increased flexibility of Dask over Joblib, we’re able to merge these tasks in the graph and only perform the fit step once for any parameter/data/model combination. For pipelines that have relatively expensive early steps, this can be a big win when performing a grid search.
Incremental Hyperparameter Optimization¶
Incrementally search for hyperparameters on models that support partial_fit 

Find the best parameters for a particular model with an adaptive crossvalidation algorithm. 

Perform the successive halving algorithm [R424ea1a907b11]. 

Incrementally search for hyperparameters on models that support partial_fit 
These estimators all handle Dask arrays/dataframe identically. The example will
use HyperbandSearchCV
, but it can easily be
generalized to any of the above estimators.
Note
These estimators require that the model implement partial_fit
.
By default, these class will call partial_fit
on each chunk of the data.
These classes can stop training any models if their score stops increasing
(via patience
and tol
). They even get one step fancier, and can choose
which models to call partial_fit
on.
First, let’s look at basic usage. “Adaptive Hyperparameter Optimization” details estimators that reduce the amount of computation required.
Basic use¶
This section uses HyperbandSearchCV
, but it can
also be applied to to IncrementalSearchCV
too.
In [1]: from dask.distributed import Client
In [2]: from dask_ml.datasets import make_classification
In [3]: from dask_ml.model_selection import train_test_split
In [4]: client = Client()
In [5]: X, y = make_classification(chunks=20, random_state=0)
In [6]: X_train, X_test, y_train, y_test = train_test_split(X, y)
Our underlying model is an sklearn.linear_model.SGDClasifier
. We
specify a few parameters common to each clone of the model:
In [7]: from sklearn.linear_model import SGDClassifier
In [8]: clf = SGDClassifier(tol=1e3, penalty='elasticnet', random_state=0)
We also define the distribution of parameters from which we will sample:
In [9]: from scipy.stats import uniform, loguniform
In [10]: params = {'alpha': loguniform(1e2, 1e0), # or np.logspace
....: 'l1_ratio': uniform(0, 1)} # or np.linspace
....:
Finally we create many random models in this parameter space and trainandscore them until we find the best one.
In [11]: from dask_ml.model_selection import HyperbandSearchCV
In [12]: search = HyperbandSearchCV(clf, params, max_iter=81, random_state=0)
In [13]: search.fit(X_train, y_train, classes=[0, 1]);
In [14]: search.best_params_
Out[14]: {'alpha': 0.08887680304436768, 'l1_ratio': 0.05813789374393907}
In [15]: search.best_score_
Out[15]: 0.65
In [16]: search.score(X_test, y_test)
Out[16]: 0.4
Note that when you do postfit tasks like search.score
, the underlying
model’s score method is used. If that is unable to handle a
largerthanmemory Dask Array, you’ll exhaust your machine’s memory. If you plan
to use postestimation features like scoring or prediction, we recommend using
dask_ml.wrappers.ParallelPostFit
.
In [17]: from dask_ml.wrappers import ParallelPostFit
In [18]: params = {'estimator__alpha': loguniform(1e2, 1e0),
....: 'estimator__l1_ratio': uniform(0, 1)}
....:
In [19]: est = ParallelPostFit(SGDClassifier(tol=1e3, random_state=0))
In [20]: search = HyperbandSearchCV(est, params, max_iter=9, random_state=0)
In [21]: search.fit(X_train, y_train, classes=[0, 1]);
In [22]: search.score(X_test, y_test)
Out[22]: 0.6
Note that the parameter names include the estimator__
prefix, as we’re
tuning the hyperparameters of the sklearn.linear_model.SGDClasifier
that’s underlying the dask_ml.wrappers.ParallelPostFit
.
Adaptive Hyperparameter Optimization¶
DaskML has these estimators that adapt to historical data to determine which
models to continue training. This means high scoring models can be found with
fewer cumulative calls to partial_fit
.
Find the best parameters for a particular model with an adaptive crossvalidation algorithm. 

Perform the successive halving algorithm [R424ea1a907b11]. 
IncrementalSearchCV
also fits in this class
when decay_rate=1
. All of these estimators require an implementation of
partial_fit
, and they all work with largerthanmemory datasets as
mentioned in “Incremental Hyperparameter Optimization”.
HyperbandSearchCV
has several niceties
mentioned in the following sections:
Hyperband parameters: ruleofthumb: a good ruleofthumb to determine
HyperbandSearchCV
’s input parameters.Hyperband Performance: how quickly
HyperbandSearchCV
will find high performing models.
Let’s see how well Hyperband does when the inputs are chosen with the provided ruleofthumb.
Hyperband parameters: ruleofthumb¶
HyperbandSearchCV
has two inputs:
max_iter
, which determines how many times to callpartial_fit
the chunk size of the Dask array, which determines how many data each
partial_fit
call receives.
These fall out pretty naturally once it’s known how long to train the best model and very approximately how many parameters to sample:
In [23]: n_examples = 20 * len(X_train) # 20 passes through dataset for best model
In [24]: n_params = 94 # sample approximately 100 parameters; more than 94 will be sampled
With this, it’s easy use a ruleofthumb to compute the inputs to Hyperband:
In [25]: max_iter = n_params
In [26]: chunk_size = n_examples // n_params # implicit
Now that we’ve determined the inputs, let’s create our search object and rechunk the Dask array:
In [27]: clf = SGDClassifier(tol=1e3, penalty='elasticnet', random_state=0)
In [28]: params = {'alpha': loguniform(1e2, 1e0), # or np.logspace
....: 'l1_ratio': uniform(0, 1)} # or np.linspace
....:
In [29]: search = HyperbandSearchCV(clf, params, max_iter=max_iter, aggressiveness=4, random_state=0)
In [30]: X_train = X_train.rechunk((chunk_size, 1))
In [31]: y_train = y_train.rechunk(chunk_size)
We used aggressiveness=4
because this is an initial search. I don’t know
much about the data, model or hyperparameters. If I had at least some sense of
what hyperparameters to use, I would specify aggressiveness=3
, the default.
The inputs to this ruleofthumb are exactly what the user cares about:
A measure of how complex the search space is (via
n_params
)How long to train the best model (via
n_examples
)How confident they are in the hyperparameters (via
aggressiveness
).
Notably, there’s no tradeoff between n_examples
and n_params
like with
RandomizedSearchCV
because n_examples
is
only for some models, not for all models. There’s more details on this
ruleofthumb in the “Notes” section of
HyperbandSearchCV
However, this does not explicitly mention the amount of computation performed – it’s only an approximation. The amount of computation can be viewed like so:
In [32]: search.metadata["partial_fit_calls"] # best model will see `max_iter` chunks
Out[32]: 1151
In [33]: search.metadata["n_models"] # actual number of parameters to sample
Out[33]: 98
This samples many more hyperparameters than RandomizedSearchCV
, which would
only sample about 12 hyperparameters (or initialize 12 models) for the same
amount of computation. Let’s fit
HyperbandSearchCV
with these different
chunks:
In [34]: search.fit(X_train, y_train, classes=[0, 1]);
In [35]: search.best_params_
Out[35]: {'alpha': 0.36518440093040994, 'l1_ratio': 0.5062738685222655}
To be clear, this is a very small toy example: there are only 100 observations and 20 features. Let’s see how the performance scales with a more realistic example.
Hyperband Performance¶
This performance comparison will briefly summarize an experiment to find performance results. This is similar to the case above. Complete details can be found in the Dask blog post “Better and faster hyperparameter optimization with Dask”.
It will use these estimators with the following inputs:
Model: Scikitlearn’s
MLPClassifier
with 12 neuronsDataset: A simple synthetic dataset with 4 classes and 6 features (2 meaningful features and 4 random features):
Let’s search for the best model to classify this dataset. Let’s search over these parameters:
One hyperparameters that control optimal model architecture:
hidden_layer_sizes
. This can take values that have 12 neurons; for example, 6 neurons in two layers or 4 neurons in 3 layers.Six hyperparameters that control finding the optimal model of a particular architecture. This includes hyperparameters such as weight decay and various optimization parameters (including batch size, learning rate and momentum).
Here’s how we’ll configure the two different estimators:
“Hyperband” is configured with ruleofthumb above with
n_params = 299
1 andn_examples = 50 * len(X_train)
.“Incremental” is configured to do the same amount of work as Hyperband with
IncrementalSearchCV(..., n_initial_parameters=19, decay_rate=0)
These two estimators are configured do the same amount of computation, the equivalent of fitting about 19 models. With this amount of computation, how do the final accuracies look?
This is great – HyperbandSearchCV
looks to
be a lot more confident than
IncrementalSearchCV
. But how fast do these
searches find models of (say) 85% accuracy? Experimentally, Hyperband reaches
84% accuracy at about 350 passes through the dataset, and Incremental requires
900 passes through the dataset:
“Passes through the dataset” is a good proxy for “time to solution” in this case because only 4 Dask workers are used, and they’re all busy for the vast majority of the search. How does this change with the number of workers?
To see this, let’s analyze how the timetocompletion for Hyperband varies with the number of Dask workers in a separate experiment.
It looks like the speedup starts to saturate around 24 Dask workers. This number will increase if the search space becomes larger or if model evaluation takes longer.
 1
Approximately 300 parameters were desired; 299 was chosen to make the Dask array chunk evenly