Applying Model-Based Optimization to Hyperparameter Optimization in Machine Learning
Janek Thomas
Computational Statistics, LMU Munich
Jun 7th, 2018

Sequential model-based optimization
Parallel batch proposals
Multicriteria SMBO
Interesting Challenges
ML Model Selection and Hyperparameter Optimization

Sequential model-based optimization

Expensive Black-Box Optimization
y = f (x) , f : X → R (1)
x∗ = arg min x∈X f (x) (2)
y, target value
x ∈ X ⊂ Rd , domain
f (x) function with considerably long runtime
Goal: Find optimum x∗

Sequential model-based optimization
Setting: Expensive black-box problem f : x → R = min!
Classical problem: Computer simulation with a bunch of control parameters and performance output; or algorithmic performance on 1 or more problem instances; we often optimize ML pipelines
Idea: Let's approximate f via regression!
Generic MBO Pseudo Code
Create initial space filling design and evaluate with f
In each iteration:
Fit regression model on all evaluated points to predict ˆ f (x) and uncertainty ˆ s(x)
Propose point via infill criterion EI(x) ↑ ⇐⇒ ˆ f (x) ↓ ∧ ˆ s(x) ↑
Evaluate proposed point and add to design
EGO proposes kriging (aka Gaussian Process) and EI
Jones 1998, Efficient Global Opt. of Exp. Black-Box Functions

Latin Hypercube Designs
Initial design to train first regression model
Not too small, not too large
LHS / maximin designs: Min dist between points is maximized
But: Type of design usually has not the largest effect on MBO, and unequal distances between points could even be beneficial

Kriging and local uncertainty prediction
Model: Zero-mean GP Y (x) with const. trend and cov. kernel kθ (x1, x2 ).
y = (y1, . . . , yn )T , K = (k(xi , xj ))i,j=1,...,n
k∗ (x) = (k(x1, x), . . . , k(xn, x))T
ˆ µ = 1T K−1y/1T K−11 (BLUE)
Prediction: ˆ f (x) = E[Y (x)|Y (xi ) = yi , i = 1, . . . , n] = ˆ µ + kn (x)T K−1(y − ˆ µ1)
Uncertainty: ˆ s2(x) = Var[Y (x)|Y (xi ) = yi , i = 1, . . . , n] = σ2 − kT n (x)K−1kn (x) + (1−1T K−1kT n (x))2 1T K−11

Kriging / GP is a spatial model
Correlation between outcomes (y1, y2 ) depends on dist of x1, x2
E.g. Gaussian covar kernel k(x1, x2 ) = exp(−||x1−x2|| 2σ )
Useful smoothness assumption for optimization
Posterior uncertainty at new x increases with dist to design points
Allows to enforce exploration

Infill Criteria: Expected Improvement
Define improvement at x over best visited point with y = fmin as random variable I(x) = |fmin − Y (x)|+
For kriging Y (x) ∼ N(ˆ f (x), ˆ s2(x)) (given x = x)
Now define EI(x) = E[I(x)|x = x]
Expectation is integral over normal density starting at fmin
Alternative: Lower confidence bound (LCB) ˆ f (x) − λˆ s(x)
Result: EI(x) = fmin − ˆ f (x) Φ fmin−ˆ f (x)) ˆ s(x) + ˆ s(x)φ fmin−ˆ f (x) ˆ s(x)

Focussearch
EI optimization is multimodal and not that simple
But objective is now cheap to evaluate
Many different algorithms exist, from gradient-based methods with restarts to evolutionary algorithms
We use an iterated, focusing random search coined "focus search"
In each iteration a random search is performed
We then shrink the constraints of the feasible region towards the best point in the current iteration (focusing) and iterate, to enforce local convergence
Whole process is restarted a few times
Works also for categorical and hierarchical params

-1.0 -0.5 0.0 0.5 1.0 y type y yhat type init prop Iter = 1, Gap = 2.0795e-01 0.00 0.01 0.02 0.03 0.04 0.05 2.5 5.0 7.5 10.0 12.5 x ei 11 / 40

-1.0 -0.5 0.0 0.5 1.0 y type y yhat type init prop seq Iter = 2, Gap = 5.5410e-02 0.00 0.02 0.04 0.06 2.5 5.0 7.5 10.0 12.5 x ei 11 / 40

-1.0 -0.5 0.0 0.5 1.0 y type y yhat type init prop seq Iter = 3, Gap = 5.5410e-02 0.00 0.03 0.06 0.09 0.12 2.5 5.0 7.5 10.0 12.5 x ei 11 / 40

-1.0 -0.5 0.0 0.5 1.0 y type y yhat type init prop seq Iter = 4, Gap = 2.2202e-05 0.00 0.01 0.02 0.03 0.04 2.5 5.0 7.5 10.0 12.5 x ei 11 / 40

-1.0 -0.5 0.0 0.5 1.0 y type y yhat type init prop seq Iter = 5, Gap = 2.2202e-05 0.000 0.005 0.010 0.015 2.5 5.0 7.5 10.0 12.5 x ei 11 / 40

-1.0 -0.5 0.0 0.5 1.0 y type y yhat type init prop seq Iter = 15, Gap = 9.0305e-06 0e+00 5e-04 1e-03 2.5 5.0 7.5 10.0 12.5 x ei 11 / 40

mlrMBO: Model-Based Optimization Toolbox
Any regression from mlr
Arbtritrary infill
Single - or multi-crit
Multi-point proposal
Via parallelMap and batchtools runs on many parallel backends and clusters
Algorithm configuration
Active research
mlr:
mlrMBO:
mlrMBO Paper on arXiv (under review)

Benchmark MBO on artificial test functions
Comparison of mlrMBO on multiple different test functions
Multimodal
Smooth
Fully numeric
Well known
We use GPs with LCB with λ = 1
Focussearch
200 iterations
25 point initial design, created by LHS sampling
Comparison with
Random search
CMAES
other MBO implementations in R

MBO GP vs. competitors in 5D
Alpine01 DeflectedCurragatedSpring Schwefel Ackley Griewank Rosenbrock
0 2 4 −1 0 1 2 3 4 −2000 −1500 −1000 0 5 10 15 20 0 2 4 6 8 0 1000 2000 3000 4000
m lrM BO cm aesr D iceO ptim rBayesO pt R andom m lrM BO cm aesr D iceO ptim rBayesO pt R andom m lrM BO cm aesr D iceO ptim rBayesO pt R andom m lrM BO cm aesr D iceO ptim rBayesO pt R andom m lrM BO cm aesr D iceO ptim rBayesO pt R andom m lrM BO cm aesr D iceO ptim rBayesO pt R andom
algorithm
y

Parallel batch proposals

Motivation for batch proposal
Function evaluations expensive
Often many cores available on a cluster
Underlying f can in many cases not be easily parallelized
Natural to consider batch proposal
Parallel MBO: suggest q promising points to evaluate: x∗ 1 , . . . , x∗ q
We need to balance exploration and exploitation
Non-trivial to construct infill criterion for this

Review of parallel MBO strategies
Constant liar: (Ginsbourger et al., 2010)
Fit kriging model based on real data and find x∗ 1 according to EI-criterion.
"Guess" f (x∗ i−1 ), update the model and find x∗ i , i = 2, ..., q
Use fmin for "guessing"
q-LCB: (Hutter et al., 2012)
q times: sample λ from Exp(1) and optimize single LCB criterion x∗ = arg minx∈X LCB(x) = arg minx∈X ˆ f (x) − λˆ s(x) .

Multiobjectivization
Multiobjectivization
Originates from multi-modal optimization
Add distance to neighbors for current set as artificial objective
Use multiobjective optimization
Select by hypervolume or first objective or . . .
Approach
Decouple ˆ f (x) and ˆ s(x) as objectives – instead of EI – to have different exploration / exploitation trade-offs
Consider distance measure as potential extra objective
Run multiobjective EA to select q well-performing, diverse points
Distance is possible alternative if no or bad ˆ s(x) estimator
Decoupling y(x), ˆ s(x) potential alternative when EI derivation does not hold for other model classes
Bischl, Wessing et al:MOI-MBO: Multiobjective infill for parallel model-based optimization, LION 2014

Multicriteria SMBO

Model-based multi-objective optimization in out Black-Box
x1 x2 . . . xd y1 y2
min x∈X f(x) = y = (y1, ..., ym ) with f : Rd → Rm (3)
y dominates ˜ y if ∀i ∈ {1, ..., m} :yi ≤ ˜ yi (4)
and ∃i ∈ {1, ..., m} :yi < ˜ yi (5)
Set of non-dominated solutions: X∗ := {x ∈ X| ˜ x ∈ X : f(˜ x) dominates f(x)}
Pareto set X∗, Pareto front f(X∗)
Goal: Find ˆ X∗ of non-dominated points that estimates the true set X∗

Model-based multi-objective optimization
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 y1 y2 dominated points Pareto front
min x∈X f(x) = y = (y1, ..., ym ) with f : Rn → Rm (6)
y dominates ˜ y if ∀i ∈ {1, ..., m} :yi ≤ ˜ yi (7)
and ∃i ∈ {1, ..., m} :yi < ˜ yi (8)
Set of non-dominated solutions: X∗ := {x ∈ X| ˜ x ∈ X : f(˜ x) dominates f(x)}
Pareto set X∗, Pareto front f(X∗)
Goal: Find ˆ X∗ of non-dominated points that estimates the true set X∗

ParEGO
1. Scalarize objectives using the augmented Tchebycheff norm max i=1,...,d [wi fi (x)] + ρ d i=1 wi fi (x) with uniformly distributed weight vector w ( wi = 1) and fit surrogate model to the respective scalarization.
2. Single-objective optimization of EI (or LCB?)
Batch proposal: Increase the number and diversity of randomly drawn weight vectors
If N points are desired, cN (c > 1) weight vectors are considered
Greedily reduce set of weight vectors by excluding one vector of the pair with minimum distance
Scalarizations implied by each weight vector are computed
Fit and optimize models for each scalarization
Optima of each model build the batch to be evaluated

Animation of ParEGO q q q q q q q q q q 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 x1 x2 0.0000 0.0005 0.0010 0.0015 0.0020 ei XSpace q q q q q q q q q q 0 2 4 6 8 0.00 0.25 0.50 0.75 1.00 y_1 y_2 type q q front init prop YSpace 22 / 40

Animation of ParEGO q q q q q q q q q q 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 x1 x2 0.0e+00 2.5e−13 5.0e−13 7.5e−13 ei XSpace q q q q q q q q q q 0 2 4 6 8 0.00 0.25 0.50 0.75 1.00 y_1 y_2 type q q front init prop seq YSpace 22 / 40

Animation of ParEGO q q q q q q q q q q 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 x1 x2 0.000 0.001 0.002 0.003 0.004 0.005 ei XSpace q q q q q q q q q q 0 2 4 6 8 0.00 0.25 0.50 0.75 1.00 y_1 y_2 type q q front init prop seq YSpace 22 / 40

Animation of ParEGO q q q q q q q q q q 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 x1 x2 0.0000 0.0005 0.0010 0.0015 0.0020 ei XSpace q q q q q q q q q q 0 2 4 6 8 0.00 0.25 0.50 0.75 1.00 y_1 y_2 type q q front init prop seq YSpace 22 / 40

Animation of ParEGO q q q q q q q q q q 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 x1 x2 0e+00 1e−05 2e−05 3e−05 ei XSpace q q q q q q q q q q 0 2 4 6 8 0.00 0.25 0.50 0.75 1.00 y_1 y_2 type q q front init prop seq YSpace 22 / 40

Animation of ParEGO q q q q q q q q q q 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 x1 x2 0e+00 1e−09 2e−09 3e−09 ei XSpace q q q q q q q q q q 0 2 4 6 8 0.00 0.25 0.50 0.75 1.00 y_1 y_2 type q q front init prop seq YSpace 22 / 40

Animation of ParEGO q q q q q q q q q q 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 x1 x2 0.000000e+00 2.947095e−21 5.894189e−21 8.841284e−21 1.178838e−20 ei XSpace q q q q q q q q q q 0 2 4 6 8 0.00 0.25 0.50 0.75 1.00 y_1 y_2 type q q front init prop seq YSpace 22 / 40

Animation of ParEGO q q q q q q q q q q 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 x1 x2 0e+00 2e−10 4e−10 6e−10 ei XSpace q q q q q q q q q q 0 2 4 6 8 0.00 0.25 0.50 0.75 1.00 y_1 y_2 type q q front init prop seq YSpace 22 / 40

Animation of ParEGO q q q q q q q q q q 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 x1 x2 0.000000e+00 5.670795e−25 1.134159e−24 1.701238e−24 2.268318e−24 ei XSpace q q q q q q q q q q 0 2 4 6 8 0.00 0.25 0.50 0.75 1.00 y_1 y_2 type q q front init prop seq YSpace 22 / 40

Animation of ParEGO q q q q q q q q q q 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 x1 x2 0.000000e+00 1.246581e−21 2.493163e−21 3.739744e−21 4.986326e−21 ei XSpace q q q q q q q q q q 0 2 4 6 8 0.00 0.25 0.50 0.75 1.00 y_1 y_2 type q q front init prop seq YSpace 22 / 40

Interesting Challenges

Challenge: The correct surrogate?
GPs are very much tailored to what we want to do, due to their spatial structure in the kernel and the uncertainty estimator.
But GPs are rather slow. And (fortunately) due to parallization (or speed-up tricks like subsampling) we have more design points to train on.
Categorical features are also a problem in GPs (although methods exist, usually by changing the kernel)
Random Forests handle categorical features nicely, are much faster. But they don't rely on a spatial kernel and the uncertainty estimation is much more heuristic / may not represent what we want.

Challenge: Time Heterogeneity
Complex configuration spaces across many algorithms results in vastly different runtimes in design points.
Actually just the RBF-SVM tuning

Section 5 ML Model Selection and Hyperparameter Optimization 26 / 40

Automatic Model Selection Prior approaches: Looking for the silver bullet model Failure Exhaustive benchmarking / search Very expensive, often contradicting results Meta-Learning: Good meta-features are hard to construct IMHO: Gets more interesting when combined with SMBO Goal for AutoML: Data dependent Automatic Include every relevant modeling decision Efficient Learn on the model-settings level! 27 / 40

From Normal SMBO to Hyperarameter Tuning Objective function is resampled performance measure Parameter space θ ∈ Θ might be discrete and dependent / hierarchical No derivative for f (·, θ), black-box Objective is stochastic / noisy Objective is expensive to evaluate In general we face a problem of algorithm configuration: Usual approaches: racing or model-based / bayesian optimization 28 / 40

From Normal SMBO to Hyperarameter Tuning Black Box Optimizer Data Set Learning Machine Preprocessing Model Fit Postprocessing Feature Filter Train / Test Data Resampling Features Hyperparameters Resampled Performace Function Features Hyperparameters Selected defines 29 / 40

Complex Parameter Space Parameter Set cl.weights learner 2[−7,...,7) randomForest L2 LogReg svm mtry nodesize cost cost kernel radial linear γ {0.1p, ..., 0.9p} {1, ..., 0.5n} 2[−15,15] 2[−15,15] 2[−15,15] 30 / 40

From Normal SMBO to Hyperarameter Tuning Initial design: LHS principle can be extended, or just use random Focus search: Can be (easily) extended, as it is based on random search. To zoom in for categorical parameters we randomly drop a category for each param which is not present in the currently best configuration. Few approaches for GPs with categorical params exist (usually with new covar kernels), not very established Alternative: Random regression forest (mlrMBO, SMAC) Estimate uncertainty / confidence interval for mean response by efficient bootstrap technique1, or jackknife, so we can define EI(x) for the RF Dependent params in mlrMBO: Imputation: Many of the current techniques to handle these problems are (from a theoretical standpoint) somewhat crude 1Sexton et al, “Standard errors for bagged and random forest estimators, 2009.” 31 / 40

Hyperparameter Tuning Still common practice: grid seach For a SVM it might look like: C ∈ (2−12, 2−10, 2−8, . . . , 28, 210, 212) γ ∈ (2−12, 2−10, 2−8, . . . , 28, 210, 212) Evaluate all 132 = 169 combinations C × γ Bad beacause: optimum might be "off the grid" lots of evaluations in bad areas lots of costy evaluations How bad? 32 / 40

Hyperparameter Tuning Because of budget restrictions grid might even be smaller! Unpromising area quite big! Lots of costly evaluations! With mlrMBO it is not hard to do it better! More interesting applications to time-series regression and cost-sensitive classification2 2Koch, Bischl et al:Tuning and evolution of support vector kernels, EI 2012 33 / 40

Hyperparameter Tuning 34 / 40

Hyperparameter Tuning 35 / 40

HPOlib HPOlib is a set of standard benchmarks for hyperparameter optimizer Allows comparison with Spearmint SMAC Hyperopt (TPE) Benchmarks: Numeric test functions (similar to the ones we’ve seen bevor) Numeric machine learning problems (lda, SVM, logistic regression) Deep neural networks and deep belief networks with 15 and 35 parameters. For benchmarks with discrete and dependent parameters (hpnnet, hpdbnet) a random forest with standard error estimation is used. 36 / 40

MBO: HPOlib svm_on_grid branin michalewicz camelback hpnnet/nocv_convex hpnnet/nocv_mrbi lda_on_grid logreg_on_grid hpdbnet/convex hpdbnet/mrbi hpnnet/cv_convex hpnnet/cv_mrbi m lrM BO sm ac spearm int TPE m lrM BO sm ac spearm int TPE m lrM BO sm ac spearm int TPE m lrM BO sm ac spearm int TPE m lrM BO sm ac spearm int TPE m lrM BO sm ac spearm int TPE m lrM BO sm ac spearm int TPE m lrM BO sm ac spearm int TPE m lrM BO sm ac spearm int TPE m lrM BO sm ac spearm int TPE m lrM BO sm ac spearm int TPE m lrM BO sm ac spearm int TPE 0.475 0.500 0.525 0.550 0.575 0.0688 0.0692 0.0696 −1.02 −1.00 −0.98 −0.96 0.18 0.19 0.20 0.21 0.22 0.23 1300 1350 1400 1450 −7 −6 −5 −4 −3 0.50 0.55 0.60 0.65 0.48 0.50 0.52 0.54 0.4 0.5 0.6 0.7 0.20 0.25 0.30 0.35 0.20 0.25 0.30 0.35 0.40 0.30 0.33 0.36 result 37 / 40

Deep Learning Configuration Example Dataset: CIFAR-10 (60000 32x32 images with 3 color channels; 10 classes) Configuration of a deep neural network (mxnet) Size of parameter set: 30, including number of hidden layers, activation functions, regularization, convolution layer setting, etc. Split: 2/3 training set, 1/6 test set, 1/6 validation set Time budget per tuning run: 4.5h (16200 sec) Surrogate: Random forest Acquisition: LCB with λ = 2 38 / 40

Deep Learning Configuration Example 39 / 40

Thanks! Any comments or questions? 40 / 40