180

# Applying Model-Based Optimization to Hyperparameter Optimization in Machine Learning

The talk will cover the main components of sequential model-based optimization algorithms.
Algorithms of this kind represent the state-of-the-art for expensive black-box optimization problems and are getting increasingly popular for hyper-parameter optimization of machine learning algorithms, especially on larger datasets.
The talk will cover the main components of sequential model-based optimization algorithms, e.g., surrogate regression models like Gaussian processes or random forests, initialization phase and point acquisition.
In a second part, some recent extensions with regard to parallel point acquisition and multi-criteria optimization will be covered.
The talk will finish with a brief overview of open questions and challenges.

June 07, 2018

## Transcript

1. ### Applying Model-Based Optimization to Hyperparameter Optimization in Machine Learning Janek

Thomas Computational Statistics, LMU Munich Jun 7th, 2018 1 / 40
2. ### Sequential model-based optimization Parallel batch proposals Multicriteria SMBO Interesting Challenges

ML Model Selection and Hyperparameter Optimization 2 / 40

4. ### Expensive Black-Box Optimization y = f (x) , f :

X → R (1) x∗ = arg min x∈X f (x) (2) y, target value x ∈ X ⊂ Rd , domain f (x) function with considerably long runtime Goal: Find optimum x∗ 4 / 40
5. ### Sequential model-based optimization Setting: Expensive black-box problem f : x

→ R = min! Classical problem: Computer simulation with a bunch of control parameters and performance output; or algorithmic performance on 1 or more problem instances; we often optimize ML pipelines Idea: Let’s approximate f via regression! Generic MBO Pseudo Code Create initial space ﬁlling design and evaluate with f In each iteration: Fit regression model on all evaluated points to predict ˆ f (x) and uncertainty ˆ s(x) Propose point via inﬁll criterion EI(x) ↑ ⇐⇒ ˆ f (x) ↓ ∧ ˆ s(x) ↑ Evaluate proposed point and add to design EGO proposes kriging (aka Gaussian Process) and EI Jones 1998, Eﬃcient Global Opt. of Exp. Black-Box Functions 5 / 40
6. ### Latin Hypercube Designs Initial design to train ﬁrst regression model

Not too small, not too large LHS / maximin designs: Min dist between points is maximized But: Type of design usually has not the largest eﬀect on MBO, and unequal distances between points could even be beneﬁcial 6 / 40
7. ### Kriging and local uncertainty prediction Model: Zero-mean GP Y (x)

with const. trend and cov. kernel kθ (x1, x2 ). y = (y1, . . . , yn )T , K = (k(xi , xj ))i,j=1,...,n k∗ (x) = (k(x1, x), . . . , k(xn, x))T ˆ µ = 1T K−1y/1T K−11 (BLUE) Prediction: ˆ f (x) = E[Y (x)|Y (xi ) = yi , i = 1, . . . , n] = ˆ µ + kn (x)T K−1(y − ˆ µ1) Uncertainty: ˆ s2(x) = Var[Y (x)|Y (xi ) = yi , i = 1, . . . , n] = σ2 − kT n (x)K−1kn (x) + (1−1T K−1kT n (x))2 1T K−11 7 / 40
8. ### Kriging / GP is a spatial model Correlation between outcomes

(y1, y2 ) depends on dist of x1, x2 E.g. Gaussian covar kernel k(x1, x2 ) = exp(−||x1−x2|| 2σ ) Useful smoothness assumption for optimization Posterior uncertainty at new x increases with dist to design points Allows to enforce exploration 8 / 40
9. ### Infill Criteria: Expected Improvement Deﬁne improvement at x over best

visited point with y = fmin as random variable I(x) = |fmin − Y (x)|+ For kriging Y (x) ∼ N(ˆ f (x), ˆ s2(x)) (given x = x) Now deﬁne EI(x) = E[I(x)|x = x] Expectation is integral over normal density starting at fmin Alternative: Lower conﬁdence bound (LCB) ˆ f (x) − λˆ s(x) Result: EI(x) = fmin − ˆ f (x) Φ fmin−ˆ f (x)) ˆ s(x) + ˆ s(x)φ fmin−ˆ f (x) ˆ s(x) 9 / 40
10. ### Focussearch EI optimization is multimodal and not that simple But

objective is now cheap to evaluate Many diﬀerent algorithms exist, from gradient-based methods with restarts to evolutionary algorithms We use an iterated, focusing random search coined “focus search” In each iteration a random search is performed We then shrink the constraints of the feasible region towards the best point in the current iteration (focusing) and iterate, to enforce local convergence Whole process is restarted a few times Works also for categorical and hierarchical params 10 / 40
11. ### -1.0 -0.5 0.0 0.5 1.0 y type y yhat type

init prop Iter = 1, Gap = 2.0795e-01 0.00 0.01 0.02 0.03 0.04 0.05 2.5 5.0 7.5 10.0 12.5 x ei 11 / 40
12. ### -1.0 -0.5 0.0 0.5 1.0 y type y yhat type

init prop seq Iter = 2, Gap = 5.5410e-02 0.00 0.02 0.04 0.06 2.5 5.0 7.5 10.0 12.5 x ei 11 / 40
13. ### -1.0 -0.5 0.0 0.5 1.0 y type y yhat type

init prop seq Iter = 3, Gap = 5.5410e-02 0.00 0.03 0.06 0.09 0.12 2.5 5.0 7.5 10.0 12.5 x ei 11 / 40
14. ### -1.0 -0.5 0.0 0.5 1.0 y type y yhat type

init prop seq Iter = 4, Gap = 2.2202e-05 0.00 0.01 0.02 0.03 0.04 2.5 5.0 7.5 10.0 12.5 x ei 11 / 40
15. ### -1.0 -0.5 0.0 0.5 1.0 y type y yhat type

init prop seq Iter = 5, Gap = 2.2202e-05 0.000 0.005 0.010 0.015 2.5 5.0 7.5 10.0 12.5 x ei 11 / 40
16. ### -1.0 -0.5 0.0 0.5 1.0 y type y yhat type

init prop seq Iter = 15, Gap = 9.0305e-06 0e+00 5e-04 1e-03 2.5 5.0 7.5 10.0 12.5 x ei 11 / 40
17. ### mlrMBO: Model-Based Optimization Toolbox Any regression from mlr Arbtritrary inﬁll

Single - or multi-crit Multi-point proposal Via parallelMap and batchtools runs on many parallel backends and clusters Algorithm conﬁguration Active research q q q q q q q q q q q q 0 5 10 15 −5 0 5 10 type q init prop seq 100 200 300 y y q q q q q q q q q q q q 0 5 10 15 −5 0 5 10 type q init prop seq 0 50 100 150 yhat yhat q q q q q q q q q q q q 0 5 10 15 −5 0 5 10 0 1 2 3 ei type q init prop seq ei q q q q q q q q q q q q 0 5 10 15 −5 0 5 10 type q init prop seq 10 20 30 40 se se Iter 5, x−axis: x1, y−axis: x2 mlr: https://github.com/mlr-org/mlr mlrMBO: https://github.com/mlr-org/mlrMBO mlrMBO Paper on arXiv (under review) https://arxiv.org/abs/1703.03373 12 / 40
18. ### Benchmark MBO on artificial test functions Comparison of mlrMBO on

multiple diﬀerent test functions Multimodal Smooth Fully numeric Well known We use GPs with LCB with λ = 1 Focussearch 200 iterations 25 point initial design, created by LHS sampling Comparison with Random search CMAES other MBO implementations in R 13 / 40
19. ### MBO GP vs. competitors in 5D Alpine01 DeflectedCurragatedSpring Schwefel Ackley

Griewank Rosenbrock 0 2 4 −1 0 1 2 3 4 −2000 −1500 −1000 0 5 10 15 20 0 2 4 6 8 0 1000 2000 3000 4000 m lrM BO cm aesr D iceO ptim rBayesO pt R andom m lrM BO cm aesr D iceO ptim rBayesO pt R andom m lrM BO cm aesr D iceO ptim rBayesO pt R andom m lrM BO cm aesr D iceO ptim rBayesO pt R andom m lrM BO cm aesr D iceO ptim rBayesO pt R andom m lrM BO cm aesr D iceO ptim rBayesO pt R andom algorithm y 14 / 40

21. ### Motivation for batch proposal Function evaluations expensive Often many cores

available on a cluster Underlying f can in many cases not be easily parallelized Natural to consider batch proposal Parallel MBO: suggest q promising points to evaluate: x∗ 1 , . . . , x∗ q We need to balance exploration and exploitation Non-trivial to construct inﬁll criterion for this 16 / 40
22. ### Review of parallel MBO strategies Constant liar: (Ginsbourger et al.,

2010) Fit kriging model based on real data and ﬁnd x∗ 1 according to EI-criterion. “Guess” f (x∗ i−1 ), update the model and ﬁnd x∗ i , i = 2, ..., q Use fmin for “guessing” q-LCB: (Hutter et al., 2012) q times: sample λ from Exp(1) and optimize single LCB criterion x∗ = arg minx∈X LCB(x) = arg minx∈X ˆ f (x) − λˆ s(x) . 17 / 40
23. ### Multiobjectivization Multiobjectivization Originates from multi-modal optimization Add distance to neighbors

for current set as artiﬁcial objective Use multiobjective optimization Select by hypervolume or ﬁrst objective or . . . Approach Decouple ˆ f (x) and ˆ s(x) as objectives – instead of EI – to have diﬀerent exploration / exploitation trade-oﬀs Consider distance measure as potential extra objective Run multiobjective EA to select q well-performing, diverse points Distance is possible alternative if no or bad ˆ s(x) estimator Decoupling y(x), ˆ s(x) potential alternative when EI derivation does not hold for other model classes Bischl, Wessing et al:MOI-MBO: Multiobjective inﬁll for parallel model-based optimization, LION 2014 18 / 40

25. ### Model-based multi-objective optimization in out Black-Box x1 x2 . .

. xd y1 y2 min x∈X f(x) = y = (y1, ..., ym ) with f : Rd → Rm (3) y dominates ˜ y if ∀i ∈ {1, ..., m} :yi ≤ ˜ yi (4) and ∃i ∈ {1, ..., m} :yi < ˜ yi (5) Set of non-dominated solutions: X∗ := {x ∈ X| ˜ x ∈ X : f(˜ x) dominates f(x)} Pareto set X∗, Pareto front f(X∗) Goal: Find ˆ X∗ of non-dominated points that estimates the true set X∗ 20 / 40
26. ### Model-based multi-objective optimization 0.0 0.2 0.4 0.6 0.8 1.0 0.0

0.2 0.4 0.6 0.8 1.0 y1 y2 dominated points Pareto front min x∈X f(x) = y = (y1, ..., ym ) with f : Rn → Rm (6) y dominates ˜ y if ∀i ∈ {1, ..., m} :yi ≤ ˜ yi (7) and ∃i ∈ {1, ..., m} :yi < ˜ yi (8) Set of non-dominated solutions: X∗ := {x ∈ X| ˜ x ∈ X : f(˜ x) dominates f(x)} Pareto set X∗, Pareto front f(X∗) Goal: Find ˆ X∗ of non-dominated points that estimates the true set X∗ 20 / 40
27. ### ParEGO 1. Scalarize objectives using the augmented Tchebycheﬀ norm max

i=1,...,d [wi fi (x)] + ρ d i=1 wi fi (x) with uniformly distributed weight vector w ( wi = 1) and ﬁt surrogate model to the respective scalarization. 2. Single-objective optimization of EI (or LCB?) Batch proposal: Increase the number and diversity of randomly drawn weight vectors If N points are desired, cN (c > 1) weight vectors are considered Greedily reduce set of weight vectors by excluding one vector of the pair with minimum distance Scalarizations implied by each weight vector are computed Fit and optimize models for each scalarization Optima of each model build the batch to be evaluated 21 / 40
28. ### Animation of ParEGO q q q q q q q

q q q 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 x1 x2 0.0000 0.0005 0.0010 0.0015 0.0020 ei XSpace q q q q q q q q q q 0 2 4 6 8 0.00 0.25 0.50 0.75 1.00 y_1 y_2 type q q front init prop YSpace 22 / 40
29. ### Animation of ParEGO q q q q q q q

q q q 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 x1 x2 0.0e+00 2.5e−13 5.0e−13 7.5e−13 ei XSpace q q q q q q q q q q 0 2 4 6 8 0.00 0.25 0.50 0.75 1.00 y_1 y_2 type q q front init prop seq YSpace 22 / 40
30. ### Animation of ParEGO q q q q q q q

q q q 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 x1 x2 0.000 0.001 0.002 0.003 0.004 0.005 ei XSpace q q q q q q q q q q 0 2 4 6 8 0.00 0.25 0.50 0.75 1.00 y_1 y_2 type q q front init prop seq YSpace 22 / 40
31. ### Animation of ParEGO q q q q q q q

q q q 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 x1 x2 0.0000 0.0005 0.0010 0.0015 0.0020 ei XSpace q q q q q q q q q q 0 2 4 6 8 0.00 0.25 0.50 0.75 1.00 y_1 y_2 type q q front init prop seq YSpace 22 / 40
32. ### Animation of ParEGO q q q q q q q

q q q 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 x1 x2 0e+00 1e−05 2e−05 3e−05 ei XSpace q q q q q q q q q q 0 2 4 6 8 0.00 0.25 0.50 0.75 1.00 y_1 y_2 type q q front init prop seq YSpace 22 / 40
33. ### Animation of ParEGO q q q q q q q

q q q 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 x1 x2 0e+00 1e−09 2e−09 3e−09 ei XSpace q q q q q q q q q q 0 2 4 6 8 0.00 0.25 0.50 0.75 1.00 y_1 y_2 type q q front init prop seq YSpace 22 / 40
34. ### Animation of ParEGO q q q q q q q

q q q 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 x1 x2 0.000000e+00 2.947095e−21 5.894189e−21 8.841284e−21 1.178838e−20 ei XSpace q q q q q q q q q q 0 2 4 6 8 0.00 0.25 0.50 0.75 1.00 y_1 y_2 type q q front init prop seq YSpace 22 / 40
35. ### Animation of ParEGO q q q q q q q

q q q 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 x1 x2 0e+00 2e−10 4e−10 6e−10 ei XSpace q q q q q q q q q q 0 2 4 6 8 0.00 0.25 0.50 0.75 1.00 y_1 y_2 type q q front init prop seq YSpace 22 / 40
36. ### Animation of ParEGO q q q q q q q

q q q 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 x1 x2 0.000000e+00 5.670795e−25 1.134159e−24 1.701238e−24 2.268318e−24 ei XSpace q q q q q q q q q q 0 2 4 6 8 0.00 0.25 0.50 0.75 1.00 y_1 y_2 type q q front init prop seq YSpace 22 / 40
37. ### Animation of ParEGO q q q q q q q

q q q 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 x1 x2 0.000000e+00 1.246581e−21 2.493163e−21 3.739744e−21 4.986326e−21 ei XSpace q q q q q q q q q q 0 2 4 6 8 0.00 0.25 0.50 0.75 1.00 y_1 y_2 type q q front init prop seq YSpace 22 / 40

39. ### Challenge: The correct surrogate? GPs are very much tailored to

what we want to do, due to their spatial structure in the kernel and the uncertainty estimator. But GPs are rather slow. And (fortunately) due to parallization (or speed-up tricks like subsampling) we have more design points to train on. Categorical features are also a problem in GPs (although methods exist, usually by changing the kernel) Random Forests handle categorical features nicely, are much faster. But they don’t rely on a spatial kernel and the uncertainty estimation is much more heuristic / may not represent what we want. 24 / 40
40. ### Challenge: Time Heterogeneity Complex conﬁguration spaces across many algorithms results

in vastly diﬀerent runtimes in design points. Actually just the RBF-SVM tuning can result in very diﬀerent runtimes. We don’t care how many points we evaluate, we care about total walltime of the conﬁguration. The option to subsample further complicates things. Parallelization further complicates things. Option: Estimate runtime as well with a surrogate, integrate it into acquisition function. 25 / 40

40
42. ### Automatic Model Selection Prior approaches: Looking for the silver bullet

model Failure Exhaustive benchmarking / search Very expensive, often contradicting results Meta-Learning: Good meta-features are hard to construct IMHO: Gets more interesting when combined with SMBO Goal for AutoML: Data dependent Automatic Include every relevant modeling decision Eﬃcient Learn on the model-settings level! 27 / 40
43. ### From Normal SMBO to Hyperarameter Tuning Objective function is resampled

performance measure Parameter space θ ∈ Θ might be discrete and dependent / hierarchical No derivative for f (·, θ), black-box Objective is stochastic / noisy Objective is expensive to evaluate In general we face a problem of algorithm conﬁguration: Usual approaches: racing or model-based / bayesian optimization 28 / 40
44. ### From Normal SMBO to Hyperarameter Tuning Black Box Optimizer Data

Set Learning Machine Preprocessing Model Fit Postprocessing Feature Filter Train / Test Data Resampling Features Hyperparameters Resampled Performace Function Features Hyperparameters Selected deﬁnes 29 / 40
45. ### Complex Parameter Space Parameter Set cl.weights learner 2[−7,...,7) randomForest L2

LogReg svm mtry nodesize cost cost kernel radial linear γ {0.1p, ..., 0.9p} {1, ..., 0.5n} 2[−15,15] 2[−15,15] 2[−15,15] 30 / 40
46. ### From Normal SMBO to Hyperarameter Tuning Initial design: LHS principle

can be extended, or just use random Focus search: Can be (easily) extended, as it is based on random search. To zoom in for categorical parameters we randomly drop a category for each param which is not present in the currently best conﬁguration. Few approaches for GPs with categorical params exist (usually with new covar kernels), not very established Alternative: Random regression forest (mlrMBO, SMAC) Estimate uncertainty / conﬁdence interval for mean response by eﬃcient bootstrap technique1, or jackknife, so we can deﬁne EI(x) for the RF Dependent params in mlrMBO: Imputation: Many of the current techniques to handle these problems are (from a theoretical standpoint) somewhat crude 1Sexton et al, “Standard errors for bagged and random forest estimators, 2009.” 31 / 40
47. ### Hyperparameter Tuning Still common practice: grid seach For a SVM

it might look like: C ∈ (2−12, 2−10, 2−8, . . . , 28, 210, 212) γ ∈ (2−12, 2−10, 2−8, . . . , 28, 210, 212) Evaluate all 132 = 169 combinations C × γ Bad beacause: optimum might be "oﬀ the grid" lots of evaluations in bad areas lots of costy evaluations How bad? 32 / 40
48. ### Hyperparameter Tuning Because of budget restrictions grid might even be

smaller! Unpromising area quite big! Lots of costly evaluations! With mlrMBO it is not hard to do it better! More interesting applications to time-series regression and cost-sensitive classiﬁcation2 2Koch, Bischl et al:Tuning and evolution of support vector kernels, EI 2012 33 / 40

51. ### HPOlib HPOlib is a set of standard benchmarks for hyperparameter

optimizer Allows comparison with Spearmint SMAC Hyperopt (TPE) Benchmarks: Numeric test functions (similar to the ones we’ve seen bevor) Numeric machine learning problems (lda, SVM, logistic regression) Deep neural networks and deep belief networks with 15 and 35 parameters. For benchmarks with discrete and dependent parameters (hpnnet, hpdbnet) a random forest with standard error estimation is used. 36 / 40
52. ### MBO: HPOlib svm_on_grid branin michalewicz camelback hpnnet/nocv_convex hpnnet/nocv_mrbi lda_on_grid logreg_on_grid

hpdbnet/convex hpdbnet/mrbi hpnnet/cv_convex hpnnet/cv_mrbi m lrM BO sm ac spearm int TPE m lrM BO sm ac spearm int TPE m lrM BO sm ac spearm int TPE m lrM BO sm ac spearm int TPE m lrM BO sm ac spearm int TPE m lrM BO sm ac spearm int TPE m lrM BO sm ac spearm int TPE m lrM BO sm ac spearm int TPE m lrM BO sm ac spearm int TPE m lrM BO sm ac spearm int TPE m lrM BO sm ac spearm int TPE m lrM BO sm ac spearm int TPE 0.475 0.500 0.525 0.550 0.575 0.0688 0.0692 0.0696 −1.02 −1.00 −0.98 −0.96 0.18 0.19 0.20 0.21 0.22 0.23 1300 1350 1400 1450 −7 −6 −5 −4 −3 0.50 0.55 0.60 0.65 0.48 0.50 0.52 0.54 0.4 0.5 0.6 0.7 0.20 0.25 0.30 0.35 0.20 0.25 0.30 0.35 0.40 0.30 0.33 0.36 result 37 / 40
53. ### Deep Learning Configuration Example Dataset: CIFAR-10 (60000 32x32 images with

3 color channels; 10 classes) Conﬁguration of a deep neural network (mxnet) Size of parameter set: 30, including number of hidden layers, activation functions, regularization, convolution layer setting, etc. Split: 2/3 training set, 1/6 test set, 1/6 validation set Time budget per tuning run: 4.5h (16200 sec) Surrogate: Random forest Acquisition: LCB with λ = 2 38 / 40