Applying Model-Based Optimization to Hyperparameter Optimization in Machine Learning

Applying Model-Based Optimization to Hyperparameter Optimization in Machine Learning Janek
Thomas Computational Statistics, LMU Munich Jun 7th, 2018 1 / 40

Sequential model-based optimization Parallel batch proposals Multicriteria SMBO Interesting Challenges
ML Model Selection and Hyperparameter Optimization 2 / 40

Section 1 Sequential model-based optimization 3 / 40

Expensive Black-Box Optimization y = f (x) , f :
X → R (1) x∗ = arg min x∈X f (x) (2) y, target value x ∈ X ⊂ Rd , domain f (x) function with considerably long runtime Goal: Find optimum x∗ 4 / 40

Sequential model-based optimization Setting: Expensive black-box problem f : x
→ R = min! Classical problem: Computer simulation with a bunch of control parameters and performance output; or algorithmic performance on 1 or more problem instances; we often optimize ML pipelines Idea: Let’s approximate f via regression! Generic MBO Pseudo Code Create initial space filling design and evaluate with f In each iteration: Fit regression model on all evaluated points to predict ˆ f (x) and uncertainty ˆ s(x) Propose point via infill criterion EI(x) ↑ ⇐⇒ ˆ f (x) ↓ ∧ ˆ s(x) ↑ Evaluate proposed point and add to design EGO proposes kriging (aka Gaussian Process) and EI Jones 1998, Efficient Global Opt. of Exp. Black-Box Functions 5 / 40

Latin Hypercube Designs Initial design to train first regression model
Not too small, not too large LHS / maximin designs: Min dist between points is maximized But: Type of design usually has not the largest effect on MBO, and unequal distances between points could even be beneficial 6 / 40

Kriging and local uncertainty prediction Model: Zero-mean GP Y (x)
with const. trend and cov. kernel kθ (x1, x2 ). y = (y1, . . . , yn )T , K = (k(xi , xj ))i,j=1,...,n k∗ (x) = (k(x1, x), . . . , k(xn, x))T ˆ µ = 1T K−1y/1T K−11 (BLUE) Prediction: ˆ f (x) = E[Y (x)|Y (xi ) = yi , i = 1, . . . , n] = ˆ µ + kn (x)T K−1(y − ˆ µ1) Uncertainty: ˆ s2(x) = Var[Y (x)|Y (xi ) = yi , i = 1, . . . , n] = σ2 − kT n (x)K−1kn (x) + (1−1T K−1kT n (x))2 1T K−11 7 / 40

Kriging / GP is a spatial model Correlation between outcomes
(y1, y2 ) depends on dist of x1, x2 E.g. Gaussian covar kernel k(x1, x2 ) = exp(−||x1−x2|| 2σ ) Useful smoothness assumption for optimization Posterior uncertainty at new x increases with dist to design points Allows to enforce exploration 8 / 40

Infill Criteria: Expected Improvement Define improvement at x over best
visited point with y = fmin as random variable I(x) = |fmin − Y (x)|+ For kriging Y (x) ∼ N(ˆ f (x), ˆ s2(x)) (given x = x) Now define EI(x) = E[I(x)|x = x] Expectation is integral over normal density starting at fmin Alternative: Lower confidence bound (LCB) ˆ f (x) − λˆ s(x) Result: EI(x) = fmin − ˆ f (x) Φ fmin−ˆ f (x)) ˆ s(x) + ˆ s(x)φ fmin−ˆ f (x) ˆ s(x) 9 / 40

Focussearch EI optimization is multimodal and not that simple But
objective is now cheap to evaluate Many diﬀerent algorithms exist, from gradient-based methods with restarts to evolutionary algorithms We use an iterated, focusing random search coined “focus search” In each iteration a random search is performed We then shrink the constraints of the feasible region towards the best point in the current iteration (focusing) and iterate, to enforce local convergence Whole process is restarted a few times Works also for categorical and hierarchical params 10 / 40

-1.0 -0.5 0.0 0.5 1.0 y type y yhat type
init prop Iter = 1, Gap = 2.0795e-01 0.00 0.01 0.02 0.03 0.04 0.05 2.5 5.0 7.5 10.0 12.5 x ei 11 / 40

-1.0 -0.5 0.0 0.5 1.0 y type y yhat type
init prop seq Iter = 2, Gap = 5.5410e-02 0.00 0.02 0.04 0.06 2.5 5.0 7.5 10.0 12.5 x ei 11 / 40

-1.0 -0.5 0.0 0.5 1.0 y type y yhat type
init prop seq Iter = 3, Gap = 5.5410e-02 0.00 0.03 0.06 0.09 0.12 2.5 5.0 7.5 10.0 12.5 x ei 11 / 40

-1.0 -0.5 0.0 0.5 1.0 y type y yhat type
init prop seq Iter = 4, Gap = 2.2202e-05 0.00 0.01 0.02 0.03 0.04 2.5 5.0 7.5 10.0 12.5 x ei 11 / 40

-1.0 -0.5 0.0 0.5 1.0 y type y yhat type
init prop seq Iter = 5, Gap = 2.2202e-05 0.000 0.005 0.010 0.015 2.5 5.0 7.5 10.0 12.5 x ei 11 / 40

-1.0 -0.5 0.0 0.5 1.0 y type y yhat type
init prop seq Iter = 15, Gap = 9.0305e-06 0e+00 5e-04 1e-03 2.5 5.0 7.5 10.0 12.5 x ei 11 / 40

mlrMBO: Model-Based Optimization Toolbox Any regression from mlr Arbtritrary inﬁll
Single - or multi-crit Multi-point proposal Via parallelMap and batchtools runs on many parallel backends and clusters Algorithm conﬁguration Active research q q q q q q q q q q q q 0 5 10 15 −5 0 5 10 type q init prop seq 100 200 300 y y q q q q q q q q q q q q 0 5 10 15 −5 0 5 10 type q init prop seq 0 50 100 150 yhat yhat q q q q q q q q q q q q 0 5 10 15 −5 0 5 10 0 1 2 3 ei type q init prop seq ei q q q q q q q q q q q q 0 5 10 15 −5 0 5 10 type q init prop seq 10 20 30 40 se se Iter 5, x−axis: x1, y−axis: x2 mlr: https://github.com/mlr-org/mlr mlrMBO: https://github.com/mlr-org/mlrMBO mlrMBO Paper on arXiv (under review) https://arxiv.org/abs/1703.03373 12 / 40

Benchmark MBO on artificial test functions Comparison of mlrMBO on
multiple diﬀerent test functions Multimodal Smooth Fully numeric Well known We use GPs with LCB with λ = 1 Focussearch 200 iterations 25 point initial design, created by LHS sampling Comparison with Random search CMAES other MBO implementations in R 13 / 40

MBO GP vs. competitors in 5D Alpine01 DeflectedCurragatedSpring Schwefel Ackley
Griewank Rosenbrock 0 2 4 −1 0 1 2 3 4 −2000 −1500 −1000 0 5 10 15 20 0 2 4 6 8 0 1000 2000 3000 4000 m lrM BO cm aesr D iceO ptim rBayesO pt R andom m lrM BO cm aesr D iceO ptim rBayesO pt R andom m lrM BO cm aesr D iceO ptim rBayesO pt R andom m lrM BO cm aesr D iceO ptim rBayesO pt R andom m lrM BO cm aesr D iceO ptim rBayesO pt R andom m lrM BO cm aesr D iceO ptim rBayesO pt R andom algorithm y 14 / 40

Section 2 Parallel batch proposals 15 / 40

Motivation for batch proposal Function evaluations expensive Often many cores
available on a cluster Underlying f can in many cases not be easily parallelized Natural to consider batch proposal Parallel MBO: suggest q promising points to evaluate: x∗ 1 , . . . , x∗ q We need to balance exploration and exploitation Non-trivial to construct inﬁll criterion for this 16 / 40

Review of parallel MBO strategies Constant liar: (Ginsbourger et al.,
2010) Fit kriging model based on real data and ﬁnd x∗ 1 according to EI-criterion. “Guess” f (x∗ i−1 ), update the model and ﬁnd x∗ i , i = 2, ..., q Use fmin for “guessing” q-LCB: (Hutter et al., 2012) q times: sample λ from Exp(1) and optimize single LCB criterion x∗ = arg minx∈X LCB(x) = arg minx∈X ˆ f (x) − λˆ s(x) . 17 / 40

Multiobjectivization Multiobjectivization Originates from multi-modal optimization Add distance to neighbors
for current set as artificial objective Use multiobjective optimization Select by hypervolume or first objective or . . . Approach Decouple ˆ f (x) and ˆ s(x) as objectives – instead of EI – to have different exploration / exploitation trade-offs Consider distance measure as potential extra objective Run multiobjective EA to select q well-performing, diverse points Distance is possible alternative if no or bad ˆ s(x) estimator Decoupling y(x), ˆ s(x) potential alternative when EI derivation does not hold for other model classes Bischl, Wessing et al:MOI-MBO: Multiobjective infill for parallel model-based optimization, LION 2014 18 / 40

Section 3 Multicriteria SMBO 19 / 40

Model-based multi-objective optimization in out Black-Box x1 x2 . .
. xd y1 y2 min x∈X f(x) = y = (y1, ..., ym ) with f : Rd → Rm (3) y dominates ˜ y if ∀i ∈ {1, ..., m} :yi ≤ ˜ yi (4) and ∃i ∈ {1, ..., m} :yi < ˜ yi (5) Set of non-dominated solutions: X∗ := {x ∈ X| ˜ x ∈ X : f(˜ x) dominates f(x)} Pareto set X∗, Pareto front f(X∗) Goal: Find ˆ X∗ of non-dominated points that estimates the true set X∗ 20 / 40

Model-based multi-objective optimization 0.0 0.2 0.4 0.6 0.8 1.0 0.0
0.2 0.4 0.6 0.8 1.0 y1 y2 dominated points Pareto front min x∈X f(x) = y = (y1, ..., ym ) with f : Rn → Rm (6) y dominates ˜ y if ∀i ∈ {1, ..., m} :yi ≤ ˜ yi (7) and ∃i ∈ {1, ..., m} :yi < ˜ yi (8) Set of non-dominated solutions: X∗ := {x ∈ X| ˜ x ∈ X : f(˜ x) dominates f(x)} Pareto set X∗, Pareto front f(X∗) Goal: Find ˆ X∗ of non-dominated points that estimates the true set X∗ 20 / 40

ParEGO 1. Scalarize objectives using the augmented Tchebycheﬀ norm max
i=1,...,d [wi fi (x)] + ρ d i=1 wi fi (x) with uniformly distributed weight vector w ( wi = 1) and ﬁt surrogate model to the respective scalarization. 2. Single-objective optimization of EI (or LCB?) Batch proposal: Increase the number and diversity of randomly drawn weight vectors If N points are desired, cN (c > 1) weight vectors are considered Greedily reduce set of weight vectors by excluding one vector of the pair with minimum distance Scalarizations implied by each weight vector are computed Fit and optimize models for each scalarization Optima of each model build the batch to be evaluated 21 / 40

Animation of ParEGO q q q q q q q
q q q 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 x1 x2 0.0000 0.0005 0.0010 0.0015 0.0020 ei XSpace q q q q q q q q q q 0 2 4 6 8 0.00 0.25 0.50 0.75 1.00 y_1 y_2 type q q front init prop YSpace 22 / 40

q q q 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 x1 x2 0.0e+00 2.5e−13 5.0e−13 7.5e−13 ei XSpace q q q q q q q q q q 0 2 4 6 8 0.00 0.25 0.50 0.75 1.00 y_1 y_2 type q q front init prop seq YSpace 22 / 40

q q q 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 x1 x2 0.000 0.001 0.002 0.003 0.004 0.005 ei XSpace q q q q q q q q q q 0 2 4 6 8 0.00 0.25 0.50 0.75 1.00 y_1 y_2 type q q front init prop seq YSpace 22 / 40

q q q 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 x1 x2 0.0000 0.0005 0.0010 0.0015 0.0020 ei XSpace q q q q q q q q q q 0 2 4 6 8 0.00 0.25 0.50 0.75 1.00 y_1 y_2 type q q front init prop seq YSpace 22 / 40

q q q 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 x1 x2 0e+00 1e−05 2e−05 3e−05 ei XSpace q q q q q q q q q q 0 2 4 6 8 0.00 0.25 0.50 0.75 1.00 y_1 y_2 type q q front init prop seq YSpace 22 / 40

q q q 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 x1 x2 0.000000e+00 2.947095e−21 5.894189e−21 8.841284e−21 1.178838e−20 ei XSpace q q q q q q q q q q 0 2 4 6 8 0.00 0.25 0.50 0.75 1.00 y_1 y_2 type q q front init prop seq YSpace 22 / 40

Section 4 Interesting Challenges 23 / 40

Challenge: The correct surrogate? GPs are very much tailored to
what we want to do, due to their spatial structure in the kernel and the uncertainty estimator. But GPs are rather slow. And (fortunately) due to parallization (or speed-up tricks like subsampling) we have more design points to train on. Categorical features are also a problem in GPs (although methods exist, usually by changing the kernel) Random Forests handle categorical features nicely, are much faster. But they don’t rely on a spatial kernel and the uncertainty estimation is much more heuristic / may not represent what we want. 24 / 40

Challenge: Time Heterogeneity Complex configuration spaces across many algorithms results
in vastly different runtimes in design points. Actually just the RBF-SVM tuning can result in very different runtimes. We don’t care how many points we evaluate, we care about total walltime of the configuration. The option to subsample further complicates things. Parallelization further complicates things. Option: Estimate runtime as well with a surrogate, integrate it into acquisition function. 25 / 40

Section 5 ML Model Selection and Hyperparameter Optimization 26 /
40

Automatic Model Selection Prior approaches: Looking for the silver bullet
model Failure Exhaustive benchmarking / search Very expensive, often contradicting results Meta-Learning: Good meta-features are hard to construct IMHO: Gets more interesting when combined with SMBO Goal for AutoML: Data dependent Automatic Include every relevant modeling decision Eﬃcient Learn on the model-settings level! 27 / 40

From Normal SMBO to Hyperarameter Tuning Objective function is resampled
performance measure Parameter space θ ∈ Θ might be discrete and dependent / hierarchical No derivative for f (·, θ), black-box Objective is stochastic / noisy Objective is expensive to evaluate In general we face a problem of algorithm conﬁguration: Usual approaches: racing or model-based / bayesian optimization 28 / 40

From Normal SMBO to Hyperarameter Tuning Black Box Optimizer Data
Set Learning Machine Preprocessing Model Fit Postprocessing Feature Filter Train / Test Data Resampling Features Hyperparameters Resampled Performace Function Features Hyperparameters Selected deﬁnes 29 / 40

Complex Parameter Space Parameter Set cl.weights learner 2[−7,...,7) randomForest L2
LogReg svm mtry nodesize cost cost kernel radial linear γ {0.1p, ..., 0.9p} {1, ..., 0.5n} 2[−15,15] 2[−15,15] 2[−15,15] 30 / 40

From Normal SMBO to Hyperarameter Tuning Initial design: LHS principle
can be extended, or just use random Focus search: Can be (easily) extended, as it is based on random search. To zoom in for categorical parameters we randomly drop a category for each param which is not present in the currently best configuration. Few approaches for GPs with categorical params exist (usually with new covar kernels), not very established Alternative: Random regression forest (mlrMBO, SMAC) Estimate uncertainty / confidence interval for mean response by efficient bootstrap technique1, or jackknife, so we can define EI(x) for the RF Dependent params in mlrMBO: Imputation: Many of the current techniques to handle these problems are (from a theoretical standpoint) somewhat crude 1Sexton et al, “Standard errors for bagged and random forest estimators, 2009.” 31 / 40

Hyperparameter Tuning Still common practice: grid seach For a SVM
it might look like: C ∈ (2−12, 2−10, 2−8, . . . , 28, 210, 212) γ ∈ (2−12, 2−10, 2−8, . . . , 28, 210, 212) Evaluate all 132 = 169 combinations C × γ Bad beacause: optimum might be "oﬀ the grid" lots of evaluations in bad areas lots of costy evaluations How bad? 32 / 40

Hyperparameter Tuning Because of budget restrictions grid might even be
smaller! Unpromising area quite big! Lots of costly evaluations! With mlrMBO it is not hard to do it better! More interesting applications to time-series regression and cost-sensitive classiﬁcation2 2Koch, Bischl et al:Tuning and evolution of support vector kernels, EI 2012 33 / 40

Hyperparameter Tuning 34 / 40

Hyperparameter Tuning 35 / 40

HPOlib HPOlib is a set of standard benchmarks for hyperparameter
optimizer Allows comparison with Spearmint SMAC Hyperopt (TPE) Benchmarks: Numeric test functions (similar to the ones we’ve seen bevor) Numeric machine learning problems (lda, SVM, logistic regression) Deep neural networks and deep belief networks with 15 and 35 parameters. For benchmarks with discrete and dependent parameters (hpnnet, hpdbnet) a random forest with standard error estimation is used. 36 / 40

MBO: HPOlib svm_on_grid branin michalewicz camelback hpnnet/nocv_convex hpnnet/nocv_mrbi lda_on_grid logreg_on_grid
hpdbnet/convex hpdbnet/mrbi hpnnet/cv_convex hpnnet/cv_mrbi m lrM BO sm ac spearm int TPE m lrM BO sm ac spearm int TPE m lrM BO sm ac spearm int TPE m lrM BO sm ac spearm int TPE m lrM BO sm ac spearm int TPE m lrM BO sm ac spearm int TPE m lrM BO sm ac spearm int TPE m lrM BO sm ac spearm int TPE m lrM BO sm ac spearm int TPE m lrM BO sm ac spearm int TPE m lrM BO sm ac spearm int TPE m lrM BO sm ac spearm int TPE 0.475 0.500 0.525 0.550 0.575 0.0688 0.0692 0.0696 −1.02 −1.00 −0.98 −0.96 0.18 0.19 0.20 0.21 0.22 0.23 1300 1350 1400 1450 −7 −6 −5 −4 −3 0.50 0.55 0.60 0.65 0.48 0.50 0.52 0.54 0.4 0.5 0.6 0.7 0.20 0.25 0.30 0.35 0.20 0.25 0.30 0.35 0.40 0.30 0.33 0.36 result 37 / 40

Deep Learning Configuration Example Dataset: CIFAR-10 (60000 32x32 images with
3 color channels; 10 classes) Conﬁguration of a deep neural network (mxnet) Size of parameter set: 30, including number of hidden layers, activation functions, regularization, convolution layer setting, etc. Split: 2/3 training set, 1/6 test set, 1/6 validation set Time budget per tuning run: 4.5h (16200 sec) Surrogate: Random forest Acquisition: LCB with λ = 2 38 / 40

Deep Learning Configuration Example 39 / 40

Thanks! Any comments or questions? 40 / 40

Applying Model-Based Optimization to Hyperparam...

Applying Model-Based Optimization to Hyperparameter Optimization in Machine Learning

More Decks by MunichDataGeeks

Other Decks in Science

Featured

Transcript