Applying Model-Based Optimization to Hyperparameter Optimization in Machine Learning

Slide 1

Slide 1 text

Applying Model-Based Optimization to Hyperparameter Optimization in Machine Learning Janek Thomas Computational Statistics, LMU Munich Jun 7th, 2018 1 / 40

Slide 2

Slide 2 text

Sequential model-based optimization Parallel batch proposals Multicriteria SMBO Interesting Challenges ML Model Selection and Hyperparameter Optimization 2 / 40

Slide 3

Slide 3 text

Section 1 Sequential model-based optimization 3 / 40

Slide 4

Slide 4 text

Expensive Black-Box Optimization y = f (x) , f : X → R (1) x∗ = arg min x∈X f (x) (2) y, target value x ∈ X ⊂ Rd , domain f (x) function with considerably long runtime Goal: Find optimum x∗ 4 / 40

Slide 5

Slide 5 text

Sequential model-based optimization Setting: Expensive black-box problem f : x → R = min! Classical problem: Computer simulation with a bunch of control parameters and performance output; or algorithmic performance on 1 or more problem instances; we often optimize ML pipelines Idea: Let’s approximate f via regression! Generic MBO Pseudo Code Create initial space filling design and evaluate with f In each iteration: Fit regression model on all evaluated points to predict ˆ f (x) and uncertainty ˆ s(x) Propose point via infill criterion EI(x) ↑ ⇐⇒ ˆ f (x) ↓ ∧ ˆ s(x) ↑ Evaluate proposed point and add to design EGO proposes kriging (aka Gaussian Process) and EI Jones 1998, Efficient Global Opt. of Exp. Black-Box Functions 5 / 40

Slide 6

Slide 6 text

Latin Hypercube Designs Initial design to train first regression model Not too small, not too large LHS / maximin designs: Min dist between points is maximized But: Type of design usually has not the largest effect on MBO, and unequal distances between points could even be beneficial 6 / 40

Slide 7

Slide 7 text

Kriging and local uncertainty prediction Model: Zero-mean GP Y (x) with const. trend and cov. kernel kθ (x1, x2 ). y = (y1, . . . , yn )T , K = (k(xi , xj ))i,j=1,...,n k∗ (x) = (k(x1, x), . . . , k(xn, x))T ˆ µ = 1T K−1y/1T K−11 (BLUE) Prediction: ˆ f (x) = E[Y (x)|Y (xi ) = yi , i = 1, . . . , n] = ˆ µ + kn (x)T K−1(y − ˆ µ1) Uncertainty: ˆ s2(x) = Var[Y (x)|Y (xi ) = yi , i = 1, . . . , n] = σ2 − kT n (x)K−1kn (x) + (1−1T K−1kT n (x))2 1T K−11 7 / 40

Slide 8

Slide 8 text

Kriging / GP is a spatial model Correlation between outcomes (y1, y2 ) depends on dist of x1, x2 E.g. Gaussian covar kernel k(x1, x2 ) = exp(−||x1−x2|| 2σ ) Useful smoothness assumption for optimization Posterior uncertainty at new x increases with dist to design points Allows to enforce exploration 8 / 40

Slide 9

Slide 9 text

Infill Criteria: Expected Improvement Define improvement at x over best visited point with y = fmin as random variable I(x) = |fmin − Y (x)|+ For kriging Y (x) ∼ N(ˆ f (x), ˆ s2(x)) (given x = x) Now define EI(x) = E[I(x)|x = x] Expectation is integral over normal density starting at fmin Alternative: Lower confidence bound (LCB) ˆ f (x) − λˆ s(x) Result: EI(x) = fmin − ˆ f (x) Φ fmin−ˆ f (x)) ˆ s(x) + ˆ s(x)φ fmin−ˆ f (x) ˆ s(x) 9 / 40

Slide 10

Slide 10 text

Focussearch EI optimization is multimodal and not that simple But objective is now cheap to evaluate Many diﬀerent algorithms exist, from gradient-based methods with restarts to evolutionary algorithms We use an iterated, focusing random search coined “focus search” In each iteration a random search is performed We then shrink the constraints of the feasible region towards the best point in the current iteration (focusing) and iterate, to enforce local convergence Whole process is restarted a few times Works also for categorical and hierarchical params 10 / 40

Slide 11

Slide 11 text

-1.0 -0.5 0.0 0.5 1.0 y type y yhat type init prop Iter = 1, Gap = 2.0795e-01 0.00 0.01 0.02 0.03 0.04 0.05 2.5 5.0 7.5 10.0 12.5 x ei 11 / 40

Slide 12

Slide 12 text

-1.0 -0.5 0.0 0.5 1.0 y type y yhat type init prop seq Iter = 2, Gap = 5.5410e-02 0.00 0.02 0.04 0.06 2.5 5.0 7.5 10.0 12.5 x ei 11 / 40

Slide 13

Slide 13 text

-1.0 -0.5 0.0 0.5 1.0 y type y yhat type init prop seq Iter = 3, Gap = 5.5410e-02 0.00 0.03 0.06 0.09 0.12 2.5 5.0 7.5 10.0 12.5 x ei 11 / 40

Slide 14

Slide 14 text

-1.0 -0.5 0.0 0.5 1.0 y type y yhat type init prop seq Iter = 4, Gap = 2.2202e-05 0.00 0.01 0.02 0.03 0.04 2.5 5.0 7.5 10.0 12.5 x ei 11 / 40

Slide 15

Slide 15 text

-1.0 -0.5 0.0 0.5 1.0 y type y yhat type init prop seq Iter = 5, Gap = 2.2202e-05 0.000 0.005 0.010 0.015 2.5 5.0 7.5 10.0 12.5 x ei 11 / 40

Slide 16

Slide 16 text

-1.0 -0.5 0.0 0.5 1.0 y type y yhat type init prop seq Iter = 15, Gap = 9.0305e-06 0e+00 5e-04 1e-03 2.5 5.0 7.5 10.0 12.5 x ei 11 / 40

Slide 17

Slide 17 text

mlrMBO: Model-Based Optimization Toolbox Any regression from mlr Arbtritrary inﬁll Single - or multi-crit Multi-point proposal Via parallelMap and batchtools runs on many parallel backends and clusters Algorithm conﬁguration Active research q q q q q q q q q q q q 0 5 10 15 −5 0 5 10 type q init prop seq 100 200 300 y y q q q q q q q q q q q q 0 5 10 15 −5 0 5 10 type q init prop seq 0 50 100 150 yhat yhat q q q q q q q q q q q q 0 5 10 15 −5 0 5 10 0 1 2 3 ei type q init prop seq ei q q q q q q q q q q q q 0 5 10 15 −5 0 5 10 type q init prop seq 10 20 30 40 se se Iter 5, x−axis: x1, y−axis: x2 mlr: https://github.com/mlr-org/mlr mlrMBO: https://github.com/mlr-org/mlrMBO mlrMBO Paper on arXiv (under review) https://arxiv.org/abs/1703.03373 12 / 40

Slide 18

Slide 18 text

Benchmark MBO on artificial test functions Comparison of mlrMBO on multiple diﬀerent test functions Multimodal Smooth Fully numeric Well known We use GPs with LCB with λ = 1 Focussearch 200 iterations 25 point initial design, created by LHS sampling Comparison with Random search CMAES other MBO implementations in R 13 / 40

Slide 19

Slide 19 text

MBO GP vs. competitors in 5D Alpine01 DeflectedCurragatedSpring Schwefel Ackley Griewank Rosenbrock 0 2 4 −1 0 1 2 3 4 −2000 −1500 −1000 0 5 10 15 20 0 2 4 6 8 0 1000 2000 3000 4000 m lrM BO cm aesr D iceO ptim rBayesO pt R andom m lrM BO cm aesr D iceO ptim rBayesO pt R andom m lrM BO cm aesr D iceO ptim rBayesO pt R andom m lrM BO cm aesr D iceO ptim rBayesO pt R andom m lrM BO cm aesr D iceO ptim rBayesO pt R andom m lrM BO cm aesr D iceO ptim rBayesO pt R andom algorithm y 14 / 40

Slide 20

Slide 20 text

Section 2 Parallel batch proposals 15 / 40

Slide 21

Slide 21 text

Motivation for batch proposal Function evaluations expensive Often many cores available on a cluster Underlying f can in many cases not be easily parallelized Natural to consider batch proposal Parallel MBO: suggest q promising points to evaluate: x∗ 1 , . . . , x∗ q We need to balance exploration and exploitation Non-trivial to construct inﬁll criterion for this 16 / 40

Slide 22

Slide 22 text

Review of parallel MBO strategies Constant liar: (Ginsbourger et al., 2010) Fit kriging model based on real data and ﬁnd x∗ 1 according to EI-criterion. “Guess” f (x∗ i−1 ), update the model and ﬁnd x∗ i , i = 2, ..., q Use fmin for “guessing” q-LCB: (Hutter et al., 2012) q times: sample λ from Exp(1) and optimize single LCB criterion x∗ = arg minx∈X LCB(x) = arg minx∈X ˆ f (x) − λˆ s(x) . 17 / 40

Slide 23

Slide 23 text

Multiobjectivization Multiobjectivization Originates from multi-modal optimization Add distance to neighbors for current set as artificial objective Use multiobjective optimization Select by hypervolume or first objective or . . . Approach Decouple ˆ f (x) and ˆ s(x) as objectives – instead of EI – to have different exploration / exploitation trade-offs Consider distance measure as potential extra objective Run multiobjective EA to select q well-performing, diverse points Distance is possible alternative if no or bad ˆ s(x) estimator Decoupling y(x), ˆ s(x) potential alternative when EI derivation does not hold for other model classes Bischl, Wessing et al:MOI-MBO: Multiobjective infill for parallel model-based optimization, LION 2014 18 / 40

Slide 24

Slide 24 text

Section 3 Multicriteria SMBO 19 / 40

Slide 25

Slide 25 text

Model-based multi-objective optimization in out Black-Box x1 x2 . . . xd y1 y2 min x∈X f(x) = y = (y1, ..., ym ) with f : Rd → Rm (3) y dominates ˜ y if ∀i ∈ {1, ..., m} :yi ≤ ˜ yi (4) and ∃i ∈ {1, ..., m} :yi < ˜ yi (5) Set of non-dominated solutions: X∗ := {x ∈ X| ˜ x ∈ X : f(˜ x) dominates f(x)} Pareto set X∗, Pareto front f(X∗) Goal: Find ˆ X∗ of non-dominated points that estimates the true set X∗ 20 / 40

Slide 26

Slide 26 text

Model-based multi-objective optimization 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 y1 y2 dominated points Pareto front min x∈X f(x) = y = (y1, ..., ym ) with f : Rn → Rm (6) y dominates ˜ y if ∀i ∈ {1, ..., m} :yi ≤ ˜ yi (7) and ∃i ∈ {1, ..., m} :yi < ˜ yi (8) Set of non-dominated solutions: X∗ := {x ∈ X| ˜ x ∈ X : f(˜ x) dominates f(x)} Pareto set X∗, Pareto front f(X∗) Goal: Find ˆ X∗ of non-dominated points that estimates the true set X∗ 20 / 40

Slide 27

Slide 27 text

ParEGO 1. Scalarize objectives using the augmented Tchebycheﬀ norm max i=1,...,d [wi fi (x)] + ρ d i=1 wi fi (x) with uniformly distributed weight vector w ( wi = 1) and ﬁt surrogate model to the respective scalarization. 2. Single-objective optimization of EI (or LCB?) Batch proposal: Increase the number and diversity of randomly drawn weight vectors If N points are desired, cN (c > 1) weight vectors are considered Greedily reduce set of weight vectors by excluding one vector of the pair with minimum distance Scalarizations implied by each weight vector are computed Fit and optimize models for each scalarization Optima of each model build the batch to be evaluated 21 / 40

Slide 28

Slide 28 text

Animation of ParEGO q q q q q q q q q q 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 x1 x2 0.0000 0.0005 0.0010 0.0015 0.0020 ei XSpace q q q q q q q q q q 0 2 4 6 8 0.00 0.25 0.50 0.75 1.00 y_1 y_2 type q q front init prop YSpace 22 / 40

Slide 29

Slide 29 text

Animation of ParEGO q q q q q q q q q q 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 x1 x2 0.0e+00 2.5e−13 5.0e−13 7.5e−13 ei XSpace q q q q q q q q q q 0 2 4 6 8 0.00 0.25 0.50 0.75 1.00 y_1 y_2 type q q front init prop seq YSpace 22 / 40

Slide 30

Slide 30 text

Animation of ParEGO q q q q q q q q q q 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 x1 x2 0.000 0.001 0.002 0.003 0.004 0.005 ei XSpace q q q q q q q q q q 0 2 4 6 8 0.00 0.25 0.50 0.75 1.00 y_1 y_2 type q q front init prop seq YSpace 22 / 40

Slide 31

Slide 31 text

Slide 32

Slide 32 text

Animation of ParEGO q q q q q q q q q q 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 x1 x2 0e+00 1e−05 2e−05 3e−05 ei XSpace q q q q q q q q q q 0 2 4 6 8 0.00 0.25 0.50 0.75 1.00 y_1 y_2 type q q front init prop seq YSpace 22 / 40

Slide 33

Slide 33 text

Animation of ParEGO q q q q q q q q q q 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 x1 x2 0e+00 1e−09 2e−09 3e−09 ei XSpace q q q q q q q q q q 0 2 4 6 8 0.00 0.25 0.50 0.75 1.00 y_1 y_2 type q q front init prop seq YSpace 22 / 40

Slide 34

Slide 34 text

Animation of ParEGO q q q q q q q q q q 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 x1 x2 0.000000e+00 2.947095e−21 5.894189e−21 8.841284e−21 1.178838e−20 ei XSpace q q q q q q q q q q 0 2 4 6 8 0.00 0.25 0.50 0.75 1.00 y_1 y_2 type q q front init prop seq YSpace 22 / 40

Slide 35

Slide 35 text

Animation of ParEGO q q q q q q q q q q 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 x1 x2 0e+00 2e−10 4e−10 6e−10 ei XSpace q q q q q q q q q q 0 2 4 6 8 0.00 0.25 0.50 0.75 1.00 y_1 y_2 type q q front init prop seq YSpace 22 / 40

Slide 36

Slide 36 text

Animation of ParEGO q q q q q q q q q q 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 x1 x2 0.000000e+00 5.670795e−25 1.134159e−24 1.701238e−24 2.268318e−24 ei XSpace q q q q q q q q q q 0 2 4 6 8 0.00 0.25 0.50 0.75 1.00 y_1 y_2 type q q front init prop seq YSpace 22 / 40

Slide 37

Slide 37 text

Animation of ParEGO q q q q q q q q q q 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 x1 x2 0.000000e+00 1.246581e−21 2.493163e−21 3.739744e−21 4.986326e−21 ei XSpace q q q q q q q q q q 0 2 4 6 8 0.00 0.25 0.50 0.75 1.00 y_1 y_2 type q q front init prop seq YSpace 22 / 40

Slide 38

Slide 38 text

Section 4 Interesting Challenges 23 / 40

Slide 39

Slide 39 text

Challenge: The correct surrogate? GPs are very much tailored to what we want to do, due to their spatial structure in the kernel and the uncertainty estimator. But GPs are rather slow. And (fortunately) due to parallization (or speed-up tricks like subsampling) we have more design points to train on. Categorical features are also a problem in GPs (although methods exist, usually by changing the kernel) Random Forests handle categorical features nicely, are much faster. But they don’t rely on a spatial kernel and the uncertainty estimation is much more heuristic / may not represent what we want. 24 / 40

Slide 40

Slide 40 text

Challenge: Time Heterogeneity Complex configuration spaces across many algorithms results in vastly different runtimes in design points. Actually just the RBF-SVM tuning can result in very different runtimes. We don’t care how many points we evaluate, we care about total walltime of the configuration. The option to subsample further complicates things. Parallelization further complicates things. Option: Estimate runtime as well with a surrogate, integrate it into acquisition function. 25 / 40

Slide 41

Slide 41 text

Section 5 ML Model Selection and Hyperparameter Optimization 26 / 40

Slide 42

Slide 42 text

Automatic Model Selection Prior approaches: Looking for the silver bullet model Failure Exhaustive benchmarking / search Very expensive, often contradicting results Meta-Learning: Good meta-features are hard to construct IMHO: Gets more interesting when combined with SMBO Goal for AutoML: Data dependent Automatic Include every relevant modeling decision Eﬃcient Learn on the model-settings level! 27 / 40

Slide 43

Slide 43 text

From Normal SMBO to Hyperarameter Tuning Objective function is resampled performance measure Parameter space θ ∈ Θ might be discrete and dependent / hierarchical No derivative for f (·, θ), black-box Objective is stochastic / noisy Objective is expensive to evaluate In general we face a problem of algorithm conﬁguration: Usual approaches: racing or model-based / bayesian optimization 28 / 40

Slide 44

Slide 44 text

From Normal SMBO to Hyperarameter Tuning Black Box Optimizer Data Set Learning Machine Preprocessing Model Fit Postprocessing Feature Filter Train / Test Data Resampling Features Hyperparameters Resampled Performace Function Features Hyperparameters Selected deﬁnes 29 / 40

Slide 45

Slide 45 text

Complex Parameter Space Parameter Set cl.weights learner 2[−7,...,7) randomForest L2 LogReg svm mtry nodesize cost cost kernel radial linear γ {0.1p, ..., 0.9p} {1, ..., 0.5n} 2[−15,15] 2[−15,15] 2[−15,15] 30 / 40

Slide 46

Slide 46 text

From Normal SMBO to Hyperarameter Tuning Initial design: LHS principle can be extended, or just use random Focus search: Can be (easily) extended, as it is based on random search. To zoom in for categorical parameters we randomly drop a category for each param which is not present in the currently best configuration. Few approaches for GPs with categorical params exist (usually with new covar kernels), not very established Alternative: Random regression forest (mlrMBO, SMAC) Estimate uncertainty / confidence interval for mean response by efficient bootstrap technique1, or jackknife, so we can define EI(x) for the RF Dependent params in mlrMBO: Imputation: Many of the current techniques to handle these problems are (from a theoretical standpoint) somewhat crude 1Sexton et al, “Standard errors for bagged and random forest estimators, 2009.” 31 / 40

Slide 47

Slide 47 text

Hyperparameter Tuning Still common practice: grid seach For a SVM it might look like: C ∈ (2−12, 2−10, 2−8, . . . , 28, 210, 212) γ ∈ (2−12, 2−10, 2−8, . . . , 28, 210, 212) Evaluate all 132 = 169 combinations C × γ Bad beacause: optimum might be "oﬀ the grid" lots of evaluations in bad areas lots of costy evaluations How bad? 32 / 40

Slide 48

Slide 48 text

Hyperparameter Tuning Because of budget restrictions grid might even be smaller! Unpromising area quite big! Lots of costly evaluations! With mlrMBO it is not hard to do it better! More interesting applications to time-series regression and cost-sensitive classiﬁcation2 2Koch, Bischl et al:Tuning and evolution of support vector kernels, EI 2012 33 / 40

Slide 49

Slide 49 text

Hyperparameter Tuning 34 / 40

Slide 50

Slide 50 text

Hyperparameter Tuning 35 / 40

Slide 51

Slide 51 text

HPOlib HPOlib is a set of standard benchmarks for hyperparameter optimizer Allows comparison with Spearmint SMAC Hyperopt (TPE) Benchmarks: Numeric test functions (similar to the ones we’ve seen bevor) Numeric machine learning problems (lda, SVM, logistic regression) Deep neural networks and deep belief networks with 15 and 35 parameters. For benchmarks with discrete and dependent parameters (hpnnet, hpdbnet) a random forest with standard error estimation is used. 36 / 40

Slide 52

Slide 52 text

MBO: HPOlib svm_on_grid branin michalewicz camelback hpnnet/nocv_convex hpnnet/nocv_mrbi lda_on_grid logreg_on_grid hpdbnet/convex hpdbnet/mrbi hpnnet/cv_convex hpnnet/cv_mrbi m lrM BO sm ac spearm int TPE m lrM BO sm ac spearm int TPE m lrM BO sm ac spearm int TPE m lrM BO sm ac spearm int TPE m lrM BO sm ac spearm int TPE m lrM BO sm ac spearm int TPE m lrM BO sm ac spearm int TPE m lrM BO sm ac spearm int TPE m lrM BO sm ac spearm int TPE m lrM BO sm ac spearm int TPE m lrM BO sm ac spearm int TPE m lrM BO sm ac spearm int TPE 0.475 0.500 0.525 0.550 0.575 0.0688 0.0692 0.0696 −1.02 −1.00 −0.98 −0.96 0.18 0.19 0.20 0.21 0.22 0.23 1300 1350 1400 1450 −7 −6 −5 −4 −3 0.50 0.55 0.60 0.65 0.48 0.50 0.52 0.54 0.4 0.5 0.6 0.7 0.20 0.25 0.30 0.35 0.20 0.25 0.30 0.35 0.40 0.30 0.33 0.36 result 37 / 40

Slide 53

Slide 53 text

Deep Learning Configuration Example Dataset: CIFAR-10 (60000 32x32 images with 3 color channels; 10 classes) Conﬁguration of a deep neural network (mxnet) Size of parameter set: 30, including number of hidden layers, activation functions, regularization, convolution layer setting, etc. Split: 2/3 training set, 1/6 test set, 1/6 validation set Time budget per tuning run: 4.5h (16200 sec) Surrogate: Random forest Acquisition: LCB with λ = 2 38 / 40