Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Applying Model-Based Optimization to Hyperparameter Optimization in Machine Learning

Applying Model-Based Optimization to Hyperparameter Optimization in Machine Learning

The talk will cover the main components of sequential model-based optimization algorithms.
Algorithms of this kind represent the state-of-the-art for expensive black-box optimization problems and are getting increasingly popular for hyper-parameter optimization of machine learning algorithms, especially on larger datasets.
The talk will cover the main components of sequential model-based optimization algorithms, e.g., surrogate regression models like Gaussian processes or random forests, initialization phase and point acquisition.
In a second part, some recent extensions with regard to parallel point acquisition and multi-criteria optimization will be covered.
The talk will finish with a brief overview of open questions and challenges.

3c3f3f18c25ea5283640ebd23553e7c6?s=128

MunichDataGeeks

June 07, 2018
Tweet

More Decks by MunichDataGeeks

Other Decks in Science

Transcript

  1. Applying Model-Based Optimization to Hyperparameter Optimization in Machine Learning Janek

    Thomas Computational Statistics, LMU Munich Jun 7th, 2018 1 / 40
  2. Sequential model-based optimization Parallel batch proposals Multicriteria SMBO Interesting Challenges

    ML Model Selection and Hyperparameter Optimization 2 / 40
  3. Section 1 Sequential model-based optimization 3 / 40

  4. Expensive Black-Box Optimization y = f (x) , f :

    X → R (1) x∗ = arg min x∈X f (x) (2) y, target value x ∈ X ⊂ Rd , domain f (x) function with considerably long runtime Goal: Find optimum x∗ 4 / 40
  5. Sequential model-based optimization Setting: Expensive black-box problem f : x

    → R = min! Classical problem: Computer simulation with a bunch of control parameters and performance output; or algorithmic performance on 1 or more problem instances; we often optimize ML pipelines Idea: Let’s approximate f via regression! Generic MBO Pseudo Code Create initial space filling design and evaluate with f In each iteration: Fit regression model on all evaluated points to predict ˆ f (x) and uncertainty ˆ s(x) Propose point via infill criterion EI(x) ↑ ⇐⇒ ˆ f (x) ↓ ∧ ˆ s(x) ↑ Evaluate proposed point and add to design EGO proposes kriging (aka Gaussian Process) and EI Jones 1998, Efficient Global Opt. of Exp. Black-Box Functions 5 / 40
  6. Latin Hypercube Designs Initial design to train first regression model

    Not too small, not too large LHS / maximin designs: Min dist between points is maximized But: Type of design usually has not the largest effect on MBO, and unequal distances between points could even be beneficial 6 / 40
  7. Kriging and local uncertainty prediction Model: Zero-mean GP Y (x)

    with const. trend and cov. kernel kθ (x1, x2 ). y = (y1, . . . , yn )T , K = (k(xi , xj ))i,j=1,...,n k∗ (x) = (k(x1, x), . . . , k(xn, x))T ˆ µ = 1T K−1y/1T K−11 (BLUE) Prediction: ˆ f (x) = E[Y (x)|Y (xi ) = yi , i = 1, . . . , n] = ˆ µ + kn (x)T K−1(y − ˆ µ1) Uncertainty: ˆ s2(x) = Var[Y (x)|Y (xi ) = yi , i = 1, . . . , n] = σ2 − kT n (x)K−1kn (x) + (1−1T K−1kT n (x))2 1T K−11 7 / 40
  8. Kriging / GP is a spatial model Correlation between outcomes

    (y1, y2 ) depends on dist of x1, x2 E.g. Gaussian covar kernel k(x1, x2 ) = exp(−||x1−x2|| 2σ ) Useful smoothness assumption for optimization Posterior uncertainty at new x increases with dist to design points Allows to enforce exploration 8 / 40
  9. Infill Criteria: Expected Improvement Define improvement at x over best

    visited point with y = fmin as random variable I(x) = |fmin − Y (x)|+ For kriging Y (x) ∼ N(ˆ f (x), ˆ s2(x)) (given x = x) Now define EI(x) = E[I(x)|x = x] Expectation is integral over normal density starting at fmin Alternative: Lower confidence bound (LCB) ˆ f (x) − λˆ s(x) Result: EI(x) = fmin − ˆ f (x) Φ fmin−ˆ f (x)) ˆ s(x) + ˆ s(x)φ fmin−ˆ f (x) ˆ s(x) 9 / 40
  10. Focussearch EI optimization is multimodal and not that simple But

    objective is now cheap to evaluate Many different algorithms exist, from gradient-based methods with restarts to evolutionary algorithms We use an iterated, focusing random search coined “focus search” In each iteration a random search is performed We then shrink the constraints of the feasible region towards the best point in the current iteration (focusing) and iterate, to enforce local convergence Whole process is restarted a few times Works also for categorical and hierarchical params 10 / 40
  11. -1.0 -0.5 0.0 0.5 1.0 y type y yhat type

    init prop Iter = 1, Gap = 2.0795e-01 0.00 0.01 0.02 0.03 0.04 0.05 2.5 5.0 7.5 10.0 12.5 x ei 11 / 40
  12. -1.0 -0.5 0.0 0.5 1.0 y type y yhat type

    init prop seq Iter = 2, Gap = 5.5410e-02 0.00 0.02 0.04 0.06 2.5 5.0 7.5 10.0 12.5 x ei 11 / 40
  13. -1.0 -0.5 0.0 0.5 1.0 y type y yhat type

    init prop seq Iter = 3, Gap = 5.5410e-02 0.00 0.03 0.06 0.09 0.12 2.5 5.0 7.5 10.0 12.5 x ei 11 / 40
  14. -1.0 -0.5 0.0 0.5 1.0 y type y yhat type

    init prop seq Iter = 4, Gap = 2.2202e-05 0.00 0.01 0.02 0.03 0.04 2.5 5.0 7.5 10.0 12.5 x ei 11 / 40
  15. -1.0 -0.5 0.0 0.5 1.0 y type y yhat type

    init prop seq Iter = 5, Gap = 2.2202e-05 0.000 0.005 0.010 0.015 2.5 5.0 7.5 10.0 12.5 x ei 11 / 40
  16. -1.0 -0.5 0.0 0.5 1.0 y type y yhat type

    init prop seq Iter = 15, Gap = 9.0305e-06 0e+00 5e-04 1e-03 2.5 5.0 7.5 10.0 12.5 x ei 11 / 40
  17. mlrMBO: Model-Based Optimization Toolbox Any regression from mlr Arbtritrary infill

    Single - or multi-crit Multi-point proposal Via parallelMap and batchtools runs on many parallel backends and clusters Algorithm configuration Active research q q q q q q q q q q q q 0 5 10 15 −5 0 5 10 type q init prop seq 100 200 300 y y q q q q q q q q q q q q 0 5 10 15 −5 0 5 10 type q init prop seq 0 50 100 150 yhat yhat q q q q q q q q q q q q 0 5 10 15 −5 0 5 10 0 1 2 3 ei type q init prop seq ei q q q q q q q q q q q q 0 5 10 15 −5 0 5 10 type q init prop seq 10 20 30 40 se se Iter 5, x−axis: x1, y−axis: x2 mlr: https://github.com/mlr-org/mlr mlrMBO: https://github.com/mlr-org/mlrMBO mlrMBO Paper on arXiv (under review) https://arxiv.org/abs/1703.03373 12 / 40
  18. Benchmark MBO on artificial test functions Comparison of mlrMBO on

    multiple different test functions Multimodal Smooth Fully numeric Well known We use GPs with LCB with λ = 1 Focussearch 200 iterations 25 point initial design, created by LHS sampling Comparison with Random search CMAES other MBO implementations in R 13 / 40
  19. MBO GP vs. competitors in 5D Alpine01 DeflectedCurragatedSpring Schwefel Ackley

    Griewank Rosenbrock 0 2 4 −1 0 1 2 3 4 −2000 −1500 −1000 0 5 10 15 20 0 2 4 6 8 0 1000 2000 3000 4000 m lrM BO cm aesr D iceO ptim rBayesO pt R andom m lrM BO cm aesr D iceO ptim rBayesO pt R andom m lrM BO cm aesr D iceO ptim rBayesO pt R andom m lrM BO cm aesr D iceO ptim rBayesO pt R andom m lrM BO cm aesr D iceO ptim rBayesO pt R andom m lrM BO cm aesr D iceO ptim rBayesO pt R andom algorithm y 14 / 40
  20. Section 2 Parallel batch proposals 15 / 40

  21. Motivation for batch proposal Function evaluations expensive Often many cores

    available on a cluster Underlying f can in many cases not be easily parallelized Natural to consider batch proposal Parallel MBO: suggest q promising points to evaluate: x∗ 1 , . . . , x∗ q We need to balance exploration and exploitation Non-trivial to construct infill criterion for this 16 / 40
  22. Review of parallel MBO strategies Constant liar: (Ginsbourger et al.,

    2010) Fit kriging model based on real data and find x∗ 1 according to EI-criterion. “Guess” f (x∗ i−1 ), update the model and find x∗ i , i = 2, ..., q Use fmin for “guessing” q-LCB: (Hutter et al., 2012) q times: sample λ from Exp(1) and optimize single LCB criterion x∗ = arg minx∈X LCB(x) = arg minx∈X ˆ f (x) − λˆ s(x) . 17 / 40
  23. Multiobjectivization Multiobjectivization Originates from multi-modal optimization Add distance to neighbors

    for current set as artificial objective Use multiobjective optimization Select by hypervolume or first objective or . . . Approach Decouple ˆ f (x) and ˆ s(x) as objectives – instead of EI – to have different exploration / exploitation trade-offs Consider distance measure as potential extra objective Run multiobjective EA to select q well-performing, diverse points Distance is possible alternative if no or bad ˆ s(x) estimator Decoupling y(x), ˆ s(x) potential alternative when EI derivation does not hold for other model classes Bischl, Wessing et al:MOI-MBO: Multiobjective infill for parallel model-based optimization, LION 2014 18 / 40
  24. Section 3 Multicriteria SMBO 19 / 40

  25. Model-based multi-objective optimization in out Black-Box x1 x2 . .

    . xd y1 y2 min x∈X f(x) = y = (y1, ..., ym ) with f : Rd → Rm (3) y dominates ˜ y if ∀i ∈ {1, ..., m} :yi ≤ ˜ yi (4) and ∃i ∈ {1, ..., m} :yi < ˜ yi (5) Set of non-dominated solutions: X∗ := {x ∈ X| ˜ x ∈ X : f(˜ x) dominates f(x)} Pareto set X∗, Pareto front f(X∗) Goal: Find ˆ X∗ of non-dominated points that estimates the true set X∗ 20 / 40
  26. Model-based multi-objective optimization 0.0 0.2 0.4 0.6 0.8 1.0 0.0

    0.2 0.4 0.6 0.8 1.0 y1 y2 dominated points Pareto front min x∈X f(x) = y = (y1, ..., ym ) with f : Rn → Rm (6) y dominates ˜ y if ∀i ∈ {1, ..., m} :yi ≤ ˜ yi (7) and ∃i ∈ {1, ..., m} :yi < ˜ yi (8) Set of non-dominated solutions: X∗ := {x ∈ X| ˜ x ∈ X : f(˜ x) dominates f(x)} Pareto set X∗, Pareto front f(X∗) Goal: Find ˆ X∗ of non-dominated points that estimates the true set X∗ 20 / 40
  27. ParEGO 1. Scalarize objectives using the augmented Tchebycheff norm max

    i=1,...,d [wi fi (x)] + ρ d i=1 wi fi (x) with uniformly distributed weight vector w ( wi = 1) and fit surrogate model to the respective scalarization. 2. Single-objective optimization of EI (or LCB?) Batch proposal: Increase the number and diversity of randomly drawn weight vectors If N points are desired, cN (c > 1) weight vectors are considered Greedily reduce set of weight vectors by excluding one vector of the pair with minimum distance Scalarizations implied by each weight vector are computed Fit and optimize models for each scalarization Optima of each model build the batch to be evaluated 21 / 40
  28. Animation of ParEGO q q q q q q q

    q q q 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 x1 x2 0.0000 0.0005 0.0010 0.0015 0.0020 ei XSpace q q q q q q q q q q 0 2 4 6 8 0.00 0.25 0.50 0.75 1.00 y_1 y_2 type q q front init prop YSpace 22 / 40
  29. Animation of ParEGO q q q q q q q

    q q q 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 x1 x2 0.0e+00 2.5e−13 5.0e−13 7.5e−13 ei XSpace q q q q q q q q q q 0 2 4 6 8 0.00 0.25 0.50 0.75 1.00 y_1 y_2 type q q front init prop seq YSpace 22 / 40
  30. Animation of ParEGO q q q q q q q

    q q q 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 x1 x2 0.000 0.001 0.002 0.003 0.004 0.005 ei XSpace q q q q q q q q q q 0 2 4 6 8 0.00 0.25 0.50 0.75 1.00 y_1 y_2 type q q front init prop seq YSpace 22 / 40
  31. Animation of ParEGO q q q q q q q

    q q q 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 x1 x2 0.0000 0.0005 0.0010 0.0015 0.0020 ei XSpace q q q q q q q q q q 0 2 4 6 8 0.00 0.25 0.50 0.75 1.00 y_1 y_2 type q q front init prop seq YSpace 22 / 40
  32. Animation of ParEGO q q q q q q q

    q q q 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 x1 x2 0e+00 1e−05 2e−05 3e−05 ei XSpace q q q q q q q q q q 0 2 4 6 8 0.00 0.25 0.50 0.75 1.00 y_1 y_2 type q q front init prop seq YSpace 22 / 40
  33. Animation of ParEGO q q q q q q q

    q q q 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 x1 x2 0e+00 1e−09 2e−09 3e−09 ei XSpace q q q q q q q q q q 0 2 4 6 8 0.00 0.25 0.50 0.75 1.00 y_1 y_2 type q q front init prop seq YSpace 22 / 40
  34. Animation of ParEGO q q q q q q q

    q q q 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 x1 x2 0.000000e+00 2.947095e−21 5.894189e−21 8.841284e−21 1.178838e−20 ei XSpace q q q q q q q q q q 0 2 4 6 8 0.00 0.25 0.50 0.75 1.00 y_1 y_2 type q q front init prop seq YSpace 22 / 40
  35. Animation of ParEGO q q q q q q q

    q q q 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 x1 x2 0e+00 2e−10 4e−10 6e−10 ei XSpace q q q q q q q q q q 0 2 4 6 8 0.00 0.25 0.50 0.75 1.00 y_1 y_2 type q q front init prop seq YSpace 22 / 40
  36. Animation of ParEGO q q q q q q q

    q q q 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 x1 x2 0.000000e+00 5.670795e−25 1.134159e−24 1.701238e−24 2.268318e−24 ei XSpace q q q q q q q q q q 0 2 4 6 8 0.00 0.25 0.50 0.75 1.00 y_1 y_2 type q q front init prop seq YSpace 22 / 40
  37. Animation of ParEGO q q q q q q q

    q q q 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 x1 x2 0.000000e+00 1.246581e−21 2.493163e−21 3.739744e−21 4.986326e−21 ei XSpace q q q q q q q q q q 0 2 4 6 8 0.00 0.25 0.50 0.75 1.00 y_1 y_2 type q q front init prop seq YSpace 22 / 40
  38. Section 4 Interesting Challenges 23 / 40

  39. Challenge: The correct surrogate? GPs are very much tailored to

    what we want to do, due to their spatial structure in the kernel and the uncertainty estimator. But GPs are rather slow. And (fortunately) due to parallization (or speed-up tricks like subsampling) we have more design points to train on. Categorical features are also a problem in GPs (although methods exist, usually by changing the kernel) Random Forests handle categorical features nicely, are much faster. But they don’t rely on a spatial kernel and the uncertainty estimation is much more heuristic / may not represent what we want. 24 / 40
  40. Challenge: Time Heterogeneity Complex configuration spaces across many algorithms results

    in vastly different runtimes in design points. Actually just the RBF-SVM tuning can result in very different runtimes. We don’t care how many points we evaluate, we care about total walltime of the configuration. The option to subsample further complicates things. Parallelization further complicates things. Option: Estimate runtime as well with a surrogate, integrate it into acquisition function. 25 / 40
  41. Section 5 ML Model Selection and Hyperparameter Optimization 26 /

    40
  42. Automatic Model Selection Prior approaches: Looking for the silver bullet

    model Failure Exhaustive benchmarking / search Very expensive, often contradicting results Meta-Learning: Good meta-features are hard to construct IMHO: Gets more interesting when combined with SMBO Goal for AutoML: Data dependent Automatic Include every relevant modeling decision Efficient Learn on the model-settings level! 27 / 40
  43. From Normal SMBO to Hyperarameter Tuning Objective function is resampled

    performance measure Parameter space θ ∈ Θ might be discrete and dependent / hierarchical No derivative for f (·, θ), black-box Objective is stochastic / noisy Objective is expensive to evaluate In general we face a problem of algorithm configuration: Usual approaches: racing or model-based / bayesian optimization 28 / 40
  44. From Normal SMBO to Hyperarameter Tuning Black Box Optimizer Data

    Set Learning Machine Preprocessing Model Fit Postprocessing Feature Filter Train / Test Data Resampling Features Hyperparameters Resampled Performace Function Features Hyperparameters Selected defines 29 / 40
  45. Complex Parameter Space Parameter Set cl.weights learner 2[−7,...,7) randomForest L2

    LogReg svm mtry nodesize cost cost kernel radial linear γ {0.1p, ..., 0.9p} {1, ..., 0.5n} 2[−15,15] 2[−15,15] 2[−15,15] 30 / 40
  46. From Normal SMBO to Hyperarameter Tuning Initial design: LHS principle

    can be extended, or just use random Focus search: Can be (easily) extended, as it is based on random search. To zoom in for categorical parameters we randomly drop a category for each param which is not present in the currently best configuration. Few approaches for GPs with categorical params exist (usually with new covar kernels), not very established Alternative: Random regression forest (mlrMBO, SMAC) Estimate uncertainty / confidence interval for mean response by efficient bootstrap technique1, or jackknife, so we can define EI(x) for the RF Dependent params in mlrMBO: Imputation: Many of the current techniques to handle these problems are (from a theoretical standpoint) somewhat crude 1Sexton et al, “Standard errors for bagged and random forest estimators, 2009.” 31 / 40
  47. Hyperparameter Tuning Still common practice: grid seach For a SVM

    it might look like: C ∈ (2−12, 2−10, 2−8, . . . , 28, 210, 212) γ ∈ (2−12, 2−10, 2−8, . . . , 28, 210, 212) Evaluate all 132 = 169 combinations C × γ Bad beacause: optimum might be "off the grid" lots of evaluations in bad areas lots of costy evaluations How bad? 32 / 40
  48. Hyperparameter Tuning Because of budget restrictions grid might even be

    smaller! Unpromising area quite big! Lots of costly evaluations! With mlrMBO it is not hard to do it better! More interesting applications to time-series regression and cost-sensitive classification2 2Koch, Bischl et al:Tuning and evolution of support vector kernels, EI 2012 33 / 40
  49. Hyperparameter Tuning 34 / 40

  50. Hyperparameter Tuning 35 / 40

  51. HPOlib HPOlib is a set of standard benchmarks for hyperparameter

    optimizer Allows comparison with Spearmint SMAC Hyperopt (TPE) Benchmarks: Numeric test functions (similar to the ones we’ve seen bevor) Numeric machine learning problems (lda, SVM, logistic regression) Deep neural networks and deep belief networks with 15 and 35 parameters. For benchmarks with discrete and dependent parameters (hpnnet, hpdbnet) a random forest with standard error estimation is used. 36 / 40
  52. MBO: HPOlib svm_on_grid branin michalewicz camelback hpnnet/nocv_convex hpnnet/nocv_mrbi lda_on_grid logreg_on_grid

    hpdbnet/convex hpdbnet/mrbi hpnnet/cv_convex hpnnet/cv_mrbi m lrM BO sm ac spearm int TPE m lrM BO sm ac spearm int TPE m lrM BO sm ac spearm int TPE m lrM BO sm ac spearm int TPE m lrM BO sm ac spearm int TPE m lrM BO sm ac spearm int TPE m lrM BO sm ac spearm int TPE m lrM BO sm ac spearm int TPE m lrM BO sm ac spearm int TPE m lrM BO sm ac spearm int TPE m lrM BO sm ac spearm int TPE m lrM BO sm ac spearm int TPE 0.475 0.500 0.525 0.550 0.575 0.0688 0.0692 0.0696 −1.02 −1.00 −0.98 −0.96 0.18 0.19 0.20 0.21 0.22 0.23 1300 1350 1400 1450 −7 −6 −5 −4 −3 0.50 0.55 0.60 0.65 0.48 0.50 0.52 0.54 0.4 0.5 0.6 0.7 0.20 0.25 0.30 0.35 0.20 0.25 0.30 0.35 0.40 0.30 0.33 0.36 result 37 / 40
  53. Deep Learning Configuration Example Dataset: CIFAR-10 (60000 32x32 images with

    3 color channels; 10 classes) Configuration of a deep neural network (mxnet) Size of parameter set: 30, including number of hidden layers, activation functions, regularization, convolution layer setting, etc. Split: 2/3 training set, 1/6 test set, 1/6 validation set Time budget per tuning run: 4.5h (16200 sec) Surrogate: Random forest Acquisition: LCB with λ = 2 38 / 40
  54. Deep Learning Configuration Example 39 / 40

  55. Thanks! Any comments or questions? 40 / 40