Slide 1

Slide 1 text

A Modern Guide to Hyperparameter Optimization Richard Liaw

Slide 2

Slide 2 text

2 ©2017 RISELab Deep learning is taking over the world

Slide 3

Slide 3 text

3 ©2017 RISELab def train_model(): model = ConvNet() optimizer = Optimizer() for batch in Dataset(): loss, acc = model.train(batch) optimizer.update(model, loss)

Slide 4

Slide 4 text

4 ©2017 RISELab def train_model(): model = ConvNet() optimizer = Optimizer() for batch in Dataset(): loss, acc = model.train(batch) optimizer.update(model, loss)

Slide 5

Slide 5 text

5 ©2017 RISELab def train_model(): model = ConvNet(layers, activations, drop... optimizer = Optimizer(lr, momentum, decay...) for batch in Dataset(standardize, shift, ...) loss, acc = model.train(batch) optimizer.update(model, loss) Tune this!

Slide 6

Slide 6 text

6 ©2017 RISELab Hyperparameters matter!

Slide 7

Slide 7 text

7 ©2017 RISELab Goal of hyperparameter tuning Maximize Model Performance Minimize Time spent Minimize money spent

Slide 8

Slide 8 text

8 ©2017 RISELab Overview of hyperparameter tuning techniques Grid Search Bayesian Optimization HyperBand (Bandits) Population- based Training Random Search Definition: “trial” = “one configuration evaluation”

Slide 9

Slide 9 text

9 ©2017 RISELab Grid Search tl;dr - Cross-product of all possible configurations. for rate in [0.1, 0.01, 0.001]: for hidden_layers in [2, 3, 4]: for param in ["a", "b", "c"]: train_model( rate, hidden_layers, param) Benefit: 1. Explainable 2. Easily parallelizable Problems: Inefficient/expensive ⇐ 27 evaluations!

Slide 10

Slide 10 text

10 ©2017 RISELab Grid Search Random Search

Slide 11

Slide 11 text

11 ©2017 RISELab Random Search tldr: Sample configurations. for i in num_samples: train_model( rate=sample(0.001, 0.1), hidden_layers=sample(2, 4), param=sample(["a", "b", "c"])) Benefit: 1. Better coverage on important parameters 2. Easily parallelizable 3. Hard to beat on high dimensions Problems: Still inefficient/expensive!

Slide 12

Slide 12 text

12 ©2017 RISELab What if we used some prior information to guide our tuning process? photo from github.com/ fmfn/bayesianoptimization

Slide 13

Slide 13 text

13 ©2017 RISELab Bayesian Optimization opt = Optimizer( lr=(0.01, 0.1), layers=(2, 5) ) for i in range(9): config = opt.ask() score = train_model(config) opt.tell(config, score) Model-based optimization of hyperparameters. Libraries: Hyperopt, Scikit-optimize Benefit: 1. Can utilize prior information 2. Semi-Parallelizable (Kandasamy 2018) Still can do better!

Slide 14

Slide 14 text

14 ©2017 RISELab We can do better by exploiting structure! Why waste resources on this?

Slide 15

Slide 15 text

15 ©2017 RISELab HyperBand/ASHA (early stopping algorithms) trial = sample_from(hyperparameter_space) while trial.iter < max_epochs: trial.run_one_epoch() if trial.at_cutoff(): if is_top_fraction(trial, trial.iter): trial.extend_cutoff() else: # allow new trials to start trial.pause(); break Intuition: 1. Compare relative performance 2. Terminate bad performing trials 3. Continue better trials for longer period of time Notes: 1. Can be combined with Bayesian Optimization 2. Can be easily parallelized

Slide 16

Slide 16 text

16 ©2017 RISELab But what about dynamic hyperparameters? Changed Learning rate!

Slide 17

Slide 17 text

17 ©2017 RISELab Population-based training 0.1 0.2 0.3 0.4 0.1 0.2 0.3 0.1 0.2 0.3 .15 0.1 0.3 .15 Main idea: Evaluate a population in parallel. Terminate lowest performers. Copy weights of the best performers and mutates hyperparameters 0.4 Benefits: 1. Easily parallelizable 2. Can search over “schedules” 3. Terminates bad performers

Slide 18

Slide 18 text

18 ©2017 RISELab Does it really work?

Slide 19

Slide 19 text

19 ©2017 RISELab OK, but there’s no way I’m going to implement all of these algorithms...

Slide 20

Slide 20 text

A library for distributed hyperparameter search tune.io

Slide 21

Slide 21 text

21 ©2017 RISELab Tune and many others!

Slide 22

Slide 22 text

22 ©2017 RISELab Tune handles hyperparameter search execution. tune.io

Slide 23

Slide 23 text

23 ©2017 RISELab Resource Aware Scheduling Framework Agnostic Tune is built with Deep Learning as a priority.

Slide 24

Slide 24 text

24 ©2017 RISELab ● HyperOpt (TPE) ● Bayesian Optimization ● SigOpt ● Nevergrad ● Scikit-Optimize ● Ax/Botorch (PyTorch BayesOpt) Tune Algorithm Offerings Search Algorithms Provided ● Population-based Training ● HyperBand ● ASHA ● Median-stopping Rule ● BOHB Trial Schedulers Provided

Slide 25

Slide 25 text

25 ©2017 RISELab Tune: Powers Many Open Source Projects

Slide 26

Slide 26 text

26 ©2017 RISELab Native Integration with TensorBoard HParams

Slide 27

Slide 27 text

27 ©2017 RISELab Resources PyData Demo: https://github.com/richardliaw/pydata_demo Tune Documentation: http://tune.io Tune Tutorial: https://github.com/ray-project/tutorial/

Slide 28

Slide 28 text

live demo

Slide 29

Slide 29 text

29 ©2017 RISELab ray.readthedocs.io/en/latest/tune.html ray.readthedocs.io/en/latest/tune.html def train_model(config={}): model = ConvNet(config) for i in range(steps): loss, acc = model.train()

Slide 30

Slide 30 text

30 ©2017 RISELab ray.readthedocs.io/en/latest/tune.html ray.readthedocs.io/en/latest/tune.html from ray.tune import run, track def train_model(config={}): model = ConvNet(config) for i in range(steps): loss, acc = model.train() track.log(mean_loss=loss)

Slide 31

Slide 31 text

31 ©2017 RISELab ray.readthedocs.io/en/latest/tune.html ray.readthedocs.io/en/latest/tune.html train_model(config={“learning_rate": 0.1}) def train_model(config={}): model = ConvNet(config) for i in range(steps): loss, acc = model.train() track.log(mean_loss=loss)

Slide 32

Slide 32 text

32 ©2017 RISELab ray.readthedocs.io/en/latest/tune.html ray.readthedocs.io/en/latest/tune.html def train_model(config={}): model = ConvNet(config) for i in range(steps): loss, acc = model.train() track.log(mean_loss=loss) tune.run(train_model, config={“learning_rate”: 0.1})

Slide 33

Slide 33 text

33 ©2017 RISELab ray.readthedocs.io/en/latest/tune.html ray.readthedocs.io/en/latest/tune.html tune.run(train_model, config={“learning_rate”: 0.1}, num_samples=100) def train_model(config): model = ConvNet(config) for i in range(steps): loss, acc = model.train() track.log(mean_loss=loss)

Slide 34

Slide 34 text

34 ©2017 RISELab ray.readthedocs.io/en/latest/tune.html ray.readthedocs.io/en/latest/tune.html tune.run(train_model, config={“learning_rate”: 0.1}, num_samples=100, upload_dir="s3://my_bucket") def train_model(config): model = ConvNet(config) for i in range(steps): loss, acc = model.train() track.log(mean_loss=loss)

Slide 35

Slide 35 text

35 ©2017 RISELab ray.readthedocs.io/en/latest/tune.html ray.readthedocs.io/en/latest/tune.html tune.run(train_model, config={ “learning_rate”: tune.uniform(0.001, 0.1)}, num_samples=100, upload_dir="s3://my_bucket", scheduler=AsyncHyperBandScheduler()) def train_model(config): model = ConvNet(config) for i in range(steps): loss, acc = model.train() track.log(mean_loss=loss)