A Modern Guide to Hyperparameter Optimization

43cc5956e26cb1ed3185b4bbf4cbe3d5?s=47 Richard Liaw
December 04, 2019

A Modern Guide to Hyperparameter Optimization

Modern deep learning model performance is very dependent on the choice of model hyperparameters, and the tuning process is a major bottleneck in the machine learning pipeline. In this talk, we will overview modern methods for hyperparameter tuning and show how Ray Tune, a scalable open source hyperparameter tuning library with cutting edge tuning methods, can be easily incorporated into everyday workflows. Find Ray Tune on Github at https://github.com/ray-project/ray.

This talk was originally given at PyData LA2019.

43cc5956e26cb1ed3185b4bbf4cbe3d5?s=128

Richard Liaw

December 04, 2019
Tweet

Transcript

  1. 3.

    3 ©2017 RISELab def train_model(): model = ConvNet() optimizer =

    Optimizer() for batch in Dataset(): loss, acc = model.train(batch) optimizer.update(model, loss)
  2. 4.

    4 ©2017 RISELab def train_model(): model = ConvNet() optimizer =

    Optimizer() for batch in Dataset(): loss, acc = model.train(batch) optimizer.update(model, loss)
  3. 5.

    5 ©2017 RISELab def train_model(): model = ConvNet(layers, activations, drop...

    optimizer = Optimizer(lr, momentum, decay...) for batch in Dataset(standardize, shift, ...) loss, acc = model.train(batch) optimizer.update(model, loss) Tune this!
  4. 8.

    8 ©2017 RISELab Overview of hyperparameter tuning techniques Grid Search

    Bayesian Optimization HyperBand (Bandits) Population- based Training Random Search Definition: “trial” = “one configuration evaluation”
  5. 9.

    9 ©2017 RISELab Grid Search tl;dr - Cross-product of all

    possible configurations. for rate in [0.1, 0.01, 0.001]: for hidden_layers in [2, 3, 4]: for param in ["a", "b", "c"]: train_model( rate, hidden_layers, param) Benefit: 1. Explainable 2. Easily parallelizable Problems: Inefficient/expensive ⇐ 27 evaluations!
  6. 11.

    11 ©2017 RISELab Random Search tldr: Sample configurations. for i

    in num_samples: train_model( rate=sample(0.001, 0.1), hidden_layers=sample(2, 4), param=sample(["a", "b", "c"])) Benefit: 1. Better coverage on important parameters 2. Easily parallelizable 3. Hard to beat on high dimensions Problems: Still inefficient/expensive!
  7. 12.

    12 ©2017 RISELab What if we used some prior information

    to guide our tuning process? photo from github.com/ fmfn/bayesianoptimization
  8. 13.

    13 ©2017 RISELab Bayesian Optimization opt = Optimizer( lr=(0.01, 0.1),

    layers=(2, 5) ) for i in range(9): config = opt.ask() score = train_model(config) opt.tell(config, score) Model-based optimization of hyperparameters. Libraries: Hyperopt, Scikit-optimize Benefit: 1. Can utilize prior information 2. Semi-Parallelizable (Kandasamy 2018) Still can do better!
  9. 15.

    15 ©2017 RISELab HyperBand/ASHA (early stopping algorithms) trial = sample_from(hyperparameter_space)

    while trial.iter < max_epochs: trial.run_one_epoch() if trial.at_cutoff(): if is_top_fraction(trial, trial.iter): trial.extend_cutoff() else: # allow new trials to start trial.pause(); break Intuition: 1. Compare relative performance 2. Terminate bad performing trials 3. Continue better trials for longer period of time Notes: 1. Can be combined with Bayesian Optimization 2. Can be easily parallelized
  10. 17.

    17 ©2017 RISELab Population-based training 0.1 0.2 0.3 0.4 0.1

    0.2 0.3 0.1 0.2 0.3 .15 0.1 0.3 .15 Main idea: Evaluate a population in parallel. Terminate lowest performers. Copy weights of the best performers and mutates hyperparameters 0.4 Benefits: 1. Easily parallelizable 2. Can search over “schedules” 3. Terminates bad performers
  11. 19.

    19 ©2017 RISELab OK, but there’s no way I’m going

    to implement all of these algorithms...
  12. 24.

    24 ©2017 RISELab • HyperOpt (TPE) • Bayesian Optimization •

    SigOpt • Nevergrad • Scikit-Optimize • Ax/Botorch (PyTorch BayesOpt) Tune Algorithm Offerings Search Algorithms Provided • Population-based Training • HyperBand • ASHA • Median-stopping Rule • BOHB Trial Schedulers Provided
  13. 28.
  14. 30.

    30 ©2017 RISELab ray.readthedocs.io/en/latest/tune.html ray.readthedocs.io/en/latest/tune.html from ray.tune import run, track

    def train_model(config={}): model = ConvNet(config) for i in range(steps): loss, acc = model.train() track.log(mean_loss=loss)
  15. 32.

    32 ©2017 RISELab ray.readthedocs.io/en/latest/tune.html ray.readthedocs.io/en/latest/tune.html def train_model(config={}): model = ConvNet(config)

    for i in range(steps): loss, acc = model.train() track.log(mean_loss=loss) tune.run(train_model, config={“learning_rate”: 0.1})
  16. 33.

    33 ©2017 RISELab ray.readthedocs.io/en/latest/tune.html ray.readthedocs.io/en/latest/tune.html tune.run(train_model, config={“learning_rate”: 0.1}, num_samples=100) def

    train_model(config): model = ConvNet(config) for i in range(steps): loss, acc = model.train() track.log(mean_loss=loss)
  17. 35.

    35 ©2017 RISELab ray.readthedocs.io/en/latest/tune.html ray.readthedocs.io/en/latest/tune.html tune.run(train_model, config={ “learning_rate”: tune.uniform(0.001, 0.1)},

    num_samples=100, upload_dir="s3://my_bucket", scheduler=AsyncHyperBandScheduler()) def train_model(config): model = ConvNet(config) for i in range(steps): loss, acc = model.train() track.log(mean_loss=loss)