Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Fast and efficient hyperparameter tuning with Ray Tune

Fast and efficient hyperparameter tuning with Ray Tune

Hyperparameter tuning or optimization is used to find the best performing machine learning (ML) model by exploring and optimizing the model hyperparameters (eg. learning rate, tree depth, etc). It is a compute-intensive problem that lends itself well to distributed execution.

Ray Tune is a Python library, built on Ray, that allows you to easily run distributed hyperparameter tuning at scale. Ray Tune is framework-agnostic and supports all the popular training frameworks including PyTorch, TensorFlow, XGBoost, LightGBM, and Keras.

Af07bbf978a0989644b039ae6b8904a5?s=128

Anyscale
PRO

October 21, 2021
Tweet

Transcript

  1. Ray Tune Fast and efficient hyperparameter tuning with Ray Tune

    Will Drevo Amog Kamsetty Xiaowei Jiang
  2. 1. What brings you here 2. Challenges in hyperparameter tuning

    3. Ray Tune’s approach 4. Demo Agenda
  3. About us! Hi! I’m Will Drevo. • PM at Anyscale

    for the open-source ML team (Ray Tune, Train, RLlib) • Previously • an ML engineer at Coinbase • founded a data labeling company for ML teams • founded a pharma SaaS for accelerating clinical trials • BS, MEng CS @ MIT in ML & distributed systems • I like to DJ, make electronic music, travel, and eat Ethiopian food Joining us is Amog Kamsetty, a software engineer at Anyscale and a Ray Tune & Train lead developer
  4. (1) What brings you here? What is your stack?

  5. (2) Why we tune

  6. The future of ML: models with lots of parameters Note

    the log axis!
  7. Hyperparameter tuning “choosing a set of optimal hyperparameters for a

    learning algorithm” How many layers? What kinds of layers? Learning rate schedule? Every number here is a hyperparameter! Example: what network structure is best for your binary classification problem?
  8. Why we tune $ Cost Performance Achieve performance you might

    otherwise not find • Minimizing fraud loss • Reduce unsold goods (forecasting) • Lower error rate in object detection • ... etc Don’t go broke with compute ($$) or developer hours to train! These can also “cost” you !
  9. Pure parallelization isn’t enough $$ $$ $$ $$ ==

  10. Why we tune Cost Performance per unit Performance Time (Money)

  11. (2) Challenges in hyperparameter tuning

  12. • Scaling memory / compute ◦ Hold at >1 copy

    of data in memory, likely many more • Algorithmic (cost) efficiency ◦ Cleverly & quickly search the parameter space • Ease of use ◦ Quick to get started ◦ Local → cluster ◦ Extensible Challenges in tuning
  13. • Parallelism: how do you distribute computation? • Does your

    distributed training framework work with your tuning framework? Hard to get both. Scaling memory
  14. Two ways to accomplish this: 1. Sampling (searching) 2. Scheduling

    (pruning) Sample better: pick the next parameter set to evaluate better Schedule better: allocate more time to promising parameter sets Algorithmic cost efficiency
  15. • Better search algorithms have wildly better performance • Deep

    learning == more complex, more parameters Algorithmic cost efficiency
  16. Algorithmic cost efficiency: scheduler (SHA)

  17. Algorithmic cost efficiency: sampler + searcher (BOHB)

  18. • [2012] Random search • [2015] Successive Halving Algorithm (SHA)

    • [2017] HyperBand (HB / AHB) • [2017] Population Based Training (PBT) • [2017] Google Vizier • [2018] Bayesian Optimization of HyperBand (BOHB) • [2018] Async Successive Halving Algorithm (ASHA) : Many algorithms out there... Today, ASHA is likely the best choice for real-world, highly distributed tuning PBT is also a good choice, specifically for deep learning.
  19. Cost(development) + Cost(tuning compute) < Cost(bad model) • Library interoperability

    • Maintenance cost over time • Ease of getting started, debugging Ease of use A very real cost!
  20. (2) Ray Tune’s approach

  21. • Open source! • Interoperability with Ray ecosystem • Distributed-first,

    GPU-first • Work with as many Python libraries as possible • Constantly updated with state-of-the-art tuning algorithms • Orchestrate other searchers, samplers: HyperOpt, Optuna, etc Why use Ray Tune?
  22. What is Ray Native Libraries 3rd Party Libraries Your app

    here! Universal framework for distributed computing Run anywhere Library + app ecosystem
  23. Zooming in: Ray’s native libraries Universal framework for distributed computing

    Run anywhere Library + app ecosystem
  24. • Efficient algorithms that enable running trials in parallel •

    Effective orchestration of distributed trials • Easy to use APIs Ray Tune - distributed HPO Cutting edge optimization algorithms Minimal code changes to work in distributed settings Compatible with ML ecosystem
  25. A universal framework for distributed computing Notable users of Ray

    Ray Tune users!
  26. Open source No cloud lock-in Distributed SOTA algorithms Bring your

    own framework Also runs Ray Tune ✅ ✅ ✅ ⭐⭐⭐ ✅ HyperOpt, Optuna, SigOpt HyperOpt ✅ ✅ ✅ ⭐⭐ ✅ Optuna ✅ ✅ ❌ ⭐⭐ ✅ SigOpt ❌ ❌ ✅ ⭐ ✅ Vertex AI (Vizier) ❌ ❌ ✅ ⭐ ❌ Sagemaker ❌ ❌ ✅ ⭐ ❌ Azure ML ❌ ❌ ✅ ⭐ ❌ Katib ✅ ✅ ✅ ⭐⭐ ❌ HyperOpt, Optuna Spark ML ✅ ✅ ✅ ❌ ❌ HyperOpt
  27. • Provides efficient cutting edge HPO algorithms • Distributes and

    coordinates parallel trials in a fault-tolerant and elastic manner • Saves you time and cost every step of HPO Ray Tune benefits Ray Tune benchmark on 2 weeks of production data at Uber 2X efficiency improvement in terms of GPU/CPU-hours
  28. Case studies: Ray Tune

  29. (4) Code example!

  30. Ray Tune with PyTorch ASHA scheduler with loss specification Simply

    use your train func (ie: train_cifer()) and just tune.run() ! See this example on the PyTorch official docs Easily specify hyperparameter ranges to search over
  31. • A scikit-learn wrapper for Ray Tune ◦ drop-in replacement

    for scikit-learn model selection module (RandomizedSearchCV and GridSearchCV) • Provides a familiar and simple API for advanced, distributed HPO • https://github.com/ray-project/tune-sklearn Tune-sklearn
  32. Drop-in replacement from sklearn.model_selection import GridSearchCV parameters = { 'alpha':

    [1e-4, 1e-1, 1], 'epsilon':[0.01, 0.1] } search = GridSearchCV( SGDClassifier(), parameters, n_jobs=-1 ) search.fit(X_train, y_train) Use all the cores on the single machine
  33. Drop-in replacement from tune_sklearn import TuneSearchCV parameters = { 'alpha':

    [1e-4, 1e-1, 1], 'epsilon':[0.01, 0.1] } search = TuneSearchCV( SGDClassifier(), parameters, n_jobs=-1 ) search.fit(X_train, y_train) Use all the resources throughout the entire cluster!
  34. tune-sklearn demo • Driver safety prediction • Jupyter notebook that

    runs on head node
  35. Website (ray.io/ray-tune) One stop for all Ray Tune resources Documentation

    (docs.ray.io) Quick start example, reference guides, etc Forums (discuss.ray.io) Learn / share with broader Ray Tune community, including developers Ray Slack Connect with the Ray team and community Get started with Ray Tune
  36. Report the best model from trying... • Grid search: N

    evenly spaced trials • [2012] Random search: N randomly sampled trials • [2013-14] Bayesian optimization: sample, then train meta-model to chose next sample. Repeat • [2015] Successive Halving Algorithm (SHA): N randomly sampled, then keep best N/2, then N/4... • [2017] Population Based Training (PBT): genetic algorithm-like optimizer for starting with a set of parameterizations and spawning off new parameterizations from well-performing ones • [2017] HyperBand: do SHA but more intelligently choose N over time • [2018] Bayesian Optimization and HyperBand (BOHB): Use HyperBand to choose budgets per parameterization, but suggest new examples via Bayesian methods instead of random • [2018] Asynchronous SHA (ASHA): SHA, but don’t wait up for stragglers, and replaced failed runs with new parameterizations A short history of techniques
  37. A short history of techniques Grid Search Random Search (2012)

    Random Search (2012)
  38. Example How many layers? What kinds of layers? - Every

    number here is a hyperparameter!
  39. Hyperparameters Model parameters • Model type and architecture • Learning

    and training related parameters • Pipeline configurations Set before training Learnt during training Hyperparameters
  40. Example XGBoost 6 - 10 hyperparameters to tune Total: ~15

    hyperparameters to tune! Type: Simple or iterative Simple strategy: Mean or median or constant? Type: One-hot encoding or label encoding? Imputer Categorical encoder Under/ oversampler Type: SMOTE or random undersampling? Number of neighbors?
  41. 01 02 03 Exhaustive Search Bayesian Optimization Advanced Scheduling •

    Over 15 algorithms natively provided or integrated • Easy to swap out different algorithms with no code change Ray Tune - HPO algorithms
  42. • Uses results from previous combinations (trials) to decide which

    trial to try next Bayesian optimization https://www.wikiwand.com/en/Hyperparamet er_optimization • Inherently sequential • Popular libraries: ◦ hyperopt ◦ optuna
  43. Parallel exploration and exploitation Advanced scheduling • Fan out parallel

    trials during the initial exploration phase • Make decisions based on intermediate cross-trial evaluations • Allocate resources to more promising trials • Early stopping • Population based training
  44. • Fan out parallel trials during the initial exploration phase

    • Use intermediate results (epochs, trees) to prune underperforming trials, saving time and computing resources Advanced Scheduling - Early stopping • Median stopping, ASHA/Hyperband • Can be combined with Bayesian Optimization (BOHB)
  45. Advanced Scheduling - Population Based Training • Evolutionary algorithm •

    Evaluate a population in parallel • Terminate lowest performers • Copy weights of the best performing trials and mutate them
  46. Woohoo! Let’s review what we have talked about.

  47. • There are various HPO algorithms with a trend of

    going parallel • More advanced ones are often hard to implement ◦ Even more so in a distributed setting
  48. Good news! • Ray Tune implements and integrates with all

    these algorithms • Allows user to swap out different algorithms very easily
  49. Architecture requirements • Granular control over when to start, pause,

    early stop, restore, or mutate each trial at specific iterations with little overhead • Master-worker architecture that centralizes decision making • Elasticity and fault tolerance
  50. Ray Tune - distributed HPO Head Node DriverProcess tune.run(train_func) Orchestrator

    running HPO algorithm from ray import tune def train_func(config): model = ConvNet(config) for i in range(epochs): current_loss = model.train() tune.report(loss=current_loss) tune.run( train_func, config={“alpha”: tune.uniform(0.001, 0.1)}, num_samples=100, scheduler=“asha”, search_alg=”optuna”)
  51. Worker Node Worker Node Ray Tune - distributed HPO Each

    actor performs one set of hyperparameter combination evaluation (a trial) Head Node Worker Node DriverProcess WorkerProcess Actor: Runs train_func tune.run(train_func) Orchestrator running HPO algorithm WorkerProcess Actor: Runs train_func WorkerProcess Actor: Runs train_func WorkerProcess Actor: Runs train_func WorkerProcess Actor: Runs train_func WorkerProcess Actor: Runs train_func Launch Launch Launch
  52. Worker Node Worker Node Ray Tune - distributed HPO Head

    Node Worker Node DriverProcess WorkerProcess Actor: Runs train_func tune.run(train_func) Orchestrator running HPO algorithm WorkerProcess Actor: Runs train_func WorkerProcess Actor: Runs train_func WorkerProcess Actor: Runs train_func WorkerProcess Actor: Runs train_func WorkerProcess Actor: Runs train_func Report metrics Report metrics Report metrics Orchestrator keeps track of all the trials’ progress and metrics.
  53. Worker Node Worker Node Ray Tune - distributed HPO Head

    Node Worker Node DriverProcess WorkerProcess Actor: Runs train_func tune.run(train_func) Orchestrator running HPO algorithm WorkerProcess Actor: Runs train_func WorkerProcess Actor: Runs train_func WorkerProcess Actor: Runs train_func WorkerProcess Actor: Runs train_func WorkerProcess Actor: Runs train_func Early stop Continue Continue Based on the metrics, the orchestrator may stop/pause/mutate trials or launch new trials when resources are available.
  54. Worker Node Worker Node Ray Tune - distributed HPO Resources

    are repurposed to explore new trials. Head Node Worker Node DriverProcess WorkerProcess Actor: Runs train_func tune.run(train_func) Orchestrator running HPO algorithm WorkerProcess Actor: Runs train_func WorkerProcess Actor: Runs train_func WorkerProcess Actor: Runs train_func WorkerProcess Actor: Runs train_func WorkerProcess Actor: Runs train_func Launch a new trial
  55. Worker Node Worker Node Ray Tune - distributed HPO Head

    Node Worker Node DriverProcess WorkerProcess Actor: Runs train_func tune.run(train_func) Orchestrator running HPO algorithm WorkerProcess Actor: Runs train_func WorkerProcess Actor: Runs train_func WorkerProcess Actor: Runs train_func WorkerProcess Actor: Runs train_func WorkerProcess Actor: Runs train_func Trials are checkpointed to cloud storage Orchestrator also manages checkpoint state. Checkpoint
  56. Worker Node Worker Node Ray Tune - distributed HPO Some

    worker process crashes. Head Node Worker Node DriverProcess WorkerProcess Actor: Runs train_func tune.run(train_func) Orchestrator running HPO algorithm WorkerProcess Actor: Runs train_func WorkerProcess Actor: Runs train_func WorkerProcess Actor: Runs train_func WorkerProcess Actor: Runs train_func WorkerProcess Actor: Runs train_func
  57. Worker Node Worker Node Ray Tune - distributed HPO New

    actor comes up fresh and the crashed trial is restored from remote checkpoint. Head Node Worker Node DriverProcess WorkerProcess Actor: Runs train_func tune.run(train_func) Orchestrator running HPO algorithm WorkerProcess Actor: Runs train_func WorkerProcess Actor: Runs train_func WorkerProcess Actor: Runs train_func WorkerProcess Actor: Runs train_func WorkerProcess Actor: Runs train_func Load checkpoint from cloud storage restore
  58. Woohoo! Let’s review what we have talked about.

  59. • Provides efficient HPO algorithms • Distributes and coordinates parallel

    trials in a fault-tolerant and elastic manner • Integrated with ML ecosystem What makes Ray Tune special
  60. None
  61. Thank You Let’s keep in touch! • https://ray.io/ • https://discuss.ray.io/

    • Ray slack • https://github.com/ray-project/tune-sklearn
  62. Q & A

  63. Appendix

  64. None
  65. 01 02 03

  66. Distributed apps will become the norm Something to highlight

  67. None
  68. 01 02 03 04 05 Ray and Ray Tune Agenda

  69. 01 02 03 04 05 Agenda

  70. Thank you for listening!

  71. None
  72. tune-sklearn demo

  73. Thank You

  74. None
  75. A universal framework for distributed computing Ray- universal framework for

    distributed computing • Berkeley RiseLab → Ray → Anyscale • Over 500 contributors from a huge number of companies
  76. Capitalize Ray Native Libraries 3rd Party Libraries Your app here!

    Universal framework for distributed computing Run anywhere Library + app ecosystem
  77. Capitalize Ray Native Libraries 3rd Party Libraries Your app here!

    Universal framework for distributed computing Run anywhere Library + app ecosystem
  78. None