Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Fast and efficient hyperparameter tuning with Ray Tune

Anyscale
October 21, 2021

Fast and efficient hyperparameter tuning with Ray Tune

Hyperparameter tuning or optimization is used to find the best performing machine learning (ML) model by exploring and optimizing the model hyperparameters (eg. learning rate, tree depth, etc). It is a compute-intensive problem that lends itself well to distributed execution.

Ray Tune is a Python library, built on Ray, that allows you to easily run distributed hyperparameter tuning at scale. Ray Tune is framework-agnostic and supports all the popular training frameworks including PyTorch, TensorFlow, XGBoost, LightGBM, and Keras.

Anyscale

October 21, 2021
Tweet

More Decks by Anyscale

Other Decks in Technology

Transcript

  1. 1. What brings you here 2. Challenges in hyperparameter tuning

    3. Ray Tune’s approach 4. Demo Agenda
  2. About us! Hi! I’m Will Drevo. • PM at Anyscale

    for the open-source ML team (Ray Tune, Train, RLlib) • Previously • an ML engineer at Coinbase • founded a data labeling company for ML teams • founded a pharma SaaS for accelerating clinical trials • BS, MEng CS @ MIT in ML & distributed systems • I like to DJ, make electronic music, travel, and eat Ethiopian food Joining us is Amog Kamsetty, a software engineer at Anyscale and a Ray Tune & Train lead developer
  3. Hyperparameter tuning “choosing a set of optimal hyperparameters for a

    learning algorithm” How many layers? What kinds of layers? Learning rate schedule? Every number here is a hyperparameter! Example: what network structure is best for your binary classification problem?
  4. Why we tune $ Cost Performance Achieve performance you might

    otherwise not find • Minimizing fraud loss • Reduce unsold goods (forecasting) • Lower error rate in object detection • ... etc Don’t go broke with compute ($$) or developer hours to train! These can also “cost” you !
  5. • Scaling memory / compute ◦ Hold at >1 copy

    of data in memory, likely many more • Algorithmic (cost) efficiency ◦ Cleverly & quickly search the parameter space • Ease of use ◦ Quick to get started ◦ Local → cluster ◦ Extensible Challenges in tuning
  6. • Parallelism: how do you distribute computation? • Does your

    distributed training framework work with your tuning framework? Hard to get both. Scaling memory
  7. Two ways to accomplish this: 1. Sampling (searching) 2. Scheduling

    (pruning) Sample better: pick the next parameter set to evaluate better Schedule better: allocate more time to promising parameter sets Algorithmic cost efficiency
  8. • Better search algorithms have wildly better performance • Deep

    learning == more complex, more parameters Algorithmic cost efficiency
  9. • [2012] Random search • [2015] Successive Halving Algorithm (SHA)

    • [2017] HyperBand (HB / AHB) • [2017] Population Based Training (PBT) • [2017] Google Vizier • [2018] Bayesian Optimization of HyperBand (BOHB) • [2018] Async Successive Halving Algorithm (ASHA) : Many algorithms out there... Today, ASHA is likely the best choice for real-world, highly distributed tuning PBT is also a good choice, specifically for deep learning.
  10. Cost(development) + Cost(tuning compute) < Cost(bad model) • Library interoperability

    • Maintenance cost over time • Ease of getting started, debugging Ease of use A very real cost!
  11. • Open source! • Interoperability with Ray ecosystem • Distributed-first,

    GPU-first • Work with as many Python libraries as possible • Constantly updated with state-of-the-art tuning algorithms • Orchestrate other searchers, samplers: HyperOpt, Optuna, etc Why use Ray Tune?
  12. What is Ray Native Libraries 3rd Party Libraries Your app

    here! Universal framework for distributed computing Run anywhere Library + app ecosystem
  13. • Efficient algorithms that enable running trials in parallel •

    Effective orchestration of distributed trials • Easy to use APIs Ray Tune - distributed HPO Cutting edge optimization algorithms Minimal code changes to work in distributed settings Compatible with ML ecosystem
  14. Open source No cloud lock-in Distributed SOTA algorithms Bring your

    own framework Also runs Ray Tune ✅ ✅ ✅ ⭐⭐⭐ ✅ HyperOpt, Optuna, SigOpt HyperOpt ✅ ✅ ✅ ⭐⭐ ✅ Optuna ✅ ✅ ❌ ⭐⭐ ✅ SigOpt ❌ ❌ ✅ ⭐ ✅ Vertex AI (Vizier) ❌ ❌ ✅ ⭐ ❌ Sagemaker ❌ ❌ ✅ ⭐ ❌ Azure ML ❌ ❌ ✅ ⭐ ❌ Katib ✅ ✅ ✅ ⭐⭐ ❌ HyperOpt, Optuna Spark ML ✅ ✅ ✅ ❌ ❌ HyperOpt
  15. • Provides efficient cutting edge HPO algorithms • Distributes and

    coordinates parallel trials in a fault-tolerant and elastic manner • Saves you time and cost every step of HPO Ray Tune benefits Ray Tune benchmark on 2 weeks of production data at Uber 2X efficiency improvement in terms of GPU/CPU-hours
  16. Ray Tune with PyTorch ASHA scheduler with loss specification Simply

    use your train func (ie: train_cifer()) and just tune.run() ! See this example on the PyTorch official docs Easily specify hyperparameter ranges to search over
  17. • A scikit-learn wrapper for Ray Tune ◦ drop-in replacement

    for scikit-learn model selection module (RandomizedSearchCV and GridSearchCV) • Provides a familiar and simple API for advanced, distributed HPO • https://github.com/ray-project/tune-sklearn Tune-sklearn
  18. Drop-in replacement from sklearn.model_selection import GridSearchCV parameters = { 'alpha':

    [1e-4, 1e-1, 1], 'epsilon':[0.01, 0.1] } search = GridSearchCV( SGDClassifier(), parameters, n_jobs=-1 ) search.fit(X_train, y_train) Use all the cores on the single machine
  19. Drop-in replacement from tune_sklearn import TuneSearchCV parameters = { 'alpha':

    [1e-4, 1e-1, 1], 'epsilon':[0.01, 0.1] } search = TuneSearchCV( SGDClassifier(), parameters, n_jobs=-1 ) search.fit(X_train, y_train) Use all the resources throughout the entire cluster!
  20. Website (ray.io/ray-tune) One stop for all Ray Tune resources Documentation

    (docs.ray.io) Quick start example, reference guides, etc Forums (discuss.ray.io) Learn / share with broader Ray Tune community, including developers Ray Slack Connect with the Ray team and community Get started with Ray Tune
  21. Report the best model from trying... • Grid search: N

    evenly spaced trials • [2012] Random search: N randomly sampled trials • [2013-14] Bayesian optimization: sample, then train meta-model to chose next sample. Repeat • [2015] Successive Halving Algorithm (SHA): N randomly sampled, then keep best N/2, then N/4... • [2017] Population Based Training (PBT): genetic algorithm-like optimizer for starting with a set of parameterizations and spawning off new parameterizations from well-performing ones • [2017] HyperBand: do SHA but more intelligently choose N over time • [2018] Bayesian Optimization and HyperBand (BOHB): Use HyperBand to choose budgets per parameterization, but suggest new examples via Bayesian methods instead of random • [2018] Asynchronous SHA (ASHA): SHA, but don’t wait up for stragglers, and replaced failed runs with new parameterizations A short history of techniques
  22. Hyperparameters Model parameters • Model type and architecture • Learning

    and training related parameters • Pipeline configurations Set before training Learnt during training Hyperparameters
  23. Example XGBoost 6 - 10 hyperparameters to tune Total: ~15

    hyperparameters to tune! Type: Simple or iterative Simple strategy: Mean or median or constant? Type: One-hot encoding or label encoding? Imputer Categorical encoder Under/ oversampler Type: SMOTE or random undersampling? Number of neighbors?
  24. 01 02 03 Exhaustive Search Bayesian Optimization Advanced Scheduling •

    Over 15 algorithms natively provided or integrated • Easy to swap out different algorithms with no code change Ray Tune - HPO algorithms
  25. • Uses results from previous combinations (trials) to decide which

    trial to try next Bayesian optimization https://www.wikiwand.com/en/Hyperparamet er_optimization • Inherently sequential • Popular libraries: ◦ hyperopt ◦ optuna
  26. Parallel exploration and exploitation Advanced scheduling • Fan out parallel

    trials during the initial exploration phase • Make decisions based on intermediate cross-trial evaluations • Allocate resources to more promising trials • Early stopping • Population based training
  27. • Fan out parallel trials during the initial exploration phase

    • Use intermediate results (epochs, trees) to prune underperforming trials, saving time and computing resources Advanced Scheduling - Early stopping • Median stopping, ASHA/Hyperband • Can be combined with Bayesian Optimization (BOHB)
  28. Advanced Scheduling - Population Based Training • Evolutionary algorithm •

    Evaluate a population in parallel • Terminate lowest performers • Copy weights of the best performing trials and mutate them
  29. • There are various HPO algorithms with a trend of

    going parallel • More advanced ones are often hard to implement ◦ Even more so in a distributed setting
  30. Good news! • Ray Tune implements and integrates with all

    these algorithms • Allows user to swap out different algorithms very easily
  31. Architecture requirements • Granular control over when to start, pause,

    early stop, restore, or mutate each trial at specific iterations with little overhead • Master-worker architecture that centralizes decision making • Elasticity and fault tolerance
  32. Ray Tune - distributed HPO Head Node DriverProcess tune.run(train_func) Orchestrator

    running HPO algorithm from ray import tune def train_func(config): model = ConvNet(config) for i in range(epochs): current_loss = model.train() tune.report(loss=current_loss) tune.run( train_func, config={“alpha”: tune.uniform(0.001, 0.1)}, num_samples=100, scheduler=“asha”, search_alg=”optuna”)
  33. Worker Node Worker Node Ray Tune - distributed HPO Each

    actor performs one set of hyperparameter combination evaluation (a trial) Head Node Worker Node DriverProcess WorkerProcess Actor: Runs train_func tune.run(train_func) Orchestrator running HPO algorithm WorkerProcess Actor: Runs train_func WorkerProcess Actor: Runs train_func WorkerProcess Actor: Runs train_func WorkerProcess Actor: Runs train_func WorkerProcess Actor: Runs train_func Launch Launch Launch
  34. Worker Node Worker Node Ray Tune - distributed HPO Head

    Node Worker Node DriverProcess WorkerProcess Actor: Runs train_func tune.run(train_func) Orchestrator running HPO algorithm WorkerProcess Actor: Runs train_func WorkerProcess Actor: Runs train_func WorkerProcess Actor: Runs train_func WorkerProcess Actor: Runs train_func WorkerProcess Actor: Runs train_func Report metrics Report metrics Report metrics Orchestrator keeps track of all the trials’ progress and metrics.
  35. Worker Node Worker Node Ray Tune - distributed HPO Head

    Node Worker Node DriverProcess WorkerProcess Actor: Runs train_func tune.run(train_func) Orchestrator running HPO algorithm WorkerProcess Actor: Runs train_func WorkerProcess Actor: Runs train_func WorkerProcess Actor: Runs train_func WorkerProcess Actor: Runs train_func WorkerProcess Actor: Runs train_func Early stop Continue Continue Based on the metrics, the orchestrator may stop/pause/mutate trials or launch new trials when resources are available.
  36. Worker Node Worker Node Ray Tune - distributed HPO Resources

    are repurposed to explore new trials. Head Node Worker Node DriverProcess WorkerProcess Actor: Runs train_func tune.run(train_func) Orchestrator running HPO algorithm WorkerProcess Actor: Runs train_func WorkerProcess Actor: Runs train_func WorkerProcess Actor: Runs train_func WorkerProcess Actor: Runs train_func WorkerProcess Actor: Runs train_func Launch a new trial
  37. Worker Node Worker Node Ray Tune - distributed HPO Head

    Node Worker Node DriverProcess WorkerProcess Actor: Runs train_func tune.run(train_func) Orchestrator running HPO algorithm WorkerProcess Actor: Runs train_func WorkerProcess Actor: Runs train_func WorkerProcess Actor: Runs train_func WorkerProcess Actor: Runs train_func WorkerProcess Actor: Runs train_func Trials are checkpointed to cloud storage Orchestrator also manages checkpoint state. Checkpoint
  38. Worker Node Worker Node Ray Tune - distributed HPO Some

    worker process crashes. Head Node Worker Node DriverProcess WorkerProcess Actor: Runs train_func tune.run(train_func) Orchestrator running HPO algorithm WorkerProcess Actor: Runs train_func WorkerProcess Actor: Runs train_func WorkerProcess Actor: Runs train_func WorkerProcess Actor: Runs train_func WorkerProcess Actor: Runs train_func
  39. Worker Node Worker Node Ray Tune - distributed HPO New

    actor comes up fresh and the crashed trial is restored from remote checkpoint. Head Node Worker Node DriverProcess WorkerProcess Actor: Runs train_func tune.run(train_func) Orchestrator running HPO algorithm WorkerProcess Actor: Runs train_func WorkerProcess Actor: Runs train_func WorkerProcess Actor: Runs train_func WorkerProcess Actor: Runs train_func WorkerProcess Actor: Runs train_func Load checkpoint from cloud storage restore
  40. • Provides efficient HPO algorithms • Distributes and coordinates parallel

    trials in a fault-tolerant and elastic manner • Integrated with ML ecosystem What makes Ray Tune special
  41. Thank You Let’s keep in touch! • https://ray.io/ • https://discuss.ray.io/

    • Ray slack • https://github.com/ray-project/tune-sklearn
  42. A universal framework for distributed computing Ray- universal framework for

    distributed computing • Berkeley RiseLab → Ray → Anyscale • Over 500 contributors from a huge number of companies
  43. Capitalize Ray Native Libraries 3rd Party Libraries Your app here!

    Universal framework for distributed computing Run anywhere Library + app ecosystem
  44. Capitalize Ray Native Libraries 3rd Party Libraries Your app here!

    Universal framework for distributed computing Run anywhere Library + app ecosystem