Fast and efficient hyperparameter tuning with Ray Tune

Ray Tune Fast and efficient hyperparameter tuning with Ray Tune
Will Drevo Amog Kamsetty Xiaowei Jiang

1. What brings you here 2. Challenges in hyperparameter tuning
3. Ray Tune’s approach 4. Demo Agenda

About us! Hi! I’m Will Drevo. • PM at Anyscale
for the open-source ML team (Ray Tune, Train, RLlib) • Previously • an ML engineer at Coinbase • founded a data labeling company for ML teams • founded a pharma SaaS for accelerating clinical trials • BS, MEng CS @ MIT in ML & distributed systems • I like to DJ, make electronic music, travel, and eat Ethiopian food Joining us is Amog Kamsetty, a software engineer at Anyscale and a Ray Tune & Train lead developer

(1) What brings you here? What is your stack?

(2) Why we tune

The future of ML: models with lots of parameters Note
the log axis!

Hyperparameter tuning “choosing a set of optimal hyperparameters for a
learning algorithm” How many layers? What kinds of layers? Learning rate schedule? Every number here is a hyperparameter! Example: what network structure is best for your binary classification problem?

Why we tune $ Cost Performance Achieve performance you might
otherwise not find • Minimizing fraud loss • Reduce unsold goods (forecasting) • Lower error rate in object detection • ... etc Don’t go broke with compute ($$) or developer hours to train! These can also “cost” you !

Pure parallelization isn’t enough $$ $$ $$ $$ ==

Why we tune Cost Performance per unit Performance Time (Money)

(2) Challenges in hyperparameter tuning

• Scaling memory / compute ◦ Hold at >1 copy
of data in memory, likely many more • Algorithmic (cost) efficiency ◦ Cleverly & quickly search the parameter space • Ease of use ◦ Quick to get started ◦ Local → cluster ◦ Extensible Challenges in tuning

• Parallelism: how do you distribute computation? • Does your
distributed training framework work with your tuning framework? Hard to get both. Scaling memory

Two ways to accomplish this: 1. Sampling (searching) 2. Scheduling
(pruning) Sample better: pick the next parameter set to evaluate better Schedule better: allocate more time to promising parameter sets Algorithmic cost efficiency

• Better search algorithms have wildly better performance • Deep
learning == more complex, more parameters Algorithmic cost efficiency

Algorithmic cost efficiency: scheduler (SHA)

Algorithmic cost efficiency: sampler + searcher (BOHB)

• [2012] Random search • [2015] Successive Halving Algorithm (SHA)
• [2017] HyperBand (HB / AHB) • [2017] Population Based Training (PBT) • [2017] Google Vizier • [2018] Bayesian Optimization of HyperBand (BOHB) • [2018] Async Successive Halving Algorithm (ASHA) : Many algorithms out there... Today, ASHA is likely the best choice for real-world, highly distributed tuning PBT is also a good choice, specifically for deep learning.

Cost(development) + Cost(tuning compute) < Cost(bad model) • Library interoperability
• Maintenance cost over time • Ease of getting started, debugging Ease of use A very real cost!

(2) Ray Tune’s approach

• Open source! • Interoperability with Ray ecosystem • Distributed-first,
GPU-first • Work with as many Python libraries as possible • Constantly updated with state-of-the-art tuning algorithms • Orchestrate other searchers, samplers: HyperOpt, Optuna, etc Why use Ray Tune?

What is Ray Native Libraries 3rd Party Libraries Your app
here! Universal framework for distributed computing Run anywhere Library + app ecosystem

Zooming in: Ray’s native libraries Universal framework for distributed computing
Run anywhere Library + app ecosystem

• Efficient algorithms that enable running trials in parallel •
Effective orchestration of distributed trials • Easy to use APIs Ray Tune - distributed HPO Cutting edge optimization algorithms Minimal code changes to work in distributed settings Compatible with ML ecosystem

A universal framework for distributed computing Notable users of Ray
Ray Tune users!

Open source No cloud lock-in Distributed SOTA algorithms Bring your
own framework Also runs Ray Tune ✅ ✅ ✅ ⭐⭐⭐ ✅ HyperOpt, Optuna, SigOpt HyperOpt ✅ ✅ ✅ ⭐⭐ ✅ Optuna ✅ ✅ ❌ ⭐⭐ ✅ SigOpt ❌ ❌ ✅ ⭐ ✅ Vertex AI (Vizier) ❌ ❌ ✅ ⭐ ❌ Sagemaker ❌ ❌ ✅ ⭐ ❌ Azure ML ❌ ❌ ✅ ⭐ ❌ Katib ✅ ✅ ✅ ⭐⭐ ❌ HyperOpt, Optuna Spark ML ✅ ✅ ✅ ❌ ❌ HyperOpt

• Provides efficient cutting edge HPO algorithms • Distributes and
coordinates parallel trials in a fault-tolerant and elastic manner • Saves you time and cost every step of HPO Ray Tune benefits Ray Tune benchmark on 2 weeks of production data at Uber 2X efficiency improvement in terms of GPU/CPU-hours

Case studies: Ray Tune

(4) Code example!

Ray Tune with PyTorch ASHA scheduler with loss specification Simply
use your train func (ie: train_cifer()) and just tune.run() ! See this example on the PyTorch official docs Easily specify hyperparameter ranges to search over

• A scikit-learn wrapper for Ray Tune ◦ drop-in replacement
for scikit-learn model selection module (RandomizedSearchCV and GridSearchCV) • Provides a familiar and simple API for advanced, distributed HPO • https://github.com/ray-project/tune-sklearn Tune-sklearn

Drop-in replacement from sklearn.model_selection import GridSearchCV parameters = { 'alpha':
[1e-4, 1e-1, 1], 'epsilon':[0.01, 0.1] } search = GridSearchCV( SGDClassifier(), parameters, n_jobs=-1 ) search.fit(X_train, y_train) Use all the cores on the single machine

Drop-in replacement from tune_sklearn import TuneSearchCV parameters = { 'alpha':
[1e-4, 1e-1, 1], 'epsilon':[0.01, 0.1] } search = TuneSearchCV( SGDClassifier(), parameters, n_jobs=-1 ) search.fit(X_train, y_train) Use all the resources throughout the entire cluster!

tune-sklearn demo • Driver safety prediction • Jupyter notebook that
runs on head node

Website (ray.io/ray-tune) One stop for all Ray Tune resources Documentation
(docs.ray.io) Quick start example, reference guides, etc Forums (discuss.ray.io) Learn / share with broader Ray Tune community, including developers Ray Slack Connect with the Ray team and community Get started with Ray Tune

Report the best model from trying... • Grid search: N
evenly spaced trials • [2012] Random search: N randomly sampled trials • [2013-14] Bayesian optimization: sample, then train meta-model to chose next sample. Repeat • [2015] Successive Halving Algorithm (SHA): N randomly sampled, then keep best N/2, then N/4... • [2017] Population Based Training (PBT): genetic algorithm-like optimizer for starting with a set of parameterizations and spawning off new parameterizations from well-performing ones • [2017] HyperBand: do SHA but more intelligently choose N over time • [2018] Bayesian Optimization and HyperBand (BOHB): Use HyperBand to choose budgets per parameterization, but suggest new examples via Bayesian methods instead of random • [2018] Asynchronous SHA (ASHA): SHA, but don’t wait up for stragglers, and replaced failed runs with new parameterizations A short history of techniques

A short history of techniques Grid Search Random Search (2012)
Random Search (2012)

Example How many layers? What kinds of layers? - Every
number here is a hyperparameter!

Hyperparameters Model parameters • Model type and architecture • Learning
and training related parameters • Pipeline configurations Set before training Learnt during training Hyperparameters

Example XGBoost 6 - 10 hyperparameters to tune Total: ~15
hyperparameters to tune! Type: Simple or iterative Simple strategy: Mean or median or constant? Type: One-hot encoding or label encoding? Imputer Categorical encoder Under/ oversampler Type: SMOTE or random undersampling? Number of neighbors?

01 02 03 Exhaustive Search Bayesian Optimization Advanced Scheduling •
Over 15 algorithms natively provided or integrated • Easy to swap out different algorithms with no code change Ray Tune - HPO algorithms

• Uses results from previous combinations (trials) to decide which
trial to try next Bayesian optimization https://www.wikiwand.com/en/Hyperparamet er_optimization • Inherently sequential • Popular libraries: ◦ hyperopt ◦ optuna

Parallel exploration and exploitation Advanced scheduling • Fan out parallel
trials during the initial exploration phase • Make decisions based on intermediate cross-trial evaluations • Allocate resources to more promising trials • Early stopping • Population based training

• Fan out parallel trials during the initial exploration phase
• Use intermediate results (epochs, trees) to prune underperforming trials, saving time and computing resources Advanced Scheduling - Early stopping • Median stopping, ASHA/Hyperband • Can be combined with Bayesian Optimization (BOHB)

Advanced Scheduling - Population Based Training • Evolutionary algorithm •
Evaluate a population in parallel • Terminate lowest performers • Copy weights of the best performing trials and mutate them

Woohoo! Let’s review what we have talked about.

• There are various HPO algorithms with a trend of
going parallel • More advanced ones are often hard to implement ◦ Even more so in a distributed setting

Good news! • Ray Tune implements and integrates with all
these algorithms • Allows user to swap out different algorithms very easily

Architecture requirements • Granular control over when to start, pause,
early stop, restore, or mutate each trial at specific iterations with little overhead • Master-worker architecture that centralizes decision making • Elasticity and fault tolerance

Ray Tune - distributed HPO Head Node DriverProcess tune.run(train_func) Orchestrator
running HPO algorithm from ray import tune def train_func(config): model = ConvNet(config) for i in range(epochs): current_loss = model.train() tune.report(loss=current_loss) tune.run( train_func, config={“alpha”: tune.uniform(0.001, 0.1)}, num_samples=100, scheduler=“asha”, search_alg=”optuna”)

Worker Node Worker Node Ray Tune - distributed HPO Each
actor performs one set of hyperparameter combination evaluation (a trial) Head Node Worker Node DriverProcess WorkerProcess Actor: Runs train_func tune.run(train_func) Orchestrator running HPO algorithm WorkerProcess Actor: Runs train_func WorkerProcess Actor: Runs train_func WorkerProcess Actor: Runs train_func WorkerProcess Actor: Runs train_func WorkerProcess Actor: Runs train_func Launch Launch Launch

Worker Node Worker Node Ray Tune - distributed HPO Head
Node Worker Node DriverProcess WorkerProcess Actor: Runs train_func tune.run(train_func) Orchestrator running HPO algorithm WorkerProcess Actor: Runs train_func WorkerProcess Actor: Runs train_func WorkerProcess Actor: Runs train_func WorkerProcess Actor: Runs train_func WorkerProcess Actor: Runs train_func Report metrics Report metrics Report metrics Orchestrator keeps track of all the trials’ progress and metrics.

Node Worker Node DriverProcess WorkerProcess Actor: Runs train_func tune.run(train_func) Orchestrator running HPO algorithm WorkerProcess Actor: Runs train_func WorkerProcess Actor: Runs train_func WorkerProcess Actor: Runs train_func WorkerProcess Actor: Runs train_func WorkerProcess Actor: Runs train_func Early stop Continue Continue Based on the metrics, the orchestrator may stop/pause/mutate trials or launch new trials when resources are available.

Worker Node Worker Node Ray Tune - distributed HPO Resources
are repurposed to explore new trials. Head Node Worker Node DriverProcess WorkerProcess Actor: Runs train_func tune.run(train_func) Orchestrator running HPO algorithm WorkerProcess Actor: Runs train_func WorkerProcess Actor: Runs train_func WorkerProcess Actor: Runs train_func WorkerProcess Actor: Runs train_func WorkerProcess Actor: Runs train_func Launch a new trial

Node Worker Node DriverProcess WorkerProcess Actor: Runs train_func tune.run(train_func) Orchestrator running HPO algorithm WorkerProcess Actor: Runs train_func WorkerProcess Actor: Runs train_func WorkerProcess Actor: Runs train_func WorkerProcess Actor: Runs train_func WorkerProcess Actor: Runs train_func Trials are checkpointed to cloud storage Orchestrator also manages checkpoint state. Checkpoint

Worker Node Worker Node Ray Tune - distributed HPO Some
worker process crashes. Head Node Worker Node DriverProcess WorkerProcess Actor: Runs train_func tune.run(train_func) Orchestrator running HPO algorithm WorkerProcess Actor: Runs train_func WorkerProcess Actor: Runs train_func WorkerProcess Actor: Runs train_func WorkerProcess Actor: Runs train_func WorkerProcess Actor: Runs train_func

Worker Node Worker Node Ray Tune - distributed HPO New
actor comes up fresh and the crashed trial is restored from remote checkpoint. Head Node Worker Node DriverProcess WorkerProcess Actor: Runs train_func tune.run(train_func) Orchestrator running HPO algorithm WorkerProcess Actor: Runs train_func WorkerProcess Actor: Runs train_func WorkerProcess Actor: Runs train_func WorkerProcess Actor: Runs train_func WorkerProcess Actor: Runs train_func Load checkpoint from cloud storage restore

Woohoo! Let’s review what we have talked about.

• Provides efficient HPO algorithms • Distributes and coordinates parallel
trials in a fault-tolerant and elastic manner • Integrated with ML ecosystem What makes Ray Tune special

Thank You Let’s keep in touch! • https://ray.io/ • https://discuss.ray.io/
• Ray slack • https://github.com/ray-project/tune-sklearn

Appendix

01 02 03

Distributed apps will become the norm Something to highlight

01 02 03 04 05 Ray and Ray Tune Agenda

01 02 03 04 05 Agenda

Thank you for listening!

tune-sklearn demo

Thank You

A universal framework for distributed computing Ray- universal framework for
distributed computing • Berkeley RiseLab → Ray → Anyscale • Over 500 contributors from a huge number of companies

Capitalize Ray Native Libraries 3rd Party Libraries Your app here!
Universal framework for distributed computing Run anywhere Library + app ecosystem

Fast and efficient hyperparameter tuning with R...

Fast and efficient hyperparameter tuning with Ray Tune

More Decks by Anyscale

Other Decks in Technology

Featured

Transcript