Fast and efficient hyperparameter tuning with Ray Tune

Slide 1

Slide 1 text

Ray Tune Fast and efficient hyperparameter tuning with Ray Tune Will Drevo Amog Kamsetty Xiaowei Jiang

Slide 2

Slide 2 text

1. What brings you here 2. Challenges in hyperparameter tuning 3. Ray Tune’s approach 4. Demo Agenda

Slide 3

Slide 3 text

About us! Hi! I’m Will Drevo. • PM at Anyscale for the open-source ML team (Ray Tune, Train, RLlib) • Previously • an ML engineer at Coinbase • founded a data labeling company for ML teams • founded a pharma SaaS for accelerating clinical trials • BS, MEng CS @ MIT in ML & distributed systems • I like to DJ, make electronic music, travel, and eat Ethiopian food Joining us is Amog Kamsetty, a software engineer at Anyscale and a Ray Tune & Train lead developer

Slide 4

Slide 4 text

(1) What brings you here? What is your stack?

Slide 5

Slide 5 text

(2) Why we tune

Slide 6

Slide 6 text

The future of ML: models with lots of parameters Note the log axis!

Slide 7

Slide 7 text

Hyperparameter tuning “choosing a set of optimal hyperparameters for a learning algorithm” How many layers? What kinds of layers? Learning rate schedule? Every number here is a hyperparameter! Example: what network structure is best for your binary classification problem?

Slide 8

Slide 8 text

Why we tune $ Cost Performance Achieve performance you might otherwise not find ● Minimizing fraud loss ● Reduce unsold goods (forecasting) ● Lower error rate in object detection ● ... etc Don’t go broke with compute ($$) or developer hours to train! These can also “cost” you !

Slide 9

Slide 9 text

Pure parallelization isn’t enough $$ $$ $$ $$ ==

Slide 10

Slide 10 text

Why we tune Cost Performance per unit Performance Time (Money)

Slide 11

Slide 11 text

(2) Challenges in hyperparameter tuning

Slide 12

Slide 12 text

● Scaling memory / compute ○ Hold at >1 copy of data in memory, likely many more ● Algorithmic (cost) efficiency ○ Cleverly & quickly search the parameter space ● Ease of use ○ Quick to get started ○ Local → cluster ○ Extensible Challenges in tuning

Slide 13

Slide 13 text

● Parallelism: how do you distribute computation? ● Does your distributed training framework work with your tuning framework? Hard to get both. Scaling memory

Slide 14

Slide 14 text

Two ways to accomplish this: 1. Sampling (searching) 2. Scheduling (pruning) Sample better: pick the next parameter set to evaluate better Schedule better: allocate more time to promising parameter sets Algorithmic cost efficiency

Slide 15

Slide 15 text

● Better search algorithms have wildly better performance ● Deep learning == more complex, more parameters Algorithmic cost efficiency

Slide 16

Slide 16 text

Algorithmic cost efficiency: scheduler (SHA)

Slide 17

Slide 17 text

Algorithmic cost efficiency: sampler + searcher (BOHB)

Slide 18

Slide 18 text

● [2012] Random search ● [2015] Successive Halving Algorithm (SHA) ● [2017] HyperBand (HB / AHB) ● [2017] Population Based Training (PBT) ● [2017] Google Vizier ● [2018] Bayesian Optimization of HyperBand (BOHB) ● [2018] Async Successive Halving Algorithm (ASHA) : Many algorithms out there... Today, ASHA is likely the best choice for real-world, highly distributed tuning PBT is also a good choice, specifically for deep learning.

Slide 19

Slide 19 text

Cost(development) + Cost(tuning compute) < Cost(bad model) ● Library interoperability ● Maintenance cost over time ● Ease of getting started, debugging Ease of use A very real cost!

Slide 20

Slide 20 text

(2) Ray Tune’s approach

Slide 21

Slide 21 text

● Open source! ● Interoperability with Ray ecosystem ● Distributed-first, GPU-first ● Work with as many Python libraries as possible ● Constantly updated with state-of-the-art tuning algorithms ● Orchestrate other searchers, samplers: HyperOpt, Optuna, etc Why use Ray Tune?

Slide 22

Slide 22 text

What is Ray Native Libraries 3rd Party Libraries Your app here! Universal framework for distributed computing Run anywhere Library + app ecosystem

Slide 23

Slide 23 text

Zooming in: Ray’s native libraries Universal framework for distributed computing Run anywhere Library + app ecosystem

Slide 24

Slide 24 text

● Efficient algorithms that enable running trials in parallel ● Effective orchestration of distributed trials ● Easy to use APIs Ray Tune - distributed HPO Cutting edge optimization algorithms Minimal code changes to work in distributed settings Compatible with ML ecosystem

Slide 25

Slide 25 text

A universal framework for distributed computing Notable users of Ray Ray Tune users!

Slide 26

Slide 26 text

Open source No cloud lock-in Distributed SOTA algorithms Bring your own framework Also runs Ray Tune ✅ ✅ ✅ ⭐⭐⭐ ✅ HyperOpt, Optuna, SigOpt HyperOpt ✅ ✅ ✅ ⭐⭐ ✅ Optuna ✅ ✅ ❌ ⭐⭐ ✅ SigOpt ❌ ❌ ✅ ⭐ ✅ Vertex AI (Vizier) ❌ ❌ ✅ ⭐ ❌ Sagemaker ❌ ❌ ✅ ⭐ ❌ Azure ML ❌ ❌ ✅ ⭐ ❌ Katib ✅ ✅ ✅ ⭐⭐ ❌ HyperOpt, Optuna Spark ML ✅ ✅ ✅ ❌ ❌ HyperOpt

Slide 27

Slide 27 text

● Provides efficient cutting edge HPO algorithms ● Distributes and coordinates parallel trials in a fault-tolerant and elastic manner ● Saves you time and cost every step of HPO Ray Tune benefits Ray Tune benchmark on 2 weeks of production data at Uber 2X efficiency improvement in terms of GPU/CPU-hours

Slide 28

Slide 28 text

Case studies: Ray Tune

Slide 29

Slide 29 text

(4) Code example!

Slide 30

Slide 30 text

Ray Tune with PyTorch ASHA scheduler with loss specification Simply use your train func (ie: train_cifer()) and just tune.run() ! See this example on the PyTorch official docs Easily specify hyperparameter ranges to search over

Slide 31

Slide 31 text

● A scikit-learn wrapper for Ray Tune ○ drop-in replacement for scikit-learn model selection module (RandomizedSearchCV and GridSearchCV) ● Provides a familiar and simple API for advanced, distributed HPO ● https://github.com/ray-project/tune-sklearn Tune-sklearn

Slide 32

Slide 32 text

Drop-in replacement from sklearn.model_selection import GridSearchCV parameters = { 'alpha': [1e-4, 1e-1, 1], 'epsilon':[0.01, 0.1] } search = GridSearchCV( SGDClassifier(), parameters, n_jobs=-1 ) search.fit(X_train, y_train) Use all the cores on the single machine

Slide 33

Slide 33 text

Drop-in replacement from tune_sklearn import TuneSearchCV parameters = { 'alpha': [1e-4, 1e-1, 1], 'epsilon':[0.01, 0.1] } search = TuneSearchCV( SGDClassifier(), parameters, n_jobs=-1 ) search.fit(X_train, y_train) Use all the resources throughout the entire cluster!

Slide 34

Slide 34 text

tune-sklearn demo ● Driver safety prediction ● Jupyter notebook that runs on head node

Slide 35

Slide 35 text

Website (ray.io/ray-tune) One stop for all Ray Tune resources Documentation (docs.ray.io) Quick start example, reference guides, etc Forums (discuss.ray.io) Learn / share with broader Ray Tune community, including developers Ray Slack Connect with the Ray team and community Get started with Ray Tune

Slide 36

Slide 36 text

Report the best model from trying... ● Grid search: N evenly spaced trials ● [2012] Random search: N randomly sampled trials ● [2013-14] Bayesian optimization: sample, then train meta-model to chose next sample. Repeat ● [2015] Successive Halving Algorithm (SHA): N randomly sampled, then keep best N/2, then N/4... ● [2017] Population Based Training (PBT): genetic algorithm-like optimizer for starting with a set of parameterizations and spawning off new parameterizations from well-performing ones ● [2017] HyperBand: do SHA but more intelligently choose N over time ● [2018] Bayesian Optimization and HyperBand (BOHB): Use HyperBand to choose budgets per parameterization, but suggest new examples via Bayesian methods instead of random ● [2018] Asynchronous SHA (ASHA): SHA, but don’t wait up for stragglers, and replaced failed runs with new parameterizations A short history of techniques

Slide 37

Slide 37 text

A short history of techniques Grid Search Random Search (2012) Random Search (2012)

Slide 38

Slide 38 text

Example How many layers? What kinds of layers? - Every number here is a hyperparameter!

Slide 39

Slide 39 text

Hyperparameters Model parameters ● Model type and architecture ● Learning and training related parameters ● Pipeline configurations Set before training Learnt during training Hyperparameters

Slide 40

Slide 40 text

Example XGBoost 6 - 10 hyperparameters to tune Total: ~15 hyperparameters to tune! Type: Simple or iterative Simple strategy: Mean or median or constant? Type: One-hot encoding or label encoding? Imputer Categorical encoder Under/ oversampler Type: SMOTE or random undersampling? Number of neighbors?

Slide 41

Slide 41 text

01 02 03 Exhaustive Search Bayesian Optimization Advanced Scheduling ● Over 15 algorithms natively provided or integrated ● Easy to swap out different algorithms with no code change Ray Tune - HPO algorithms

Slide 42

Slide 42 text

● Uses results from previous combinations (trials) to decide which trial to try next Bayesian optimization https://www.wikiwand.com/en/Hyperparamet er_optimization ● Inherently sequential ● Popular libraries: ○ hyperopt ○ optuna

Slide 43

Slide 43 text

Parallel exploration and exploitation Advanced scheduling • Fan out parallel trials during the initial exploration phase • Make decisions based on intermediate cross-trial evaluations • Allocate resources to more promising trials • Early stopping • Population based training

Slide 44

Slide 44 text

● Fan out parallel trials during the initial exploration phase ● Use intermediate results (epochs, trees) to prune underperforming trials, saving time and computing resources Advanced Scheduling - Early stopping ● Median stopping, ASHA/Hyperband ● Can be combined with Bayesian Optimization (BOHB)

Slide 45

Slide 45 text

Advanced Scheduling - Population Based Training ● Evolutionary algorithm ● Evaluate a population in parallel ● Terminate lowest performers ● Copy weights of the best performing trials and mutate them

Slide 46

Slide 46 text

Woohoo! Let’s review what we have talked about.

Slide 47

Slide 47 text

● There are various HPO algorithms with a trend of going parallel ● More advanced ones are often hard to implement ○ Even more so in a distributed setting

Slide 48

Slide 48 text

Good news! ● Ray Tune implements and integrates with all these algorithms ● Allows user to swap out different algorithms very easily

Slide 49

Slide 49 text

Architecture requirements ● Granular control over when to start, pause, early stop, restore, or mutate each trial at specific iterations with little overhead ● Master-worker architecture that centralizes decision making ● Elasticity and fault tolerance

Slide 50

Slide 50 text

Ray Tune - distributed HPO Head Node DriverProcess tune.run(train_func) Orchestrator running HPO algorithm from ray import tune def train_func(config): model = ConvNet(config) for i in range(epochs): current_loss = model.train() tune.report(loss=current_loss) tune.run( train_func, config={“alpha”: tune.uniform(0.001, 0.1)}, num_samples=100, scheduler=“asha”, search_alg=”optuna”)

Slide 51

Slide 51 text

Worker Node Worker Node Ray Tune - distributed HPO Each actor performs one set of hyperparameter combination evaluation (a trial) Head Node Worker Node DriverProcess WorkerProcess Actor: Runs train_func tune.run(train_func) Orchestrator running HPO algorithm WorkerProcess Actor: Runs train_func WorkerProcess Actor: Runs train_func WorkerProcess Actor: Runs train_func WorkerProcess Actor: Runs train_func WorkerProcess Actor: Runs train_func Launch Launch Launch

Slide 52

Slide 52 text

Worker Node Worker Node Ray Tune - distributed HPO Head Node Worker Node DriverProcess WorkerProcess Actor: Runs train_func tune.run(train_func) Orchestrator running HPO algorithm WorkerProcess Actor: Runs train_func WorkerProcess Actor: Runs train_func WorkerProcess Actor: Runs train_func WorkerProcess Actor: Runs train_func WorkerProcess Actor: Runs train_func Report metrics Report metrics Report metrics Orchestrator keeps track of all the trials’ progress and metrics.

Slide 53

Slide 53 text

Worker Node Worker Node Ray Tune - distributed HPO Head Node Worker Node DriverProcess WorkerProcess Actor: Runs train_func tune.run(train_func) Orchestrator running HPO algorithm WorkerProcess Actor: Runs train_func WorkerProcess Actor: Runs train_func WorkerProcess Actor: Runs train_func WorkerProcess Actor: Runs train_func WorkerProcess Actor: Runs train_func Early stop Continue Continue Based on the metrics, the orchestrator may stop/pause/mutate trials or launch new trials when resources are available.

Slide 54

Slide 54 text

Worker Node Worker Node Ray Tune - distributed HPO Resources are repurposed to explore new trials. Head Node Worker Node DriverProcess WorkerProcess Actor: Runs train_func tune.run(train_func) Orchestrator running HPO algorithm WorkerProcess Actor: Runs train_func WorkerProcess Actor: Runs train_func WorkerProcess Actor: Runs train_func WorkerProcess Actor: Runs train_func WorkerProcess Actor: Runs train_func Launch a new trial

Slide 55

Slide 55 text

Worker Node Worker Node Ray Tune - distributed HPO Head Node Worker Node DriverProcess WorkerProcess Actor: Runs train_func tune.run(train_func) Orchestrator running HPO algorithm WorkerProcess Actor: Runs train_func WorkerProcess Actor: Runs train_func WorkerProcess Actor: Runs train_func WorkerProcess Actor: Runs train_func WorkerProcess Actor: Runs train_func Trials are checkpointed to cloud storage Orchestrator also manages checkpoint state. Checkpoint

Slide 56

Slide 56 text

Worker Node Worker Node Ray Tune - distributed HPO Some worker process crashes. Head Node Worker Node DriverProcess WorkerProcess Actor: Runs train_func tune.run(train_func) Orchestrator running HPO algorithm WorkerProcess Actor: Runs train_func WorkerProcess Actor: Runs train_func WorkerProcess Actor: Runs train_func WorkerProcess Actor: Runs train_func WorkerProcess Actor: Runs train_func

Slide 57

Slide 57 text

Worker Node Worker Node Ray Tune - distributed HPO New actor comes up fresh and the crashed trial is restored from remote checkpoint. Head Node Worker Node DriverProcess WorkerProcess Actor: Runs train_func tune.run(train_func) Orchestrator running HPO algorithm WorkerProcess Actor: Runs train_func WorkerProcess Actor: Runs train_func WorkerProcess Actor: Runs train_func WorkerProcess Actor: Runs train_func WorkerProcess Actor: Runs train_func Load checkpoint from cloud storage restore

Slide 58

Slide 58 text

Woohoo! Let’s review what we have talked about.

Slide 59

Slide 59 text

● Provides efficient HPO algorithms ● Distributes and coordinates parallel trials in a fault-tolerant and elastic manner ● Integrated with ML ecosystem What makes Ray Tune special

Slide 60

Slide 60 text

No content

Slide 61

Slide 61 text

Thank You Let’s keep in touch! ● https://ray.io/ ● https://discuss.ray.io/ ● Ray slack ● https://github.com/ray-project/tune-sklearn

Slide 62

Slide 62 text

Q & A

Slide 63

Slide 63 text

Appendix

Slide 64

Slide 64 text

●

Slide 65

Slide 65 text

No content

Slide 66

Slide 66 text

01 02 03

Slide 67

Slide 67 text

Distributed apps will become the norm Something to highlight

Slide 68

Slide 68 text

No content

Slide 69

Slide 69 text

01 02 03 04 05 Ray and Ray Tune Agenda

Slide 70

Slide 70 text

01 02 03 04 05 Agenda

Slide 71

Slide 71 text

Thank you for listening!

Slide 72

Slide 72 text

No content

Slide 73

Slide 73 text

tune-sklearn demo

Slide 74

Slide 74 text

Thank You

Slide 75

Slide 75 text

No content

Slide 76

Slide 76 text

A universal framework for distributed computing Ray- universal framework for distributed computing ● Berkeley RiseLab → Ray → Anyscale ● Over 500 contributors from a huge number of companies

Slide 77

Slide 77 text

Capitalize Ray Native Libraries 3rd Party Libraries Your app here! Universal framework for distributed computing Run anywhere Library + app ecosystem

Slide 78

Slide 78 text

Capitalize Ray Native Libraries 3rd Party Libraries Your app here! Universal framework for distributed computing Run anywhere Library + app ecosystem

Slide 79

Slide 79 text

No content