Scaling Machine Learning Workloads with Ray

Slide 1

Slide 1 text

Scaling Machine Learning Workloads with Ray Jules S. Damji Lead Developer Advocate, Anyscale Antoni Baum Software Engineer ML , Anyscale

Slide 2

Slide 2 text

01 02 03 What & Why Ray Ray’s Ecosystem & Use Cases Demo Agenda

Slide 3

Slide 3 text

- Machine learning is pervasive in every domain - Distributed machine learning is becoming a necessity - Distributed computing is notoriously hard Why Ray?

Slide 4

Slide 4 text

- Machine learning is pervasive in every domain - Distributed machine learning is becoming a necessity - Distributed computing is notoriously hard Why Ray?

Slide 5

Slide 5 text

Apps increasingly incorporate AI/ML

Slide 6

Slide 6 text

- Machine learning is pervasive in every domain - Distributed machine learning is becoming a necessity - Distributed computing is notoriously hard Why Ray?

Slide 7

Slide 7 text

35x every 18 m onths 2020 GPT-3 Compute demand growing faster than supply Moore’s Law (2x every 18 months) CPU https://openai.com/blog/ai-and-compute/

Slide 8

Slide 8 text

35x every 18 m onths 2020 GPT-3 Specialized hardware is also not enough Moore’s Law (2x every 18 months) CPU https://openai.com/blog/ai-and-compute/ GPU* TPU *

Slide 9

Slide 9 text

35x every 18 m onths 2020 GPT-3 Specialized hardware is also not enough Moore’s Law (2x every 18 months) CPU https://openai.com/blog/ai-and-compute/ GPU* TPU * No way out but to distribute!

Slide 10

Slide 10 text

- Machine learning is pervasive in every domain - Distributed machine learning is becoming a necessity - Distributed computing is notoriously hard Why Ray?

Slide 11

Slide 11 text

Generality Ease of development Existing solutions have may tradeoffs

Slide 12

Slide 12 text

Generality Ease of development Existing solutions have may tradeoffs

Slide 13

Slide 13 text

Existing solutions have may tradeoffs Generality Ease of development

Slide 14

Slide 14 text

Existing solutions have may tradeoffs Generality Ease of development

Slide 15

Slide 15 text

- Machine learning is pervasive in every domain - Distributed machine learning is becoming a necessity - Distributed computing is notoriously hard Ray’s vision: Make distributed computing accessible to every developer Why Ray?

Slide 16

Slide 16 text

The Ray Ecosystem Datasets Workflows

Slide 17

Slide 17 text

What is Ray? A Cluster Looks Like… Think about RM in YARN Think about NM daemon (YARN) or Spark Executor Think about Spark Driver Unique to Ray

Slide 18

Slide 18 text

def read_array(file): # read array a from file return a def add(a, b): return np.add(a, b) a = read_array(file1) b = read_array(file2) sum = add(a, b) What is Ray - API

Slide 19

Slide 19 text

What is Ray - API @ray.remote def read_array(file): # read array a from file return a @ray.remote def add(a, b): return np.add(a, b) a = read_array(file1) b = read_array(file2) sum = add(a, b) def read_array(file): # read array a from file return a def add(a, b): return np.add(a, b) a = read_array(file1) b = read_array(file2) sum = add(a, b) …. ….

Slide 20

Slide 20 text

What is Ray - API @ray.remote def read_array(file): # read array a from file return a @ray.remote def add(a, b): return np.add(a, b) @ray.remote def read_array(file): # read array a from file return a @ray.remote def add(a, b): return np.add(a, b) ref1 = read_array.remote(file1) ref2 = read_array.remote(file2) ref = add.remote(ref1, ref2) sum = ray.get(ref) …. ….

Slide 21

Slide 21 text

What is Ray - API class Counter(object): def __init__(self): self.value = 0 def inc(self): self.value += 1 return self.value c = Counter() c.inc() c.inc() @ray.remote def read_array(file): # read array a from file return a @ray.remote def add(a, b): return np.add(a, b) ref1 = read_array.remote(file1) ref2 = read_array.remote(file2) ref = add.remote(ref1, ref2) sum = ray.get(ref)

Slide 22

Slide 22 text

What is Ray - API @ray.remote class Counter(object): def __init__(self): self.value = 0 def inc(self): self.value += 1 return self.value @ray.remote def read_array(file): # read array a from file return a @ray.remote def add(a, b): return np.add(a, b) ref1 = read_array.remote(file1) ref2 = read_array.remote(file2) ref = add.remote(ref1, ref2) sum = ray.get(ref) @ray.remote(num_gpus=1) class Counter(object): def __init__(self): self.value = 0 def inc(self): self.value += 1 return self.value c = Counter.remote() ref4 = c.inc.remote() ref5 = c.inc.remote()

Slide 23

Slide 23 text

What is Ray - API c = Counter.remote() increment_refs = [c.inc.remote() for _ in range(5)] while len(increment_refs) > 0: return_n = 2 if len(increment_refs) > 1 else 1 ready_refs, remaining_refs = ray.wait(increment_refs, num_returns=return_n, timeout=10.0) if len(ready_refs > 0: print(ray.get(ready_refs)) # Update the remaining ones increment_refs = remaining_refs [1, 2] [3, 4] [5]

Slide 24

Slide 24 text

The Ray Ecosystem Datasets Workflows https://www.anyscale.com/blog/whats-new-in-the-ray-distributed-library-ecosystem

Slide 25

Slide 25 text

Rich ecosystem for scaling ML workloads Native libraries - easily scale common bottlenecks in ML workflows - Examples: Ray Tune for HPO, RLlib for RLlib, Ray Serve for Serving, etc. Integrations - scale popular frameworks with Ray with minimal changes - Examples: XGBoost, TF, Jax, PyTorch etc.

Slide 26

Slide 26 text

Rich ecosystem for scaling ML workloads Ray Core / Datasets Model Serving Data Processing Training Serving Ray Core + Datasets Reinforcement Learning Hyper. Tuning

Slide 27

Slide 27 text

Rich ecosystem for scaling ML workloads Ray Core / Datasets Model Serving Data Processing Training Serving Ray Core + Datasets Reinforcement Learning Hyper. Tuning ** a small subset of the Ray ecosystem in ML Integrate Ray only based on your needs!

Slide 28

Slide 28 text

Rich ecosystem for scaling ML workloads Ray Core / Datasets Model Serving Data Processing Training Serving Ray Core + Datasets Reinforcement Learning Hyper. Tuning Integrate Ray Tune! No need to adopt entire Ray framework.

Slide 29

Slide 29 text

Challenges in scaling hyperparameter tuning? Rich ecosystem for scaling ML workloads Ray Core / Datasets Model Serving Data Processing Training Serving Ray Core + Datasets Reinforcement Learning Hyper. Tuning

Slide 30

Slide 30 text

Generality Ease of development Stitching together different frameworks to go end-to-end?

Slide 31

Slide 31 text

Rich ecosystem for scaling ML workloads Ray Core / Datasets Model Serving Data Processing Training Serving Ray Core + Datasets Reinforcement Learning Hyper. Tuning Unified, distributed toolkit to go end-to-end

Slide 32

Slide 32 text

Companies scaling ML with Ray

Slide 33

Slide 33 text

Ray Core / Datasets Model Serving Data Processing Training Serving Reinforcement Learning Hyper. Tuning Companies scaling ML with Ray ● https://eng.uber.com/horovod-ray/ ● https://www.anyscale.com/blog/wildlife-studios-serves-in-game-offers-3x-faster-at-1-10th-the-cost-with-ray ● https://www.ikigailabs.com/blog/how-ikigai-labs-serves-interactive-ai-workflows-at-scale-using-ray-serve

Slide 34

Slide 34 text

Scaling Ecosystem Restoration Dendra Systems

Slide 35

Slide 35 text

Making Boats Fly with AI Mckinsey | QuantumBlack Australia

Slide 36

Slide 36 text

Large Scale ML Platforms Uber, Shopify, Robinhood, and more

Slide 37

Slide 37 text

Demo

Slide 38

Slide 38 text

Starting scaling your ML workloads Getting Started: Documentation (docs.ray.io) Quick start example, reference guides, etc Join Ray Meetup Revive in Jan 2022. Publish recording to the members https://www.meetup.com/Bay-Area-Ray-Meetup/ Forums (discuss.ray.io) Learn / share with broader Ray community, including core team Ray Slack Connect with the Ray team and community GitHub Check out sources, file an issue, become a contributor, give us a Star :) https://github.com/ray-project/ray

Slide 39

Slide 39 text

Thank you