Scaling Machine Learning Workloads with Ray

Scaling Machine Learning Workloads with Ray Jules S. Damji Lead
Developer Advocate, Anyscale Antoni Baum Software Engineer ML , Anyscale

01 02 03 What & Why Ray Ray’s Ecosystem &
Use Cases Demo Agenda

- Machine learning is pervasive in every domain - Distributed
machine learning is becoming a necessity - Distributed computing is notoriously hard Why Ray?

Apps increasingly incorporate AI/ML

35x every 18 m onths 2020 GPT-3 Compute demand growing
faster than supply Moore’s Law (2x every 18 months) CPU https://openai.com/blog/ai-and-compute/

35x every 18 m onths 2020 GPT-3 Specialized hardware is
also not enough Moore’s Law (2x every 18 months) CPU https://openai.com/blog/ai-and-compute/ GPU* TPU *

35x every 18 m onths 2020 GPT-3 Specialized hardware is
also not enough Moore’s Law (2x every 18 months) CPU https://openai.com/blog/ai-and-compute/ GPU* TPU * No way out but to distribute!

Generality Ease of development Existing solutions have may tradeoffs

Existing solutions have may tradeoffs Generality Ease of development

machine learning is becoming a necessity - Distributed computing is notoriously hard Ray’s vision: Make distributed computing accessible to every developer Why Ray?

The Ray Ecosystem Datasets Workflows

What is Ray? A Cluster Looks Like… Think about RM
in YARN Think about NM daemon (YARN) or Spark Executor Think about Spark Driver Unique to Ray

def read_array(file): # read array a from file return a
def add(a, b): return np.add(a, b) a = read_array(file1) b = read_array(file2) sum = add(a, b) What is Ray - API

What is Ray - API @ray.remote def read_array(file): # read
array a from file return a @ray.remote def add(a, b): return np.add(a, b) a = read_array(file1) b = read_array(file2) sum = add(a, b) def read_array(file): # read array a from file return a def add(a, b): return np.add(a, b) a = read_array(file1) b = read_array(file2) sum = add(a, b) …. ….

What is Ray - API @ray.remote def read_array(file): # read
array a from file return a @ray.remote def add(a, b): return np.add(a, b) @ray.remote def read_array(file): # read array a from file return a @ray.remote def add(a, b): return np.add(a, b) ref1 = read_array.remote(file1) ref2 = read_array.remote(file2) ref = add.remote(ref1, ref2) sum = ray.get(ref) …. ….

What is Ray - API class Counter(object): def __init__(self): self.value
= 0 def inc(self): self.value += 1 return self.value c = Counter() c.inc() c.inc() @ray.remote def read_array(file): # read array a from file return a @ray.remote def add(a, b): return np.add(a, b) ref1 = read_array.remote(file1) ref2 = read_array.remote(file2) ref = add.remote(ref1, ref2) sum = ray.get(ref)

What is Ray - API @ray.remote class Counter(object): def __init__(self):
self.value = 0 def inc(self): self.value += 1 return self.value @ray.remote def read_array(file): # read array a from file return a @ray.remote def add(a, b): return np.add(a, b) ref1 = read_array.remote(file1) ref2 = read_array.remote(file2) ref = add.remote(ref1, ref2) sum = ray.get(ref) @ray.remote(num_gpus=1) class Counter(object): def __init__(self): self.value = 0 def inc(self): self.value += 1 return self.value c = Counter.remote() ref4 = c.inc.remote() ref5 = c.inc.remote()

What is Ray - API c = Counter.remote() increment_refs =
[c.inc.remote() for _ in range(5)] while len(increment_refs) > 0: return_n = 2 if len(increment_refs) > 1 else 1 ready_refs, remaining_refs = ray.wait(increment_refs, num_returns=return_n, timeout=10.0) if len(ready_refs > 0: print(ray.get(ready_refs)) # Update the remaining ones increment_refs = remaining_refs [1, 2] [3, 4] [5]

The Ray Ecosystem Datasets Workflows https://www.anyscale.com/blog/whats-new-in-the-ray-distributed-library-ecosystem

Rich ecosystem for scaling ML workloads Native libraries - easily
scale common bottlenecks in ML workflows - Examples: Ray Tune for HPO, RLlib for RLlib, Ray Serve for Serving, etc. Integrations - scale popular frameworks with Ray with minimal changes - Examples: XGBoost, TF, Jax, PyTorch etc.

Rich ecosystem for scaling ML workloads Ray Core / Datasets
Model Serving Data Processing Training Serving Ray Core + Datasets Reinforcement Learning Hyper. Tuning

Model Serving Data Processing Training Serving Ray Core + Datasets Reinforcement Learning Hyper. Tuning ** a small subset of the Ray ecosystem in ML Integrate Ray only based on your needs!

Model Serving Data Processing Training Serving Ray Core + Datasets Reinforcement Learning Hyper. Tuning Integrate Ray Tune! No need to adopt entire Ray framework.

Challenges in scaling hyperparameter tuning? Rich ecosystem for scaling ML
workloads Ray Core / Datasets Model Serving Data Processing Training Serving Ray Core + Datasets Reinforcement Learning Hyper. Tuning

Generality Ease of development Stitching together different frameworks to go
end-to-end?

Model Serving Data Processing Training Serving Ray Core + Datasets Reinforcement Learning Hyper. Tuning Unified, distributed toolkit to go end-to-end

Companies scaling ML with Ray

Ray Core / Datasets Model Serving Data Processing Training Serving
Reinforcement Learning Hyper. Tuning Companies scaling ML with Ray • https://eng.uber.com/horovod-ray/ • https://www.anyscale.com/blog/wildlife-studios-serves-in-game-offers-3x-faster-at-1-10th-the-cost-with-ray • https://www.ikigailabs.com/blog/how-ikigai-labs-serves-interactive-ai-workflows-at-scale-using-ray-serve

Scaling Ecosystem Restoration Dendra Systems

Making Boats Fly with AI Mckinsey | QuantumBlack Australia

Large Scale ML Platforms Uber, Shopify, Robinhood, and more

Starting scaling your ML workloads Getting Started: Documentation (docs.ray.io) Quick
start example, reference guides, etc Join Ray Meetup Revive in Jan 2022. Publish recording to the members https://www.meetup.com/Bay-Area-Ray-Meetup/ Forums (discuss.ray.io) Learn / share with broader Ray community, including core team Ray Slack Connect with the Ray team and community GitHub Check out sources, file an issue, become a contributor, give us a Star :) https://github.com/ray-project/ray

Thank you

Scaling Machine Learning Workloads with Ray

Scaling Machine Learning Workloads with Ray

More Decks by Anyscale

Other Decks in Technology

Featured

Transcript