Save 37% off PRO during our Black Friday Sale! »

Scaling Machine Learning Workloads with Ray

Scaling Machine Learning Workloads with Ray

Modern machine learning (ML) workloads, such as deep learning and large-scale model training, are compute-intensive and require distributed execution. Ray was created in the UC Berkeley RISELab to make it easy for every engineer to scale their applications and ML workloads, without requiring any distributed systems expertise, making distributed programming easy.

Join Jules S. Damji, developer advocate at Anyscale, and Antoni Baum, software engineer at Anyscale, for an introduction to Ray for scaling your ML workloads. Learn how Ray libraries (eg., Ray Tune, Ray Serve, etc) help you easily scale every step of your ML pipeline — from model training and hyperparameter search to inference and production serving.

Highlights include:
- Ray overview & core concepts
- Library ecosystem and use cases
- Demo: Ray for scaling ML workflows
- Getting started resources

Af07bbf978a0989644b039ae6b8904a5?s=128

Anyscale
PRO

November 18, 2021
Tweet

Transcript

  1. Scaling Machine Learning Workloads with Ray Jules S. Damji Lead

    Developer Advocate, Anyscale Antoni Baum Software Engineer ML , Anyscale
  2. 01 02 03 What & Why Ray Ray’s Ecosystem &

    Use Cases Demo Agenda
  3. - Machine learning is pervasive in every domain - Distributed

    machine learning is becoming a necessity - Distributed computing is notoriously hard Why Ray?
  4. - Machine learning is pervasive in every domain - Distributed

    machine learning is becoming a necessity - Distributed computing is notoriously hard Why Ray?
  5. Apps increasingly incorporate AI/ML

  6. - Machine learning is pervasive in every domain - Distributed

    machine learning is becoming a necessity - Distributed computing is notoriously hard Why Ray?
  7. 35x every 18 m onths 2020 GPT-3 Compute demand growing

    faster than supply Moore’s Law (2x every 18 months) CPU https://openai.com/blog/ai-and-compute/
  8. 35x every 18 m onths 2020 GPT-3 Specialized hardware is

    also not enough Moore’s Law (2x every 18 months) CPU https://openai.com/blog/ai-and-compute/ GPU* TPU *
  9. 35x every 18 m onths 2020 GPT-3 Specialized hardware is

    also not enough Moore’s Law (2x every 18 months) CPU https://openai.com/blog/ai-and-compute/ GPU* TPU * No way out but to distribute!
  10. - Machine learning is pervasive in every domain - Distributed

    machine learning is becoming a necessity - Distributed computing is notoriously hard Why Ray?
  11. Generality Ease of development Existing solutions have may tradeoffs

  12. Generality Ease of development Existing solutions have may tradeoffs

  13. Existing solutions have may tradeoffs Generality Ease of development

  14. Existing solutions have may tradeoffs Generality Ease of development

  15. - Machine learning is pervasive in every domain - Distributed

    machine learning is becoming a necessity - Distributed computing is notoriously hard Ray’s vision: Make distributed computing accessible to every developer Why Ray?
  16. The Ray Ecosystem Datasets Workflows

  17. What is Ray? A Cluster Looks Like… Think about RM

    in YARN Think about NM daemon (YARN) or Spark Executor Think about Spark Driver Unique to Ray
  18. def read_array(file): # read array a from file return a

    def add(a, b): return np.add(a, b) a = read_array(file1) b = read_array(file2) sum = add(a, b) What is Ray - API
  19. What is Ray - API @ray.remote def read_array(file): # read

    array a from file return a @ray.remote def add(a, b): return np.add(a, b) a = read_array(file1) b = read_array(file2) sum = add(a, b) def read_array(file): # read array a from file return a def add(a, b): return np.add(a, b) a = read_array(file1) b = read_array(file2) sum = add(a, b) …. ….
  20. What is Ray - API @ray.remote def read_array(file): # read

    array a from file return a @ray.remote def add(a, b): return np.add(a, b) @ray.remote def read_array(file): # read array a from file return a @ray.remote def add(a, b): return np.add(a, b) ref1 = read_array.remote(file1) ref2 = read_array.remote(file2) ref = add.remote(ref1, ref2) sum = ray.get(ref) …. ….
  21. What is Ray - API class Counter(object): def __init__(self): self.value

    = 0 def inc(self): self.value += 1 return self.value c = Counter() c.inc() c.inc() @ray.remote def read_array(file): # read array a from file return a @ray.remote def add(a, b): return np.add(a, b) ref1 = read_array.remote(file1) ref2 = read_array.remote(file2) ref = add.remote(ref1, ref2) sum = ray.get(ref)
  22. What is Ray - API @ray.remote class Counter(object): def __init__(self):

    self.value = 0 def inc(self): self.value += 1 return self.value @ray.remote def read_array(file): # read array a from file return a @ray.remote def add(a, b): return np.add(a, b) ref1 = read_array.remote(file1) ref2 = read_array.remote(file2) ref = add.remote(ref1, ref2) sum = ray.get(ref) @ray.remote(num_gpus=1) class Counter(object): def __init__(self): self.value = 0 def inc(self): self.value += 1 return self.value c = Counter.remote() ref4 = c.inc.remote() ref5 = c.inc.remote()
  23. What is Ray - API c = Counter.remote() increment_refs =

    [c.inc.remote() for _ in range(5)] while len(increment_refs) > 0: return_n = 2 if len(increment_refs) > 1 else 1 ready_refs, remaining_refs = ray.wait(increment_refs, num_returns=return_n, timeout=10.0) if len(ready_refs > 0: print(ray.get(ready_refs)) # Update the remaining ones increment_refs = remaining_refs [1, 2] [3, 4] [5]
  24. The Ray Ecosystem Datasets Workflows https://www.anyscale.com/blog/whats-new-in-the-ray-distributed-library-ecosystem

  25. Rich ecosystem for scaling ML workloads Native libraries - easily

    scale common bottlenecks in ML workflows - Examples: Ray Tune for HPO, RLlib for RLlib, Ray Serve for Serving, etc. Integrations - scale popular frameworks with Ray with minimal changes - Examples: XGBoost, TF, Jax, PyTorch etc.
  26. Rich ecosystem for scaling ML workloads Ray Core / Datasets

    Model Serving Data Processing Training Serving Ray Core + Datasets Reinforcement Learning Hyper. Tuning
  27. Rich ecosystem for scaling ML workloads Ray Core / Datasets

    Model Serving Data Processing Training Serving Ray Core + Datasets Reinforcement Learning Hyper. Tuning ** a small subset of the Ray ecosystem in ML Integrate Ray only based on your needs!
  28. Rich ecosystem for scaling ML workloads Ray Core / Datasets

    Model Serving Data Processing Training Serving Ray Core + Datasets Reinforcement Learning Hyper. Tuning Integrate Ray Tune! No need to adopt entire Ray framework.
  29. Challenges in scaling hyperparameter tuning? Rich ecosystem for scaling ML

    workloads Ray Core / Datasets Model Serving Data Processing Training Serving Ray Core + Datasets Reinforcement Learning Hyper. Tuning
  30. Generality Ease of development Stitching together different frameworks to go

    end-to-end?
  31. Rich ecosystem for scaling ML workloads Ray Core / Datasets

    Model Serving Data Processing Training Serving Ray Core + Datasets Reinforcement Learning Hyper. Tuning Unified, distributed toolkit to go end-to-end
  32. Companies scaling ML with Ray

  33. Ray Core / Datasets Model Serving Data Processing Training Serving

    Reinforcement Learning Hyper. Tuning Companies scaling ML with Ray • https://eng.uber.com/horovod-ray/ • https://www.anyscale.com/blog/wildlife-studios-serves-in-game-offers-3x-faster-at-1-10th-the-cost-with-ray • https://www.ikigailabs.com/blog/how-ikigai-labs-serves-interactive-ai-workflows-at-scale-using-ray-serve
  34. Scaling Ecosystem Restoration Dendra Systems

  35. Making Boats Fly with AI Mckinsey | QuantumBlack Australia

  36. Large Scale ML Platforms Uber, Shopify, Robinhood, and more

  37. Demo

  38. Starting scaling your ML workloads Getting Started: Documentation (docs.ray.io) Quick

    start example, reference guides, etc Join Ray Meetup Revive in Jan 2022. Publish recording to the members https://www.meetup.com/Bay-Area-Ray-Meetup/ Forums (discuss.ray.io) Learn / share with broader Ray community, including core team Ray Slack Connect with the Ray team and community GitHub Check out sources, file an issue, become a contributor, give us a Star :) https://github.com/ray-project/ray
  39. Thank you