Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Scaling Machine Learning Workloads with Ray

Anyscale
November 18, 2021

Scaling Machine Learning Workloads with Ray

Modern machine learning (ML) workloads, such as deep learning and large-scale model training, are compute-intensive and require distributed execution. Ray was created in the UC Berkeley RISELab to make it easy for every engineer to scale their applications and ML workloads, without requiring any distributed systems expertise, making distributed programming easy.

Join Jules S. Damji, developer advocate at Anyscale, and Antoni Baum, software engineer at Anyscale, for an introduction to Ray for scaling your ML workloads. Learn how Ray libraries (eg., Ray Tune, Ray Serve, etc) help you easily scale every step of your ML pipeline — from model training and hyperparameter search to inference and production serving.

Highlights include:
- Ray overview & core concepts
- Library ecosystem and use cases
- Demo: Ray for scaling ML workflows
- Getting started resources

Anyscale

November 18, 2021
Tweet

More Decks by Anyscale

Other Decks in Technology

Transcript

  1. Scaling Machine Learning Workloads with Ray Jules S. Damji Lead

    Developer Advocate, Anyscale Antoni Baum Software Engineer ML , Anyscale
  2. - Machine learning is pervasive in every domain - Distributed

    machine learning is becoming a necessity - Distributed computing is notoriously hard Why Ray?
  3. - Machine learning is pervasive in every domain - Distributed

    machine learning is becoming a necessity - Distributed computing is notoriously hard Why Ray?
  4. - Machine learning is pervasive in every domain - Distributed

    machine learning is becoming a necessity - Distributed computing is notoriously hard Why Ray?
  5. 35x every 18 m onths 2020 GPT-3 Compute demand growing

    faster than supply Moore’s Law (2x every 18 months) CPU https://openai.com/blog/ai-and-compute/
  6. 35x every 18 m onths 2020 GPT-3 Specialized hardware is

    also not enough Moore’s Law (2x every 18 months) CPU https://openai.com/blog/ai-and-compute/ GPU* TPU *
  7. 35x every 18 m onths 2020 GPT-3 Specialized hardware is

    also not enough Moore’s Law (2x every 18 months) CPU https://openai.com/blog/ai-and-compute/ GPU* TPU * No way out but to distribute!
  8. - Machine learning is pervasive in every domain - Distributed

    machine learning is becoming a necessity - Distributed computing is notoriously hard Why Ray?
  9. - Machine learning is pervasive in every domain - Distributed

    machine learning is becoming a necessity - Distributed computing is notoriously hard Ray’s vision: Make distributed computing accessible to every developer Why Ray?
  10. What is Ray? A Cluster Looks Like… Think about RM

    in YARN Think about NM daemon (YARN) or Spark Executor Think about Spark Driver Unique to Ray
  11. def read_array(file): # read array a from file return a

    def add(a, b): return np.add(a, b) a = read_array(file1) b = read_array(file2) sum = add(a, b) What is Ray - API
  12. What is Ray - API @ray.remote def read_array(file): # read

    array a from file return a @ray.remote def add(a, b): return np.add(a, b) a = read_array(file1) b = read_array(file2) sum = add(a, b) def read_array(file): # read array a from file return a def add(a, b): return np.add(a, b) a = read_array(file1) b = read_array(file2) sum = add(a, b) …. ….
  13. What is Ray - API @ray.remote def read_array(file): # read

    array a from file return a @ray.remote def add(a, b): return np.add(a, b) @ray.remote def read_array(file): # read array a from file return a @ray.remote def add(a, b): return np.add(a, b) ref1 = read_array.remote(file1) ref2 = read_array.remote(file2) ref = add.remote(ref1, ref2) sum = ray.get(ref) …. ….
  14. What is Ray - API class Counter(object): def __init__(self): self.value

    = 0 def inc(self): self.value += 1 return self.value c = Counter() c.inc() c.inc() @ray.remote def read_array(file): # read array a from file return a @ray.remote def add(a, b): return np.add(a, b) ref1 = read_array.remote(file1) ref2 = read_array.remote(file2) ref = add.remote(ref1, ref2) sum = ray.get(ref)
  15. What is Ray - API @ray.remote class Counter(object): def __init__(self):

    self.value = 0 def inc(self): self.value += 1 return self.value @ray.remote def read_array(file): # read array a from file return a @ray.remote def add(a, b): return np.add(a, b) ref1 = read_array.remote(file1) ref2 = read_array.remote(file2) ref = add.remote(ref1, ref2) sum = ray.get(ref) @ray.remote(num_gpus=1) class Counter(object): def __init__(self): self.value = 0 def inc(self): self.value += 1 return self.value c = Counter.remote() ref4 = c.inc.remote() ref5 = c.inc.remote()
  16. What is Ray - API c = Counter.remote() increment_refs =

    [c.inc.remote() for _ in range(5)] while len(increment_refs) > 0: return_n = 2 if len(increment_refs) > 1 else 1 ready_refs, remaining_refs = ray.wait(increment_refs, num_returns=return_n, timeout=10.0) if len(ready_refs > 0: print(ray.get(ready_refs)) # Update the remaining ones increment_refs = remaining_refs [1, 2] [3, 4] [5]
  17. Rich ecosystem for scaling ML workloads Native libraries - easily

    scale common bottlenecks in ML workflows - Examples: Ray Tune for HPO, RLlib for RLlib, Ray Serve for Serving, etc. Integrations - scale popular frameworks with Ray with minimal changes - Examples: XGBoost, TF, Jax, PyTorch etc.
  18. Rich ecosystem for scaling ML workloads Ray Core / Datasets

    Model Serving Data Processing Training Serving Ray Core + Datasets Reinforcement Learning Hyper. Tuning
  19. Rich ecosystem for scaling ML workloads Ray Core / Datasets

    Model Serving Data Processing Training Serving Ray Core + Datasets Reinforcement Learning Hyper. Tuning ** a small subset of the Ray ecosystem in ML Integrate Ray only based on your needs!
  20. Rich ecosystem for scaling ML workloads Ray Core / Datasets

    Model Serving Data Processing Training Serving Ray Core + Datasets Reinforcement Learning Hyper. Tuning Integrate Ray Tune! No need to adopt entire Ray framework.
  21. Challenges in scaling hyperparameter tuning? Rich ecosystem for scaling ML

    workloads Ray Core / Datasets Model Serving Data Processing Training Serving Ray Core + Datasets Reinforcement Learning Hyper. Tuning
  22. Rich ecosystem for scaling ML workloads Ray Core / Datasets

    Model Serving Data Processing Training Serving Ray Core + Datasets Reinforcement Learning Hyper. Tuning Unified, distributed toolkit to go end-to-end
  23. Ray Core / Datasets Model Serving Data Processing Training Serving

    Reinforcement Learning Hyper. Tuning Companies scaling ML with Ray • https://eng.uber.com/horovod-ray/ • https://www.anyscale.com/blog/wildlife-studios-serves-in-game-offers-3x-faster-at-1-10th-the-cost-with-ray • https://www.ikigailabs.com/blog/how-ikigai-labs-serves-interactive-ai-workflows-at-scale-using-ray-serve
  24. Starting scaling your ML workloads Getting Started: Documentation (docs.ray.io) Quick

    start example, reference guides, etc Join Ray Meetup Revive in Jan 2022. Publish recording to the members https://www.meetup.com/Bay-Area-Ray-Meetup/ Forums (discuss.ray.io) Learn / share with broader Ray community, including core team Ray Slack Connect with the Ray team and community GitHub Check out sources, file an issue, become a contributor, give us a Star :) https://github.com/ray-project/ray