Upgrade to Pro — share decks privately, control downloads, hide ads and more …

A Deep Dive into Ray’s Scheduling Policy (Alex Wu, Anyscale)

A Deep Dive into Ray’s Scheduling Policy (Alex Wu, Anyscale)

In Ray, the scheduling subsystem is responsible for deciding which process is responsible for executing a task. In this talk, we’ll take a look the different features which form Ray’s scheduling policy, in particular:

* Data locality
* Queuing
* Spillback
* Actor scheduling
* Resource borrowing

Anyscale

July 21, 2021
Tweet

More Decks by Anyscale

Other Decks in Technology

Transcript

  1. (Getting Started) Evaluating Scheduling (Typical User) What you get for

    free with Ray (Power users) Designing better Ray applications Goals
  2. Where does Ray fit in with other schedulers? What is

    scheduling, why does it matter? Ray’s Scheduler • Leasing • Scheduling Policy • Data Locality Overview
  3. • General purpose • Very low overhead • Resource isolation

    Single node, dynamic scheduler Linux Kernel (CFS)
  4. • Highly robust • Suitable for low throughput, coarse grain

    jobs ◦ e.g. container orchestration • Great for scheduling ray nodes ◦ See the Ray k8s operator Coarse grain resource scheduler Kubernetes
  5. • Common in special purpose systems (e.g. data frames). ◦

    Great when it works. • Built in assumptions allow for more optimizations/query planning. • Can be run on top of Ray with great performance. ◦ Dask on Ray ◦ RayDP Centralized, Static Scheduling Spark
  6. Scheduling in other systems General Purpose Task granularity Scalable Linux

    ✅ microseconds 🚫 Spark 🚫 seconds ✅ Kubernetes ✅ minutes ✅ Ray ✅ milliseconds ✅
  7. Where does Ray fit in with other schedulers? What is

    scheduling, why does it matter? Ray’s Scheduler • Leasing • Policy • Data Locality Overview
  8. 01 02 03 Where do we run a task/actor? When

    do we run it? How do we do it at large scale? What is scheduling
  9. • 3 PiB input tensor • Apply sliding-window operations per

    “row” • Write a 14 TiB output https://www.jennakwon.page/2021/03/benchmarks-dask-distributed-vs-ray-for.html Ex: Data processing with Dask on Ray
  10. Where does Ray fit in with other schedulers? What is

    scheduling, why does it matter? Ray’s Scheduler • Leasing • Policy • Data Locality Overview
  11. • One per node • Resource manager • Manages worker

    processes • Centralized Component • Tracks cluster-wide properties • Executes actor/task
  12. • One per node • Resource manager • Manages worker

    processes • Centralized Component • Tracks cluster-wide properties • Executes actor/task
  13. @ray.remote def step(data, weights): gradients = [] for batch in

    data: gradient_ref = calculate_gradient.remote(batch) gradients.append(gradient_ref) combined = ray.get(combine.remote(*gradients)) return weights + combined Add a task graph visualization of this
  14. Node 1 Node 2 Driver Calculate Gradient Calculate Gradient Calculate

    Gradient Calculate Gradient Calculate Gradient
  15. Node 1 Node 2 Driver Calculate Gradient Calculate Gradient Calculate

    Gradient Calculate Gradient Calculate Gradient Combine
  16. @ray.remote def step(data, weights): gradients = [] for batch in

    data: gradient_ref = calculate_gradient.remote(batch) gradients.append(gradient_ref) combined = ray.get(combine.remote(*gradients)) return weights + combined How do we run this?
  17. calculate_gradient.remote(batch) 1. Ask raylet for a worker to run the

    task A distributed scheduler Node 1 Node 2 Worker Raylet 1 Raylet Worker Worker Worker Worker
  18. calculate_gradient.remote(batch) 1. Ask raylet for a worker to run the

    task 2. Raylet responds with spillback to node 2 A distributed scheduler Node 1 Node 2 Worker Raylet 1 Raylet Worker Worker Worker Worker 2
  19. calculate_gradient.remote(batch) 1. Ask raylet for a worker to run the

    task 2. Raylet responds with spillback to node 2 3. Ask node 2 raylet where to run the task A distributed scheduler Node 2 Worker Node 1 Raylet 1 Raylet Worker Worker Worker Worker 2 3
  20. calculate_gradient.remote(batch) 1. Ask raylet for a worker to run the

    task 2. Raylet responds with spillback to node 2 3. Ask node 2 raylet where to run the task 4. Raylet responds with a worker address A distributed scheduler Node 1 Node 2 Worker Raylet 1 Raylet Worker Worker Worker Worker 2 3 4
  21. calculate_gradient.remote(batch) 1. Ask raylet for a worker to run the

    task 2. Raylet responds with spillback to node 2 3. Ask node 2 raylet where to run the task 4. Raylet responds with a worker address 5. Workers directly communicate to execute task A distributed scheduler Node 1 Node 2 Worker Raylet 1 Raylet Worker Worker Worker Worker 2 3 4 5
  22. @ray.remote def step(data, weights): gradients = [] for batch in

    data: gradient_ref = calculate_gradient.remote(batch) gradients.append(gradient_ref) combined = ray.get(combine.remote(*gradients)) return weights + combined That’s a lot of requests!
  23. Where does Ray fit in with other schedulers? What is

    scheduling, why does it matter? Ray’s Scheduler • Leasing • Policy • Data Locality Overview
  24. for batch in data: gradient_ref = calculate_gradient.remote(batch) gradients.append(gradient_ref) 1. Request

    a 1 CPU Lease Leasing Node 1 Node 2 Worker Raylet Raylet Worker Worker Worker Worker 1
  25. for batch in data: gradient_ref = calculate_gradient.remote(batch) gradients.append(gradient_ref) 1. Request

    a 1 CPU Lease 2. Worker lease is granted Leasing Node 1 Node 2 Worker Raylet Raylet Worker Worker Worker Worker 1 2
  26. for batch in data: gradient_ref = calculate_gradient.remote(batch) gradients.append(gradient_ref) 1. Request

    a 1 CPU Lease 2. Worker lease is granted 3. Workers directly communicate to execute task Leasing Node 1 Node 2 Worker Raylet Raylet Worker Worker Worker Worker 1 2 3
  27. for batch in data: gradient_ref = calculate_gradient.remote(batch) gradients.append(gradient_ref) 1. Request

    a 1 CPU Lease 2. Worker lease is granted 3. Workers directly communicate to execute task 4. Workers reuse the lease to execute another task • Don’t repeat the whole process! Leasing Node 1 Node 2 Worker Raylet Raylet Worker Worker Worker Worker 1 2 3 4
  28. Where does Ray fit in with other schedulers? What is

    scheduling, why does it matter? Ray’s Scheduler • Leasing • Policy • Data Locality Overview
  29. Anatomy of a Raylet CPU 1/1, 0.5/1 GPU 0.5/1, 1/1

    Custom 3/4 Local Resources Worker Pool
  30. Anatomy of a Raylet CPU 1/1, 0.5/1 GPU 0.5/1, 1/1

    Custom 3/4 Local Resources Worker Pool GCS
  31. Anatomy of a Raylet CPU 1/1, 0.5/1 GPU 0.5/1, 1/1

    Custom 3/4 Local Resources Cluster Resources Node 2 GPU: 1/2 Node 3 Cust: 4/4 Worker Pool GCS
  32. Anatomy of a Raylet Lease Request Queue CPU 1/1, 0.5/1

    GPU 0.5/1, 1/1 Custom 3/4 Local Resources Worker Pool GCS Cluster Resources Node 2 GPU: 1/2 Node 3 Cust: 4/4
  33. Anatomy of a Raylet Lease Request Queue Scheduling Policy CPU

    1/1, 0.5/1 GPU 0.5/1, 1/1 Custom 3/4 Local Resources Worker Pool GCS Lease or Spillback Cluster Resources Node 2 GPU: 1/2 Node 3 Cust: 4/4
  34. If we load balance calculate_gradient • Less likely to require

    object spilling • Less interference from other workers But… • combine will have to fetch the results from other nodes Check out the Anyscale blog for more details! https://www.anyscale.com/blog Load balancing vs locality Scheduling Considerations @ray.remote def step(data, weights): gradients = [] for batch in data: gradient_ref = calculate_gradient.remote(batch) gradients.append(gradient_ref) combined = ray.get(combine.remote(*gradients)) return weights + combined
  35. Where does Ray fit in with other schedulers? What is

    scheduling, why does it matter? Ray’s Scheduler • Leasing • Policy • Data Locality Overview
  36. @ray.remote def step(data, weights): gradients = [] for batch in

    data: gradient_ref = calculate_gradient.remote(batch) gradients.append(gradient_ref) combined = ray.get(combine.remote(*gradients)) return weights + combined Where should we run this?
  37. Locality aware leasing • Workers store information about objects they

    “own”. • Workers can send lease requests to any raylet Solution • Send lease requests to nodes which already have parts of combined in their object store Data Locality Learn more about ownership: https://www.usenix.org/system/files/nsdi21-wang.pdf @ray.remote def step(data, weights): gradients = [] for batch in data: gradient_ref = calculate_gradient.remote(batch) gradients.append(gradient_ref) combined = ray.get(combine.remote(*gradients)) return weights + combined
  38. Locality Aware Leasing Core Worker Python combine.remote(*gradients) gradients[0] node 0

    gradients[0] node 0 ... ... gradients[1] node 2 Object Directory calculate_gradient finishes
  39. Locality Aware Leasing Core Worker Python combine.remote(*gradients) Task Queue gradients[0]

    node 0 gradients[0] node 0 ... ... gradients[1] node 2 Object Directory calculate_gradient finishes
  40. Locality Aware Leasing Core Worker Python combine.remote(*gradients) Task Queue gradients[0]

    node 0 gradients[0] node 0 ... ... gradients[1] node 2 Object Directory Lease Request Queue Lease Policy calculate_gradient finishes
  41. Locality Aware Leasing Core Worker Python combine.remote(*gradients) Task Queue gradients[0]

    node 0 gradients[0] node 0 ... ... gradients[1] node 2 Object Directory Lease Request Queue Lease Policy calculate_gradient finishes lease request
  42. Ray can speed up your distributed applications with many optimizations

    including • Worker reuse • Resource-based scheduling • Data Locality Ray’s architecture allows it to • Schedule tasks with little overhead • Dynamically schedule tasks • Scale to 100,000s of tasks/s Ray can utilize or improve other distributed systems Summary
  43. Can these features improve your workload? You don’t need to

    be a distributed systems expert to reap the benefits of Ray • Just use @ray.remote Understanding Ray’s scheduler can help you improve your distributed applications Takeaways