Upgrade to Pro — share decks privately, control downloads, hide ads and more …

A Deep Dive into Ray’s Scheduling Policy (Alex Wu, Anyscale)

A Deep Dive into Ray’s Scheduling Policy (Alex Wu, Anyscale)

In Ray, the scheduling subsystem is responsible for deciding which process is responsible for executing a task. In this talk, we’ll take a look the different features which form Ray’s scheduling policy, in particular:

* Data locality
* Queuing
* Spillback
* Actor scheduling
* Resource borrowing

Af07bbf978a0989644b039ae6b8904a5?s=128

Anyscale
PRO

July 21, 2021
Tweet

Transcript

  1. A Deep Dive into Ray’s Scheduling Policy Alex Wu Ray

    Core Contributor, SWE@Anyscale
  2. (Getting Started) Evaluating Scheduling (Typical User) What you get for

    free with Ray (Power users) Designing better Ray applications Goals
  3. Where does Ray fit in with other schedulers? What is

    scheduling, why does it matter? Ray’s Scheduler • Leasing • Scheduling Policy • Data Locality Overview
  4. Linux Kernel (CFS) Kubernetes Spark Scheduling in other systems

  5. • General purpose • Very low overhead • Resource isolation

    Single node, dynamic scheduler Linux Kernel (CFS)
  6. • Highly robust • Suitable for low throughput, coarse grain

    jobs ◦ e.g. container orchestration • Great for scheduling ray nodes ◦ See the Ray k8s operator Coarse grain resource scheduler Kubernetes
  7. • Common in special purpose systems (e.g. data frames). ◦

    Great when it works. • Built in assumptions allow for more optimizations/query planning. • Can be run on top of Ray with great performance. ◦ Dask on Ray ◦ RayDP Centralized, Static Scheduling Spark
  8. • General purpose • Low overhead • Highly scalable Distributed,

    dynamic scheduling Ray
  9. Scheduling in other systems General Purpose Task granularity Scalable Linux

    ✅ microseconds 🚫 Spark 🚫 seconds ✅ Kubernetes ✅ minutes ✅ Ray ✅ milliseconds ✅
  10. Where does Ray fit in with other schedulers? What is

    scheduling, why does it matter? Ray’s Scheduler • Leasing • Policy • Data Locality Overview
  11. 01 02 03 Where do we run a task/actor? When

    do we run it? How do we do it at large scale? What is scheduling
  12. • 3 PiB input tensor • Apply sliding-window operations per

    “row” • Write a 14 TiB output https://www.jennakwon.page/2021/03/benchmarks-dask-distributed-vs-ray-for.html Ex: Data processing with Dask on Ray
  13. Ex: Data processing with Dask on Ray x2 million!!!

  14. Airflow on Ray

  15. https://github.com/ray-project/ray/tree/master/benchmarks Scale

  16. Where does Ray fit in with other schedulers? What is

    scheduling, why does it matter? Ray’s Scheduler • Leasing • Policy • Data Locality Overview
  17. A brief review Ray Architecture

  18. None
  19. • One per node • Resource manager • Manages worker

    processes • Centralized Component • Tracks cluster-wide properties • Executes actor/task
  20. • One per node • Resource manager • Manages worker

    processes • Centralized Component • Tracks cluster-wide properties • Executes actor/task
  21. Gradient Descent Example

  22. @ray.remote def step(data, weights): gradients = [] for batch in

    data: gradient_ref = calculate_gradient.remote(batch) gradients.append(gradient_ref) combined = ray.get(combine.remote(*gradients)) return weights + combined Add a task graph visualization of this
  23. Driver Calculate Gradient 0 Calculate Gradient 1

  24. Driver Calculate Gradient 0 Calculate Gradient 1 Gradient 0 Gradient

    1
  25. Driver Calculate Gradient 0 Calculate Gradient 1 Gradient 0 Gradient

    1 Combine
  26. Driver Calculate Gradient 0 Calculate Gradient 1 Gradient 0 Gradient

    1 Combine Result
  27. Node 1 Node 2 Driver

  28. Node 1 Node 2 Driver Calculate Gradient

  29. Node 1 Node 2 Driver Calculate Gradient Calculate Gradient Calculate

    Gradient Calculate Gradient Calculate Gradient
  30. Node 1 Node 2 Driver Calculate Gradient Calculate Gradient Calculate

    Gradient Calculate Gradient Calculate Gradient Combine
  31. A distributed scheduler

  32. @ray.remote def step(data, weights): gradients = [] for batch in

    data: gradient_ref = calculate_gradient.remote(batch) gradients.append(gradient_ref) combined = ray.get(combine.remote(*gradients)) return weights + combined How do we run this?
  33. calculate_gradient.remote(batch) 1. Ask raylet for a worker to run the

    task A distributed scheduler Node 1 Node 2 Worker Raylet 1 Raylet Worker Worker Worker Worker
  34. calculate_gradient.remote(batch) 1. Ask raylet for a worker to run the

    task 2. Raylet responds with spillback to node 2 A distributed scheduler Node 1 Node 2 Worker Raylet 1 Raylet Worker Worker Worker Worker 2
  35. calculate_gradient.remote(batch) 1. Ask raylet for a worker to run the

    task 2. Raylet responds with spillback to node 2 3. Ask node 2 raylet where to run the task A distributed scheduler Node 2 Worker Node 1 Raylet 1 Raylet Worker Worker Worker Worker 2 3
  36. calculate_gradient.remote(batch) 1. Ask raylet for a worker to run the

    task 2. Raylet responds with spillback to node 2 3. Ask node 2 raylet where to run the task 4. Raylet responds with a worker address A distributed scheduler Node 1 Node 2 Worker Raylet 1 Raylet Worker Worker Worker Worker 2 3 4
  37. calculate_gradient.remote(batch) 1. Ask raylet for a worker to run the

    task 2. Raylet responds with spillback to node 2 3. Ask node 2 raylet where to run the task 4. Raylet responds with a worker address 5. Workers directly communicate to execute task A distributed scheduler Node 1 Node 2 Worker Raylet 1 Raylet Worker Worker Worker Worker 2 3 4 5
  38. @ray.remote def step(data, weights): gradients = [] for batch in

    data: gradient_ref = calculate_gradient.remote(batch) gradients.append(gradient_ref) combined = ray.get(combine.remote(*gradients)) return weights + combined That’s a lot of requests!
  39. Where does Ray fit in with other schedulers? What is

    scheduling, why does it matter? Ray’s Scheduler • Leasing • Policy • Data Locality Overview
  40. Worker reuse Leasing

  41. for batch in data: gradient_ref = calculate_gradient.remote(batch) gradients.append(gradient_ref) 1. Request

    a 1 CPU Lease Leasing Node 1 Node 2 Worker Raylet Raylet Worker Worker Worker Worker 1
  42. for batch in data: gradient_ref = calculate_gradient.remote(batch) gradients.append(gradient_ref) 1. Request

    a 1 CPU Lease 2. Worker lease is granted Leasing Node 1 Node 2 Worker Raylet Raylet Worker Worker Worker Worker 1 2
  43. for batch in data: gradient_ref = calculate_gradient.remote(batch) gradients.append(gradient_ref) 1. Request

    a 1 CPU Lease 2. Worker lease is granted 3. Workers directly communicate to execute task Leasing Node 1 Node 2 Worker Raylet Raylet Worker Worker Worker Worker 1 2 3
  44. for batch in data: gradient_ref = calculate_gradient.remote(batch) gradients.append(gradient_ref) 1. Request

    a 1 CPU Lease 2. Worker lease is granted 3. Workers directly communicate to execute task 4. Workers reuse the lease to execute another task • Don’t repeat the whole process! Leasing Node 1 Node 2 Worker Raylet Raylet Worker Worker Worker Worker 1 2 3 4
  45. Where does Ray fit in with other schedulers? What is

    scheduling, why does it matter? Ray’s Scheduler • Leasing • Policy • Data Locality Overview
  46. Spillback Policy

  47. Recall our naive scheduling algorithm How do we pick the

    node? Spillback 5 mins ago...
  48. Anatomy of a Raylet CPU 1/1, 0.5/1 GPU 0.5/1, 1/1

    Custom 3/4 Local Resources Worker Pool
  49. Anatomy of a Raylet CPU 1/1, 0.5/1 GPU 0.5/1, 1/1

    Custom 3/4 Local Resources Worker Pool GCS
  50. Anatomy of a Raylet CPU 1/1, 0.5/1 GPU 0.5/1, 1/1

    Custom 3/4 Local Resources Cluster Resources Node 2 GPU: 1/2 Node 3 Cust: 4/4 Worker Pool GCS
  51. Anatomy of a Raylet Lease Request Queue CPU 1/1, 0.5/1

    GPU 0.5/1, 1/1 Custom 3/4 Local Resources Worker Pool GCS Cluster Resources Node 2 GPU: 1/2 Node 3 Cust: 4/4
  52. Anatomy of a Raylet Lease Request Queue Scheduling Policy CPU

    1/1, 0.5/1 GPU 0.5/1, 1/1 Custom 3/4 Local Resources Worker Pool GCS Lease or Spillback Cluster Resources Node 2 GPU: 1/2 Node 3 Cust: 4/4
  53. If we load balance calculate_gradient • Less likely to require

    object spilling • Less interference from other workers But… • combine will have to fetch the results from other nodes Check out the Anyscale blog for more details! https://www.anyscale.com/blog Load balancing vs locality Scheduling Considerations @ray.remote def step(data, weights): gradients = [] for batch in data: gradient_ref = calculate_gradient.remote(batch) gradients.append(gradient_ref) combined = ray.get(combine.remote(*gradients)) return weights + combined
  54. Where does Ray fit in with other schedulers? What is

    scheduling, why does it matter? Ray’s Scheduler • Leasing • Policy • Data Locality Overview
  55. Data locality

  56. @ray.remote def step(data, weights): gradients = [] for batch in

    data: gradient_ref = calculate_gradient.remote(batch) gradients.append(gradient_ref) combined = ray.get(combine.remote(*gradients)) return weights + combined Where should we run this?
  57. Locality aware leasing • Workers store information about objects they

    “own”. • Workers can send lease requests to any raylet Solution • Send lease requests to nodes which already have parts of combined in their object store Data Locality Learn more about ownership: https://www.usenix.org/system/files/nsdi21-wang.pdf @ray.remote def step(data, weights): gradients = [] for batch in data: gradient_ref = calculate_gradient.remote(batch) gradients.append(gradient_ref) combined = ray.get(combine.remote(*gradients)) return weights + combined
  58. Locality Aware Leasing Core Worker Python combine.remote(*gradients) gradients[0] node 0

    gradients[0] node 0 ... ... gradients[1] node 2 Object Directory calculate_gradient finishes
  59. Locality Aware Leasing Core Worker Python combine.remote(*gradients) Task Queue gradients[0]

    node 0 gradients[0] node 0 ... ... gradients[1] node 2 Object Directory calculate_gradient finishes
  60. Locality Aware Leasing Core Worker Python combine.remote(*gradients) Task Queue gradients[0]

    node 0 gradients[0] node 0 ... ... gradients[1] node 2 Object Directory Lease Request Queue Lease Policy calculate_gradient finishes
  61. Locality Aware Leasing Core Worker Python combine.remote(*gradients) Task Queue gradients[0]

    node 0 gradients[0] node 0 ... ... gradients[1] node 2 Object Directory Lease Request Queue Lease Policy calculate_gradient finishes lease request
  62. Ray can speed up your distributed applications with many optimizations

    including • Worker reuse • Resource-based scheduling • Data Locality Ray’s architecture allows it to • Schedule tasks with little overhead • Dynamically schedule tasks • Scale to 100,000s of tasks/s Ray can utilize or improve other distributed systems Summary
  63. Can these features improve your workload? You don’t need to

    be a distributed systems expert to reap the benefits of Ray • Just use @ray.remote Understanding Ray’s scheduler can help you improve your distributed applications Takeaways
  64. Thank You Thank You! Questions?