A Deep Dive into Ray’s Scheduling Policy (Alex Wu, Anyscale)

A Deep Dive into Ray’s Scheduling Policy Alex Wu Ray
Core Contributor, SWE@Anyscale

(Getting Started) Evaluating Scheduling (Typical User) What you get for
free with Ray (Power users) Designing better Ray applications Goals

Where does Ray fit in with other schedulers? What is
scheduling, why does it matter? Ray’s Scheduler • Leasing • Scheduling Policy • Data Locality Overview

Linux Kernel (CFS) Kubernetes Spark Scheduling in other systems

• General purpose • Very low overhead • Resource isolation
Single node, dynamic scheduler Linux Kernel (CFS)

• Highly robust • Suitable for low throughput, coarse grain
jobs ◦ e.g. container orchestration • Great for scheduling ray nodes ◦ See the Ray k8s operator Coarse grain resource scheduler Kubernetes

• Common in special purpose systems (e.g. data frames). ◦
Great when it works. • Built in assumptions allow for more optimizations/query planning. • Can be run on top of Ray with great performance. ◦ Dask on Ray ◦ RayDP Centralized, Static Scheduling Spark

• General purpose • Low overhead • Highly scalable Distributed,
dynamic scheduling Ray

Scheduling in other systems General Purpose Task granularity Scalable Linux
✅ microseconds 🚫 Spark 🚫 seconds ✅ Kubernetes ✅ minutes ✅ Ray ✅ milliseconds ✅

scheduling, why does it matter? Ray’s Scheduler • Leasing • Policy • Data Locality Overview

01 02 03 Where do we run a task/actor? When
do we run it? How do we do it at large scale? What is scheduling

• 3 PiB input tensor • Apply sliding-window operations per
“row” • Write a 14 TiB output https://www.jennakwon.page/2021/03/benchmarks-dask-distributed-vs-ray-for.html Ex: Data processing with Dask on Ray

Ex: Data processing with Dask on Ray x2 million!!!

Airflow on Ray

https://github.com/ray-project/ray/tree/master/benchmarks Scale

A brief review Ray Architecture

• One per node • Resource manager • Manages worker
processes • Centralized Component • Tracks cluster-wide properties • Executes actor/task

Gradient Descent Example

@ray.remote def step(data, weights): gradients = [] for batch in
data: gradient_ref = calculate_gradient.remote(batch) gradients.append(gradient_ref) combined = ray.get(combine.remote(*gradients)) return weights + combined Add a task graph visualization of this

Driver Calculate Gradient 0 Calculate Gradient 1

Driver Calculate Gradient 0 Calculate Gradient 1 Gradient 0 Gradient
1

1 Combine

1 Combine Result

Node 1 Node 2 Driver

Node 1 Node 2 Driver Calculate Gradient

Node 1 Node 2 Driver Calculate Gradient Calculate Gradient Calculate
Gradient Calculate Gradient Calculate Gradient

Node 1 Node 2 Driver Calculate Gradient Calculate Gradient Calculate
Gradient Calculate Gradient Calculate Gradient Combine

A distributed scheduler

data: gradient_ref = calculate_gradient.remote(batch) gradients.append(gradient_ref) combined = ray.get(combine.remote(*gradients)) return weights + combined How do we run this?

calculate_gradient.remote(batch) 1. Ask raylet for a worker to run the
task A distributed scheduler Node 1 Node 2 Worker Raylet 1 Raylet Worker Worker Worker Worker

task 2. Raylet responds with spillback to node 2 A distributed scheduler Node 1 Node 2 Worker Raylet 1 Raylet Worker Worker Worker Worker 2

task 2. Raylet responds with spillback to node 2 3. Ask node 2 raylet where to run the task A distributed scheduler Node 2 Worker Node 1 Raylet 1 Raylet Worker Worker Worker Worker 2 3

task 2. Raylet responds with spillback to node 2 3. Ask node 2 raylet where to run the task 4. Raylet responds with a worker address A distributed scheduler Node 1 Node 2 Worker Raylet 1 Raylet Worker Worker Worker Worker 2 3 4

task 2. Raylet responds with spillback to node 2 3. Ask node 2 raylet where to run the task 4. Raylet responds with a worker address 5. Workers directly communicate to execute task A distributed scheduler Node 1 Node 2 Worker Raylet 1 Raylet Worker Worker Worker Worker 2 3 4 5

data: gradient_ref = calculate_gradient.remote(batch) gradients.append(gradient_ref) combined = ray.get(combine.remote(*gradients)) return weights + combined That’s a lot of requests!

Worker reuse Leasing

for batch in data: gradient_ref = calculate_gradient.remote(batch) gradients.append(gradient_ref) 1. Request
a 1 CPU Lease Leasing Node 1 Node 2 Worker Raylet Raylet Worker Worker Worker Worker 1

a 1 CPU Lease 2. Worker lease is granted Leasing Node 1 Node 2 Worker Raylet Raylet Worker Worker Worker Worker 1 2

a 1 CPU Lease 2. Worker lease is granted 3. Workers directly communicate to execute task Leasing Node 1 Node 2 Worker Raylet Raylet Worker Worker Worker Worker 1 2 3

a 1 CPU Lease 2. Worker lease is granted 3. Workers directly communicate to execute task 4. Workers reuse the lease to execute another task • Don’t repeat the whole process! Leasing Node 1 Node 2 Worker Raylet Raylet Worker Worker Worker Worker 1 2 3 4

Spillback Policy

Recall our naive scheduling algorithm How do we pick the
node? Spillback 5 mins ago...

Anatomy of a Raylet CPU 1/1, 0.5/1 GPU 0.5/1, 1/1
Custom 3/4 Local Resources Worker Pool

Custom 3/4 Local Resources Worker Pool GCS

Custom 3/4 Local Resources Cluster Resources Node 2 GPU: 1/2 Node 3 Cust: 4/4 Worker Pool GCS

Anatomy of a Raylet Lease Request Queue CPU 1/1, 0.5/1
GPU 0.5/1, 1/1 Custom 3/4 Local Resources Worker Pool GCS Cluster Resources Node 2 GPU: 1/2 Node 3 Cust: 4/4

Anatomy of a Raylet Lease Request Queue Scheduling Policy CPU
1/1, 0.5/1 GPU 0.5/1, 1/1 Custom 3/4 Local Resources Worker Pool GCS Lease or Spillback Cluster Resources Node 2 GPU: 1/2 Node 3 Cust: 4/4

If we load balance calculate_gradient • Less likely to require
object spilling • Less interference from other workers But… • combine will have to fetch the results from other nodes Check out the Anyscale blog for more details! https://www.anyscale.com/blog Load balancing vs locality Scheduling Considerations @ray.remote def step(data, weights): gradients = [] for batch in data: gradient_ref = calculate_gradient.remote(batch) gradients.append(gradient_ref) combined = ray.get(combine.remote(*gradients)) return weights + combined

Data locality

data: gradient_ref = calculate_gradient.remote(batch) gradients.append(gradient_ref) combined = ray.get(combine.remote(*gradients)) return weights + combined Where should we run this?

Locality aware leasing • Workers store information about objects they
“own”. • Workers can send lease requests to any raylet Solution • Send lease requests to nodes which already have parts of combined in their object store Data Locality Learn more about ownership: https://www.usenix.org/system/files/nsdi21-wang.pdf @ray.remote def step(data, weights): gradients = [] for batch in data: gradient_ref = calculate_gradient.remote(batch) gradients.append(gradient_ref) combined = ray.get(combine.remote(*gradients)) return weights + combined

Locality Aware Leasing Core Worker Python combine.remote(*gradients) gradients[0] node 0
gradients[0] node 0 ... ... gradients[1] node 2 Object Directory calculate_gradient finishes

Locality Aware Leasing Core Worker Python combine.remote(*gradients) Task Queue gradients[0]
node 0 gradients[0] node 0 ... ... gradients[1] node 2 Object Directory calculate_gradient finishes

node 0 gradients[0] node 0 ... ... gradients[1] node 2 Object Directory Lease Request Queue Lease Policy calculate_gradient finishes

node 0 gradients[0] node 0 ... ... gradients[1] node 2 Object Directory Lease Request Queue Lease Policy calculate_gradient finishes lease request

Ray can speed up your distributed applications with many optimizations
including • Worker reuse • Resource-based scheduling • Data Locality Ray’s architecture allows it to • Schedule tasks with little overhead • Dynamically schedule tasks • Scale to 100,000s of tasks/s Ray can utilize or improve other distributed systems Summary

Can these features improve your workload? You don’t need to
be a distributed systems expert to reap the benefits of Ray • Just use @ray.remote Understanding Ray’s scheduler can help you improve your distributed applications Takeaways

Thank You Thank You! Questions?

A Deep Dive into Ray’s Scheduling Policy (Alex ...

A Deep Dive into Ray’s Scheduling Policy (Alex Wu, Anyscale)

More Decks by Anyscale

Other Decks in Technology

Featured

Transcript