Ray Internals: Object Management with the Ownership Model (Stephanie Wang & Yi Cheng, Anyscale)

Ray Internals: Object Management with the Ownership Model Stephanie Wang
Yi Cheng

Applications Why ownership in Ray 1.0?

Application: Model serving Request 1. Tasks are dynamically generated 2.
Tasks in the 10s of ms 3. Requires efficient data movement Request Request Image data Model Model Router

Application: Model serving Router becomes a bottleneck because it must
copy all preprocessed images to the workers.

Application: Model serving Passing ObjectRefs instead of data avoids the
Router bottleneck, but latency still suffers from a centralized design.

Application: Model serving With ownership, application gets automatic and decentralized
memory management for image data.

Evaluation: Online video processing frame3 ... ... ... ... ...
transform’ transform frame0 Decode Flow Cumulative Sum Smooth Sink Invocation Legend Task (RPC) 1. Tasks in the milliseconds 2. Complex data dependencies 3. Pipelined parallelism State dependency frame1 frame2

Evaluation: Online video processing (60 videos) Centralized = Ray modified
with writes to a centralized metadata store

Evaluation: Online video processing (60 videos) TODO: Add leases plot

Evaluation: Online video processing (60 videos) Latency with ownership is
lower because each video has a different owner.

Evaluation: Online video processing with failures Recovery when the owner
is intact, with lineage reconstruction.

Evaluation: Online video processing with failures Recovery from owner failure
using application-level checkpoints to bound re-execution.

Live input video Stabilized video

An overview of ObjectRefs in Ray Combining distributed memory +
futures

RPC model Driver Worker 1 Worker 2 o1=f() o1 o2=f()
o2 o3=add( o1,o2) o3 o2 o1 o1 = f() o2 = f() o3 = add(o1, o2) Problems: • Data movement • Parallelism

Data movement: RPC model +distributed memory Driver Worker 1 Worker
2 o1=f() o3=add( o1,o2) o1 o1 o3 o3 o2=f() o2 Distributed memory: Ability to reference data stored in the memory of a remote process. • Application can pass by reference • System manages data movement

Parallelism: RPC model +futures Worker 1 Worker 2 o2=f() o1=f()
o1 o2 o3=add( o1,o2) o2 o1 o3 Driver Futures: Ability to reference data that has not yet been computed. • Application can specify parallelism and data dependencies • System manages task scheduling

ObjectRefs in Ray: distributed memory + futures • Performance: System
handles data movement and parallelism • Generality: RPC-like interface (data is immutable). Application does not specify when or where computation should execute. Driver Worker 1 Worker 2 o1=f() o2=f() o1 o2 o3=add( o1,o2) o3 o3 o1

System challenges and requirements What is the right architecture for
managing ObjectRefs?

System requirements For generality, the system must impose low overhead.
Analogy: gRPC can execute millions of tasks/s. Can we do the same for ObjectRefs in Ray? Goal: Build a system that guarantees fault tolerance with low task overhead for ObjectRefs.

Challenge: ObjectRefs introduce distributed shared state f() f() driver add(o1,o2)
o2 o1 Invocation Legend Task (RPC) Data dependency o1

Multiple processes refer to the same value. add(o1,o2) f() f()
driver o2 o1 o1 o1 Dereferencing an ObjectRef requires coordination. 1. The process that specifies how the value is created and used. 2. The process that creates the value. 3. The process that uses the value. 4. The physical location of the value. Challenge: ObjectRefs introduce distributed shared state

Requirements for dereferencing a value: • Retrieval: The location of
the value • Garbage collection: Whether the value is referenced Requirements in the presence of failures: • Detection: The location of the task that returns the value. • Recovery: A description of the task and its dependencies. • Persistence: Metadata should survive failures. System requirements

Requirements for dereferencing a value: • Retrieval: The location of
the value • Garbage collection: Whether the value is referenced Requirements in the presence of failures: • Detection: The location of the task that returns the value. • Recovery: A description of the task and its dependencies. • Persistence: Metadata should survive failures. Challenge: Recording this metadata, while ensuring latency and throughput for dynamic and fine-grained tasks. System requirements

Existing solutions Architecture Coordination Performance Leases (decentralized, Ray <v0.8) Workers
coordinate. For example, use leases to detect a task failure. Asynchronous metadata updates. Scale by adding more worker nodes. Centralized master Master records all metadata updates and handles all failures. Can scale through sharding, but high overhead due to synchronous updates.

Master Option 1: A centralized architecture (Dask, Spark, etc.) Driver
Worker Worker Performance: Can scale through sharding, but high overhead for short tasks due to synchronous updates Coordination: Master records all metadata updates and handles all failures

Option 2: Decentralized leases (Ray <v0.8) Driver Worker Worker Lease
manager Performance: Asynchronous metadata updates. Scale by adding more worker nodes. Coordination: Workers coordinate by acquiring (e.g., by acquiring leases on task execution).

The ownership model Achieving fault tolerance without giving up performance.

Our approach: Ownership Existing solutions do not take advantage of
the inherent structure of a Ray application. f() f() driver add(o1,o2) o2 o1 1. Task graphs are hierarchical. 2. An ObjectRef is often passed within the scope of the caller (“passing downwards”).

Our approach: Ownership Existing solutions do not take advantage of
the inherent structure of a Ray application. 1. Task graphs are hierarchical. Insight: By leveraging the structure of Ray applications, we can decentralize without requiring expensive coordination. 2. An ObjectRef is often passed within the scope of the caller (“passing downwards”). f() f() add(o1,o2) o2 o1 driver

Our approach: Ownership Insight: By leveraging the structure of Ray
applications, we can decentralize without requiring expensive coordination. Architecture Failure handling Performance Ownership: The worker that calls a task owns the returned ObjectRef. Each worker is a “centralized master” for the objects that it owns. No additional writes on the critical path of task execution. Scaling through nested function calls.

Ownership: Challenges • Failure recovery ◦ Recovering a lost worker
◦ Recovering a lost owner • Garbage collection and memory safety

Ownership: Challenges • Failure recovery • Recovering a lost worker
• Recovering a lost owner • Garbage collection and memory safety

Node 3 Object Store Worker Task scheduling Node 2 Object
Store Worker Node 1 Worker Obj Task Val Loc C Y A X B A Obj Task Val Loc Y C(X) Obj Task Val Loc X B()

Node 3 Object Store Worker Task scheduling Node 2 Object
Store Worker Node 1 Worker Obj Task Val Loc X B() Y C(X) A C X Y A B B 2 A task’s pending location is written locally at the owner. N2 1

Node 3 Object Store Worker Distributed memory management Node 2
Object Store Worker Node 1 Worker Obj Task Val Loc X B() Y C(X) A B X: N2 4 X 3 C Y A X B *X N2 5 Owner tracks locations of objects stored in distributed memory.

Node 2 Object Store X Worker Node 3 Object Store
Worker Task scheduling with dependencies Node 1 Worker Obj Task Val Loc X B() *X N2 Y C(X) A X Y B C A Obj O. X W1 C N3

Worker Worker failure Node 1 Worker Obj Task Val Loc X B() *X N2 Y C(X) N3 A C Obj O. X W1 B Y A X C Reference holders only need to check whether the owner is alive.

Node 3 Object Store Worker C Obj O. X W1
Worker recovery Node 4 Object Store Worker Owner coordinates lineage reconstruction. Node 1 Worker Obj Task Val Loc X B() *X N2 Y C(X) N3 A B *X N4 B Y A X C

Worker Owner failure Node 1 Worker Obj Task Val Loc X B() *X N2 Y C(X) N3 A C Obj O. X W1 B Y A X C

Node 2 Object Store Worker Node 3 Object Store Worker
Owner recovery C Obj O. X W1 B Y A X C References fate-share with the object’s owner. X

Owner recovery C Obj O. X W1 References fate-share with the object’s owner. X C X Y B A A’s owner

Owner recovery Node 4 Worker Obj Task Val Loc A Leveraging the application’s hierarchical structure: the owner of A recovers A. C X Y B A A’s owner Obj Task Val Loc X B() Y C(X)

Ownership transfer Addressing the limitations of ownership

Limitations of ownership Ownership model: Whoever creates an ObjectRef owns
the metadata. + Using a single owner guarantees consistency and low latency + Decentralize the system according to the application structure + Use lineage reconstruction to recover from failures BUT: - Ray’s flexibility makes it difficult to apply lineage reconstruction in all scenarios - Objects (and their lineage) fate-share with their owner - This is fine if ObjectRefs are only passed downwards, but what if they’re not?

Yi’s content starts here - Problem statement - What happens
if the key assumption that ObjectRefs get passed downward is broken? - Key use cases where this happens - Present solution + (briefly, optionally) 1 alternative - Finish with “we’re looking for feedback”, summary of other open questions (fault tolerance?)

Ownership transfer • Where does the issue come from? •
Real world use cases • Designs: when & how

Where does the issue come from? • The key assumption
• object reference is passed downwards. g() driver f() X Invocation Legend Worker Owner Object x = f.remote() • What if the assumption is broken? g.remote(x) • What if the worker run f failed? • What if the driver failed? OK OK

Upward pattern Invocation Legend Worker Owner Object • What if
the worker run f exits? @ray.remote def f(): x = ray.put("Hello World") # msg is owned by the worker return [x] x_list = f.remote() # `x_list` is owned by the driver x = ray.get(ray.get(x_list)[0]) • Driver will fail to get the inner object • The inner object won’t be able to be reconstructed. driver f() x x_list = [x] x_list • What if the driver exits? OK NOT OK

Lateral pattern • What if the driver exits? Invocation Legend
Worker Owner Object x = f.remote() # `x` is owned by the driver actor = Actor.options(lifetime="detached").remote() actor.g.remote([x]) # `actor` doesn't own `x` g() Detached actor driver X f() • The actor won’t be able to access x any more • X can’t be reconstructed because the owner died • What if the worker run f failed? OK NOT OK

Real world use cases • Designs: when & how

Why these cases are important? Or what if we don’t
take care of these cases? - Scale down a cluster will be hard - Won’t be able to remove zombie process - Can’t reconstructed with failures

Real world use case - RayDP Invocation Legend Worker Owner
Object driver Data partition List of data partition Data processing

Real world use case Invocation Legend Worker Owner Object Cache
the created tensor object Detached actor driver Create tensor object tensor object

Real world use cases • Designs: when and how

Two questions to answer • When to do the transfer
&& how to do the transfer • It needs to be easy to use • It’ll introduce as little overhead as possible • It should cover the most cases

When to transfer • Manually transfer the ownership • Automatically
transfer by detecting scope change ◦ Detect upward transfer ◦ Detect lateral transfer ▪ To cover most cases: transfer if either owner or receiver is a detached actor ▪ Idea solution: transfer if it’s in different ownership scope

How to transfer Invocation Legend Worker Owner Object g() Detached
actor driver X f() x = f.remote() # `x` is owned by the driver actor = Actor.options(lifetime="detached").remote() actor.g.remote([x]) # `actor` doesn't own `x` Driver Actor (x, driver) (x, driver) Plasma Store Object: x owns

How to transfer Invocation Legend Worker Owner Object g() Detached
actor driver X f() Driver Actor (x, driver) (x, actor) Plasma Store Object: x owns owns • Ownership sharing in reference counting layer ◦ Most changes in reference counting layer ◦ Avoid physical copy of the object

How to transfer - alternative Invocation Legend Worker Owner Object
g() Detached actor driver X f() Driver Actor (x, GCS) (x, GCS) Plasma Store Object: x GCS (x, GCS) owns • Transfer the ownership to GCS ◦ Able to have HA support if GCS support HA ◦ Potentially becomes a bottleneck

Ownership transfer - summary • There are three patterns of
object passing • Upward/downward and lateral • Automatically transferring ownership of an object via pattern detecting • Sharing ownership among workers to “transfer” ownership

Ownership is a way to decentralize ObjectRef management for performance
and stability. Ray whitepaper: tinyurl.com/ray-white-paper Thank you!

Ray Internals: Object Management with the Owner...

Ray Internals: Object Management with the Ownership Model (Stephanie Wang & Yi Cheng, Anyscale)

More Decks by Anyscale

Other Decks in Technology

Featured

Transcript