Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Ray Internals: Object Management with the Owner...

Ray Internals: Object Management with the Ownership Model (Stephanie Wang & Yi Cheng, Anyscale)

In this talk, we'll do a deep dive into Ray's distributed object management layer. We'll explain the Ray execution model and the basics behind the Ray distributed object store. Next, we'll describe the challenges with achieving both performance and reliability for object management. We'll present our solution to this problem, which is based on a novel concept called ownership that ensures object metadata consistency with low overhead. Finally, we'll present some exciting upcoming work on how to extend ownership to better support recent use cases.

Anyscale

July 16, 2021
Tweet

More Decks by Anyscale

Other Decks in Technology

Transcript

  1. Application: Model serving Request 1. Tasks are dynamically generated 2.

    Tasks in the 10s of ms 3. Requires efficient data movement Request Request Image data Model Model Router
  2. Application: Model serving Router becomes a bottleneck because it must

    copy all preprocessed images to the workers.
  3. Application: Model serving Passing ObjectRefs instead of data avoids the

    Router bottleneck, but latency still suffers from a centralized design.
  4. Evaluation: Online video processing frame3 ... ... ... ... ...

    transform’ transform frame0 Decode Flow Cumulative Sum Smooth Sink Invocation Legend Task (RPC) 1. Tasks in the milliseconds 2. Complex data dependencies 3. Pipelined parallelism State dependency frame1 frame2
  5. Evaluation: Online video processing (60 videos) Latency with ownership is

    lower because each video has a different owner.
  6. Evaluation: Online video processing with failures Recovery from owner failure

    using application-level checkpoints to bound re-execution.
  7. RPC model Driver Worker 1 Worker 2 o1=f() o1 o2=f()

    o2 o3=add( o1,o2) o3 o2 o1 o1 = f() o2 = f() o3 = add(o1, o2) Problems: • Data movement • Parallelism
  8. Data movement: RPC model +distributed memory Driver Worker 1 Worker

    2 o1=f() o3=add( o1,o2) o1 o1 o3 o3 o2=f() o2 Distributed memory: Ability to reference data stored in the memory of a remote process. • Application can pass by reference • System manages data movement
  9. Parallelism: RPC model +futures Worker 1 Worker 2 o2=f() o1=f()

    o1 o2 o3=add( o1,o2) o2 o1 o3 Driver Futures: Ability to reference data that has not yet been computed. • Application can specify parallelism and data dependencies • System manages task scheduling
  10. ObjectRefs in Ray: distributed memory + futures • Performance: System

    handles data movement and parallelism • Generality: RPC-like interface (data is immutable). Application does not specify when or where computation should execute. Driver Worker 1 Worker 2 o1=f() o2=f() o1 o2 o3=add( o1,o2) o3 o3 o1
  11. System requirements For generality, the system must impose low overhead.

    Analogy: gRPC can execute millions of tasks/s. Can we do the same for ObjectRefs in Ray? Goal: Build a system that guarantees fault tolerance with low task overhead for ObjectRefs.
  12. Challenge: ObjectRefs introduce distributed shared state f() f() driver add(o1,o2)

    o2 o1 Invocation Legend Task (RPC) Data dependency o1
  13. Multiple processes refer to the same value. add(o1,o2) f() f()

    driver o2 o1 o1 o1 Dereferencing an ObjectRef requires coordination. 1. The process that specifies how the value is created and used. 2. The process that creates the value. 3. The process that uses the value. 4. The physical location of the value. Challenge: ObjectRefs introduce distributed shared state
  14. Requirements for dereferencing a value: • Retrieval: The location of

    the value • Garbage collection: Whether the value is referenced Requirements in the presence of failures: • Detection: The location of the task that returns the value. • Recovery: A description of the task and its dependencies. • Persistence: Metadata should survive failures. System requirements
  15. Requirements for dereferencing a value: • Retrieval: The location of

    the value • Garbage collection: Whether the value is referenced Requirements in the presence of failures: • Detection: The location of the task that returns the value. • Recovery: A description of the task and its dependencies. • Persistence: Metadata should survive failures. Challenge: Recording this metadata, while ensuring latency and throughput for dynamic and fine-grained tasks. System requirements
  16. Existing solutions Architecture Coordination Performance Leases (decentralized, Ray <v0.8) Workers

    coordinate. For example, use leases to detect a task failure. Asynchronous metadata updates. Scale by adding more worker nodes. Centralized master Master records all metadata updates and handles all failures. Can scale through sharding, but high overhead due to synchronous updates.
  17. Master Option 1: A centralized architecture (Dask, Spark, etc.) Driver

    Worker Worker Performance: Can scale through sharding, but high overhead for short tasks due to synchronous updates Coordination: Master records all metadata updates and handles all failures
  18. Option 2: Decentralized leases (Ray <v0.8) Driver Worker Worker Lease

    manager Performance: Asynchronous metadata updates. Scale by adding more worker nodes. Coordination: Workers coordinate by acquiring (e.g., by acquiring leases on task execution).
  19. Our approach: Ownership Existing solutions do not take advantage of

    the inherent structure of a Ray application. f() f() driver add(o1,o2) o2 o1 1. Task graphs are hierarchical. 2. An ObjectRef is often passed within the scope of the caller (“passing downwards”).
  20. Our approach: Ownership Existing solutions do not take advantage of

    the inherent structure of a Ray application. 1. Task graphs are hierarchical. Insight: By leveraging the structure of Ray applications, we can decentralize without requiring expensive coordination. 2. An ObjectRef is often passed within the scope of the caller (“passing downwards”). f() f() add(o1,o2) o2 o1 driver
  21. Our approach: Ownership Insight: By leveraging the structure of Ray

    applications, we can decentralize without requiring expensive coordination. Architecture Failure handling Performance Ownership: The worker that calls a task owns the returned ObjectRef. Each worker is a “centralized master” for the objects that it owns. No additional writes on the critical path of task execution. Scaling through nested function calls.
  22. Ownership: Challenges • Failure recovery ◦ Recovering a lost worker

    ◦ Recovering a lost owner • Garbage collection and memory safety
  23. Ownership: Challenges • Failure recovery • Recovering a lost worker

    • Recovering a lost owner • Garbage collection and memory safety
  24. Node 3 Object Store Worker Task scheduling Node 2 Object

    Store Worker Node 1 Worker Obj Task Val Loc C Y A X B A Obj Task Val Loc Y C(X) Obj Task Val Loc X B()
  25. Node 3 Object Store Worker Task scheduling Node 2 Object

    Store Worker Node 1 Worker Obj Task Val Loc X B() Y C(X) A C X Y A B B 2 A task’s pending location is written locally at the owner. N2 1
  26. Node 3 Object Store Worker Distributed memory management Node 2

    Object Store Worker Node 1 Worker Obj Task Val Loc X B() Y C(X) A B X: N2 4 X 3 C Y A X B *X N2 5 Owner tracks locations of objects stored in distributed memory.
  27. Node 2 Object Store X Worker Node 3 Object Store

    Worker Task scheduling with dependencies Node 1 Worker Obj Task Val Loc X B() *X N2 Y C(X) A X Y B C A Obj O. X W1 C N3
  28. Node 2 Object Store X Worker Node 3 Object Store

    Worker Worker failure Node 1 Worker Obj Task Val Loc X B() *X N2 Y C(X) N3 A C Obj O. X W1 B Y A X C Reference holders only need to check whether the owner is alive.
  29. Node 3 Object Store Worker C Obj O. X W1

    Worker recovery Node 4 Object Store Worker Owner coordinates lineage reconstruction. Node 1 Worker Obj Task Val Loc X B() *X N2 Y C(X) N3 A B *X N4 B Y A X C
  30. Node 2 Object Store X Worker Node 3 Object Store

    Worker Owner failure Node 1 Worker Obj Task Val Loc X B() *X N2 Y C(X) N3 A C Obj O. X W1 B Y A X C
  31. Node 2 Object Store Worker Node 3 Object Store Worker

    Owner recovery C Obj O. X W1 B Y A X C References fate-share with the object’s owner. X
  32. Node 2 Object Store Worker Node 3 Object Store Worker

    Owner recovery C Obj O. X W1 References fate-share with the object’s owner. X C X Y B A A’s owner
  33. Node 2 Object Store Worker Node 3 Object Store Worker

    Owner recovery Node 4 Worker Obj Task Val Loc A Leveraging the application’s hierarchical structure: the owner of A recovers A. C X Y B A A’s owner Obj Task Val Loc X B() Y C(X)
  34. Limitations of ownership Ownership model: Whoever creates an ObjectRef owns

    the metadata. + Using a single owner guarantees consistency and low latency + Decentralize the system according to the application structure + Use lineage reconstruction to recover from failures BUT: - Ray’s flexibility makes it difficult to apply lineage reconstruction in all scenarios - Objects (and their lineage) fate-share with their owner - This is fine if ObjectRefs are only passed downwards, but what if they’re not?
  35. Limitations of ownership Ownership model: Whoever creates an ObjectRef owns

    the metadata. + Using a single owner guarantees consistency and low latency + Decentralize the system according to the application structure + Use lineage reconstruction to recover from failures BUT: - Ray’s flexibility makes it difficult to apply lineage reconstruction in all scenarios - Objects (and their lineage) fate-share with their owner - This is fine if ObjectRefs are only passed downwards, but what if they’re not?
  36. Yi’s content starts here - Problem statement - What happens

    if the key assumption that ObjectRefs get passed downward is broken? - Key use cases where this happens - Present solution + (briefly, optionally) 1 alternative - Finish with “we’re looking for feedback”, summary of other open questions (fault tolerance?)
  37. Ownership transfer • Where does the issue come from? •

    Real world use cases • Designs: when & how
  38. Ownership transfer • Where does the issue come from? •

    Real world use cases • Designs: when & how
  39. Where does the issue come from? • The key assumption

    • object reference is passed downwards. g() driver f() X Invocation Legend Worker Owner Object x = f.remote() • What if the assumption is broken? g.remote(x) • What if the worker run f failed? • What if the driver failed? OK OK
  40. Upward pattern Invocation Legend Worker Owner Object • What if

    the worker run f exits? @ray.remote def f(): x = ray.put("Hello World") # msg is owned by the worker return [x] x_list = f.remote() # `x_list` is owned by the driver x = ray.get(ray.get(x_list)[0]) • Driver will fail to get the inner object • The inner object won’t be able to be reconstructed. driver f() x x_list = [x] x_list • What if the driver exits? OK NOT OK
  41. Lateral pattern • What if the driver exits? Invocation Legend

    Worker Owner Object x = f.remote() # `x` is owned by the driver actor = Actor.options(lifetime="detached").remote() actor.g.remote([x]) # `actor` doesn't own `x` g() Detached actor driver X f() • The actor won’t be able to access x any more • X can’t be reconstructed because the owner died • What if the worker run f failed? OK NOT OK
  42. Ownership transfer • Where does the issue come from? •

    Real world use cases • Designs: when & how
  43. Why these cases are important? Or what if we don’t

    take care of these cases? - Scale down a cluster will be hard - Won’t be able to remove zombie process - Can’t reconstructed with failures
  44. Real world use case - RayDP Invocation Legend Worker Owner

    Object driver Data partition List of data partition Data processing
  45. Real world use case Invocation Legend Worker Owner Object Cache

    the created tensor object Detached actor driver Create tensor object tensor object
  46. Ownership transfer • Where does the issue come from? •

    Real world use cases • Designs: when and how
  47. Two questions to answer • When to do the transfer

    && how to do the transfer • It needs to be easy to use • It’ll introduce as little overhead as possible • It should cover the most cases
  48. When to transfer • Manually transfer the ownership • Automatically

    transfer by detecting scope change ◦ Detect upward transfer ◦ Detect lateral transfer ▪ To cover most cases: transfer if either owner or receiver is a detached actor ▪ Idea solution: transfer if it’s in different ownership scope
  49. How to transfer Invocation Legend Worker Owner Object g() Detached

    actor driver X f() x = f.remote() # `x` is owned by the driver actor = Actor.options(lifetime="detached").remote() actor.g.remote([x]) # `actor` doesn't own `x` Driver Actor (x, driver) (x, driver) Plasma Store Object: x owns
  50. How to transfer Invocation Legend Worker Owner Object g() Detached

    actor driver X f() Driver Actor (x, driver) (x, actor) Plasma Store Object: x owns owns • Ownership sharing in reference counting layer ◦ Most changes in reference counting layer ◦ Avoid physical copy of the object
  51. How to transfer - alternative Invocation Legend Worker Owner Object

    g() Detached actor driver X f() Driver Actor (x, GCS) (x, GCS) Plasma Store Object: x GCS (x, GCS) owns • Transfer the ownership to GCS ◦ Able to have HA support if GCS support HA ◦ Potentially becomes a bottleneck
  52. Ownership transfer - summary • There are three patterns of

    object passing • Upward/downward and lateral • Automatically transferring ownership of an object via pattern detecting • Sharing ownership among workers to “transfer” ownership
  53. Ownership is a way to decentralize ObjectRef management for performance

    and stability. Ray whitepaper: tinyurl.com/ray-white-paper Thank you!