Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Ray Internals: Object Management with the Ownership Model (Stephanie Wang & Yi Cheng, Anyscale)

Ray Internals: Object Management with the Ownership Model (Stephanie Wang & Yi Cheng, Anyscale)

In this talk, we'll do a deep dive into Ray's distributed object management layer. We'll explain the Ray execution model and the basics behind the Ray distributed object store. Next, we'll describe the challenges with achieving both performance and reliability for object management. We'll present our solution to this problem, which is based on a novel concept called ownership that ensures object metadata consistency with low overhead. Finally, we'll present some exciting upcoming work on how to extend ownership to better support recent use cases.

Anyscale

July 16, 2021
Tweet

More Decks by Anyscale

Other Decks in Technology

Transcript

  1. Ray Internals: Object Management
    with the
    Ownership Model
    Stephanie Wang
    Yi Cheng

    View full-size slide

  2. Applications
    Why ownership in Ray 1.0?

    View full-size slide

  3. Application: Model serving
    Request
    1. Tasks are dynamically generated
    2. Tasks in the 10s of ms
    3. Requires efficient data movement
    Request
    Request
    Image
    data
    Model
    Model
    Router

    View full-size slide

  4. Application: Model serving
    Router becomes a bottleneck because it
    must copy all preprocessed images to the
    workers.

    View full-size slide

  5. Application: Model serving
    Passing ObjectRefs instead of data avoids the Router
    bottleneck, but latency still suffers from a centralized
    design.

    View full-size slide

  6. Application: Model serving
    With ownership, application gets automatic and
    decentralized memory management for image data.

    View full-size slide

  7. Evaluation: Online video processing
    frame3
    ...
    ... ... ... ...
    transform’
    transform
    frame0
    Decode Flow Cumulative
    Sum
    Smooth Sink
    Invocation
    Legend
    Task (RPC)
    1. Tasks in the milliseconds
    2. Complex data
    dependencies
    3. Pipelined parallelism
    State dependency frame1
    frame2

    View full-size slide

  8. Evaluation: Online video processing (60 videos)
    Centralized = Ray modified with writes to a
    centralized metadata store

    View full-size slide

  9. Evaluation: Online video processing (60 videos)
    TODO: Add leases plot

    View full-size slide

  10. Evaluation: Online video processing (60 videos)
    Latency with ownership is lower because each
    video has a different owner.

    View full-size slide

  11. Evaluation: Online video processing with failures
    Recovery when the owner is intact, with lineage
    reconstruction.

    View full-size slide

  12. Evaluation: Online video processing with failures
    Recovery from owner failure using
    application-level checkpoints to bound
    re-execution.

    View full-size slide

  13. Live input video
    Stabilized video

    View full-size slide

  14. An overview of ObjectRefs in Ray
    Combining distributed memory + futures

    View full-size slide

  15. RPC model
    Driver Worker 1 Worker 2
    o1=f()
    o1
    o2=f()
    o2
    o3=add(
    o1,o2)
    o3
    o2
    o1
    o1 = f()
    o2 = f()
    o3 = add(o1, o2)
    Problems:
    • Data movement
    • Parallelism

    View full-size slide

  16. Data movement: RPC model +distributed memory
    Driver Worker 1 Worker 2
    o1=f()
    o3=add(
    o1,o2)
    o1
    o1
    o3
    o3
    o2=f()
    o2
    Distributed memory: Ability to
    reference data stored in the
    memory of a remote process.
    • Application can pass by
    reference
    • System manages data
    movement

    View full-size slide

  17. Parallelism: RPC model +futures
    Worker 1 Worker 2
    o2=f()
    o1=f()
    o1
    o2
    o3=add(
    o1,o2)
    o2
    o1
    o3
    Driver Futures: Ability to reference data
    that has not yet been
    computed.
    • Application can specify
    parallelism and data
    dependencies
    • System manages task
    scheduling

    View full-size slide

  18. ObjectRefs in Ray: distributed memory + futures
    • Performance: System
    handles data movement
    and parallelism
    • Generality: RPC-like
    interface (data is
    immutable). Application
    does not specify when or
    where computation should
    execute.
    Driver Worker 1 Worker 2
    o1=f() o2=f()
    o1 o2
    o3=add(
    o1,o2)
    o3
    o3
    o1

    View full-size slide

  19. System challenges and requirements
    What is the right architecture for
    managing ObjectRefs?

    View full-size slide

  20. System requirements
    For generality, the system must impose low overhead.
    Analogy: gRPC can execute millions of tasks/s. Can we do the same
    for ObjectRefs in Ray?
    Goal: Build a system that guarantees fault tolerance with low task
    overhead for ObjectRefs.

    View full-size slide

  21. Challenge: ObjectRefs introduce distributed shared state
    f()
    f()
    driver
    add(o1,o2)
    o2
    o1
    Invocation
    Legend
    Task (RPC)
    Data dependency o1

    View full-size slide

  22. Multiple processes refer to the same value.
    add(o1,o2)
    f()
    f()
    driver
    o2
    o1
    o1
    o1
    Dereferencing an ObjectRef requires coordination.
    1. The process that specifies how the
    value is created and used.
    2. The process that creates the value.
    3. The process that uses the value.
    4. The physical location of the value.
    Challenge: ObjectRefs introduce distributed shared state

    View full-size slide

  23. Requirements for dereferencing a value:
    • Retrieval: The location of the value
    • Garbage collection: Whether the value is referenced
    Requirements in the presence of failures:
    • Detection: The location of the task that returns the value.
    • Recovery: A description of the task and its dependencies.
    • Persistence: Metadata should survive failures.
    System requirements

    View full-size slide

  24. Requirements for dereferencing a value:
    • Retrieval: The location of the value
    • Garbage collection: Whether the value is referenced
    Requirements in the presence of failures:
    • Detection: The location of the task that returns the value.
    • Recovery: A description of the task and its dependencies.
    • Persistence: Metadata should survive failures.
    Challenge: Recording this metadata, while ensuring
    latency and throughput
    for dynamic and fine-grained tasks.
    System requirements

    View full-size slide

  25. Existing solutions
    Architecture Coordination Performance
    Leases
    (decentralized,
    Ray Workers coordinate. For
    example, use leases to
    detect a task failure.
    Asynchronous
    metadata updates.
    Scale by adding more
    worker nodes.
    Centralized
    master
    Master records all
    metadata updates and
    handles all failures.
    Can scale through
    sharding, but high
    overhead due to
    synchronous updates.

    View full-size slide

  26. Master
    Option 1: A centralized architecture (Dask, Spark, etc.)
    Driver Worker Worker
    Performance: Can scale
    through sharding, but high
    overhead for short tasks due
    to synchronous updates
    Coordination: Master records
    all metadata updates and
    handles all failures

    View full-size slide

  27. Option 2: Decentralized leases (Ray Driver Worker Worker
    Lease
    manager
    Performance: Asynchronous
    metadata updates. Scale by
    adding more worker nodes.
    Coordination: Workers
    coordinate by acquiring (e.g.,
    by acquiring leases on task
    execution).

    View full-size slide

  28. The ownership model
    Achieving fault tolerance without giving
    up performance.

    View full-size slide

  29. Our approach: Ownership
    Existing solutions do not take advantage of the inherent structure of a
    Ray application.
    f()
    f()
    driver
    add(o1,o2)
    o2
    o1
    1. Task graphs are hierarchical.
    2. An ObjectRef is often passed
    within the scope of the caller
    (“passing downwards”).

    View full-size slide

  30. Our approach: Ownership
    Existing solutions do not take advantage of the inherent structure of a
    Ray application.
    1. Task graphs are hierarchical.
    Insight: By leveraging the structure of Ray applications, we can
    decentralize without requiring expensive coordination.
    2. An ObjectRef is often passed
    within the scope of the caller
    (“passing downwards”).
    f()
    f()
    add(o1,o2)
    o2
    o1
    driver

    View full-size slide

  31. Our approach: Ownership
    Insight: By leveraging the structure of Ray applications, we can
    decentralize without requiring expensive coordination.
    Architecture Failure handling Performance
    Ownership:
    The worker that
    calls a task owns
    the returned
    ObjectRef.
    Each worker is a
    “centralized master” for
    the objects that it owns.
    No additional writes on
    the critical path of task
    execution. Scaling
    through nested
    function calls.

    View full-size slide

  32. Ownership: Challenges
    ● Failure recovery
    ○ Recovering a lost worker
    ○ Recovering a lost owner
    ● Garbage collection and memory safety

    View full-size slide

  33. Ownership: Challenges
    • Failure recovery
    • Recovering a lost worker
    • Recovering a lost owner
    • Garbage collection and memory safety

    View full-size slide

  34. Node 3
    Object
    Store
    Worker
    Task scheduling
    Node 2
    Object
    Store
    Worker
    Node 1
    Worker
    Obj Task Val Loc
    C Y
    A
    X
    B
    A
    Obj Task Val Loc
    Y C(X)
    Obj Task Val Loc
    X B()

    View full-size slide

  35. Node 3
    Object
    Store
    Worker
    Task scheduling
    Node 2
    Object
    Store
    Worker
    Node 1
    Worker
    Obj Task Val Loc
    X B()
    Y C(X)
    A
    C
    X Y
    A
    B
    B
    2
    A task’s pending location is written locally at the
    owner.
    N2
    1

    View full-size slide

  36. Node 3
    Object
    Store
    Worker
    Distributed memory management
    Node 2
    Object
    Store
    Worker
    Node 1
    Worker
    Obj Task Val Loc
    X B()
    Y C(X)
    A B
    X: N2
    4
    X
    3
    C Y
    A
    X
    B
    *X N2
    5
    Owner tracks locations of objects stored in
    distributed memory.

    View full-size slide

  37. Node 2
    Object
    Store
    X
    Worker
    Node 3
    Object
    Store
    Worker
    Task scheduling with dependencies
    Node 1
    Worker
    Obj Task Val Loc
    X B() *X N2
    Y C(X)
    A
    X Y
    B C
    A
    Obj O.
    X W1
    C
    N3

    View full-size slide

  38. Node 2
    Object
    Store
    X
    Worker
    Node 3
    Object
    Store
    Worker
    Worker failure
    Node 1
    Worker
    Obj Task Val Loc
    X B() *X N2
    Y C(X) N3
    A C
    Obj O.
    X W1
    B Y
    A
    X C
    Reference holders only need to check
    whether the owner is alive.

    View full-size slide

  39. Node 3
    Object
    Store
    Worker
    C
    Obj O.
    X W1
    Worker recovery
    Node 4
    Object
    Store
    Worker
    Owner coordinates lineage reconstruction.
    Node 1
    Worker
    Obj Task Val Loc
    X B() *X N2
    Y C(X) N3
    A B
    *X N4
    B Y
    A
    X C

    View full-size slide

  40. Node 2
    Object
    Store
    X
    Worker
    Node 3
    Object
    Store
    Worker
    Owner failure
    Node 1
    Worker
    Obj Task Val Loc
    X B() *X N2
    Y C(X) N3
    A C
    Obj O.
    X W1
    B Y
    A
    X C

    View full-size slide

  41. Node 2
    Object
    Store
    Worker
    Node 3
    Object
    Store
    Worker
    Owner recovery
    C
    Obj O.
    X W1
    B Y
    A
    X C
    References
    fate-share with the
    object’s owner.
    X

    View full-size slide

  42. Node 2
    Object
    Store
    Worker
    Node 3
    Object
    Store
    Worker
    Owner recovery
    C
    Obj O.
    X W1
    References
    fate-share with the
    object’s owner.
    X
    C
    X Y
    B
    A
    A’s owner

    View full-size slide

  43. Node 2
    Object
    Store
    Worker
    Node 3
    Object
    Store
    Worker
    Owner recovery
    Node 4
    Worker
    Obj Task Val Loc
    A
    Leveraging the application’s hierarchical
    structure: the owner of A recovers A.
    C
    X Y
    B
    A
    A’s owner
    Obj Task Val Loc
    X B()
    Y C(X)

    View full-size slide

  44. Ownership transfer
    Addressing the limitations of ownership

    View full-size slide

  45. Limitations of ownership
    Ownership model: Whoever creates an ObjectRef owns the metadata.
    + Using a single owner guarantees consistency and low latency
    + Decentralize the system according to the application structure
    + Use lineage reconstruction to recover from failures
    BUT:
    - Ray’s flexibility makes it difficult to apply lineage reconstruction in all
    scenarios
    - Objects (and their lineage) fate-share with their owner
    - This is fine if ObjectRefs are only passed downwards, but what if
    they’re not?

    View full-size slide

  46. Limitations of ownership
    Ownership model: Whoever creates an ObjectRef owns the metadata.
    + Using a single owner guarantees consistency and low latency
    + Decentralize the system according to the application structure
    + Use lineage reconstruction to recover from failures
    BUT:
    - Ray’s flexibility makes it difficult to apply lineage reconstruction in all
    scenarios
    - Objects (and their lineage) fate-share with their owner
    - This is fine if ObjectRefs are only passed downwards, but what if
    they’re not?

    View full-size slide

  47. Yi’s content starts here
    - Problem statement
    - What happens if the key assumption that ObjectRefs get passed
    downward is broken?
    - Key use cases where this happens
    - Present solution + (briefly, optionally) 1 alternative
    - Finish with “we’re looking for feedback”, summary of other open
    questions (fault tolerance?)

    View full-size slide

  48. Ownership transfer
    • Where does the issue come from?
    • Real world use cases
    • Designs: when & how

    View full-size slide

  49. Ownership transfer
    • Where does the issue come from?
    • Real world use cases
    • Designs: when & how

    View full-size slide

  50. Where does the issue come from?
    • The key assumption
    • object reference is passed downwards.
    g()
    driver
    f()
    X
    Invocation
    Legend
    Worker
    Owner
    Object
    x = f.remote()
    • What if the assumption is broken?
    g.remote(x)
    • What if the worker run f failed?
    • What if the driver failed?
    OK
    OK

    View full-size slide

  51. Upward pattern
    Invocation
    Legend
    Worker
    Owner
    Object
    ● What if the worker run f exits?
    @ray.remote
    def f():
    x = ray.put("Hello World") # msg is owned by the
    worker
    return [x]
    x_list = f.remote() # `x_list` is owned by the driver
    x = ray.get(ray.get(x_list)[0])
    ● Driver will fail to get the inner object
    ● The inner object won’t be able to be
    reconstructed.
    driver
    f()
    x
    x_list = [x]
    x_list
    ● What if the driver exits? OK
    NOT OK

    View full-size slide

  52. Lateral pattern
    • What if the driver exits?
    Invocation
    Legend
    Worker
    Owner
    Object
    x = f.remote() # `x` is owned by the driver
    actor = Actor.options(lifetime="detached").remote()
    actor.g.remote([x]) # `actor` doesn't own `x`
    g()
    Detached
    actor
    driver
    X
    f()
    • The actor won’t be able to access x any
    more
    • X can’t be reconstructed because the
    owner died
    • What if the worker run f failed? OK
    NOT OK

    View full-size slide

  53. Ownership transfer
    • Where does the issue come from?
    • Real world use cases
    • Designs: when & how

    View full-size slide

  54. Why these cases are important?
    Or what if we don’t take care of these cases?
    - Scale down a cluster will be hard
    - Won’t be able to remove zombie process
    - Can’t reconstructed with failures

    View full-size slide

  55. Real world use case - RayDP
    Invocation
    Legend
    Worker
    Owner
    Object
    driver
    Data partition
    List of data partition
    Data processing

    View full-size slide

  56. Real world use case
    Invocation
    Legend
    Worker
    Owner
    Object
    Cache the created tensor object
    Detached
    actor
    driver
    Create tensor object
    tensor object

    View full-size slide

  57. Ownership transfer
    • Where does the issue come from?
    • Real world use cases
    • Designs: when and how

    View full-size slide

  58. Two questions to answer
    • When to do the transfer && how to do the transfer
    • It needs to be easy to use
    • It’ll introduce as little overhead as possible
    • It should cover the most cases

    View full-size slide

  59. When to transfer
    ● Manually transfer the ownership
    ● Automatically transfer by detecting scope change
    ○ Detect upward transfer
    ○ Detect lateral transfer
    ■ To cover most cases: transfer if either owner or receiver is a
    detached actor
    ■ Idea solution: transfer if it’s in different ownership scope

    View full-size slide

  60. How to transfer
    Invocation
    Legend
    Worker
    Owner
    Object
    g()
    Detached
    actor
    driver
    X
    f()
    x = f.remote() # `x` is owned by the driver
    actor = Actor.options(lifetime="detached").remote()
    actor.g.remote([x]) # `actor` doesn't own `x`
    Driver Actor
    (x, driver) (x, driver)
    Plasma Store
    Object: x
    owns

    View full-size slide

  61. How to transfer
    Invocation
    Legend
    Worker
    Owner
    Object
    g()
    Detached
    actor
    driver
    X
    f()
    Driver Actor
    (x, driver) (x, actor)
    Plasma Store
    Object: x
    owns owns
    ● Ownership sharing in reference counting layer
    ○ Most changes in reference counting layer
    ○ Avoid physical copy of the object

    View full-size slide

  62. How to transfer - alternative
    Invocation
    Legend
    Worker
    Owner
    Object
    g()
    Detached
    actor
    driver
    X
    f()
    Driver Actor
    (x, GCS) (x, GCS)
    Plasma Store
    Object: x
    GCS
    (x, GCS)
    owns
    ● Transfer the ownership to GCS
    ○ Able to have HA support if GCS support HA
    ○ Potentially becomes a bottleneck

    View full-size slide

  63. Ownership transfer - summary
    • There are three patterns of object passing
    • Upward/downward and lateral
    • Automatically transferring ownership of an object via pattern
    detecting
    • Sharing ownership among workers to “transfer” ownership

    View full-size slide

  64. Ownership is a way to decentralize ObjectRef
    management for performance and stability.
    Ray whitepaper:
    tinyurl.com/ray-white-paper
    Thank you!

    View full-size slide