Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Ray Observability: Present and Future

Ray Observability: Present and Future

Machine Learning offline and online serving users leverage Ray in radically different ways, which makes it challenging to provide a common architecture to inspect and debug performance problems.

In this talk, we discuss Ray 2. x's current observability architecture, including how to view metrics and logs and inspect the state of tasks, actors, and other resources in Ray. We discuss features like the newly revamped ray dashboard and the added ray metrics and present a roadmap for where Ray observability is going in the future with a unified observability data model.

This talk is for you if you're interested in the following:

* What’s new with ray metrics and dashboard revamp
* How to debug Ray programs using the CLI, Dashboard, and logs
* Learn how to implement observability in a general-purpose distributed system such as Ray.

Anyscale
PRO

December 07, 2022
Tweet

More Decks by Anyscale

Other Decks in Technology

Transcript

  1. SangBin Cho, Anyscale
    Ricky Xu, Anyscale
    Ray Observability:
    Present and Future

    View Slide

  2. We are…
    Software engineers at Anyscale
    Ray core developers
    Currently focusing on improving observability

    View Slide

  3. Anyscale
    Who we are::Original creators of Ray, a unified framework
    for scalable computing
    What we do: Scalable compute for AI And Python
    Why we do it: Scaling is a necessity, scaling is hard; make
    distributed computing easy and simple for everyone

    View Slide

  4. How to debug Ray Applications?
    If you are interested in

    View Slide

  5. What tools are available for debugging?
    If you are interested in
    How do I debug Ray Applications?

    View Slide

  6. How do I debug Ray Applications?
    What tools are available for debugging?
    What’s new observability features
    in recent Ray releases
    If you are interested in

    View Slide

  7. Outline
    1. Requirements in Ray Observability
    2. Ray Observability before 2.0
    3. What’s new in Ray 2.0?
    4. Demo (Ray 2.1 + 2.2 features)
    5. Future roadmap

    View Slide

  8. Outline
    1. Requirements in Ray Observability
    2. Ray Observability before 2.0
    3. What’s new in Ray 2.0?
    4. Demo (Ray 2.1 + 2.2 features)
    5. Future roadmap

    View Slide

  9. Ray: APIs to build Distributed Systems

    View Slide

  10. View Slide

  11. View Slide

  12. Ray Head
    Node

    View Slide

  13. Ray Head
    Node
    Ray
    Worker
    Node
    Ray
    Worker
    Node
    Ray
    Worker
    Node
    Ray
    Worker
    Node

    View Slide

  14. Ray Head
    Node

    View Slide

  15. Ray Head
    Node

    View Slide

  16. What if anything goes wrong?

    View Slide

  17. Ray Head
    Node

    View Slide

  18. Ray Head
    Node
    Actor failed

    View Slide

  19. Ray Head
    Node
    Actor failed

    View Slide

  20. Ray Head
    Node
    Actor failed

    View Slide

  21. How can we still make debugging easy?

    View Slide

  22. How can we still make debugging easy?
    Understand the Workflow

    View Slide

  23. How can we still make debugging easy?
    Understand the Workflow
    Tools

    View Slide

  24. Visibility into
    applications
    Monitoring
    Key Workflow for Debugging
    Surface
    problems
    Error
    Debugging
    with Tools
    and Data
    Fix it!
    Mitigation

    View Slide

  25. Tools for the Workflow
    Monitor
    State
    Dashboard Time-series
    App Logs
    Events
    Progress
    Report
    Integrations (high level ML)
    Tracing
    Surface Error Exceptions Logs Time-series
    Good error
    message
    Tools and Data
    Debugger Profiler Tracing
    Integrations (metrics, logs, tracing)
    Logs
    Time-series
    Events
    State
    State
    Dashboard
    Progress
    Report
    App Logs
    Exceptions
    Good error
    message
    Debugger Profiler

    View Slide

  26. Outline
    1. Requirements in Ray Observability
    2. Ray Observability before 2.0
    3. What’s new in Ray 2.0?
    4. Demo
    5. Future roadmap (post Ray 2.0)

    View Slide

  27. Application Failures (failures from user code)
    Debugging System failures (Hanging, OOM, etc.)
    Case Studies

    View Slide

  28. Application Failures (failures from user code)
    Debugging System failures (Hanging, OOM, etc.)
    Case Studies

    View Slide

  29. Develop Ray Applications Locally
    @ray.remote
    class Actor:
    def print_task(self):
    return “hello”
    actors = [
    Actor.remote()
    for in range(4)
    ]
    refs = [
    actor.print_task.remote()
    for actor in actors
    ]
    ray.get(refs)
    Actor 1 Actor 2
    ref = actor.print_task.remote()
    Actor 3 Actor 4

    View Slide

  30. Monitoring: Logs
    @ray.remote
    class Actor:
    def print_task(self):
    return “hello”
    actors = [
    Actor.remote()
    for in range(4)
    ]
    refs = [
    actor.print_task.remote()
    for actor in actors
    ]
    ray.get(refs)
    Actor 1 Actor 2
    ref = actor.print_task.remote()
    Actor 3 Actor 4

    View Slide

  31. Monitoring: Logs
    @ray.remote
    class Actor:
    def print_task(self):
    print(“hello”)
    return “hello”
    actors = [
    Actor.remote()
    for in range(4)
    ]
    refs = [
    actor.print_task.remote()
    for actor in actors
    ]
    ray.get(refs)
    Actor 1 Actor 2
    ref = actor.print_task.remote()
    Actor 3 Actor 4

    View Slide

  32. Monitoring: Logs
    @ray.remote
    def print_task():
    print(“haha”)
    print_task.remote()
    • All logs from tasks/actors
    are printed to the driver
    program
    • Support `ray logs` API to
    access logs of actors,
    tasks, workers, and system
    logs.

    View Slide

  33. Surface Error: Exceptions
    @ray.remote
    class Actor:
    def print_task(self):
    return 4 / 0
    actors = [
    Actor.remote()
    for in range(4)
    ]
    refs = [
    actor.print_task.remote()
    for actor in actors
    ]
    ray.get(refs)
    Actor 1 Actor 2
    ref = actor.print_task.remote()
    Actor 3 Actor 4

    View Slide

  34. Ray Exceptions
    • All ray primitives (task,
    object, actor) generates a
    “object reference”.
    • If anything goes wrong
    with the generator of
    object reference, `ray.get`
    raises an exception
    @ray.remote
    class Actor:
    def print_task(self):
    return 4 / 0
    actors = [
    Actor.remote()
    for in range(4)
    ]
    refs = [
    actor.print_task.remote()
    for actor in actors
    ]
    ray.get(refs)

    View Slide

  35. Ray Exceptions
    • Any failures (application
    error & system error)
    should be raised as
    exception
    • All ray primitives (task,
    object, actor) generates a
    “object reference”.
    • If anything goes wrong
    with the generator of
    object reference, `ray.get`
    raises an exception
    class Actor:
    def print_task(self):
    return 4 / 0
    actors = [
    Actor.remote()
    for in range(4)
    ]
    refs = [
    actor.print_task.remote()
    for actor in actors
    ]
    ray.get(refs)
    RayActorError: The actor died unexpectedly before finishing this task.
    class_name: G
    actor_id: e818d2f0521a334daf03540701000000
    pid: 61251
    namespace: 674a49b2-5b9b-4fcc-b6e1-5a1d4b9400d2
    ip: 127.0.0.1
    The actor is dead because its worker process has died. Worker exit type: UNEXPECTED_SYSTEM_EXIT Worker
    exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root
    causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called.
    (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.

    View Slide

  36. Debugging: Debugger
    @ray.remote
    class Actor:
    def print_task(self):
    return 4 / 0
    actors = [
    Actor.remote()
    for in range(4)
    ]
    refs = [
    actor.print_task.remote()
    for actor in actors
    ]
    ray.get(refs)
    Actor 1 Actor 2
    ref = actor.print_task.remote()
    Actor 3 Actor 4

    View Slide

  37. Debugging: Debugger
    @ray.remote
    class Actor:
    def print_task(self):
    breakpoint()
    return 4 / 0
    actors = [
    Actor.remote()
    for in range(4)
    ]
    refs = [
    actor.print_task.remote()
    for actor in actors
    ]
    ray.get(refs)
    Actor 1 Actor 2
    ref = actor.print_task.remote()
    Actor 3 Actor 4

    View Slide

  38. Interactive Debugging: Ray Debugger
    • When developing Ray,
    users want to use familiar
    debugger such as pdb
    • Ray has its own pdb
    integration to support
    “distributed debugging”

    View Slide

  39. Debugging: Tracing
    @ray.remote
    class Actor:
    def print_task(self):
    big_obj = do_busy_work()
    return big_obj
    actors = [
    Actor.remote()
    for in range(4)
    ]
    refs = [
    actor.print_task.remote()
    for actor in actors
    ]
    ray.get(refs)
    Actor 1 Actor 2
    ref = actor.print_task.remote()
    Actor 3 Actor 4

    View Slide

  40. Debugging: Tracing
    @ray.remote
    class Actor:
    def print_task(self):
    big_obj = do_busy_work()
    return big_obj
    actors = [
    Actor.remote()
    for in range(4)
    ]
    refs = [
    actor.print_task.remote()
    for actor in actors
    ]
    ray.get(refs)
    Actor 1 Actor 2
    ref = actor.print_task.remote()
    Actor 3 Actor 4
    Why is actor 1’s print_task slow?

    View Slide

  41. Performance: Ray Timeline
    • Sometimes, users want to
    trace low-level per-worker
    operation to find bottlenecks
    in performance
    • Ray supports `ray timeline`
    API for this!

    View Slide

  42. Visualization
    • Metrics
    • Dashboard
    Huge improvement in
    Ray 2.1 & 2.2!

    View Slide

  43. Outline
    1. Requirements in Ray Observability
    2. Ray Observability before 2.0
    3. What’s new in Ray 2.0?
    4. Demo (Ray 2.1 + 2.2 features)
    5. Future roadmap

    View Slide

  44. What’s new in Ray 2.0
    Visibility into
    applications
    State APIs
    Dashboard Time-series
    Logs
    Events
    Progress
    Bar
    Integrations (high level ML)
    Tracing
    Surface
    problems
    Exceptions Logs Time-series
    Good error
    message
    Debugging with
    tools and data
    Debugger Profiler Tracing
    Integrations (metrics, logs, tracing)
    Logs
    Time-series
    Events
    State APIs
    State APIs
    State API
    +
    Better Ray
    Exceptions
    State APIs
    Good error
    message
    State APIs
    State APIs

    View Slide

  45. CLIs Python SDKs
    > ray summary actors
    > ray list tasks
    > ray logs
    import ray.experimental.state.api as api
    api.summarize_tasks()
    api.list_nodes()
    api.get_log()
    Key Ray 2.0 Feature: State APIs

    View Slide

  46. State API example
    ray summary
    objects
    tasks actors

    View Slide

  47. State API example
    > ray summary tasks [--address ]

    View Slide

  48. State API example
    > ray summary actors

    View Slide

  49. State API example
    > ray summary objects

    View Slide

  50. State API example
    ray list / get
    objects
    tasks
    actors
    workers
    runtime-envs
    placement
    groups
    nodes
    jobs

    View Slide

  51. State API example
    > ray list tasks

    View Slide

  52. State API example
    > ray list tasks -f scheduling_state!=FINISHED -f name=PPO.train

    View Slide

  53. State API example
    53
    > ray list actors -f node_id=

    View Slide

  54. State API example
    > ray get actors

    View Slide

  55. State API example
    > ray get actors

    View Slide

  56. State API example
    > ray get actors

    View Slide

  57. State API example
    ray logs
    workers
    cluster actors

    View Slide

  58. State API example
    > ray logs cluster

    View Slide

  59. State API example
    > ray logs worker*

    View Slide

  60. State API example
    > ray logs worker*
    > ray logs –f

    View Slide

  61. Key Ray 2.0 Feature: State APIs
    https://docs.ray.io/en/latest/ray-observability/state/state-api.html
    Check out the documentation for more details:
    👇

    View Slide

  62. Outline
    1. Requirements in Ray Observability
    2. Ray Observability before 2.0
    3. What’s new in Ray 2.0?
    4. Demo (Ray 2.1 + 2.2 features)
    5. Future roadmap

    View Slide

  63. Demo
    Read Images Preprocessing Training
    16 TrainWorker actors
    Job
    Supervisor
    S3
    Storage
    (Images)
    Read -> split ->
    preprocessing
    (100 of workers)

    View Slide

  64. Outline
    1. Requirements in Ray Observability
    2. Ray Observability before 2.0
    3. What’s new in Ray 2.0?
    4. Demo
    5. Future roadmap

    View Slide

  65. - State API Beta
    Future Roadmap

    View Slide

  66. State API Beta
    More info to come:
    - Historical Task State Information
    - Resource usage/requirements of actors and tasks.
    - Duration, exit code
    - Relationship between different resources
    - What tasks have an actor runs?
    - What logical resources is used on a node?
    More stable API
    Better usability in options and formatting

    View Slide

  67. - State API Beta
    - Dashboard Usability Improvement
    Future Roadmap

    View Slide

  68. Dashboard Usability Improvement

    View Slide

  69. - State API Beta
    - Dashboard Usability Improvement
    - Advanced Task Drilldown
    Future Roadmap

    View Slide

  70. Advanced Task Drilldown

    View Slide

  71. Advanced Task Drilldown

    View Slide

  72. Advanced Task Drilldown

    View Slide

  73. - State API Beta
    - Dashboard Usability Improvement
    - Advanced Task Drilldown
    - Advanced Visualization
    Future Roadmap

    View Slide

  74. Advanced Visualization

    View Slide

  75. 💬 Help us shape the future of Ray Observability
    Join #ray_observability in Ray slack or scan the QR code
    👇

    View Slide

  76. Thank you.

    View Slide