Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Ray Observability: Present and Future

Anyscale
December 07, 2022

Ray Observability: Present and Future

Machine Learning offline and online serving users leverage Ray in radically different ways, which makes it challenging to provide a common architecture to inspect and debug performance problems.

In this talk, we discuss Ray 2. x's current observability architecture, including how to view metrics and logs and inspect the state of tasks, actors, and other resources in Ray. We discuss features like the newly revamped ray dashboard and the added ray metrics and present a roadmap for where Ray observability is going in the future with a unified observability data model.

This talk is for you if you're interested in the following:

* What’s new with ray metrics and dashboard revamp
* How to debug Ray programs using the CLI, Dashboard, and logs
* Learn how to implement observability in a general-purpose distributed system such as Ray.

Anyscale

December 07, 2022
Tweet

More Decks by Anyscale

Other Decks in Technology

Transcript

  1. Anyscale Who we are::Original creators of Ray, a unified framework

    for scalable computing What we do: Scalable compute for AI And Python Why we do it: Scaling is a necessity, scaling is hard; make distributed computing easy and simple for everyone
  2. How do I debug Ray Applications? What tools are available

    for debugging? What’s new observability features in recent Ray releases If you are interested in
  3. Outline 1. Requirements in Ray Observability 2. Ray Observability before

    2.0 3. What’s new in Ray 2.0? 4. Demo (Ray 2.1 + 2.2 features) 5. Future roadmap
  4. Outline 1. Requirements in Ray Observability 2. Ray Observability before

    2.0 3. What’s new in Ray 2.0? 4. Demo (Ray 2.1 + 2.2 features) 5. Future roadmap
  5. Visibility into applications Monitoring Key Workflow for Debugging Surface problems

    Error Debugging with Tools and Data Fix it! Mitigation
  6. Tools for the Workflow Monitor State Dashboard Time-series App Logs

    Events Progress Report Integrations (high level ML) Tracing Surface Error Exceptions Logs Time-series Good error message Tools and Data Debugger Profiler Tracing Integrations (metrics, logs, tracing) Logs Time-series Events State State Dashboard Progress Report App Logs Exceptions Good error message Debugger Profiler
  7. Outline 1. Requirements in Ray Observability 2. Ray Observability before

    2.0 3. What’s new in Ray 2.0? 4. Demo 5. Future roadmap (post Ray 2.0)
  8. Develop Ray Applications Locally @ray.remote class Actor: def print_task(self): return

    “hello” actors = [ Actor.remote() for in range(4) ] refs = [ actor.print_task.remote() for actor in actors ] ray.get(refs) Actor 1 Actor 2 ref = actor.print_task.remote() Actor 3 Actor 4
  9. Monitoring: Logs @ray.remote class Actor: def print_task(self): return “hello” actors

    = [ Actor.remote() for in range(4) ] refs = [ actor.print_task.remote() for actor in actors ] ray.get(refs) Actor 1 Actor 2 ref = actor.print_task.remote() Actor 3 Actor 4
  10. Monitoring: Logs @ray.remote class Actor: def print_task(self): print(“hello”) return “hello”

    actors = [ Actor.remote() for in range(4) ] refs = [ actor.print_task.remote() for actor in actors ] ray.get(refs) Actor 1 Actor 2 ref = actor.print_task.remote() Actor 3 Actor 4
  11. Monitoring: Logs @ray.remote def print_task(): print(“haha”) print_task.remote() • All logs

    from tasks/actors are printed to the driver program • Support `ray logs` API to access logs of actors, tasks, workers, and system logs.
  12. Surface Error: Exceptions @ray.remote class Actor: def print_task(self): return 4

    / 0 actors = [ Actor.remote() for in range(4) ] refs = [ actor.print_task.remote() for actor in actors ] ray.get(refs) Actor 1 Actor 2 ref = actor.print_task.remote() Actor 3 Actor 4
  13. Ray Exceptions • All ray primitives (task, object, actor) generates

    a “object reference”. • If anything goes wrong with the generator of object reference, `ray.get` raises an exception @ray.remote class Actor: def print_task(self): return 4 / 0 actors = [ Actor.remote() for in range(4) ] refs = [ actor.print_task.remote() for actor in actors ] ray.get(refs)
  14. Ray Exceptions • Any failures (application error & system error)

    should be raised as exception • All ray primitives (task, object, actor) generates a “object reference”. • If anything goes wrong with the generator of object reference, `ray.get` raises an exception class Actor: def print_task(self): return 4 / 0 actors = [ Actor.remote() for in range(4) ] refs = [ actor.print_task.remote() for actor in actors ] ray.get(refs) RayActorError: The actor died unexpectedly before finishing this task. class_name: G actor_id: e818d2f0521a334daf03540701000000 pid: 61251 namespace: 674a49b2-5b9b-4fcc-b6e1-5a1d4b9400d2 ip: 127.0.0.1 The actor is dead because its worker process has died. Worker exit type: UNEXPECTED_SYSTEM_EXIT Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
  15. Debugging: Debugger @ray.remote class Actor: def print_task(self): return 4 /

    0 actors = [ Actor.remote() for in range(4) ] refs = [ actor.print_task.remote() for actor in actors ] ray.get(refs) Actor 1 Actor 2 ref = actor.print_task.remote() Actor 3 Actor 4
  16. Debugging: Debugger @ray.remote class Actor: def print_task(self): breakpoint() return 4

    / 0 actors = [ Actor.remote() for in range(4) ] refs = [ actor.print_task.remote() for actor in actors ] ray.get(refs) Actor 1 Actor 2 ref = actor.print_task.remote() Actor 3 Actor 4
  17. Interactive Debugging: Ray Debugger • When developing Ray, users want

    to use familiar debugger such as pdb • Ray has its own pdb integration to support “distributed debugging”
  18. Debugging: Tracing @ray.remote class Actor: def print_task(self): big_obj = do_busy_work()

    return big_obj actors = [ Actor.remote() for in range(4) ] refs = [ actor.print_task.remote() for actor in actors ] ray.get(refs) Actor 1 Actor 2 ref = actor.print_task.remote() Actor 3 Actor 4
  19. Debugging: Tracing @ray.remote class Actor: def print_task(self): big_obj = do_busy_work()

    return big_obj actors = [ Actor.remote() for in range(4) ] refs = [ actor.print_task.remote() for actor in actors ] ray.get(refs) Actor 1 Actor 2 ref = actor.print_task.remote() Actor 3 Actor 4 Why is actor 1’s print_task slow?
  20. Performance: Ray Timeline • Sometimes, users want to trace low-level

    per-worker operation to find bottlenecks in performance • Ray supports `ray timeline` API for this!
  21. Outline 1. Requirements in Ray Observability 2. Ray Observability before

    2.0 3. What’s new in Ray 2.0? 4. Demo (Ray 2.1 + 2.2 features) 5. Future roadmap
  22. What’s new in Ray 2.0 Visibility into applications State APIs

    Dashboard Time-series Logs Events Progress Bar Integrations (high level ML) Tracing Surface problems Exceptions Logs Time-series Good error message Debugging with tools and data Debugger Profiler Tracing Integrations (metrics, logs, tracing) Logs Time-series Events State APIs State APIs State API + Better Ray Exceptions State APIs Good error message State APIs State APIs
  23. CLIs Python SDKs > ray summary actors > ray list

    tasks > ray logs import ray.experimental.state.api as api api.summarize_tasks() api.list_nodes() api.get_log() Key Ray 2.0 Feature: State APIs
  24. State API example ray list / get objects tasks actors

    workers runtime-envs placement groups nodes jobs
  25. Outline 1. Requirements in Ray Observability 2. Ray Observability before

    2.0 3. What’s new in Ray 2.0? 4. Demo (Ray 2.1 + 2.2 features) 5. Future roadmap
  26. Demo Read Images Preprocessing Training 16 TrainWorker actors Job Supervisor

    S3 Storage (Images) Read -> split -> preprocessing (100 of workers)
  27. Outline 1. Requirements in Ray Observability 2. Ray Observability before

    2.0 3. What’s new in Ray 2.0? 4. Demo 5. Future roadmap
  28. State API Beta More info to come: - Historical Task

    State Information - Resource usage/requirements of actors and tasks. - Duration, exit code - Relationship between different resources - What tasks have an actor runs? - What logical resources is used on a node? More stable API Better usability in options and formatting
  29. - State API Beta - Dashboard Usability Improvement - Advanced

    Task Drilldown - Advanced Visualization Future Roadmap
  30. 💬 Help us shape the future of Ray Observability Join

    #ray_observability in Ray slack or scan the QR code 👇