Ray Observability: Present and Future

SangBin Cho, Anyscale Ricky Xu, Anyscale Ray Observability: Present and
Future

We are… Software engineers at Anyscale Ray core developers Currently
focusing on improving observability

Anyscale Who we are::Original creators of Ray, a unified framework
for scalable computing What we do: Scalable compute for AI And Python Why we do it: Scaling is a necessity, scaling is hard; make distributed computing easy and simple for everyone

How to debug Ray Applications? If you are interested in

What tools are available for debugging? If you are interested
in How do I debug Ray Applications?

How do I debug Ray Applications? What tools are available
for debugging? What’s new observability features in recent Ray releases If you are interested in

Outline 1. Requirements in Ray Observability 2. Ray Observability before
2.0 3. What’s new in Ray 2.0? 4. Demo (Ray 2.1 + 2.2 features) 5. Future roadmap

Ray: APIs to build Distributed Systems

Ray Head Node

Ray Head Node Ray Worker Node Ray Worker Node Ray
Worker Node Ray Worker Node

Ray Head Node

What if anything goes wrong?

Ray Head Node

Ray Head Node Actor failed

How can we still make debugging easy?

How can we still make debugging easy? Understand the Workflow

How can we still make debugging easy? Understand the Workflow
Tools

Visibility into applications Monitoring Key Workflow for Debugging Surface problems
Error Debugging with Tools and Data Fix it! Mitigation

Tools for the Workflow Monitor State Dashboard Time-series App Logs
Events Progress Report Integrations (high level ML) Tracing Surface Error Exceptions Logs Time-series Good error message Tools and Data Debugger Profiler Tracing Integrations (metrics, logs, tracing) Logs Time-series Events State State Dashboard Progress Report App Logs Exceptions Good error message Debugger Profiler

2.0 3. What’s new in Ray 2.0? 4. Demo 5. Future roadmap (post Ray 2.0)

Application Failures (failures from user code) Debugging System failures (Hanging,
OOM, etc.) Case Studies

Develop Ray Applications Locally @ray.remote class Actor: def print_task(self): return
“hello” actors = [ Actor.remote() for in range(4) ] refs = [ actor.print_task.remote() for actor in actors ] ray.get(refs) Actor 1 Actor 2 ref = actor.print_task.remote() Actor 3 Actor 4

Monitoring: Logs @ray.remote class Actor: def print_task(self): return “hello” actors
= [ Actor.remote() for in range(4) ] refs = [ actor.print_task.remote() for actor in actors ] ray.get(refs) Actor 1 Actor 2 ref = actor.print_task.remote() Actor 3 Actor 4

Monitoring: Logs @ray.remote class Actor: def print_task(self): print(“hello”) return “hello”
actors = [ Actor.remote() for in range(4) ] refs = [ actor.print_task.remote() for actor in actors ] ray.get(refs) Actor 1 Actor 2 ref = actor.print_task.remote() Actor 3 Actor 4

Monitoring: Logs @ray.remote def print_task(): print(“haha”) print_task.remote() • All logs
from tasks/actors are printed to the driver program • Support `ray logs` API to access logs of actors, tasks, workers, and system logs.

Surface Error: Exceptions @ray.remote class Actor: def print_task(self): return 4
/ 0 actors = [ Actor.remote() for in range(4) ] refs = [ actor.print_task.remote() for actor in actors ] ray.get(refs) Actor 1 Actor 2 ref = actor.print_task.remote() Actor 3 Actor 4

Ray Exceptions • All ray primitives (task, object, actor) generates
a “object reference”. • If anything goes wrong with the generator of object reference, `ray.get` raises an exception @ray.remote class Actor: def print_task(self): return 4 / 0 actors = [ Actor.remote() for in range(4) ] refs = [ actor.print_task.remote() for actor in actors ] ray.get(refs)

Ray Exceptions • Any failures (application error & system error)
should be raised as exception • All ray primitives (task, object, actor) generates a “object reference”. • If anything goes wrong with the generator of object reference, `ray.get` raises an exception class Actor: def print_task(self): return 4 / 0 actors = [ Actor.remote() for in range(4) ] refs = [ actor.print_task.remote() for actor in actors ] ray.get(refs) RayActorError: The actor died unexpectedly before finishing this task. class_name: G actor_id: e818d2f0521a334daf03540701000000 pid: 61251 namespace: 674a49b2-5b9b-4fcc-b6e1-5a1d4b9400d2 ip: 127.0.0.1 The actor is dead because its worker process has died. Worker exit type: UNEXPECTED_SYSTEM_EXIT Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.

Debugging: Debugger @ray.remote class Actor: def print_task(self): return 4 /
0 actors = [ Actor.remote() for in range(4) ] refs = [ actor.print_task.remote() for actor in actors ] ray.get(refs) Actor 1 Actor 2 ref = actor.print_task.remote() Actor 3 Actor 4

Debugging: Debugger @ray.remote class Actor: def print_task(self): breakpoint() return 4
/ 0 actors = [ Actor.remote() for in range(4) ] refs = [ actor.print_task.remote() for actor in actors ] ray.get(refs) Actor 1 Actor 2 ref = actor.print_task.remote() Actor 3 Actor 4

Interactive Debugging: Ray Debugger • When developing Ray, users want
to use familiar debugger such as pdb • Ray has its own pdb integration to support “distributed debugging”

Debugging: Tracing @ray.remote class Actor: def print_task(self): big_obj = do_busy_work()
return big_obj actors = [ Actor.remote() for in range(4) ] refs = [ actor.print_task.remote() for actor in actors ] ray.get(refs) Actor 1 Actor 2 ref = actor.print_task.remote() Actor 3 Actor 4

Debugging: Tracing @ray.remote class Actor: def print_task(self): big_obj = do_busy_work()
return big_obj actors = [ Actor.remote() for in range(4) ] refs = [ actor.print_task.remote() for actor in actors ] ray.get(refs) Actor 1 Actor 2 ref = actor.print_task.remote() Actor 3 Actor 4 Why is actor 1’s print_task slow?

Performance: Ray Timeline • Sometimes, users want to trace low-level
per-worker operation to find bottlenecks in performance • Ray supports `ray timeline` API for this!

Visualization • Metrics • Dashboard Huge improvement in Ray 2.1
& 2.2!

What’s new in Ray 2.0 Visibility into applications State APIs
Dashboard Time-series Logs Events Progress Bar Integrations (high level ML) Tracing Surface problems Exceptions Logs Time-series Good error message Debugging with tools and data Debugger Profiler Tracing Integrations (metrics, logs, tracing) Logs Time-series Events State APIs State APIs State API + Better Ray Exceptions State APIs Good error message State APIs State APIs

CLIs Python SDKs > ray summary actors > ray list
tasks > ray logs import ray.experimental.state.api as api api.summarize_tasks() api.list_nodes() api.get_log() Key Ray 2.0 Feature: State APIs

State API example ray summary objects tasks actors

State API example > ray summary tasks [--address <RAY_ADDRESS>]

State API example > ray summary actors

State API example > ray summary objects

State API example ray list / get objects tasks actors
workers runtime-envs placement groups nodes jobs

State API example > ray list tasks

State API example > ray list tasks -f scheduling_state!=FINISHED -f
name=PPO.train

State API example 53 > ray list actors -f node_id=<node_id>

State API example > ray get actors <actor id>

State API example ray logs workers cluster actors

State API example > ray logs cluster

State API example > ray logs worker*

State API example > ray logs worker* > ray logs
<log file name> –f

Key Ray 2.0 Feature: State APIs https://docs.ray.io/en/latest/ray-observability/state/state-api.html Check out the
documentation for more details: 👇

Demo Read Images Preprocessing Training 16 TrainWorker actors Job Supervisor
S3 Storage (Images) Read -> split -> preprocessing (100 of workers)

2.0 3. What’s new in Ray 2.0? 4. Demo 5. Future roadmap

- State API Beta Future Roadmap

State API Beta More info to come: - Historical Task
State Information - Resource usage/requirements of actors and tasks. - Duration, exit code - Relationship between different resources - What tasks have an actor runs? - What logical resources is used on a node? More stable API Better usability in options and formatting

- State API Beta - Dashboard Usability Improvement Future Roadmap

Dashboard Usability Improvement

- State API Beta - Dashboard Usability Improvement - Advanced
Task Drilldown Future Roadmap

Advanced Task Drilldown

- State API Beta - Dashboard Usability Improvement - Advanced
Task Drilldown - Advanced Visualization Future Roadmap

Advanced Visualization

💬 Help us shape the future of Ray Observability Join
#ray_observability in Ray slack or scan the QR code 👇

Thank you.

Ray Observability: Present and Future

Ray Observability: Present and Future

More Decks by Anyscale

Other Decks in Technology

Featured

Transcript