Ray Observability: Present and Future

Slide 1

Slide 1 text

SangBin Cho, Anyscale Ricky Xu, Anyscale Ray Observability: Present and Future

Slide 2

Slide 2 text

We are… Software engineers at Anyscale Ray core developers Currently focusing on improving observability

Slide 3

Slide 3 text

Anyscale Who we are::Original creators of Ray, a unified framework for scalable computing What we do: Scalable compute for AI And Python Why we do it: Scaling is a necessity, scaling is hard; make distributed computing easy and simple for everyone

Slide 4

Slide 4 text

How to debug Ray Applications? If you are interested in

Slide 5

Slide 5 text

What tools are available for debugging? If you are interested in How do I debug Ray Applications?

Slide 6

Slide 6 text

How do I debug Ray Applications? What tools are available for debugging? What’s new observability features in recent Ray releases If you are interested in

Slide 7

Slide 7 text

Outline 1. Requirements in Ray Observability 2. Ray Observability before 2.0 3. What’s new in Ray 2.0? 4. Demo (Ray 2.1 + 2.2 features) 5. Future roadmap

Slide 8

Slide 8 text

Outline 1. Requirements in Ray Observability 2. Ray Observability before 2.0 3. What’s new in Ray 2.0? 4. Demo (Ray 2.1 + 2.2 features) 5. Future roadmap

Slide 9

Slide 9 text

Ray: APIs to build Distributed Systems

Slide 10

Slide 10 text

No content

Slide 11

Slide 11 text

No content

Slide 12

Slide 12 text

Ray Head Node

Slide 13

Slide 13 text

Ray Head Node Ray Worker Node Ray Worker Node Ray Worker Node Ray Worker Node

Slide 14

Slide 14 text

Ray Head Node

Slide 15

Slide 15 text

Ray Head Node

Slide 16

Slide 16 text

What if anything goes wrong?

Slide 17

Slide 17 text

Ray Head Node

Slide 18

Slide 18 text

Ray Head Node Actor failed

Slide 19

Slide 19 text

Ray Head Node Actor failed

Slide 20

Slide 20 text

Ray Head Node Actor failed

Slide 21

Slide 21 text

How can we still make debugging easy?

Slide 22

Slide 22 text

How can we still make debugging easy? Understand the Workflow

Slide 23

Slide 23 text

How can we still make debugging easy? Understand the Workflow Tools

Slide 24

Slide 24 text

Visibility into applications Monitoring Key Workflow for Debugging Surface problems Error Debugging with Tools and Data Fix it! Mitigation

Slide 25

Slide 25 text

Tools for the Workflow Monitor State Dashboard Time-series App Logs Events Progress Report Integrations (high level ML) Tracing Surface Error Exceptions Logs Time-series Good error message Tools and Data Debugger Profiler Tracing Integrations (metrics, logs, tracing) Logs Time-series Events State State Dashboard Progress Report App Logs Exceptions Good error message Debugger Profiler

Slide 26

Slide 26 text

Outline 1. Requirements in Ray Observability 2. Ray Observability before 2.0 3. What’s new in Ray 2.0? 4. Demo 5. Future roadmap (post Ray 2.0)

Slide 27

Slide 27 text

Application Failures (failures from user code) Debugging System failures (Hanging, OOM, etc.) Case Studies

Slide 28

Slide 28 text

Application Failures (failures from user code) Debugging System failures (Hanging, OOM, etc.) Case Studies

Slide 29

Slide 29 text

Develop Ray Applications Locally @ray.remote class Actor: def print_task(self): return “hello” actors = [ Actor.remote() for in range(4) ] refs = [ actor.print_task.remote() for actor in actors ] ray.get(refs) Actor 1 Actor 2 ref = actor.print_task.remote() Actor 3 Actor 4

Slide 30

Slide 30 text

Monitoring: Logs @ray.remote class Actor: def print_task(self): return “hello” actors = [ Actor.remote() for in range(4) ] refs = [ actor.print_task.remote() for actor in actors ] ray.get(refs) Actor 1 Actor 2 ref = actor.print_task.remote() Actor 3 Actor 4

Slide 31

Slide 31 text

Monitoring: Logs @ray.remote class Actor: def print_task(self): print(“hello”) return “hello” actors = [ Actor.remote() for in range(4) ] refs = [ actor.print_task.remote() for actor in actors ] ray.get(refs) Actor 1 Actor 2 ref = actor.print_task.remote() Actor 3 Actor 4

Slide 32

Slide 32 text

Monitoring: Logs @ray.remote def print_task(): print(“haha”) print_task.remote() • All logs from tasks/actors are printed to the driver program • Support `ray logs` API to access logs of actors, tasks, workers, and system logs.

Slide 33

Slide 33 text

Surface Error: Exceptions @ray.remote class Actor: def print_task(self): return 4 / 0 actors = [ Actor.remote() for in range(4) ] refs = [ actor.print_task.remote() for actor in actors ] ray.get(refs) Actor 1 Actor 2 ref = actor.print_task.remote() Actor 3 Actor 4

Slide 34

Slide 34 text

Ray Exceptions • All ray primitives (task, object, actor) generates a “object reference”. • If anything goes wrong with the generator of object reference, `ray.get` raises an exception @ray.remote class Actor: def print_task(self): return 4 / 0 actors = [ Actor.remote() for in range(4) ] refs = [ actor.print_task.remote() for actor in actors ] ray.get(refs)

Slide 35

Slide 35 text

Ray Exceptions • Any failures (application error & system error) should be raised as exception • All ray primitives (task, object, actor) generates a “object reference”. • If anything goes wrong with the generator of object reference, `ray.get` raises an exception class Actor: def print_task(self): return 4 / 0 actors = [ Actor.remote() for in range(4) ] refs = [ actor.print_task.remote() for actor in actors ] ray.get(refs) RayActorError: The actor died unexpectedly before finishing this task. class_name: G actor_id: e818d2f0521a334daf03540701000000 pid: 61251 namespace: 674a49b2-5b9b-4fcc-b6e1-5a1d4b9400d2 ip: 127.0.0.1 The actor is dead because its worker process has died. Worker exit type: UNEXPECTED_SYSTEM_EXIT Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.

Slide 36

Slide 36 text

Debugging: Debugger @ray.remote class Actor: def print_task(self): return 4 / 0 actors = [ Actor.remote() for in range(4) ] refs = [ actor.print_task.remote() for actor in actors ] ray.get(refs) Actor 1 Actor 2 ref = actor.print_task.remote() Actor 3 Actor 4

Slide 37

Slide 37 text

Debugging: Debugger @ray.remote class Actor: def print_task(self): breakpoint() return 4 / 0 actors = [ Actor.remote() for in range(4) ] refs = [ actor.print_task.remote() for actor in actors ] ray.get(refs) Actor 1 Actor 2 ref = actor.print_task.remote() Actor 3 Actor 4

Slide 38

Slide 38 text

Interactive Debugging: Ray Debugger • When developing Ray, users want to use familiar debugger such as pdb • Ray has its own pdb integration to support “distributed debugging”

Slide 39

Slide 39 text

Debugging: Tracing @ray.remote class Actor: def print_task(self): big_obj = do_busy_work() return big_obj actors = [ Actor.remote() for in range(4) ] refs = [ actor.print_task.remote() for actor in actors ] ray.get(refs) Actor 1 Actor 2 ref = actor.print_task.remote() Actor 3 Actor 4

Slide 40

Slide 40 text

Slide 41

Slide 41 text

Performance: Ray Timeline • Sometimes, users want to trace low-level per-worker operation to find bottlenecks in performance • Ray supports `ray timeline` API for this!

Slide 42

Slide 42 text

Visualization • Metrics • Dashboard Huge improvement in Ray 2.1 & 2.2!

Slide 43

Slide 43 text

Outline 1. Requirements in Ray Observability 2. Ray Observability before 2.0 3. What’s new in Ray 2.0? 4. Demo (Ray 2.1 + 2.2 features) 5. Future roadmap

Slide 44

Slide 44 text

What’s new in Ray 2.0 Visibility into applications State APIs Dashboard Time-series Logs Events Progress Bar Integrations (high level ML) Tracing Surface problems Exceptions Logs Time-series Good error message Debugging with tools and data Debugger Profiler Tracing Integrations (metrics, logs, tracing) Logs Time-series Events State APIs State APIs State API + Better Ray Exceptions State APIs Good error message State APIs State APIs

Slide 45

Slide 45 text

CLIs Python SDKs > ray summary actors > ray list tasks > ray logs import ray.experimental.state.api as api api.summarize_tasks() api.list_nodes() api.get_log() Key Ray 2.0 Feature: State APIs

Slide 46

Slide 46 text

State API example ray summary objects tasks actors

Slide 47

Slide 47 text

State API example > ray summary tasks [--address ]

Slide 48

Slide 48 text

State API example > ray summary actors

Slide 49

Slide 49 text

State API example > ray summary objects

Slide 50

Slide 50 text

State API example ray list / get objects tasks actors workers runtime-envs placement groups nodes jobs

Slide 51

Slide 51 text

State API example > ray list tasks

Slide 52

Slide 52 text

State API example > ray list tasks -f scheduling_state!=FINISHED -f name=PPO.train

Slide 53

Slide 53 text

State API example 53 > ray list actors -f node_id=

Slide 54

Slide 54 text

State API example > ray get actors

Slide 55

Slide 55 text

State API example > ray get actors

Slide 56

Slide 56 text

State API example > ray get actors

Slide 57

Slide 57 text

State API example ray logs workers cluster actors

Slide 58

Slide 58 text

State API example > ray logs cluster

Slide 59

Slide 59 text

State API example > ray logs worker*

Slide 60

Slide 60 text

State API example > ray logs worker* > ray logs –f

Slide 61

Slide 61 text

Key Ray 2.0 Feature: State APIs https://docs.ray.io/en/latest/ray-observability/state/state-api.html Check out the documentation for more details: 👇

Slide 62

Slide 62 text

Outline 1. Requirements in Ray Observability 2. Ray Observability before 2.0 3. What’s new in Ray 2.0? 4. Demo (Ray 2.1 + 2.2 features) 5. Future roadmap

Slide 63

Slide 63 text

Demo Read Images Preprocessing Training 16 TrainWorker actors Job Supervisor S3 Storage (Images) Read -> split -> preprocessing (100 of workers)

Slide 64

Slide 64 text

Outline 1. Requirements in Ray Observability 2. Ray Observability before 2.0 3. What’s new in Ray 2.0? 4. Demo 5. Future roadmap

Slide 65

Slide 65 text

- State API Beta Future Roadmap

Slide 66

Slide 66 text

State API Beta More info to come: - Historical Task State Information - Resource usage/requirements of actors and tasks. - Duration, exit code - Relationship between different resources - What tasks have an actor runs? - What logical resources is used on a node? More stable API Better usability in options and formatting

Slide 67

Slide 67 text

- State API Beta - Dashboard Usability Improvement Future Roadmap

Slide 68

Slide 68 text

Dashboard Usability Improvement

Slide 69

Slide 69 text

- State API Beta - Dashboard Usability Improvement - Advanced Task Drilldown Future Roadmap

Slide 70

Slide 70 text

Advanced Task Drilldown

Slide 71

Slide 71 text

Advanced Task Drilldown

Slide 72

Slide 72 text

Advanced Task Drilldown

Slide 73

Slide 73 text

- State API Beta - Dashboard Usability Improvement - Advanced Task Drilldown - Advanced Visualization Future Roadmap

Slide 74

Slide 74 text

Advanced Visualization

Slide 75

Slide 75 text

💬 Help us shape the future of Ray Observability Join #ray_observability in Ray slack or scan the QR code 👇

Slide 76

Slide 76 text

Thank you.