Slide 1

Slide 1 text

Ray Memory Monitor Ray team @ Anyscale

Slide 2

Slide 2 text

Agenda Introduction Problem with the existing solution How does Ray memory monitor work Deep dive into preemption policy Demo

Slide 3

Slide 3 text

Memory capacity Introduction

Slide 4

Slide 4 text

Memory capacity Introduction

Slide 5

Slide 5 text

OS (Linux) OOM killer → Triggers when the system runs of free memory pages → Kills the most memory-hungry task → Frequency is capped : stalls other processes → Ray: sets process priority for the tasks

Slide 6

Slide 6 text

Ray stalling

Slide 7

Slide 7 text

Ray stalling

Slide 8

Slide 8 text

Before: did not finish Tasks Memory used Cluster unreachable due to low memory

Slide 9

Slide 9 text

After Retried tasks from OOM Tasks Memory used

Slide 10

Slide 10 text

Thrashing 90% done 30% done OOM!

Slide 11

Slide 11 text

Thrashing 90% done 30% done

Slide 12

Slide 12 text

Thrashing 30% done

Slide 13

Slide 13 text

Thrashing 90% done 30% done

Slide 14

Slide 14 text

Observability Task A OOM! Infinite retry

Slide 15

Slide 15 text

Ray memory monitor Embedded memory monitor Worker processes Task Operating System Task Actor Actor Raylet Get resources usage Process stats

Slide 16

Slide 16 text

Ray memory monitor Embedded memory monitor Operating System Raylet Using too much Memory? Worker processes Task Task Actor Actor

Slide 17

Slide 17 text

Ray memory monitor Embedded memory monitor Operating System Raylet preempt Worker processes Task Task Actor Actor

Slide 18

Slide 18 text

Ray memory monitor Embedded memory monitor Operating System Raylet Worker processes Task Actor Actor

Slide 19

Slide 19 text

Preemption policy Requirements: If the application cannot complete we should surface that information to simplify debugging and path to resolution The application will finish even when it tries to overload the cluster → It should finish in a reasonable amount of time → workload shouldn’t hang

Slide 20

Slide 20 text

Preemption policy (Ray 2.2) → Prefer killing retriable task → Prefer killing newest task → Limited retry : could deadlock otherwise

Slide 21

Slide 21 text

Retriable task Task Crash! max_retries

Slide 22

Slide 22 text

Newest executed task Task Start time of execution = 14:38 PM Task Start time of execution = 14:22 PM OOM kill

Slide 23

Slide 23 text

Hyperparameter tuning Driver Trial Trainer Trial Trainer Data Data Data Data Data Data

Slide 24

Slide 24 text

Preempt retriable tasks first Driver Trial Trainer Trial Trainer Data Data Data Data Data Data

Slide 25

Slide 25 text

Preempt newest task Driver Trial Trainer Trial Trainer Data Data Data Data Data Data

Slide 26

Slide 26 text

Problem: starvation Driver Trial Trainer Trial Trainer Data Data Data

Slide 27

Slide 27 text

Ideal Driver Trial Trainer Trial Trainer Data Data Data

Slide 28

Slide 28 text

Problem: deadlock Task A OOM!

Slide 29

Slide 29 text

Preemption policy (Ray 2.3) → Group tasks that have the same parent if it is retriable → Preempt retriable groups → Preempt largest group → Preempt newest task within the group → Always retry task unless the task is the last member of the group

Slide 30

Slide 30 text

Group tasks (non-retriable) Driver Trial Trainer Trial Trainer Data Data Data Data Data Data

Slide 31

Slide 31 text

Group tasks (same parent) Driver Trial Trainer Trial Trainer Data Data Data Data Data Data

Slide 32

Slide 32 text

Kill tasks from the largest group Driver Trial Trainer Trial Trainer Data Data Data Data Data

Slide 33

Slide 33 text

Kill tasks from the largest group Driver Trial Trainer Trial Trainer Data Data Data Data

Slide 34

Slide 34 text

Kill tasks from the largest group Driver Trial Trainer Trial Trainer Data Data Data

Slide 35

Slide 35 text

Kill tasks from the largest group Driver Trial Trainer Trial Trainer Data Data

Slide 36

Slide 36 text

Workload fails (group has no task) Driver Trial Trainer Trial Trainer Data Data

Slide 37

Slide 37 text

Workload fails (group has no task) Task A OOM!

Slide 38

Slide 38 text

Demo

Slide 39

Slide 39 text

Summary ● Ray memory monitor improves cluster stability ● Latest release of Ray (2.2) ○ preemptively kills task to prevent the node from failing ○ Improved observability for debugging memory issues ● Next release of Ray (2.3) ○ Detects when a workload gets stuck and reports the error ○ Fairness across tasks to avoid starvation

Slide 40

Slide 40 text

Thank you clarence@anyscale.com

Slide 41

Slide 41 text

Demo: single task Task A OOM!

Slide 42

Slide 42 text

Demo: Two trials (HPO) Driver Trial Trainer Trial Trainer Data Data Data Data Data Data

Slide 43

Slide 43 text

“Deadlock” Task A OOM!

Slide 44

Slide 44 text

“Deadlock” Task A Task B Task C OOM!

Slide 45

Slide 45 text

“Deadlock” Driver Task B Task C Task B Task C

Slide 46

Slide 46 text

“Deadlock” Driver Task B Task C OK

Slide 47

Slide 47 text

“Deadlock” Driver Task B Task B

Slide 48

Slide 48 text

“Deadlock” Driver Task B Task C OOM! Task B

Slide 49

Slide 49 text

Thrashing Driver Task B

Slide 50

Slide 50 text

Thrashing Driver Task B OOM! Task B

Slide 51

Slide 51 text

Thrashing Driver Task B

Slide 52

Slide 52 text

Thrashing Driver Task B OOM! Task B

Slide 53

Slide 53 text

Thrashing Driver Task B

Slide 54

Slide 54 text

How to use this template? Please DO NOT edit this master template. If you want to use these styles in your presentations, please create a copy of this template before you edit. OR copy slides from here into your deck. When creating a copy, please change the location of the copy to your My Drive or another location to avoid cluttering this central folder.

Slide 55

Slide 55 text

Colors 55

Slide 56

Slide 56 text

Presentation Title Slide (Light) Name, Name

Slide 57

Slide 57 text

Presentation Title - Light Name, Name

Slide 58

Slide 58 text

Presenter Information Name Name Name

Slide 59

Slide 59 text

Here is a basic information page - Light Lorem ipsum Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Lorem ipsum Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Lorem ipsum Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.

Slide 60

Slide 60 text

Here’s a basic information page Click to add text here

Slide 61

Slide 61 text

Here is a basic information page Lorem ipsum Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Lorem ipsum Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Lorem ipsum Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.

Slide 62

Slide 62 text

Slide with 2 points Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Slide with 2 points Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.

Slide 63

Slide 63 text

Section Title

Slide 64

Slide 64 text

Section Title Section Description

Slide 65

Slide 65 text

Section Title

Slide 66

Slide 66 text

Section Title Section Description

Slide 67

Slide 67 text

Some important points Point 1 Point 2 Point 3

Slide 68

Slide 68 text

Some more important points Point 1 Point 2

Slide 69

Slide 69 text

Impact slide Description

Slide 70

Slide 70 text

Image title IMAGE SUBTITLE Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.

Slide 71

Slide 71 text

Image title slide

Slide 72

Slide 72 text

IMAGE TITLE IMAGE SUBTITLE Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.

Slide 73

Slide 73 text

Image title

Slide 74

Slide 74 text

Image Title

Slide 75

Slide 75 text

Slide with 2 columns Column title Column title

Slide 76

Slide 76 text

Slide with 3 columns Column title Column title Column title

Slide 77

Slide 77 text

Slide title Column title Column title Column title Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.Lorem ipsum dolor sit amet, Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.Lorem ipsum dolor sit amet, Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.Lorem ipsum dolor sit amet,

Slide 78

Slide 78 text

Thank you Any questions?