$30 off During Our Annual Pro Sale. View Details »

Ray Community Meetup Jan 25, 2023: Ray OOM Monitor

Ray Community Meetup Jan 25, 2023: Ray OOM Monitor

We are delighted to kick off New Year with our first January Ray Meetup with talks from Ray community users and committers. Join us to hear from the Ray team at Anyscale and Shopify about Ray and its usage.

Agenda:
5:00 p.m. Welcome remarks, Year 2022: Ray in Review & upcoming announcements - Jules Damji, Anyscale
Talk 1 (35-40 mins): Monitor & prevent out-of-memory problems with Ray OOM monitor - Clarence Ng, Anyscale
Q & A (10 mins)
Talk 2 (35-40 mins): How Shopify used Ray<>Tensorflow to build a Product Hierarchical Categorization model to auto classify billions of products using NLP and Computer Vision, Kshetrajna Raghavan, Shopify

Anyscale
PRO

January 27, 2023
Tweet

More Decks by Anyscale

Other Decks in Programming

Transcript

  1. Ray Memory Monitor
    Ray team @ Anyscale

    View Slide

  2. Agenda
    Introduction
    Problem with the existing solution
    How does Ray memory monitor work
    Deep dive into preemption policy
    Demo

    View Slide

  3. Memory
    capacity
    Introduction

    View Slide

  4. Memory
    capacity
    Introduction

    View Slide

  5. OS (Linux) OOM killer
    → Triggers when the system runs of free memory pages
    → Kills the most memory-hungry task
    → Frequency is capped : stalls other processes
    → Ray: sets process priority for the tasks

    View Slide

  6. Ray stalling

    View Slide

  7. Ray stalling

    View Slide

  8. Before: did not finish
    Tasks Memory used
    Cluster unreachable due to
    low memory

    View Slide

  9. After
    Retried tasks from OOM
    Tasks Memory used

    View Slide

  10. Thrashing
    90% done 30% done
    OOM!

    View Slide

  11. Thrashing
    90% done 30% done

    View Slide

  12. Thrashing
    30% done

    View Slide

  13. Thrashing
    90% done
    30% done

    View Slide

  14. Observability
    Task A OOM!
    Infinite retry

    View Slide

  15. Ray memory monitor
    Embedded
    memory
    monitor
    Worker
    processes
    Task
    Operating System
    Task
    Actor
    Actor
    Raylet
    Get resources
    usage
    Process stats

    View Slide

  16. Ray memory monitor
    Embedded
    memory
    monitor
    Operating System
    Raylet
    Using too much Memory?
    Worker
    processes
    Task
    Task
    Actor
    Actor

    View Slide

  17. Ray memory monitor
    Embedded
    memory
    monitor
    Operating System
    Raylet
    preempt
    Worker
    processes
    Task
    Task
    Actor
    Actor

    View Slide

  18. Ray memory monitor
    Embedded
    memory
    monitor
    Operating System
    Raylet
    Worker
    processes
    Task
    Actor
    Actor

    View Slide

  19. Preemption policy
    Requirements:
    If the application cannot complete we should surface that information to simplify debugging and path to
    resolution
    The application will finish even when it tries to overload the cluster
    → It should finish in a reasonable amount of time
    → workload shouldn’t hang

    View Slide

  20. Preemption policy (Ray 2.2)
    → Prefer killing retriable task
    → Prefer killing newest task
    → Limited retry : could deadlock otherwise

    View Slide

  21. Retriable task
    Task Crash!
    max_retries

    View Slide

  22. Newest executed task
    Task Start time of execution = 14:38 PM
    Task Start time of execution = 14:22 PM
    OOM kill

    View Slide

  23. Hyperparameter tuning
    Driver
    Trial
    Trainer
    Trial
    Trainer
    Data
    Data
    Data Data Data Data

    View Slide

  24. Preempt retriable tasks first
    Driver
    Trial
    Trainer
    Trial
    Trainer
    Data
    Data
    Data Data Data Data

    View Slide

  25. Preempt newest task
    Driver
    Trial
    Trainer
    Trial
    Trainer
    Data
    Data
    Data Data Data Data

    View Slide

  26. Problem: starvation
    Driver
    Trial
    Trainer
    Trial
    Trainer
    Data
    Data
    Data

    View Slide

  27. Ideal
    Driver
    Trial
    Trainer
    Trial
    Trainer
    Data
    Data Data

    View Slide

  28. Problem: deadlock
    Task A OOM!

    View Slide

  29. Preemption policy (Ray 2.3)
    → Group tasks that have the same parent if it is retriable
    → Preempt retriable groups
    → Preempt largest group
    → Preempt newest task within the group
    → Always retry task unless the task is the last member of the group

    View Slide

  30. Group tasks (non-retriable)
    Driver
    Trial
    Trainer
    Trial
    Trainer
    Data
    Data
    Data Data Data Data

    View Slide

  31. Group tasks (same parent)
    Driver
    Trial
    Trainer
    Trial
    Trainer
    Data
    Data
    Data Data Data Data

    View Slide

  32. Kill tasks from the largest group
    Driver
    Trial
    Trainer
    Trial
    Trainer
    Data
    Data Data Data Data

    View Slide

  33. Kill tasks from the largest group
    Driver
    Trial
    Trainer
    Trial
    Trainer
    Data
    Data Data Data

    View Slide

  34. Kill tasks from the largest group
    Driver
    Trial
    Trainer
    Trial
    Trainer
    Data Data Data

    View Slide

  35. Kill tasks from the largest group
    Driver
    Trial
    Trainer
    Trial
    Trainer
    Data Data

    View Slide

  36. Workload fails (group has no task)
    Driver
    Trial
    Trainer
    Trial
    Trainer
    Data Data

    View Slide

  37. Workload fails (group has no task)
    Task A OOM!

    View Slide

  38. Demo

    View Slide

  39. Summary
    ● Ray memory monitor improves cluster stability
    ● Latest release of Ray (2.2)
    ○ preemptively kills task to prevent the node from failing
    ○ Improved observability for debugging memory issues
    ● Next release of Ray (2.3)
    ○ Detects when a workload gets stuck and reports the error
    ○ Fairness across tasks to avoid starvation

    View Slide

  40. View Slide

  41. Demo: single task
    Task A OOM!

    View Slide

  42. Demo: Two trials (HPO)
    Driver
    Trial
    Trainer
    Trial
    Trainer
    Data
    Data
    Data Data Data Data

    View Slide

  43. “Deadlock”
    Task A OOM!

    View Slide

  44. “Deadlock”
    Task A
    Task B
    Task C OOM!

    View Slide

  45. “Deadlock”
    Driver
    Task B
    Task C
    Task B
    Task C

    View Slide

  46. “Deadlock”
    Driver
    Task B
    Task C OK

    View Slide

  47. “Deadlock”
    Driver
    Task B
    Task B

    View Slide

  48. “Deadlock”
    Driver
    Task B
    Task C OOM!
    Task B

    View Slide

  49. Thrashing
    Driver
    Task B

    View Slide

  50. Thrashing
    Driver
    Task B OOM!
    Task B

    View Slide

  51. Thrashing
    Driver
    Task B

    View Slide

  52. Thrashing
    Driver
    Task B OOM!
    Task B

    View Slide

  53. Thrashing
    Driver
    Task B

    View Slide

  54. How to use this template?
    Please DO NOT edit this master template.
    If you want to use these styles in your presentations, please create a copy of this template before you
    edit. OR copy slides from here into your deck.
    When creating a copy, please change the location of the copy to your My Drive or another location to
    avoid cluttering this central folder.

    View Slide

  55. Colors
    55

    View Slide

  56. Presentation Title Slide (Light)
    Name, Name

    View Slide

  57. Presentation Title - Light
    Name, Name

    View Slide

  58. Presenter Information
    Name Name Name

    View Slide

  59. Here is a basic
    information page -
    Light
    Lorem ipsum
    Lorem ipsum dolor sit amet, consectetur adipiscing
    elit, sed do eiusmod tempor incididunt ut labore et
    dolore magna aliqua.
    Lorem ipsum
    Lorem ipsum dolor sit amet, consectetur adipiscing
    elit, sed do eiusmod tempor incididunt ut labore et
    dolore magna aliqua.
    Lorem ipsum
    Lorem ipsum dolor sit amet, consectetur adipiscing
    elit, sed do eiusmod tempor incididunt ut labore et
    dolore magna aliqua.

    View Slide

  60. Here’s a basic information page
    Click to add text here

    View Slide

  61. Here is a basic
    information page
    Lorem ipsum
    Lorem ipsum dolor sit amet, consectetur adipiscing
    elit, sed do eiusmod tempor incididunt ut labore et
    dolore magna aliqua.
    Lorem ipsum
    Lorem ipsum dolor sit amet, consectetur adipiscing
    elit, sed do eiusmod tempor incididunt ut labore et
    dolore magna aliqua.
    Lorem ipsum
    Lorem ipsum dolor sit amet, consectetur adipiscing
    elit, sed do eiusmod tempor incididunt ut labore et
    dolore magna aliqua.

    View Slide

  62. Slide with 2 points
    Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna
    aliqua.Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore
    magna aliqua.Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et
    dolore magna aliqua.
    Slide with 2 points
    Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna
    aliqua.Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore
    magna aliqua.Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et
    dolore magna aliqua.

    View Slide

  63. Section Title

    View Slide

  64. Section Title
    Section Description

    View Slide

  65. Section Title

    View Slide

  66. Section Title
    Section Description

    View Slide

  67. Some important points
    Point 1 Point 2 Point 3

    View Slide

  68. Some more important points
    Point 1 Point 2

    View Slide

  69. Impact slide
    Description

    View Slide

  70. Image title
    IMAGE SUBTITLE
    Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do
    eiusmod tempor incididunt ut labore et dolore magna aliqua.Lorem
    ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod
    tempor incididunt ut labore et dolore magna aliqua.Lorem ipsum dolor
    sit amet, consectetur adipiscing elit, sed do eiusmod tempor
    incididunt ut labore et dolore magna aliqua.

    View Slide

  71. Image title slide

    View Slide

  72. IMAGE TITLE
    IMAGE SUBTITLE
    Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do
    eiusmod tempor incididunt ut labore et dolore magna aliqua.Lorem
    ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod
    tempor incididunt ut labore et dolore magna aliqua.Lorem ipsum dolor
    sit amet, consectetur adipiscing elit, sed do eiusmod tempor
    incididunt ut labore et dolore magna aliqua.

    View Slide

  73. Image title

    View Slide

  74. Image Title

    View Slide

  75. Slide with 2 columns
    Column title Column title

    View Slide

  76. Slide with 3 columns
    Column title Column title Column title

    View Slide

  77. Slide title
    Column title Column title Column title
    Lorem ipsum dolor sit amet,
    consectetur adipiscing elit, sed do
    eiusmod tempor incididunt ut labore
    et dolore magna aliqua.Lorem ipsum
    dolor sit amet,
    Lorem ipsum dolor sit amet,
    consectetur adipiscing elit, sed do
    eiusmod tempor incididunt ut labore
    et dolore magna aliqua.Lorem ipsum
    dolor sit amet,
    Lorem ipsum dolor sit amet,
    consectetur adipiscing elit, sed do
    eiusmod tempor incididunt ut labore
    et dolore magna aliqua.Lorem ipsum
    dolor sit amet,

    View Slide

  78. Thank you
    Any questions?

    View Slide