State of Ray Serve in 2.0

Slide 1

Slide 1 text

Welcome to the European Ray meetup community talks February 22, 2023 Hosted by Jules S. Damji, Ray Advocacy Team @ Anyscale Email: [email protected] Twitter: @2twitme

Slide 2

Slide 2 text

Agenda: Virtual Meetup Talk 1: Announcements, Upcoming events, Overview of Ray Serve in 2.x - Jules S. Damji, Anyscale Talk 2: Scaling to/from zero on demand with Serve Handle API - Miha Jenko, Outbrain Talk 3: Smart shortcuts bootstrapping a modern NLP project - David Berenstein, Argilla.io

Slide 3

Slide 3 text

Ray Summit 2023 - CfP ● Open 1/3/2023– February 3/06/ 2023 ● In-person Conference 9/18-20, 2033 ○ Marriott Marquis, San Francisco, ● CfP: https://bit.ly/cfp-ray-summit-2023

Slide 4

Slide 4 text

No content

Slide 5

Slide 5 text

NYC Ray Meetup - March 22

Slide 6

Slide 6 text

Ray books …

Slide 7

Slide 7 text

https://www.ray.io/community

Slide 8

Slide 8 text

No content

Slide 9

Slide 9 text

State of Ray Serve in 2.x

Slide 10

Slide 10 text

Outline ○ Ray Serve Overview ○ Highlights in Ray 2.x ■ Model Composition API ■ Autoscaling ■ Production Hardening ■

Slide 11

Slide 11 text

Outline ● Ray Serve Overview ● Highlights in Ray 2.x ○ Model Composition API ○ Autoscaling ○ Production Hardening ○ ○

Slide 12

Slide 12 text

Ray AI Runtime (AIR) Online Inference

Slide 13

Slide 13 text

Ray Serve: TL;Dr 1. Scalable 2. Low latency 3. Efficient First-class support for multi-model inference Python-native: mix business logic & machine learning Flexible, scalable, efficient compute for online inference

Slide 14

Slide 14 text

Working Example: Content understanding End-to-end inference service { “object”: “hummingbird”, “score”: 0.97, “segment_pixels”: [ (200, 403), … ], } Input Output

Slide 15

Slide 15 text

Content understanding architecture End to end ﬂow inference service Downloader Preprocessor Image Detector Dynamic Dispatch Image Classifier #1 Image Classifier #2 Image Classifier #3 Image features Object bounding box { “object”: “hummingbird”, “score”: 0.97, “segment_pixels”: [ (200, 403), … ], }

Slide 16

Slide 16 text

Requirements for online inference → Minimize latency, maximize throughput 🔥 → Fine-grained resources: fractional CPUs & GPUs ➗ → Programmatic API for development and testing 󰠁 → Rock-solid story for production 🧱

Slide 17

Slide 17 text

Basic solution: multi-model monolith Single container 17\ ❌ Unable to scale models independently! 🙁 High latency, low throughput, and costly HTTP

Slide 18

Slide 18 text

Complex Solution: Models as Containers 18 ✅ Scale models independently, use different resources ❌ Requires deep Kubernetes knowledge to build an app 🤷 HTTP HTTP

Slide 19

Slide 19 text

Ray Serve is built for multi-model inference Write a single Python program Use your favorite tools & libraries Scale across CPUs and GPUs

Slide 20

Slide 20 text

Slide 21

Slide 21 text

Content Understanding as a Serve App Downloader Preprocessor Image Detector Dynamic Dispatch Image Classifier #1 Image Classifier #2 Image Classifier #3 Pod Pod GPU: 1, CPU: 4 GPU: 0, CPU: 2 GPU: 0.3 CPU: 1 GPU: 0 CPU: 1 Combine Content Understanding as a Serve App 21 Downloader Preprocessor Image Detector Dynamic Dispatch Image Classifier #1 Image Classifier #2 Image Classifier #3 Combine Pod Pod GPU: 1, CPU: 4 GPU: 0, CPU: 2 GPU: 0.3 CPU: 1 ● Single Python program ● Developed and tested locally ● Deployed & updated as a single app

Slide 22

Slide 22 text

Ray Serve in 2.x + Production hardening, focused on Kubernetes → Goal: Make it easy to put scalable ML in production + Great UX for flexible model composition + Improved efficiency and save costs with advanced autoscaling 💸💸 💸

Slide 23

Slide 23 text

Outline ● Ray Serve Overview ● Highlights in Ray 2.x ○ Model Composition API ○ Autoscaling ○ Production Hardening ○ ○

Slide 24

Slide 24 text

Model composition requirements → Flexible to satisfy diverse use cases + Different models, frameworks, and business logic → Scalable and efficient when running in production → Ability to develop, test, and debug locally 💻 💸💸 💸

Slide 25

Slide 25 text

Solution: Model composition API → First-class API to build graphs of Serve deployments → Full flexibility of Ray Serve + Author, configure, scale each model independently → Orchestrate computation using regular Python code

Slide 26

Slide 26 text

Model composition: Pattern chaining

Slide 27

Slide 27 text

Model Composition Pattern: Ensemble

Slide 28

Slide 28 text

Model composition pattern: Dynamic selection

Slide 29

Slide 29 text

Deployment graph API enables flexible model composition Downloader Preprocessor Image Detector Dynamic Dispatch Image Classifier #1 Image Classifier #2 Image Classifier #3 Image features Object Bounding Boxes { “object”: “hummingbird”, “score”: 0.97, “segment_pixels”: [ (200, 403), … ], } Chaining Ensemble Dynamic Selection

Slide 30

Slide 30 text

Ray Serve composition API → Write your models as ordinary classes → Flexibly compose models & logic w/ Python code → Run, test, and debug on your laptop → Deploy to production – configure and scale models independently

Slide 31

Slide 31 text

Outline ● Ray Serve Overview ● Highlights in Ray 2.x ○ Model Composition API ○ Autoscaling ○ Production Hardening

Slide 32

Slide 32 text

Autoscaling for ML models → Problem: ML models are compute intensive -> 💸💸💸 + Not all models are always used + Hard to tune hardware utilization + Needs to work for multi-model 32 → Solution: Advanced autoscaling for Serve 🧠 + Supports scale-to-zero + Uses request queue lengths, no profiling + Fully compatible with model composition API

Slide 33

Slide 33 text

Model composition with autoscaling Downloader Preprocessor Image Detector Dynamic Dispatch Image Classifier #1 Image Classifier #2 Image Classifier #3 Combine

Slide 34

Slide 34 text

Model composition with autoscaling 34 Downloader Preprocessor Image Detector Dynamic Dispatch Image Classifier #1 Image Classifier #2 Image Classifier #3 Combine

Slide 35

Slide 35 text

Model composition with autoscaling 35 Downloader Preprocessor Image Detector Dynamic Dispatch Image Classifier #1 Image Classifier #2 Image Classifier #3 Combine Classifier #1 queue_size >> target_queue_size

Slide 36

Slide 36 text

Model composition with autoscaling Downloader Preprocessor Image Detector Dynamic Dispatch Image Classifier #1 Image Classifier #2 Image Classifier #3 Combine Image Classifier #1 Image Classifier #1 Image Classifier #1 Classifier #1 queue_size >> target_queue_size → add replicas

Slide 37

Slide 37 text

Model composition with autoscaling Downloader Preprocessor Image Detector Dynamic Dispatch Image Classifier #1 Image Classifier #3 Combine Image Classifier #1 Image Classifier #1 Image Classifier #1 Classifier #3 Idle for X min → remove replicas Image Classifier #2

Slide 38

Slide 38 text

Ray Serve autoscaling → Easy to get started + Just set a few basic parameters + No need to profile or update your models → Supports scale-to-zero and integrates fully with model composition

Slide 39

Slide 39 text

Production hardening → Online inference means solving operational problems: + Updates without downtime + Handling failures gracefully + Monitoring, observability, alerting

Slide 40

Slide 40 text

RayService operator 💻 serve run serve build kubectl apply create update monitor Production Hardening: Kubernetes Operator ● Zero-downtime updates ● Health checking and recovery ● Integrate with k8s tooling

Slide 41

Slide 41 text

Ray Summit 2022 Talks https://www.anyscale.com/ray-summit-2022

Slide 42

Slide 42 text

Ray Summit 2022 Talks https://www.anyscale.com/ray-summit-2022

Slide 43

Slide 43 text

Ray Serve Recap + Production hardening, focus on Kubernetes → Goal: Make it easy to put scalable ML in production + Great UX for flexible model composition + Improved efficiency and save costs with advanced autoscaling 💰💰💰

Slide 44

Slide 44 text

Thank you! & Q & A [email protected] @2twitme Learn more and get in touch with rayserve.org