Slide 1

Slide 1 text

State of Ray Serve in 2.0 Simon Mo - [email protected] Ray Meetup @ 09/2022

Slide 2

Slide 2 text

● Ray Serve Overview ● Highlights in Ray 2.0 ○ Model Composition API ○ Autoscaling ○ Production Hardening ○ ○ Outline

Slide 3

Slide 3 text

● Ray Serve Overview ● Highlights in Ray 2.0 ○ Model Composition API ○ Autoscaling ○ Production Hardening ○ ○ Outline

Slide 4

Slide 4 text

Ray AI Runtime (AIR) 4 Online Inference

Slide 5

Slide 5 text

Ray Serve TL;DR Flexible, scalable, efficient compute for online inference 1. Scalable 2. Low latency 3. Efficient First-class support for multi-model inference Python-native: mix business logic & machine learning 5

Slide 6

Slide 6 text

Working Example: Content Understanding 6 { “object”: “hummingbird”, “score”: 0.97, “segment_pixels”: [ (200, 403), … ], } Input Output

Slide 7

Slide 7 text

Content Understanding Architecture 7 Downloader Preprocessor Image Detector Dynamic Dispatch Image Classifier #1 Image Classifier #2 Image Classifier #3 Image features Object bounding box { “object”: “hummingbird”, “score”: 0.97, “segment_pixels”: [ (200, 403), … ], }

Slide 8

Slide 8 text

→ Minimize latency, maximize throughput 🔥 → Fine-grained resources: fractional CPUs & GPUs ➗ → Programmatic API for development and testing 󰠁 → Rock-solid story for production 🧱 Requirements for Online Inference 8

Slide 9

Slide 9 text

Basic Solution: Multi-model Monolith 9 ❌ Unable to scale models independently! 🙁 High latency, low throughput, and costly HTTP

Slide 10

Slide 10 text

Complex Solution: Models as Containers 10 ✅ Scale models independently, use different resources ❌ Requires deep Kubernetes knowledge to build an app 🤷 HTTP HTTP

Slide 11

Slide 11 text

Write a single Python program Use your favorite tools & libraries Scale across CPUs and GPUs 11 Ray Serve is built for Multi-model Inference

Slide 12

Slide 12 text

Content Understanding as a Serve App 12 Downloader Preprocessor Image Detector Dynamic Dispatch Image Classifier #1 Image Classifier #2 Image Classifier #3 Pod Pod GPU: 1, CPU: 4 GPU: 0, CPU: 2 GPU: 0.3 CPU: 1 GPU: 0 CPU: 1 Combine

Slide 13

Slide 13 text

Content Understanding as a Serve App 13 Downloader Preprocessor Image Detector Dynamic Dispatch Image Classifier #1 Image Classifier #2 Image Classifier #3 Combine Pod Pod GPU: 1, CPU: 4 GPU: 0, CPU: 2 GPU: 0.3 CPU: 1 ● Single Python program ● Developed and tested locally ● Deployed & updated as a single app

Slide 14

Slide 14 text

+ Production hardening, focused on Kubernetes Ray Serve in 2.0 → Goal: Make it easy to put scalable ML in production 14 💸💸 💸 + Great UX for flexible model composition + Improved efficiency and save costs with advanced autoscaling

Slide 15

Slide 15 text

● Ray Serve Overview ● Highlights in Ray 2.0 ○ Model Composition API ○ Autoscaling ○ Production Hardening ○ ○ Outline

Slide 16

Slide 16 text

Model Composition Requirements → Flexible to satisfy diverse use cases + Different models, frameworks, and business logic 16 → Scalable and efficient when running in production → Ability to develop, test, and debug locally 💻 💸💸 💸

Slide 17

Slide 17 text

Solution: Model Composition API → First-class API to build graphs of Serve deployments → Full flexibility of Ray Serve + Author, configure, scale each model independently → Orchestrate computation using regular Python code 17

Slide 18

Slide 18 text

Model Composition Pattern: Chaining 18

Slide 19

Slide 19 text

Model Composition Pattern: Ensemble 19

Slide 20

Slide 20 text

Model Composition Pattern: Dynamic Selection 20

Slide 21

Slide 21 text

Deployment Graph API Enables Flexible Model Composition 21 Downloader Preprocessor Image Detector Dynamic Dispatch Image Classifier #1 Image Classifier #2 Image Classifier #3 Image features Object Bounding Boxes { “object”: “hummingbird”, “score”: 0.97, “segment_pixels”: [ (200, 403), … ], } Chaining Ensemble Dynamic Selection

Slide 22

Slide 22 text

Ray Serve Model Composition API 22 → Write your models as ordinary classes → Flexibly compose models & logic w/ Python code → Run, test, and debug on your laptop → Deploy to production – configure and scale models independently

Slide 23

Slide 23 text

● Ray Serve Overview ● Highlights in Ray 2.0 ○ Model Composition API ○ Autoscaling ○ Production Hardening ○ ○ Outline

Slide 24

Slide 24 text

Autoscaling for ML Models → Problem: ML models are compute intensive -> 💸💸💸 + Not all models are always used + Hard to tune hardware utilization + Needs to work for multi-model 24 → Solution: Advanced autoscaling for Serve 🧠 + Supports scale-to-zero + Uses request queue lengths, no profiling + Fully compatible with model composition API

Slide 25

Slide 25 text

Model Composition with Autoscaling 25 Downloader Preprocessor Image Detector Dynamic Dispatch Image Classifier #1 Image Classifier #2 Image Classifier #3 Combine

Slide 26

Slide 26 text

Model Composition with Autoscaling 26 Downloader Preprocessor Image Detector Dynamic Dispatch Image Classifier #1 Image Classifier #2 Image Classifier #3 Combine

Slide 27

Slide 27 text

Model Composition with Autoscaling 27 Downloader Preprocessor Image Detector Dynamic Dispatch Image Classifier #1 Image Classifier #2 Image Classifier #3 Combine Classifier #1 queue_size >> target_queue_size

Slide 28

Slide 28 text

Model Composition with Autoscaling 28 Downloader Preprocessor Image Detector Dynamic Dispatch Image Classifier #1 Image Classifier #2 Image Classifier #3 Combine Image Classifier #1 Image Classifier #1 Image Classifier #1 Classifier #1 queue_size >> target_queue_size → add replicas

Slide 29

Slide 29 text

Model Composition with Autoscaling 29 Downloader Preprocessor Image Detector Dynamic Dispatch Image Classifier #1 Image Classifier #3 Combine Image Classifier #1 Image Classifier #1 Image Classifier #1 Classifier #3 Idle for X min → remove replicas Image Classifier #2

Slide 30

Slide 30 text

Ray Serve Autoscaling → Easy to get started + Just set a few basic parameters + No need to profile or update your models → Supports scale-to-zero and integrates fully with model composition 30

Slide 31

Slide 31 text

● Ray Serve Overview ● Highlights in Ray 2.0 ○ Model Composition API ○ Autoscaling ○ Production Hardening ○ ○ Outline

Slide 32

Slide 32 text

Production Hardening → Online inference means solving operational problems: + Updates without downtime + Handling failures gracefully + Monitoring, observability, alerting 32 Operational Benefits Flexibility, User Experience, Efficiency

Slide 33

Slide 33 text

RayService operator 33 💻 serve run serve build kubectl apply create update monitor Production Hardening: Kubernetes Operator ● Zero-downtime updates ● Health checking and recovery ● Integrate with k8s tooling

Slide 34

Slide 34 text

34 Production Hardening: GCS Fault Tolerance head node Prior to version 2.0, Ray had a single point of failure GCS Ray Actor Ray Actor Ray Actor worker node Ray Actor Ray Actor Ray Actor Ray Actor worker node Ray Actor Ray Actor Ray Actor Ray Actor ❌

Slide 35

Slide 35 text

35 Production Hardening: GCS Fault Tolerance Ray can now recover from GCS failures in version 2.0 → Tasks and actors continue to run → A new GCS is started and the cluster is recovered + Handled automatically by k8s operator Ray Serve applications continue to serve traffic

Slide 36

Slide 36 text

36 Chaos Testing: 99.99% uptime

Slide 37

Slide 37 text

+ Production hardening, focus on Kubernetes Ray Serve in 2.0 → Goal: Make it easy to put scalable ML in production 37 💸💸 💸 + Great UX for flexible model composition + Improved efficiency and save costs with advanced autoscaling

Slide 38

Slide 38 text

Ray Serve Deep Dives @ Ray Summit 38

Slide 39

Slide 39 text

Ray Serve Community @ Ray Summit 39

Slide 40

Slide 40 text

Thank you! Q&A Learn more and get in touch at rayserve.org