Ray Community Meetup Talks

State of Ray Serve in 2.0 Simon Mo - [email protected]
Ray Meetup @ 09/2022

• Ray Serve Overview • Highlights in Ray 2.0 ◦
Model Composition API ◦ Autoscaling ◦ Production Hardening ◦ ◦ Outline

Ray AI Runtime (AIR) 4 Online Inference

Ray Serve TL;DR Flexible, scalable, efficient compute for online inference
1. Scalable 2. Low latency 3. Efficient First-class support for multi-model inference Python-native: mix business logic & machine learning 5

Working Example: Content Understanding 6 { “object”: “hummingbird”, “score”: 0.97,
“segment_pixels”: [ (200, 403), … ], } Input Output

Content Understanding Architecture 7 Downloader Preprocessor Image Detector Dynamic Dispatch
Image Classifier #1 Image Classifier #2 Image Classifier #3 Image features Object bounding box { “object”: “hummingbird”, “score”: 0.97, “segment_pixels”: [ (200, 403), … ], }

→ Minimize latency, maximize throughput 🔥 → Fine-grained resources: fractional
CPUs & GPUs ➗ → Programmatic API for development and testing 󰠁 → Rock-solid story for production 🧱 Requirements for Online Inference 8

Basic Solution: Multi-model Monolith 9 ❌ Unable to scale models
independently! 🙁 High latency, low throughput, and costly HTTP

Complex Solution: Models as Containers 10 ✅ Scale models independently,
use different resources ❌ Requires deep Kubernetes knowledge to build an app 🤷 HTTP HTTP

Write a single Python program Use your favorite tools &
libraries Scale across CPUs and GPUs 11 Ray Serve is built for Multi-model Inference

Content Understanding as a Serve App 12 Downloader Preprocessor Image
Detector Dynamic Dispatch Image Classifier #1 Image Classifier #2 Image Classifier #3 Pod Pod GPU: 1, CPU: 4 GPU: 0, CPU: 2 GPU: 0.3 CPU: 1 GPU: 0 CPU: 1 Combine

Content Understanding as a Serve App 13 Downloader Preprocessor Image
Detector Dynamic Dispatch Image Classifier #1 Image Classifier #2 Image Classifier #3 Combine Pod Pod GPU: 1, CPU: 4 GPU: 0, CPU: 2 GPU: 0.3 CPU: 1 • Single Python program • Developed and tested locally • Deployed & updated as a single app

+ Production hardening, focused on Kubernetes Ray Serve in 2.0
→ Goal: Make it easy to put scalable ML in production 14 💸💸 💸 + Great UX for flexible model composition + Improved efficiency and save costs with advanced autoscaling

Model Composition Requirements → Flexible to satisfy diverse use cases
+ Different models, frameworks, and business logic 16 → Scalable and efficient when running in production → Ability to develop, test, and debug locally 💻 💸💸 💸

Solution: Model Composition API → First-class API to build graphs
of Serve deployments → Full flexibility of Ray Serve + Author, configure, scale each model independently → Orchestrate computation using regular Python code 17

Model Composition Pattern: Chaining 18

Model Composition Pattern: Ensemble 19

Model Composition Pattern: Dynamic Selection 20

Deployment Graph API Enables Flexible Model Composition 21 Downloader Preprocessor
Image Detector Dynamic Dispatch Image Classifier #1 Image Classifier #2 Image Classifier #3 Image features Object Bounding Boxes { “object”: “hummingbird”, “score”: 0.97, “segment_pixels”: [ (200, 403), … ], } Chaining Ensemble Dynamic Selection

Ray Serve Model Composition API 22 → Write your models
as ordinary classes → Flexibly compose models & logic w/ Python code → Run, test, and debug on your laptop → Deploy to production – configure and scale models independently

Autoscaling for ML Models → Problem: ML models are compute
intensive -> 💸💸💸 + Not all models are always used + Hard to tune hardware utilization + Needs to work for multi-model 24 → Solution: Advanced autoscaling for Serve 🧠 + Supports scale-to-zero + Uses request queue lengths, no profiling + Fully compatible with model composition API

Model Composition with Autoscaling 25 Downloader Preprocessor Image Detector Dynamic
Dispatch Image Classifier #1 Image Classifier #2 Image Classifier #3 Combine

Dispatch Image Classifier #1 Image Classifier #2 Image Classifier #3 Combine

Dispatch Image Classifier #1 Image Classifier #2 Image Classifier #3 Combine Classifier #1 queue_size >> target_queue_size

Dispatch Image Classifier #1 Image Classifier #2 Image Classifier #3 Combine Image Classifier #1 Image Classifier #1 Image Classifier #1 Classifier #1 queue_size >> target_queue_size → add replicas

Dispatch Image Classifier #1 Image Classifier #3 Combine Image Classifier #1 Image Classifier #1 Image Classifier #1 Classifier #3 Idle for X min → remove replicas Image Classifier #2

Ray Serve Autoscaling → Easy to get started + Just
set a few basic parameters + No need to profile or update your models → Supports scale-to-zero and integrates fully with model composition 30

Production Hardening → Online inference means solving operational problems: +
Updates without downtime + Handling failures gracefully + Monitoring, observability, alerting 32 Operational Benefits Flexibility, User Experience, Efficiency

RayService operator 33 💻 serve run serve build kubectl apply
create update monitor Production Hardening: Kubernetes Operator • Zero-downtime updates • Health checking and recovery • Integrate with k8s tooling

34 Production Hardening: GCS Fault Tolerance head node Prior to
version 2.0, Ray had a single point of failure GCS Ray Actor Ray Actor Ray Actor worker node Ray Actor Ray Actor Ray Actor Ray Actor worker node Ray Actor Ray Actor Ray Actor Ray Actor ❌

35 Production Hardening: GCS Fault Tolerance Ray can now recover
from GCS failures in version 2.0 → Tasks and actors continue to run → A new GCS is started and the cluster is recovered + Handled automatically by k8s operator Ray Serve applications continue to serve traffic

36 Chaos Testing: 99.99% uptime

+ Production hardening, focus on Kubernetes Ray Serve in 2.0
→ Goal: Make it easy to put scalable ML in production 37 💸💸 💸 + Great UX for flexible model composition + Improved efficiency and save costs with advanced autoscaling

Ray Serve Deep Dives @ Ray Summit 38

Ray Serve Community @ Ray Summit 39

Thank you! Q&A Learn more and get in touch at
rayserve.org

Ray Community Meetup Talks

Ray Community Meetup Talks

More Decks by Anyscale

Other Decks in Programming

Featured

Transcript