State of Ray Serve in 2.0

Welcome to the European Ray meetup community talks February 22,
2023 Hosted by Jules S. Damji, Ray Advocacy Team @ Anyscale Email: [email protected] Twitter: @2twitme

Agenda: Virtual Meetup Talk 1: Announcements, Upcoming events, Overview of
Ray Serve in 2.x - Jules S. Damji, Anyscale Talk 2: Scaling to/from zero on demand with Serve Handle API - Miha Jenko, Outbrain Talk 3: Smart shortcuts bootstrapping a modern NLP project - David Berenstein, Argilla.io

Ray Summit 2023 - CfP • Open 1/3/2023– February 3/06/
2023 • In-person Conference 9/18-20, 2033 ◦ Marriott Marquis, San Francisco, • CfP: https://bit.ly/cfp-ray-summit-2023

NYC Ray Meetup - March 22

Ray books …

https://www.ray.io/community

State of Ray Serve in 2.x

Outline ◦ Ray Serve Overview ◦ Highlights in Ray 2.x
▪ Model Composition API ▪ Autoscaling ▪ Production Hardening ▪

Outline • Ray Serve Overview • Highlights in Ray 2.x
◦ Model Composition API ◦ Autoscaling ◦ Production Hardening ◦ ◦

Ray AI Runtime (AIR) Online Inference

Ray Serve: TL;Dr 1. Scalable 2. Low latency 3. Efficient
First-class support for multi-model inference Python-native: mix business logic & machine learning Flexible, scalable, efficient compute for online inference

Working Example: Content understanding End-to-end inference service { “object”: “hummingbird”,
“score”: 0.97, “segment_pixels”: [ (200, 403), … ], } Input Output

Content understanding architecture End to end ﬂow inference service Downloader
Preprocessor Image Detector Dynamic Dispatch Image Classifier #1 Image Classifier #2 Image Classifier #3 Image features Object bounding box { “object”: “hummingbird”, “score”: 0.97, “segment_pixels”: [ (200, 403), … ], }

Requirements for online inference → Minimize latency, maximize throughput 🔥
→ Fine-grained resources: fractional CPUs & GPUs ➗ → Programmatic API for development and testing 󰠁 → Rock-solid story for production 🧱

Basic solution: multi-model monolith Single container 17\ ❌ Unable to
scale models independently! 🙁 High latency, low throughput, and costly HTTP

Complex Solution: Models as Containers 18 ✅ Scale models independently,
use different resources ❌ Requires deep Kubernetes knowledge to build an app 🤷 HTTP HTTP

Ray Serve is built for multi-model inference Write a single
Python program Use your favorite tools & libraries Scale across CPUs and GPUs

Content Understanding as a Serve App Downloader Preprocessor Image Detector
Dynamic Dispatch Image Classifier #1 Image Classifier #2 Image Classifier #3 Pod Pod GPU: 1, CPU: 4 GPU: 0, CPU: 2 GPU: 0.3 CPU: 1 GPU: 0 CPU: 1 Combine

Content Understanding as a Serve App Downloader Preprocessor Image Detector
Dynamic Dispatch Image Classifier #1 Image Classifier #2 Image Classifier #3 Pod Pod GPU: 1, CPU: 4 GPU: 0, CPU: 2 GPU: 0.3 CPU: 1 GPU: 0 CPU: 1 Combine Content Understanding as a Serve App 21 Downloader Preprocessor Image Detector Dynamic Dispatch Image Classifier #1 Image Classifier #2 Image Classifier #3 Combine Pod Pod GPU: 1, CPU: 4 GPU: 0, CPU: 2 GPU: 0.3 CPU: 1 • Single Python program • Developed and tested locally • Deployed & updated as a single app

Ray Serve in 2.x + Production hardening, focused on Kubernetes
→ Goal: Make it easy to put scalable ML in production + Great UX for flexible model composition + Improved efficiency and save costs with advanced autoscaling 💸💸 💸

◦ Model Composition API ◦ Autoscaling ◦ Production Hardening ◦ ◦

Model composition requirements → Flexible to satisfy diverse use cases
+ Different models, frameworks, and business logic → Scalable and efficient when running in production → Ability to develop, test, and debug locally 💻 💸💸 💸

Solution: Model composition API → First-class API to build graphs
of Serve deployments → Full flexibility of Ray Serve + Author, configure, scale each model independently → Orchestrate computation using regular Python code

Model composition: Pattern chaining

Model Composition Pattern: Ensemble

Model composition pattern: Dynamic selection

Deployment graph API enables flexible model composition Downloader Preprocessor Image
Detector Dynamic Dispatch Image Classifier #1 Image Classifier #2 Image Classifier #3 Image features Object Bounding Boxes { “object”: “hummingbird”, “score”: 0.97, “segment_pixels”: [ (200, 403), … ], } Chaining Ensemble Dynamic Selection

Ray Serve composition API → Write your models as ordinary
classes → Flexibly compose models & logic w/ Python code → Run, test, and debug on your laptop → Deploy to production – configure and scale models independently

◦ Model Composition API ◦ Autoscaling ◦ Production Hardening

Autoscaling for ML models → Problem: ML models are compute
intensive -> 💸💸💸 + Not all models are always used + Hard to tune hardware utilization + Needs to work for multi-model 32 → Solution: Advanced autoscaling for Serve 🧠 + Supports scale-to-zero + Uses request queue lengths, no profiling + Fully compatible with model composition API

Model composition with autoscaling Downloader Preprocessor Image Detector Dynamic Dispatch
Image Classifier #1 Image Classifier #2 Image Classifier #3 Combine

Model composition with autoscaling 34 Downloader Preprocessor Image Detector Dynamic
Dispatch Image Classifier #1 Image Classifier #2 Image Classifier #3 Combine

Model composition with autoscaling 35 Downloader Preprocessor Image Detector Dynamic
Dispatch Image Classifier #1 Image Classifier #2 Image Classifier #3 Combine Classifier #1 queue_size >> target_queue_size

Image Classifier #1 Image Classifier #2 Image Classifier #3 Combine Image Classifier #1 Image Classifier #1 Image Classifier #1 Classifier #1 queue_size >> target_queue_size → add replicas

Image Classifier #1 Image Classifier #3 Combine Image Classifier #1 Image Classifier #1 Image Classifier #1 Classifier #3 Idle for X min → remove replicas Image Classifier #2

Ray Serve autoscaling → Easy to get started + Just
set a few basic parameters + No need to profile or update your models → Supports scale-to-zero and integrates fully with model composition

Production hardening → Online inference means solving operational problems: +
Updates without downtime + Handling failures gracefully + Monitoring, observability, alerting

RayService operator 💻 serve run serve build kubectl apply create
update monitor Production Hardening: Kubernetes Operator • Zero-downtime updates • Health checking and recovery • Integrate with k8s tooling

Ray Summit 2022 Talks https://www.anyscale.com/ray-summit-2022

Ray Serve Recap + Production hardening, focus on Kubernetes →
Goal: Make it easy to put scalable ML in production + Great UX for flexible model composition + Improved efficiency and save costs with advanced autoscaling 💰💰💰

Thank you! & Q & A [email protected] @2twitme Learn more
and get in touch with rayserve.org

State of Ray Serve in 2.0

State of Ray Serve in 2.0

More Decks by Anyscale

Other Decks in Programming

Featured

Transcript