Upgrade to Pro — share decks privately, control downloads, hide ads and more …

State of Ray Serve in 2.0

Anyscale
February 23, 2023

State of Ray Serve in 2.0

Ray Serve is a scalable model serving library for building online inference APIs. Serve is framework agnostic, so you can use a single toolkit to serve everything from deep learning models built with frameworks like PyTorch, Tensorflow, and Keras, to Scikit-Learn models, to arbitrary Python business logic.

Anyscale

February 23, 2023
Tweet

More Decks by Anyscale

Other Decks in Programming

Transcript

  1. Welcome to the European Ray meetup community talks February 22,

    2023 Hosted by Jules S. Damji, Ray Advocacy Team @ Anyscale Email: [email protected] Twitter: @2twitme
  2. Agenda: Virtual Meetup Talk 1: Announcements, Upcoming events, Overview of

    Ray Serve in 2.x - Jules S. Damji, Anyscale Talk 2: Scaling to/from zero on demand with Serve Handle API - Miha Jenko, Outbrain Talk 3: Smart shortcuts bootstrapping a modern NLP project - David Berenstein, Argilla.io
  3. Ray Summit 2023 - CfP • Open 1/3/2023– February 3/06/

    2023 • In-person Conference 9/18-20, 2033 ◦ Marriott Marquis, San Francisco, • CfP: https://bit.ly/cfp-ray-summit-2023
  4. Outline ◦ Ray Serve Overview ◦ Highlights in Ray 2.x

    ▪ Model Composition API ▪ Autoscaling ▪ Production Hardening ▪
  5. Outline • Ray Serve Overview • Highlights in Ray 2.x

    ◦ Model Composition API ◦ Autoscaling ◦ Production Hardening ◦ ◦
  6. Ray Serve: TL;Dr 1. Scalable 2. Low latency 3. Efficient

    First-class support for multi-model inference Python-native: mix business logic & machine learning Flexible, scalable, efficient compute for online inference
  7. Working Example: Content understanding End-to-end inference service { “object”: “hummingbird”,

    “score”: 0.97, “segment_pixels”: [ (200, 403), … ], } Input Output
  8. Content understanding architecture End to end flow inference service Downloader

    Preprocessor Image Detector Dynamic Dispatch Image Classifier #1 Image Classifier #2 Image Classifier #3 Image features Object bounding box { “object”: “hummingbird”, “score”: 0.97, “segment_pixels”: [ (200, 403), … ], }
  9. Requirements for online inference → Minimize latency, maximize throughput 🔥

    → Fine-grained resources: fractional CPUs & GPUs ➗ → Programmatic API for development and testing 󰠁 → Rock-solid story for production 🧱
  10. Basic solution: multi-model monolith Single container 17\ ❌ Unable to

    scale models independently! 🙁 High latency, low throughput, and costly HTTP
  11. Complex Solution: Models as Containers 18 ✅ Scale models independently,

    use different resources ❌ Requires deep Kubernetes knowledge to build an app 🤷 HTTP HTTP
  12. Ray Serve is built for multi-model inference Write a single

    Python program Use your favorite tools & libraries Scale across CPUs and GPUs
  13. Content Understanding as a Serve App Downloader Preprocessor Image Detector

    Dynamic Dispatch Image Classifier #1 Image Classifier #2 Image Classifier #3 Pod Pod GPU: 1, CPU: 4 GPU: 0, CPU: 2 GPU: 0.3 CPU: 1 GPU: 0 CPU: 1 Combine
  14. Content Understanding as a Serve App Downloader Preprocessor Image Detector

    Dynamic Dispatch Image Classifier #1 Image Classifier #2 Image Classifier #3 Pod Pod GPU: 1, CPU: 4 GPU: 0, CPU: 2 GPU: 0.3 CPU: 1 GPU: 0 CPU: 1 Combine Content Understanding as a Serve App 21 Downloader Preprocessor Image Detector Dynamic Dispatch Image Classifier #1 Image Classifier #2 Image Classifier #3 Combine Pod Pod GPU: 1, CPU: 4 GPU: 0, CPU: 2 GPU: 0.3 CPU: 1 • Single Python program • Developed and tested locally • Deployed & updated as a single app
  15. Ray Serve in 2.x + Production hardening, focused on Kubernetes

    → Goal: Make it easy to put scalable ML in production + Great UX for flexible model composition + Improved efficiency and save costs with advanced autoscaling 💸💸 💸
  16. Outline • Ray Serve Overview • Highlights in Ray 2.x

    ◦ Model Composition API ◦ Autoscaling ◦ Production Hardening ◦ ◦
  17. Model composition requirements → Flexible to satisfy diverse use cases

    + Different models, frameworks, and business logic → Scalable and efficient when running in production → Ability to develop, test, and debug locally 💻 💸💸 💸
  18. Solution: Model composition API → First-class API to build graphs

    of Serve deployments → Full flexibility of Ray Serve + Author, configure, scale each model independently → Orchestrate computation using regular Python code
  19. Deployment graph API enables flexible model composition Downloader Preprocessor Image

    Detector Dynamic Dispatch Image Classifier #1 Image Classifier #2 Image Classifier #3 Image features Object Bounding Boxes { “object”: “hummingbird”, “score”: 0.97, “segment_pixels”: [ (200, 403), … ], } Chaining Ensemble Dynamic Selection
  20. Ray Serve composition API → Write your models as ordinary

    classes → Flexibly compose models & logic w/ Python code → Run, test, and debug on your laptop → Deploy to production – configure and scale models independently
  21. Outline • Ray Serve Overview • Highlights in Ray 2.x

    ◦ Model Composition API ◦ Autoscaling ◦ Production Hardening
  22. Autoscaling for ML models → Problem: ML models are compute

    intensive -> 💸💸💸 + Not all models are always used + Hard to tune hardware utilization + Needs to work for multi-model 32 → Solution: Advanced autoscaling for Serve 🧠 + Supports scale-to-zero + Uses request queue lengths, no profiling + Fully compatible with model composition API
  23. Model composition with autoscaling Downloader Preprocessor Image Detector Dynamic Dispatch

    Image Classifier #1 Image Classifier #2 Image Classifier #3 Combine
  24. Model composition with autoscaling 34 Downloader Preprocessor Image Detector Dynamic

    Dispatch Image Classifier #1 Image Classifier #2 Image Classifier #3 Combine
  25. Model composition with autoscaling 35 Downloader Preprocessor Image Detector Dynamic

    Dispatch Image Classifier #1 Image Classifier #2 Image Classifier #3 Combine Classifier #1 queue_size >> target_queue_size
  26. Model composition with autoscaling Downloader Preprocessor Image Detector Dynamic Dispatch

    Image Classifier #1 Image Classifier #2 Image Classifier #3 Combine Image Classifier #1 Image Classifier #1 Image Classifier #1 Classifier #1 queue_size >> target_queue_size → add replicas
  27. Model composition with autoscaling Downloader Preprocessor Image Detector Dynamic Dispatch

    Image Classifier #1 Image Classifier #3 Combine Image Classifier #1 Image Classifier #1 Image Classifier #1 Classifier #3 Idle for X min → remove replicas Image Classifier #2
  28. Ray Serve autoscaling → Easy to get started + Just

    set a few basic parameters + No need to profile or update your models → Supports scale-to-zero and integrates fully with model composition
  29. Production hardening → Online inference means solving operational problems: +

    Updates without downtime + Handling failures gracefully + Monitoring, observability, alerting
  30. RayService operator 💻 serve run serve build kubectl apply create

    update monitor Production Hardening: Kubernetes Operator • Zero-downtime updates • Health checking and recovery • Integrate with k8s tooling
  31. Ray Serve Recap + Production hardening, focus on Kubernetes →

    Goal: Make it easy to put scalable ML in production + Great UX for flexible model composition + Improved efficiency and save costs with advanced autoscaling 💰💰💰