Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Ray Community Meetup Talks

Anyscale
PRO
September 29, 2022

Ray Community Meetup Talks

State of Ray Serve in Ray 2.0

Anyscale
PRO

September 29, 2022
Tweet

More Decks by Anyscale

Other Decks in Programming

Transcript

  1. State of Ray Serve in 2.0 Simon Mo - [email protected]

    Ray Meetup @ 09/2022
  2. • Ray Serve Overview • Highlights in Ray 2.0 ◦

    Model Composition API ◦ Autoscaling ◦ Production Hardening ◦ ◦ Outline
  3. • Ray Serve Overview • Highlights in Ray 2.0 ◦

    Model Composition API ◦ Autoscaling ◦ Production Hardening ◦ ◦ Outline
  4. Ray AI Runtime (AIR) 4 Online Inference

  5. Ray Serve TL;DR Flexible, scalable, efficient compute for online inference

    1. Scalable 2. Low latency 3. Efficient First-class support for multi-model inference Python-native: mix business logic & machine learning 5
  6. Working Example: Content Understanding 6 { “object”: “hummingbird”, “score”: 0.97,

    “segment_pixels”: [ (200, 403), … ], } Input Output
  7. Content Understanding Architecture 7 Downloader Preprocessor Image Detector Dynamic Dispatch

    Image Classifier #1 Image Classifier #2 Image Classifier #3 Image features Object bounding box { “object”: “hummingbird”, “score”: 0.97, “segment_pixels”: [ (200, 403), … ], }
  8. → Minimize latency, maximize throughput 🔥 → Fine-grained resources: fractional

    CPUs & GPUs ➗ → Programmatic API for development and testing 󰠁 → Rock-solid story for production 🧱 Requirements for Online Inference 8
  9. Basic Solution: Multi-model Monolith 9 ❌ Unable to scale models

    independently! 🙁 High latency, low throughput, and costly HTTP
  10. Complex Solution: Models as Containers 10 ✅ Scale models independently,

    use different resources ❌ Requires deep Kubernetes knowledge to build an app 🤷 HTTP HTTP
  11. Write a single Python program Use your favorite tools &

    libraries Scale across CPUs and GPUs 11 Ray Serve is built for Multi-model Inference
  12. Content Understanding as a Serve App 12 Downloader Preprocessor Image

    Detector Dynamic Dispatch Image Classifier #1 Image Classifier #2 Image Classifier #3 Pod Pod GPU: 1, CPU: 4 GPU: 0, CPU: 2 GPU: 0.3 CPU: 1 GPU: 0 CPU: 1 Combine
  13. Content Understanding as a Serve App 13 Downloader Preprocessor Image

    Detector Dynamic Dispatch Image Classifier #1 Image Classifier #2 Image Classifier #3 Combine Pod Pod GPU: 1, CPU: 4 GPU: 0, CPU: 2 GPU: 0.3 CPU: 1 • Single Python program • Developed and tested locally • Deployed & updated as a single app
  14. + Production hardening, focused on Kubernetes Ray Serve in 2.0

    → Goal: Make it easy to put scalable ML in production 14 💸💸 💸 + Great UX for flexible model composition + Improved efficiency and save costs with advanced autoscaling
  15. • Ray Serve Overview • Highlights in Ray 2.0 ◦

    Model Composition API ◦ Autoscaling ◦ Production Hardening ◦ ◦ Outline
  16. Model Composition Requirements → Flexible to satisfy diverse use cases

    + Different models, frameworks, and business logic 16 → Scalable and efficient when running in production → Ability to develop, test, and debug locally 💻 💸💸 💸
  17. Solution: Model Composition API → First-class API to build graphs

    of Serve deployments → Full flexibility of Ray Serve + Author, configure, scale each model independently → Orchestrate computation using regular Python code 17
  18. Model Composition Pattern: Chaining 18

  19. Model Composition Pattern: Ensemble 19

  20. Model Composition Pattern: Dynamic Selection 20

  21. Deployment Graph API Enables Flexible Model Composition 21 Downloader Preprocessor

    Image Detector Dynamic Dispatch Image Classifier #1 Image Classifier #2 Image Classifier #3 Image features Object Bounding Boxes { “object”: “hummingbird”, “score”: 0.97, “segment_pixels”: [ (200, 403), … ], } Chaining Ensemble Dynamic Selection
  22. Ray Serve Model Composition API 22 → Write your models

    as ordinary classes → Flexibly compose models & logic w/ Python code → Run, test, and debug on your laptop → Deploy to production – configure and scale models independently
  23. • Ray Serve Overview • Highlights in Ray 2.0 ◦

    Model Composition API ◦ Autoscaling ◦ Production Hardening ◦ ◦ Outline
  24. Autoscaling for ML Models → Problem: ML models are compute

    intensive -> 💸💸💸 + Not all models are always used + Hard to tune hardware utilization + Needs to work for multi-model 24 → Solution: Advanced autoscaling for Serve 🧠 + Supports scale-to-zero + Uses request queue lengths, no profiling + Fully compatible with model composition API
  25. Model Composition with Autoscaling 25 Downloader Preprocessor Image Detector Dynamic

    Dispatch Image Classifier #1 Image Classifier #2 Image Classifier #3 Combine
  26. Model Composition with Autoscaling 26 Downloader Preprocessor Image Detector Dynamic

    Dispatch Image Classifier #1 Image Classifier #2 Image Classifier #3 Combine
  27. Model Composition with Autoscaling 27 Downloader Preprocessor Image Detector Dynamic

    Dispatch Image Classifier #1 Image Classifier #2 Image Classifier #3 Combine Classifier #1 queue_size >> target_queue_size
  28. Model Composition with Autoscaling 28 Downloader Preprocessor Image Detector Dynamic

    Dispatch Image Classifier #1 Image Classifier #2 Image Classifier #3 Combine Image Classifier #1 Image Classifier #1 Image Classifier #1 Classifier #1 queue_size >> target_queue_size → add replicas
  29. Model Composition with Autoscaling 29 Downloader Preprocessor Image Detector Dynamic

    Dispatch Image Classifier #1 Image Classifier #3 Combine Image Classifier #1 Image Classifier #1 Image Classifier #1 Classifier #3 Idle for X min → remove replicas Image Classifier #2
  30. Ray Serve Autoscaling → Easy to get started + Just

    set a few basic parameters + No need to profile or update your models → Supports scale-to-zero and integrates fully with model composition 30
  31. • Ray Serve Overview • Highlights in Ray 2.0 ◦

    Model Composition API ◦ Autoscaling ◦ Production Hardening ◦ ◦ Outline
  32. Production Hardening → Online inference means solving operational problems: +

    Updates without downtime + Handling failures gracefully + Monitoring, observability, alerting 32 Operational Benefits Flexibility, User Experience, Efficiency
  33. RayService operator 33 💻 serve run serve build kubectl apply

    create update monitor Production Hardening: Kubernetes Operator • Zero-downtime updates • Health checking and recovery • Integrate with k8s tooling
  34. 34 Production Hardening: GCS Fault Tolerance head node Prior to

    version 2.0, Ray had a single point of failure GCS Ray Actor Ray Actor Ray Actor worker node Ray Actor Ray Actor Ray Actor Ray Actor worker node Ray Actor Ray Actor Ray Actor Ray Actor ❌
  35. 35 Production Hardening: GCS Fault Tolerance Ray can now recover

    from GCS failures in version 2.0 → Tasks and actors continue to run → A new GCS is started and the cluster is recovered + Handled automatically by k8s operator Ray Serve applications continue to serve traffic
  36. 36 Chaos Testing: 99.99% uptime

  37. + Production hardening, focus on Kubernetes Ray Serve in 2.0

    → Goal: Make it easy to put scalable ML in production 37 💸💸 💸 + Great UX for flexible model composition + Improved efficiency and save costs with advanced autoscaling
  38. Ray Serve Deep Dives @ Ray Summit 38

  39. Ray Serve Community @ Ray Summit 39

  40. Thank you! Q&A Learn more and get in touch at

    rayserve.org