Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Ray Community Meetup Talks

Anyscale
September 29, 2022

Ray Community Meetup Talks

State of Ray Serve in Ray 2.0

Anyscale

September 29, 2022
Tweet

More Decks by Anyscale

Other Decks in Programming

Transcript

  1. • Ray Serve Overview • Highlights in Ray 2.0 ◦

    Model Composition API ◦ Autoscaling ◦ Production Hardening ◦ ◦ Outline
  2. • Ray Serve Overview • Highlights in Ray 2.0 ◦

    Model Composition API ◦ Autoscaling ◦ Production Hardening ◦ ◦ Outline
  3. Ray Serve TL;DR Flexible, scalable, efficient compute for online inference

    1. Scalable 2. Low latency 3. Efficient First-class support for multi-model inference Python-native: mix business logic & machine learning 5
  4. Content Understanding Architecture 7 Downloader Preprocessor Image Detector Dynamic Dispatch

    Image Classifier #1 Image Classifier #2 Image Classifier #3 Image features Object bounding box { “object”: “hummingbird”, “score”: 0.97, “segment_pixels”: [ (200, 403), … ], }
  5. → Minimize latency, maximize throughput 🔥 → Fine-grained resources: fractional

    CPUs & GPUs ➗ → Programmatic API for development and testing 󰠁 → Rock-solid story for production 🧱 Requirements for Online Inference 8
  6. Basic Solution: Multi-model Monolith 9 ❌ Unable to scale models

    independently! 🙁 High latency, low throughput, and costly HTTP
  7. Complex Solution: Models as Containers 10 ✅ Scale models independently,

    use different resources ❌ Requires deep Kubernetes knowledge to build an app 🤷 HTTP HTTP
  8. Write a single Python program Use your favorite tools &

    libraries Scale across CPUs and GPUs 11 Ray Serve is built for Multi-model Inference
  9. Content Understanding as a Serve App 12 Downloader Preprocessor Image

    Detector Dynamic Dispatch Image Classifier #1 Image Classifier #2 Image Classifier #3 Pod Pod GPU: 1, CPU: 4 GPU: 0, CPU: 2 GPU: 0.3 CPU: 1 GPU: 0 CPU: 1 Combine
  10. Content Understanding as a Serve App 13 Downloader Preprocessor Image

    Detector Dynamic Dispatch Image Classifier #1 Image Classifier #2 Image Classifier #3 Combine Pod Pod GPU: 1, CPU: 4 GPU: 0, CPU: 2 GPU: 0.3 CPU: 1 • Single Python program • Developed and tested locally • Deployed & updated as a single app
  11. + Production hardening, focused on Kubernetes Ray Serve in 2.0

    → Goal: Make it easy to put scalable ML in production 14 💸💸 💸 + Great UX for flexible model composition + Improved efficiency and save costs with advanced autoscaling
  12. • Ray Serve Overview • Highlights in Ray 2.0 ◦

    Model Composition API ◦ Autoscaling ◦ Production Hardening ◦ ◦ Outline
  13. Model Composition Requirements → Flexible to satisfy diverse use cases

    + Different models, frameworks, and business logic 16 → Scalable and efficient when running in production → Ability to develop, test, and debug locally 💻 💸💸 💸
  14. Solution: Model Composition API → First-class API to build graphs

    of Serve deployments → Full flexibility of Ray Serve + Author, configure, scale each model independently → Orchestrate computation using regular Python code 17
  15. Deployment Graph API Enables Flexible Model Composition 21 Downloader Preprocessor

    Image Detector Dynamic Dispatch Image Classifier #1 Image Classifier #2 Image Classifier #3 Image features Object Bounding Boxes { “object”: “hummingbird”, “score”: 0.97, “segment_pixels”: [ (200, 403), … ], } Chaining Ensemble Dynamic Selection
  16. Ray Serve Model Composition API 22 → Write your models

    as ordinary classes → Flexibly compose models & logic w/ Python code → Run, test, and debug on your laptop → Deploy to production – configure and scale models independently
  17. • Ray Serve Overview • Highlights in Ray 2.0 ◦

    Model Composition API ◦ Autoscaling ◦ Production Hardening ◦ ◦ Outline
  18. Autoscaling for ML Models → Problem: ML models are compute

    intensive -> 💸💸💸 + Not all models are always used + Hard to tune hardware utilization + Needs to work for multi-model 24 → Solution: Advanced autoscaling for Serve 🧠 + Supports scale-to-zero + Uses request queue lengths, no profiling + Fully compatible with model composition API
  19. Model Composition with Autoscaling 25 Downloader Preprocessor Image Detector Dynamic

    Dispatch Image Classifier #1 Image Classifier #2 Image Classifier #3 Combine
  20. Model Composition with Autoscaling 26 Downloader Preprocessor Image Detector Dynamic

    Dispatch Image Classifier #1 Image Classifier #2 Image Classifier #3 Combine
  21. Model Composition with Autoscaling 27 Downloader Preprocessor Image Detector Dynamic

    Dispatch Image Classifier #1 Image Classifier #2 Image Classifier #3 Combine Classifier #1 queue_size >> target_queue_size
  22. Model Composition with Autoscaling 28 Downloader Preprocessor Image Detector Dynamic

    Dispatch Image Classifier #1 Image Classifier #2 Image Classifier #3 Combine Image Classifier #1 Image Classifier #1 Image Classifier #1 Classifier #1 queue_size >> target_queue_size → add replicas
  23. Model Composition with Autoscaling 29 Downloader Preprocessor Image Detector Dynamic

    Dispatch Image Classifier #1 Image Classifier #3 Combine Image Classifier #1 Image Classifier #1 Image Classifier #1 Classifier #3 Idle for X min → remove replicas Image Classifier #2
  24. Ray Serve Autoscaling → Easy to get started + Just

    set a few basic parameters + No need to profile or update your models → Supports scale-to-zero and integrates fully with model composition 30
  25. • Ray Serve Overview • Highlights in Ray 2.0 ◦

    Model Composition API ◦ Autoscaling ◦ Production Hardening ◦ ◦ Outline
  26. Production Hardening → Online inference means solving operational problems: +

    Updates without downtime + Handling failures gracefully + Monitoring, observability, alerting 32 Operational Benefits Flexibility, User Experience, Efficiency
  27. RayService operator 33 💻 serve run serve build kubectl apply

    create update monitor Production Hardening: Kubernetes Operator • Zero-downtime updates • Health checking and recovery • Integrate with k8s tooling
  28. 34 Production Hardening: GCS Fault Tolerance head node Prior to

    version 2.0, Ray had a single point of failure GCS Ray Actor Ray Actor Ray Actor worker node Ray Actor Ray Actor Ray Actor Ray Actor worker node Ray Actor Ray Actor Ray Actor Ray Actor ❌
  29. 35 Production Hardening: GCS Fault Tolerance Ray can now recover

    from GCS failures in version 2.0 → Tasks and actors continue to run → A new GCS is started and the cluster is recovered + Handled automatically by k8s operator Ray Serve applications continue to serve traffic
  30. + Production hardening, focus on Kubernetes Ray Serve in 2.0

    → Goal: Make it easy to put scalable ML in production 37 💸💸 💸 + Great UX for flexible model composition + Improved efficiency and save costs with advanced autoscaling