Upgrade to Pro — share decks privately, control downloads, hide ads and more …

State of Ray Serve in 2.0

State of Ray Serve in 2.0

Ray Serve is a scalable model serving library for building online inference APIs. Serve is framework agnostic, so you can use a single toolkit to serve everything from deep learning models built with frameworks like PyTorch, Tensorflow, and Keras, to Scikit-Learn models, to arbitrary Python business logic.

Anyscale
PRO

February 23, 2023
Tweet

More Decks by Anyscale

Other Decks in Programming

Transcript

  1. Welcome to the European Ray
    meetup community talks
    February 22, 2023
    Hosted by Jules S. Damji, Ray Advocacy Team @ Anyscale
    Email: [email protected]
    Twitter: @2twitme

    View Slide

  2. Agenda: Virtual Meetup
    Talk 1: Announcements, Upcoming events, Overview of Ray Serve in
    2.x - Jules S. Damji, Anyscale
    Talk 2: Scaling to/from zero on demand with Serve Handle API - Miha
    Jenko, Outbrain
    Talk 3: Smart shortcuts bootstrapping a modern NLP project - David
    Berenstein, Argilla.io

    View Slide

  3. Ray Summit 2023 - CfP
    ● Open 1/3/2023– February 3/06/ 2023
    ● In-person Conference 9/18-20, 2033
    ○ Marriott Marquis, San Francisco,
    ● CfP: https://bit.ly/cfp-ray-summit-2023

    View Slide

  4. View Slide

  5. NYC Ray Meetup - March 22

    View Slide

  6. Ray books …

    View Slide

  7. https://www.ray.io/community

    View Slide

  8. View Slide

  9. State of Ray Serve in 2.x

    View Slide

  10. Outline
    ○ Ray Serve Overview
    ○ Highlights in Ray 2.x
    ■ Model Composition API
    ■ Autoscaling
    ■ Production Hardening

    View Slide

  11. Outline
    ● Ray Serve Overview
    ● Highlights in Ray 2.x
    ○ Model Composition API
    ○ Autoscaling
    ○ Production Hardening


    View Slide

  12. Ray AI Runtime (AIR)
    Online
    Inference

    View Slide

  13. Ray Serve: TL;Dr
    1. Scalable
    2. Low latency
    3. Efficient
    First-class support
    for multi-model
    inference
    Python-native: mix
    business logic &
    machine learning
    Flexible, scalable, efficient compute for online inference

    View Slide

  14. Working Example: Content understanding
    End-to-end inference service
    {
    “object”: “hummingbird”,
    “score”: 0.97,
    “segment_pixels”: [
    (200, 403), …
    ],
    }
    Input Output

    View Slide

  15. Content understanding architecture
    End to end flow inference service
    Downloader Preprocessor
    Image Detector
    Dynamic Dispatch
    Image Classifier #1
    Image Classifier #2
    Image Classifier #3
    Image
    features
    Object
    bounding
    box
    {
    “object”: “hummingbird”,
    “score”: 0.97,
    “segment_pixels”: [
    (200, 403), …
    ],
    }

    View Slide

  16. Requirements for online inference
    → Minimize latency, maximize throughput
    🔥
    → Fine-grained resources: fractional CPUs & GPUs

    → Programmatic API for development and testing
    󰠁
    → Rock-solid story for production
    🧱

    View Slide

  17. Basic solution: multi-model monolith
    Single container
    17\
    ❌ Unable to scale models independently!
    🙁 High latency, low throughput, and costly
    HTTP

    View Slide

  18. Complex Solution: Models as Containers
    18
    ✅ Scale models independently, use different resources
    ❌ Requires deep Kubernetes knowledge to build an app 🤷
    HTTP HTTP

    View Slide

  19. Ray Serve is built for multi-model inference
    Write a single Python program
    Use your favorite tools & libraries
    Scale across CPUs and GPUs

    View Slide

  20. Content Understanding as a Serve App
    Downloader
    Preprocessor
    Image Detector
    Dynamic Dispatch
    Image Classifier #1
    Image Classifier #2
    Image Classifier #3
    Pod
    Pod
    GPU: 1, CPU: 4
    GPU: 0, CPU: 2
    GPU: 0.3
    CPU: 1
    GPU: 0
    CPU: 1
    Combine

    View Slide

  21. Content Understanding as a Serve App
    Downloader
    Preprocessor
    Image Detector
    Dynamic Dispatch
    Image Classifier #1
    Image Classifier #2
    Image Classifier #3
    Pod
    Pod
    GPU: 1, CPU: 4
    GPU: 0, CPU: 2
    GPU: 0.3
    CPU: 1
    GPU: 0
    CPU: 1
    Combine
    Content Understanding as a Serve App
    21
    Downloader
    Preprocessor
    Image Detector
    Dynamic Dispatch
    Image Classifier #1
    Image Classifier #2
    Image Classifier #3
    Combine
    Pod
    Pod
    GPU: 1, CPU: 4
    GPU: 0, CPU: 2
    GPU: 0.3
    CPU: 1
    ● Single Python program
    ● Developed and tested locally
    ● Deployed & updated as a
    single app

    View Slide

  22. Ray Serve in 2.x
    + Production hardening, focused
    on Kubernetes
    → Goal: Make it easy to put scalable ML in production
    + Great UX for flexible model
    composition
    + Improved efficiency and save costs with
    advanced autoscaling
    💸💸
    💸

    View Slide

  23. Outline
    ● Ray Serve Overview
    ● Highlights in Ray 2.x
    ○ Model Composition API
    ○ Autoscaling
    ○ Production Hardening


    View Slide

  24. Model composition requirements
    → Flexible to satisfy diverse use cases
    + Different models, frameworks, and business logic
    → Scalable and efficient when running in production
    → Ability to develop, test, and debug locally
    💻
    💸💸
    💸

    View Slide

  25. Solution: Model composition API
    → First-class API to build graphs of Serve deployments
    → Full flexibility of Ray Serve
    + Author, configure, scale each model independently
    → Orchestrate computation using regular Python code

    View Slide

  26. Model composition: Pattern chaining

    View Slide

  27. Model Composition Pattern: Ensemble

    View Slide

  28. Model composition pattern: Dynamic selection

    View Slide

  29. Deployment graph API enables flexible model
    composition
    Downloader Preprocessor
    Image Detector
    Dynamic Dispatch
    Image Classifier #1
    Image Classifier #2
    Image Classifier #3
    Image
    features
    Object
    Bounding
    Boxes
    {
    “object”:
    “hummingbird”,
    “score”: 0.97,
    “segment_pixels”: [
    (200, 403), …
    ],
    }
    Chaining
    Ensemble
    Dynamic
    Selection

    View Slide

  30. Ray Serve composition API
    → Write your models as ordinary classes
    → Flexibly compose models & logic w/ Python code
    → Run, test, and debug on your laptop
    → Deploy to production – configure and scale models
    independently

    View Slide

  31. Outline
    ● Ray Serve Overview
    ● Highlights in Ray 2.x
    ○ Model Composition API
    ○ Autoscaling
    ○ Production Hardening

    View Slide

  32. Autoscaling for ML models
    → Problem: ML models are compute intensive -> 💸💸💸
    + Not all models are always used
    + Hard to tune hardware utilization
    + Needs to work for multi-model
    32
    → Solution: Advanced autoscaling for Serve 🧠
    + Supports scale-to-zero
    + Uses request queue lengths, no profiling
    + Fully compatible with model composition API

    View Slide

  33. Model composition with autoscaling
    Downloader Preprocessor
    Image Detector
    Dynamic Dispatch
    Image Classifier #1
    Image Classifier #2
    Image Classifier #3
    Combine

    View Slide

  34. Model composition with autoscaling
    34
    Downloader Preprocessor
    Image Detector
    Dynamic Dispatch
    Image Classifier #1
    Image Classifier #2
    Image Classifier #3
    Combine

    View Slide

  35. Model composition with autoscaling
    35
    Downloader Preprocessor
    Image Detector
    Dynamic Dispatch
    Image Classifier #1
    Image Classifier #2
    Image Classifier #3
    Combine
    Classifier #1 queue_size >> target_queue_size

    View Slide

  36. Model composition with autoscaling
    Downloader Preprocessor
    Image Detector
    Dynamic Dispatch
    Image Classifier #1
    Image Classifier #2
    Image Classifier #3
    Combine
    Image Classifier #1
    Image Classifier #1
    Image Classifier #1
    Classifier #1 queue_size >> target_queue_size → add replicas

    View Slide

  37. Model composition with autoscaling
    Downloader Preprocessor
    Image Detector
    Dynamic Dispatch
    Image Classifier #1
    Image Classifier #3
    Combine
    Image Classifier #1
    Image Classifier #1
    Image Classifier #1
    Classifier #3 Idle for X min → remove replicas
    Image Classifier #2

    View Slide

  38. Ray Serve autoscaling
    → Easy to get started
    + Just set a few basic parameters
    + No need to profile or update your models
    → Supports scale-to-zero and integrates fully with model
    composition

    View Slide

  39. Production hardening
    → Online inference means solving operational problems:
    + Updates without downtime
    + Handling failures gracefully
    + Monitoring, observability, alerting

    View Slide

  40. RayService
    operator
    💻
    serve run
    serve build
    kubectl apply create
    update
    monitor
    Production Hardening: Kubernetes
    Operator
    ● Zero-downtime updates
    ● Health checking and recovery
    ● Integrate with k8s tooling

    View Slide

  41. Ray Summit 2022 Talks
    https://www.anyscale.com/ray-summit-2022

    View Slide

  42. Ray Summit 2022 Talks
    https://www.anyscale.com/ray-summit-2022

    View Slide

  43. Ray Serve Recap
    + Production hardening, focus on Kubernetes
    → Goal: Make it easy to put scalable ML in production
    + Great UX for flexible model composition
    + Improved efficiency and save costs with
    advanced autoscaling 💰💰💰

    View Slide

  44. Thank you! & Q & A
    [email protected]
    @2twitme
    Learn more and get in touch with rayserve.org

    View Slide