Upgrade to Pro — share decks privately, control downloads, hide ads and more …

State of Ray Serve in 2.0

Anyscale
February 23, 2023

State of Ray Serve in 2.0

Ray Serve is a scalable model serving library for building online inference APIs. Serve is framework agnostic, so you can use a single toolkit to serve everything from deep learning models built with frameworks like PyTorch, Tensorflow, and Keras, to Scikit-Learn models, to arbitrary Python business logic.

Anyscale

February 23, 2023
Tweet

More Decks by Anyscale

Other Decks in Programming

Transcript

  1. Welcome to the European Ray
    meetup community talks
    February 22, 2023
    Hosted by Jules S. Damji, Ray Advocacy Team @ Anyscale
    Email: [email protected]
    Twitter: @2twitme

    View full-size slide

  2. Agenda: Virtual Meetup
    Talk 1: Announcements, Upcoming events, Overview of Ray Serve in
    2.x - Jules S. Damji, Anyscale
    Talk 2: Scaling to/from zero on demand with Serve Handle API - Miha
    Jenko, Outbrain
    Talk 3: Smart shortcuts bootstrapping a modern NLP project - David
    Berenstein, Argilla.io

    View full-size slide

  3. Ray Summit 2023 - CfP
    ● Open 1/3/2023– February 3/06/ 2023
    ● In-person Conference 9/18-20, 2033
    ○ Marriott Marquis, San Francisco,
    ● CfP: https://bit.ly/cfp-ray-summit-2023

    View full-size slide

  4. NYC Ray Meetup - March 22

    View full-size slide

  5. Ray books …

    View full-size slide

  6. https://www.ray.io/community

    View full-size slide

  7. State of Ray Serve in 2.x

    View full-size slide

  8. Outline
    ○ Ray Serve Overview
    ○ Highlights in Ray 2.x
    ■ Model Composition API
    ■ Autoscaling
    ■ Production Hardening

    View full-size slide

  9. Outline
    ● Ray Serve Overview
    ● Highlights in Ray 2.x
    ○ Model Composition API
    ○ Autoscaling
    ○ Production Hardening


    View full-size slide

  10. Ray AI Runtime (AIR)
    Online
    Inference

    View full-size slide

  11. Ray Serve: TL;Dr
    1. Scalable
    2. Low latency
    3. Efficient
    First-class support
    for multi-model
    inference
    Python-native: mix
    business logic &
    machine learning
    Flexible, scalable, efficient compute for online inference

    View full-size slide

  12. Working Example: Content understanding
    End-to-end inference service
    {
    “object”: “hummingbird”,
    “score”: 0.97,
    “segment_pixels”: [
    (200, 403), …
    ],
    }
    Input Output

    View full-size slide

  13. Content understanding architecture
    End to end flow inference service
    Downloader Preprocessor
    Image Detector
    Dynamic Dispatch
    Image Classifier #1
    Image Classifier #2
    Image Classifier #3
    Image
    features
    Object
    bounding
    box
    {
    “object”: “hummingbird”,
    “score”: 0.97,
    “segment_pixels”: [
    (200, 403), …
    ],
    }

    View full-size slide

  14. Requirements for online inference
    → Minimize latency, maximize throughput
    🔥
    → Fine-grained resources: fractional CPUs & GPUs

    → Programmatic API for development and testing
    󰠁
    → Rock-solid story for production
    🧱

    View full-size slide

  15. Basic solution: multi-model monolith
    Single container
    17\
    ❌ Unable to scale models independently!
    🙁 High latency, low throughput, and costly
    HTTP

    View full-size slide

  16. Complex Solution: Models as Containers
    18
    ✅ Scale models independently, use different resources
    ❌ Requires deep Kubernetes knowledge to build an app 🤷
    HTTP HTTP

    View full-size slide

  17. Ray Serve is built for multi-model inference
    Write a single Python program
    Use your favorite tools & libraries
    Scale across CPUs and GPUs

    View full-size slide

  18. Content Understanding as a Serve App
    Downloader
    Preprocessor
    Image Detector
    Dynamic Dispatch
    Image Classifier #1
    Image Classifier #2
    Image Classifier #3
    Pod
    Pod
    GPU: 1, CPU: 4
    GPU: 0, CPU: 2
    GPU: 0.3
    CPU: 1
    GPU: 0
    CPU: 1
    Combine

    View full-size slide

  19. Content Understanding as a Serve App
    Downloader
    Preprocessor
    Image Detector
    Dynamic Dispatch
    Image Classifier #1
    Image Classifier #2
    Image Classifier #3
    Pod
    Pod
    GPU: 1, CPU: 4
    GPU: 0, CPU: 2
    GPU: 0.3
    CPU: 1
    GPU: 0
    CPU: 1
    Combine
    Content Understanding as a Serve App
    21
    Downloader
    Preprocessor
    Image Detector
    Dynamic Dispatch
    Image Classifier #1
    Image Classifier #2
    Image Classifier #3
    Combine
    Pod
    Pod
    GPU: 1, CPU: 4
    GPU: 0, CPU: 2
    GPU: 0.3
    CPU: 1
    ● Single Python program
    ● Developed and tested locally
    ● Deployed & updated as a
    single app

    View full-size slide

  20. Ray Serve in 2.x
    + Production hardening, focused
    on Kubernetes
    → Goal: Make it easy to put scalable ML in production
    + Great UX for flexible model
    composition
    + Improved efficiency and save costs with
    advanced autoscaling
    💸💸
    💸

    View full-size slide

  21. Outline
    ● Ray Serve Overview
    ● Highlights in Ray 2.x
    ○ Model Composition API
    ○ Autoscaling
    ○ Production Hardening


    View full-size slide

  22. Model composition requirements
    → Flexible to satisfy diverse use cases
    + Different models, frameworks, and business logic
    → Scalable and efficient when running in production
    → Ability to develop, test, and debug locally
    💻
    💸💸
    💸

    View full-size slide

  23. Solution: Model composition API
    → First-class API to build graphs of Serve deployments
    → Full flexibility of Ray Serve
    + Author, configure, scale each model independently
    → Orchestrate computation using regular Python code

    View full-size slide

  24. Model composition: Pattern chaining

    View full-size slide

  25. Model Composition Pattern: Ensemble

    View full-size slide

  26. Model composition pattern: Dynamic selection

    View full-size slide

  27. Deployment graph API enables flexible model
    composition
    Downloader Preprocessor
    Image Detector
    Dynamic Dispatch
    Image Classifier #1
    Image Classifier #2
    Image Classifier #3
    Image
    features
    Object
    Bounding
    Boxes
    {
    “object”:
    “hummingbird”,
    “score”: 0.97,
    “segment_pixels”: [
    (200, 403), …
    ],
    }
    Chaining
    Ensemble
    Dynamic
    Selection

    View full-size slide

  28. Ray Serve composition API
    → Write your models as ordinary classes
    → Flexibly compose models & logic w/ Python code
    → Run, test, and debug on your laptop
    → Deploy to production – configure and scale models
    independently

    View full-size slide

  29. Outline
    ● Ray Serve Overview
    ● Highlights in Ray 2.x
    ○ Model Composition API
    ○ Autoscaling
    ○ Production Hardening

    View full-size slide

  30. Autoscaling for ML models
    → Problem: ML models are compute intensive -> 💸💸💸
    + Not all models are always used
    + Hard to tune hardware utilization
    + Needs to work for multi-model
    32
    → Solution: Advanced autoscaling for Serve 🧠
    + Supports scale-to-zero
    + Uses request queue lengths, no profiling
    + Fully compatible with model composition API

    View full-size slide

  31. Model composition with autoscaling
    Downloader Preprocessor
    Image Detector
    Dynamic Dispatch
    Image Classifier #1
    Image Classifier #2
    Image Classifier #3
    Combine

    View full-size slide

  32. Model composition with autoscaling
    34
    Downloader Preprocessor
    Image Detector
    Dynamic Dispatch
    Image Classifier #1
    Image Classifier #2
    Image Classifier #3
    Combine

    View full-size slide

  33. Model composition with autoscaling
    35
    Downloader Preprocessor
    Image Detector
    Dynamic Dispatch
    Image Classifier #1
    Image Classifier #2
    Image Classifier #3
    Combine
    Classifier #1 queue_size >> target_queue_size

    View full-size slide

  34. Model composition with autoscaling
    Downloader Preprocessor
    Image Detector
    Dynamic Dispatch
    Image Classifier #1
    Image Classifier #2
    Image Classifier #3
    Combine
    Image Classifier #1
    Image Classifier #1
    Image Classifier #1
    Classifier #1 queue_size >> target_queue_size → add replicas

    View full-size slide

  35. Model composition with autoscaling
    Downloader Preprocessor
    Image Detector
    Dynamic Dispatch
    Image Classifier #1
    Image Classifier #3
    Combine
    Image Classifier #1
    Image Classifier #1
    Image Classifier #1
    Classifier #3 Idle for X min → remove replicas
    Image Classifier #2

    View full-size slide

  36. Ray Serve autoscaling
    → Easy to get started
    + Just set a few basic parameters
    + No need to profile or update your models
    → Supports scale-to-zero and integrates fully with model
    composition

    View full-size slide

  37. Production hardening
    → Online inference means solving operational problems:
    + Updates without downtime
    + Handling failures gracefully
    + Monitoring, observability, alerting

    View full-size slide

  38. RayService
    operator
    💻
    serve run
    serve build
    kubectl apply create
    update
    monitor
    Production Hardening: Kubernetes
    Operator
    ● Zero-downtime updates
    ● Health checking and recovery
    ● Integrate with k8s tooling

    View full-size slide

  39. Ray Summit 2022 Talks
    https://www.anyscale.com/ray-summit-2022

    View full-size slide

  40. Ray Summit 2022 Talks
    https://www.anyscale.com/ray-summit-2022

    View full-size slide

  41. Ray Serve Recap
    + Production hardening, focus on Kubernetes
    → Goal: Make it easy to put scalable ML in production
    + Great UX for flexible model composition
    + Improved efficiency and save costs with
    advanced autoscaling 💰💰💰

    View full-size slide

  42. Thank you! & Q & A
    [email protected]
    @2twitme
    Learn more and get in touch with rayserve.org

    View full-size slide