Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Scaling to/from zero on demand with Serve Handle API

Scaling to/from zero on demand with Serve Handle API

In this talk, we will discuss an implementation of a REST API controller for on-demand scaling of Serve replicas. This process updates Serve Deployment configurations, similarly to the Serve REST API, which has been available since Ray 2.0.0. We will also cover the use case and motivations for this approach, as well as our experience with the Ray developer community.

Anyscale
PRO

February 23, 2023
Tweet

More Decks by Anyscale

Other Decks in Programming

Transcript

  1. Scaling Ray Serve 0:N
    Miha Jenko

    View Slide

  2. Speaker
    Miha Jenko
    ● Currently: Senior Data Scientist @ Outbrain
    ● Ex: Machine Learning Engineer @ Heureka Group

    View Slide

  3. Talk
    ● Use case and implementation of a custom Serve scaling controller
    ● Ray FinOps
    ● Experience with the Ray Developer Community

    View Slide

  4. Context
    We used Ray’s ecosystem of libraries and tools as the central tenet for:
    ● Model inference
    ● Model training
    ● Cloud compute provisioning

    View Slide

  5. Project overview
    E-commerce use cases:
    ● Product classification
    ● Product matching
    Goals:
    ● Migrate to the cloud, but keep inference costs as low as possible.
    ● Simplify cloud GPU provisioning and management.
    ● Use the same code for batch as for on-demand inference.
    ● Reduction of the tooling surface.
    ● …

    View Slide

  6. Implementation:
    Model Deployment

    View Slide

  7. Use case #1
    On-demand inference
    Serving requirements:
    ● Serving from 7 am to 5 pm (~50 hours / week). QoS guarantee.

    View Slide

  8. Use case #2
    Batch inference jobs
    Serving requirements:
    ● Batch jobs had to be fully processed before the start of the work week.
    ● Batch size was estimated to an order of 100k per week.
    ● CPU instances were estimated to be too slow to fit into our constraints.
    ● GPU instance lease time had to be limited.

    View Slide

  9. Implementation:
    Scaling Controller

    View Slide

  10. Resource usage savings
    On-demand inference
    ● Resource usage reduction factor: 0.30
    Batch inference jobs
    ● Sustained peak utilization for 30m / week
    ● Resource usage (GPU) reduction factor: 0.003

    View Slide

  11. Implementation:
    Serve Ingress as a router

    View Slide

  12. Ray Systems Complexity and FinOps
    The recommended way of running Ray on Kubernetes requires using multiple layers of resource
    managers:
    ● RayCluster autoscaler - for managing Serve Actors.
    ● Kuberay - for managing k8s Pods.
    ● EKS, GKE, AKS - for managing compute instances.
    Introducing this level of complexity must be justified:
    ● Costly hardware.
    ● Deep inference graphs.
    ● Multiplicity of served models.

    View Slide

  13. Experience with the Ray Community
    Chapters:
    1. Hyperparameter search with Ray Tune
    2. Our first RayCluster setup in Kubernetes
    3. Collaboration with the Serve team (Serve autoscaler)
    4. Fixing nasty memory leaks
    5. Ray 2.x.x migration

    View Slide

  14. CREDITS: This presentation template was created by
    Slidesgo, including icons by Flaticon, and
    infographics & images by Freepik
    Thanks!
    Do you have any questions?
    Find Miha Jenko on Ray Slack, or Linkedin.
    Email: [email protected]
    Please keep this slide for attribution

    View Slide