Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Scaling to/from zero on demand with Serve Handle API

Anyscale
February 23, 2023

Scaling to/from zero on demand with Serve Handle API

In this talk, we will discuss an implementation of a REST API controller for on-demand scaling of Serve replicas. This process updates Serve Deployment configurations, similarly to the Serve REST API, which has been available since Ray 2.0.0. We will also cover the use case and motivations for this approach, as well as our experience with the Ray developer community.

Anyscale

February 23, 2023
Tweet

More Decks by Anyscale

Other Decks in Programming

Transcript

  1. Speaker Miha Jenko • Currently: Senior Data Scientist @ Outbrain

    • Ex: Machine Learning Engineer @ Heureka Group
  2. Talk • Use case and implementation of a custom Serve

    scaling controller • Ray FinOps • Experience with the Ray Developer Community
  3. Context We used Ray’s ecosystem of libraries and tools as

    the central tenet for: • Model inference • Model training • Cloud compute provisioning
  4. Project overview E-commerce use cases: • Product classification • Product

    matching Goals: • Migrate to the cloud, but keep inference costs as low as possible. • Simplify cloud GPU provisioning and management. • Use the same code for batch as for on-demand inference. • Reduction of the tooling surface. • …
  5. Use case #1 On-demand inference Serving requirements: • Serving from

    7 am to 5 pm (~50 hours / week). QoS guarantee.
  6. Use case #2 Batch inference jobs Serving requirements: • Batch

    jobs had to be fully processed before the start of the work week. • Batch size was estimated to an order of 100k per week. • CPU instances were estimated to be too slow to fit into our constraints. • GPU instance lease time had to be limited.
  7. Resource usage savings On-demand inference • Resource usage reduction factor:

    0.30 Batch inference jobs • Sustained peak utilization for 30m / week • Resource usage (GPU) reduction factor: 0.003
  8. Ray Systems Complexity and FinOps The recommended way of running

    Ray on Kubernetes requires using multiple layers of resource managers: • RayCluster autoscaler - for managing Serve Actors. • Kuberay - for managing k8s Pods. • EKS, GKE, AKS - for managing compute instances. Introducing this level of complexity must be justified: • Costly hardware. • Deep inference graphs. • Multiplicity of served models.
  9. Experience with the Ray Community Chapters: 1. Hyperparameter search with

    Ray Tune 2. Our first RayCluster setup in Kubernetes 3. Collaboration with the Serve team (Serve autoscaler) 4. Fixing nasty memory leaks 5. Ray 2.x.x migration
  10. CREDITS: This presentation template was created by Slidesgo, including icons

    by Flaticon, and infographics & images by Freepik Thanks! Do you have any questions? Find Miha Jenko on Ray Slack, or Linkedin. Email: [email protected] Please keep this slide for attribution