In this talk, we will discuss an implementation of a REST API controller for on-demand scaling of Serve replicas. This process updates Serve Deployment configurations, similarly to the Serve REST API, which has been available since Ray 2.0.0. We will also cover the use case and motivations for this approach, as well as our experience with the Ray developer community.
Scaling Ray Serve 0:N
● Currently: Senior Data Scientist @ Outbrain
● Ex: Machine Learning Engineer @ Heureka Group
● Use case and implementation of a custom Serve scaling controller
● Ray FinOps
● Experience with the Ray Developer Community
We used Ray’s ecosystem of libraries and tools as the central tenet for:
● Model inference
● Model training
● Cloud compute provisioning
E-commerce use cases:
● Product classiﬁcation
● Product matching
● Migrate to the cloud, but keep inference costs as low as possible.
● Simplify cloud GPU provisioning and management.
● Use the same code for batch as for on-demand inference.
● Reduction of the tooling surface.
Use case #1
● Serving from 7 am to 5 pm (~50 hours / week). QoS guarantee.
Use case #2
Batch inference jobs
● Batch jobs had to be fully processed before the start of the work week.
● Batch size was estimated to an order of 100k per week.
● CPU instances were estimated to be too slow to ﬁt into our constraints.
● GPU instance lease time had to be limited.
Resource usage savings
● Resource usage reduction factor: 0.30
Batch inference jobs
● Sustained peak utilization for 30m / week
● Resource usage (GPU) reduction factor: 0.003
Serve Ingress as a router
Ray Systems Complexity and FinOps
The recommended way of running Ray on Kubernetes requires using multiple layers of resource
● RayCluster autoscaler - for managing Serve Actors.
● Kuberay - for managing k8s Pods.
● EKS, GKE, AKS - for managing compute instances.
Introducing this level of complexity must be justiﬁed:
● Costly hardware.
● Deep inference graphs.
● Multiplicity of served models.
Experience with the Ray Community
1. Hyperparameter search with Ray Tune
2. Our ﬁrst RayCluster setup in Kubernetes
3. Collaboration with the Serve team (Serve autoscaler)
4. Fixing nasty memory leaks
5. Ray 2.x.x migration
CREDITS: This presentation template was created by
Slidesgo, including icons by Flaticon, and
infographics & images by Freepik
Do you have any questions?
Find Miha Jenko on Ray Slack, or Linkedin.
Email: [email protected]
Please keep this slide for attribution