Scaling to/from zero on demand with Serve Handle API

Slide 1

Slide 1 text

Scaling Ray Serve 0:N Miha Jenko

Slide 2

Slide 2 text

Speaker Miha Jenko ● Currently: Senior Data Scientist @ Outbrain ● Ex: Machine Learning Engineer @ Heureka Group

Slide 3

Slide 3 text

Talk ● Use case and implementation of a custom Serve scaling controller ● Ray FinOps ● Experience with the Ray Developer Community

Slide 4

Slide 4 text

Context We used Ray’s ecosystem of libraries and tools as the central tenet for: ● Model inference ● Model training ● Cloud compute provisioning

Slide 5

Slide 5 text

Project overview E-commerce use cases: ● Product classiﬁcation ● Product matching Goals: ● Migrate to the cloud, but keep inference costs as low as possible. ● Simplify cloud GPU provisioning and management. ● Use the same code for batch as for on-demand inference. ● Reduction of the tooling surface. ● …

Slide 6

Slide 6 text

Implementation: Model Deployment

Slide 7

Slide 7 text

Use case #1 On-demand inference Serving requirements: ● Serving from 7 am to 5 pm (~50 hours / week). QoS guarantee.

Slide 8

Slide 8 text

Use case #2 Batch inference jobs Serving requirements: ● Batch jobs had to be fully processed before the start of the work week. ● Batch size was estimated to an order of 100k per week. ● CPU instances were estimated to be too slow to ﬁt into our constraints. ● GPU instance lease time had to be limited.

Slide 9

Slide 9 text

Implementation: Scaling Controller

Slide 10

Slide 10 text

Resource usage savings On-demand inference ● Resource usage reduction factor: 0.30 Batch inference jobs ● Sustained peak utilization for 30m / week ● Resource usage (GPU) reduction factor: 0.003

Slide 11

Slide 11 text

Implementation: Serve Ingress as a router

Slide 12

Slide 12 text

Ray Systems Complexity and FinOps The recommended way of running Ray on Kubernetes requires using multiple layers of resource managers: ● RayCluster autoscaler - for managing Serve Actors. ● Kuberay - for managing k8s Pods. ● EKS, GKE, AKS - for managing compute instances. Introducing this level of complexity must be justiﬁed: ● Costly hardware. ● Deep inference graphs. ● Multiplicity of served models.

Slide 13

Slide 13 text

Experience with the Ray Community Chapters: 1. Hyperparameter search with Ray Tune 2. Our ﬁrst RayCluster setup in Kubernetes 3. Collaboration with the Serve team (Serve autoscaler) 4. Fixing nasty memory leaks 5. Ray 2.x.x migration

Slide 14

Slide 14 text

CREDITS: This presentation template was created by Slidesgo, including icons by Flaticon, and infographics & images by Freepik Thanks! Do you have any questions? Find Miha Jenko on Ray Slack, or Linkedin. Email: [email protected] Please keep this slide for attribution