Scaling to/from zero on demand with Serve Handle API

Scaling Ray Serve 0:N Miha Jenko

Speaker Miha Jenko • Currently: Senior Data Scientist @ Outbrain
• Ex: Machine Learning Engineer @ Heureka Group

Talk • Use case and implementation of a custom Serve
scaling controller • Ray FinOps • Experience with the Ray Developer Community

Context We used Ray’s ecosystem of libraries and tools as
the central tenet for: • Model inference • Model training • Cloud compute provisioning

Project overview E-commerce use cases: • Product classiﬁcation • Product
matching Goals: • Migrate to the cloud, but keep inference costs as low as possible. • Simplify cloud GPU provisioning and management. • Use the same code for batch as for on-demand inference. • Reduction of the tooling surface. • …

Implementation: Model Deployment

Use case #1 On-demand inference Serving requirements: • Serving from
7 am to 5 pm (~50 hours / week). QoS guarantee.

Use case #2 Batch inference jobs Serving requirements: • Batch
jobs had to be fully processed before the start of the work week. • Batch size was estimated to an order of 100k per week. • CPU instances were estimated to be too slow to ﬁt into our constraints. • GPU instance lease time had to be limited.

Implementation: Scaling Controller

Resource usage savings On-demand inference • Resource usage reduction factor:
0.30 Batch inference jobs • Sustained peak utilization for 30m / week • Resource usage (GPU) reduction factor: 0.003

Implementation: Serve Ingress as a router

Ray Systems Complexity and FinOps The recommended way of running
Ray on Kubernetes requires using multiple layers of resource managers: • RayCluster autoscaler - for managing Serve Actors. • Kuberay - for managing k8s Pods. • EKS, GKE, AKS - for managing compute instances. Introducing this level of complexity must be justiﬁed: • Costly hardware. • Deep inference graphs. • Multiplicity of served models.

Experience with the Ray Community Chapters: 1. Hyperparameter search with
Ray Tune 2. Our ﬁrst RayCluster setup in Kubernetes 3. Collaboration with the Serve team (Serve autoscaler) 4. Fixing nasty memory leaks 5. Ray 2.x.x migration

CREDITS: This presentation template was created by Slidesgo, including icons
by Flaticon, and infographics & images by Freepik Thanks! Do you have any questions? Find Miha Jenko on Ray Slack, or Linkedin. Email: [email protected] Please keep this slide for attribution

Scaling to/from zero on demand with Serve Handl...

Scaling to/from zero on demand with Serve Handle API

Anyscale

More Decks by Anyscale

Other Decks in Programming

Featured

Transcript

Scaling Ray Serve 0:N Miha Jenko

Speaker Miha Jenko • Currently: Senior Data Scientist @ Outbrain

Talk • Use case and implementation of a custom Serve

Context We used Ray’s ecosystem of libraries and tools as

Project overview E-commerce use cases: • Product classiﬁcation • Product

Implementation: Model Deployment

Use case #1 On-demand inference Serving requirements: • Serving from

Use case #2 Batch inference jobs Serving requirements: • Batch

Implementation: Scaling Controller

Resource usage savings On-demand inference • Resource usage reduction factor:

Implementation: Serve Ingress as a router

Ray Systems Complexity and FinOps The recommended way of running

Experience with the Ray Community Chapters: 1. Hyperparameter search with

CREDITS: This presentation template was created by Slidesgo, including icons