Building a scalable ML model serving API with Ray Serve

Slide 1

Slide 1 text

Building a scalable ML model serving API with Ray Serve Tricia Fu // Product Manager @ Anyscale

Slide 2

Slide 2 text

What is this webinar about? You trained a model, now what? “Put it in production”

Slide 3

Slide 3 text

● Model serving - options and challenges ● Ray Serve overview ● Demo Agenda

Slide 4

Slide 4 text

Challenges with deploying ML models: ● New models are developed over time ● Scaling out a single model ● Composing multiple models

Slide 5

Slide 5 text

Existing serving solutions have tradeoffs Ease of development Custom Tooling Hard to ● develop ● deploy ● manage Web Frameworks Can’t achieve ● high performance ● low cost Generality Specialized Systems Not ● flexible ● easy

Slide 6

Slide 6 text

● Complex YAML ● Scalability issue ● $$$ … but are struggle with handling multiple models Many existing tools run 1 model well

Slide 7

Slide 7 text

● Simple to use ● Python API ● Built for putting ML models into production … built from the learnings from 100s of practitioners Ray Serve is a model serving framework

Slide 8

Slide 8 text

The Ray Ecosystem RAY Universal framework for distributed computing

Slide 9

Slide 9 text

Ray Serve: Web Framework Simple to deploy web services on Ray

Slide 10

Slide 10 text

Ray Serve: Model Serving Specialized for ML model serving ● GPUs ● Batching ● Scale-out ● Model composition

Slide 11

Slide 11 text

● Python API ● High Performance Calls (No HTTP) ● 1 line to scale to 100 machines Ray Serve enables seamless model composition

Slide 12

Slide 12 text

• Separate I/O and Compute-heavy Work • Native integration with FastAPI • Easily scale your models and business logic Combine ML with custom business logic

Slide 13

Slide 13 text

● Deploy a trained Python model and scale it to a cluster using Ray Serve ● Utilize Ray Serve’s native FastAPI integration ● Compose multiple independent models into a single model and run them in parallel Demo

Slide 14

Slide 14 text

Let’s say you want to translate tweets to French and filter out negative content. We need an endpoint that: ○ Fetches a Tweet (I/O-intensive work) ○ Analyzes the sentiment (compute-intensive model) ○ Translates it to French (compute-intensive model) ○ Combine & present the results (business logic)

Slide 15

Slide 15 text

Let’s break it down ○ Step 1 - Write a Python script that translates Tweets to French and filters out negative content. Doesn’t use Ray Serve ○ Step 2 - Put it behind a local Ray Serve endpoint ○ Step 3 - Separate it into multiple local deployments ○ Step 4 - Deploy it onto a cluster

Slide 16

Slide 16 text

No content

Slide 17

Slide 17 text

Ray Serve: A framework for 1+ models in production 01 02 03 Deployments Ingress Handle Python interface Scalable Rolling upgrade Fully-featured HTTP FastAPI Integration Model composition Offload computation Just Python

Slide 18

Slide 18 text

Ray Serve is easy to use, easy to deploy, and ready for production: ○ Python interface ○ Multi-model composition ○ Scaling models independently ○ Programmatic deployment API ○ Easy to incorporate custom business logic

Slide 19

Slide 19 text

Thank you! To learn more, check out ray.io and rayserve.org