Building a scalable ML model serving API
with Ray Serve
Tricia Fu // Product Manager @ Anyscale
Slide 2
Slide 2 text
What is this webinar about?
You trained a model, now what?
“Put it in production”
Slide 3
Slide 3 text
● Model serving - options and challenges
● Ray Serve overview
● Demo
Agenda
Slide 4
Slide 4 text
Challenges with deploying ML models:
● New models are developed over time
● Scaling out a single model
● Composing multiple models
Slide 5
Slide 5 text
Existing serving solutions have tradeoffs
Ease of development
Custom Tooling
Hard to
● develop
● deploy
● manage
Web Frameworks
Can’t achieve
● high performance
● low cost
Generality
Specialized Systems
Not
● flexible
● easy
Slide 6
Slide 6 text
● Complex YAML
● Scalability issue
● $$$
… but are struggle with handling multiple models
Many existing tools run 1 model well
Slide 7
Slide 7 text
● Simple to use
● Python API
● Built for putting ML models into production
… built from the learnings from 100s of practitioners
Ray Serve is a model serving framework
Slide 8
Slide 8 text
The Ray Ecosystem
RAY
Universal framework for distributed computing
Slide 9
Slide 9 text
Ray Serve: Web Framework
Simple to deploy web
services on Ray
Slide 10
Slide 10 text
Ray Serve: Model Serving
Specialized for ML
model serving
● GPUs
● Batching
● Scale-out
● Model composition
Slide 11
Slide 11 text
● Python API
● High Performance Calls (No HTTP)
● 1 line to scale to 100 machines
Ray Serve enables seamless model composition
Slide 12
Slide 12 text
• Separate I/O and Compute-heavy Work
• Native integration with FastAPI
• Easily scale your models and business logic
Combine ML with custom business logic
Slide 13
Slide 13 text
● Deploy a trained Python model and scale it to a
cluster using Ray Serve
● Utilize Ray Serve’s native FastAPI integration
● Compose multiple independent models into a
single model and run them in parallel
Demo
Slide 14
Slide 14 text
Let’s say you want to translate tweets to French and
filter out negative content.
We need an endpoint that:
○ Fetches a Tweet (I/O-intensive work)
○ Analyzes the sentiment (compute-intensive
model)
○ Translates it to French (compute-intensive
model)
○ Combine & present the results (business logic)
Slide 15
Slide 15 text
Let’s break it down
○ Step 1 - Write a Python script that translates
Tweets to French and filters out negative content.
Doesn’t use Ray Serve
○ Step 2 - Put it behind a local Ray Serve endpoint
○ Step 3 - Separate it into multiple local
deployments
○ Step 4 - Deploy it onto a cluster
Slide 16
Slide 16 text
No content
Slide 17
Slide 17 text
Ray Serve: A framework for 1+ models in production
01 02 03
Deployments Ingress Handle
Python interface
Scalable
Rolling upgrade
Fully-featured HTTP
FastAPI Integration
Model composition
Offload computation
Just Python
Slide 18
Slide 18 text
Ray Serve is easy to use, easy to deploy, and ready for
production:
○ Python interface
○ Multi-model composition
○ Scaling models independently
○ Programmatic deployment API
○ Easy to incorporate custom business logic
Slide 19
Slide 19 text
Thank you!
To learn more, check out ray.io and rayserve.org