Building a scalable ML model serving API with Ray Serve

Building a scalable ML model serving API with Ray Serve
Tricia Fu // Product Manager @ Anyscale

What is this webinar about? You trained a model, now
what? “Put it in production”

• Model serving - options and challenges • Ray Serve
overview • Demo Agenda

Challenges with deploying ML models: • New models are developed
over time • Scaling out a single model • Composing multiple models

Existing serving solutions have tradeoffs Ease of development Custom Tooling
Hard to • develop • deploy • manage Web Frameworks Can’t achieve • high performance • low cost Generality Specialized Systems Not • flexible • easy

• Complex YAML • Scalability issue • $$$ … but
are struggle with handling multiple models Many existing tools run 1 model well

• Simple to use • Python API • Built for
putting ML models into production … built from the learnings from 100s of practitioners Ray Serve is a model serving framework

The Ray Ecosystem RAY Universal framework for distributed computing

Ray Serve: Web Framework Simple to deploy web services on
Ray

Ray Serve: Model Serving Specialized for ML model serving •
GPUs • Batching • Scale-out • Model composition

• Python API • High Performance Calls (No HTTP) •
1 line to scale to 100 machines Ray Serve enables seamless model composition

• Separate I/O and Compute-heavy Work • Native integration with
FastAPI • Easily scale your models and business logic Combine ML with custom business logic

• Deploy a trained Python model and scale it to
a cluster using Ray Serve • Utilize Ray Serve’s native FastAPI integration • Compose multiple independent models into a single model and run them in parallel Demo

Let’s say you want to translate tweets to French and
filter out negative content. We need an endpoint that: ◦ Fetches a Tweet (I/O-intensive work) ◦ Analyzes the sentiment (compute-intensive model) ◦ Translates it to French (compute-intensive model) ◦ Combine & present the results (business logic)

Let’s break it down ◦ Step 1 - Write a
Python script that translates Tweets to French and filters out negative content. Doesn’t use Ray Serve ◦ Step 2 - Put it behind a local Ray Serve endpoint ◦ Step 3 - Separate it into multiple local deployments ◦ Step 4 - Deploy it onto a cluster

Ray Serve: A framework for 1+ models in production 01
02 03 Deployments Ingress Handle Python interface Scalable Rolling upgrade Fully-featured HTTP FastAPI Integration Model composition Offload computation Just Python

Ray Serve is easy to use, easy to deploy, and
ready for production: ◦ Python interface ◦ Multi-model composition ◦ Scaling models independently ◦ Programmatic deployment API ◦ Easy to incorporate custom business logic

Thank you! To learn more, check out ray.io and rayserve.org

Building a scalable ML model serving API with R...

Building a scalable ML model serving API with Ray Serve

Anyscale

More Decks by Anyscale

Other Decks in Technology

Featured

Transcript