Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Building a scalable ML model serving API with Ray Serve

September 10, 2021

Building a scalable ML model serving API with Ray Serve

Ray Serve is a framework-agnostic and Python-first model serving library built on Ray. In this introductory webinar on Ray Serve, we will highlight how Ray Serve makes it easy to deploy, operate and scale a machine learning API.

The core of the webinar will be a live demo that shows how to build a scalable API using Natural Language Processing models.

The demo will show how to:
- Deploy a trained Python model and scale it to a cluster using Ray Serve
- Improve the HTTP API using Ray Serve’s native FastAPI integration
- Compose multiple independently-scalable models into a single model, and run them in parallel to minimize latency.


September 10, 2021

More Decks by Anyscale

Other Decks in Technology


  1. Building a scalable ML model serving API with Ray Serve

    Tricia Fu // Product Manager @ Anyscale
  2. What is this webinar about? You trained a model, now

    what? “Put it in production”
  3. Challenges with deploying ML models: • New models are developed

    over time • Scaling out a single model • Composing multiple models
  4. Existing serving solutions have tradeoffs Ease of development Custom Tooling

    Hard to • develop • deploy • manage Web Frameworks Can’t achieve • high performance • low cost Generality Specialized Systems Not • flexible • easy
  5. • Complex YAML • Scalability issue • $$$ … but

    are struggle with handling multiple models Many existing tools run 1 model well
  6. • Simple to use • Python API • Built for

    putting ML models into production … built from the learnings from 100s of practitioners Ray Serve is a model serving framework
  7. Ray Serve: Model Serving Specialized for ML model serving •

    GPUs • Batching • Scale-out • Model composition
  8. • Python API • High Performance Calls (No HTTP) •

    1 line to scale to 100 machines Ray Serve enables seamless model composition
  9. • Separate I/O and Compute-heavy Work • Native integration with

    FastAPI • Easily scale your models and business logic Combine ML with custom business logic
  10. • Deploy a trained Python model and scale it to

    a cluster using Ray Serve • Utilize Ray Serve’s native FastAPI integration • Compose multiple independent models into a single model and run them in parallel Demo
  11. Let’s say you want to translate tweets to French and

    filter out negative content. We need an endpoint that: ◦ Fetches a Tweet (I/O-intensive work) ◦ Analyzes the sentiment (compute-intensive model) ◦ Translates it to French (compute-intensive model) ◦ Combine & present the results (business logic)
  12. Let’s break it down ◦ Step 1 - Write a

    Python script that translates Tweets to French and filters out negative content. Doesn’t use Ray Serve ◦ Step 2 - Put it behind a local Ray Serve endpoint ◦ Step 3 - Separate it into multiple local deployments ◦ Step 4 - Deploy it onto a cluster
  13. Ray Serve: A framework for 1+ models in production 01

    02 03 Deployments Ingress Handle Python interface Scalable Rolling upgrade Fully-featured HTTP FastAPI Integration Model composition Offload computation Just Python
  14. Ray Serve is easy to use, easy to deploy, and

    ready for production: ◦ Python interface ◦ Multi-model composition ◦ Scaling models independently ◦ Programmatic deployment API ◦ Easy to incorporate custom business logic