Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Building a scalable ML model serving API with Ray Serve

Af07bbf978a0989644b039ae6b8904a5?s=47 Anyscale
PRO
September 10, 2021

Building a scalable ML model serving API with Ray Serve

Ray Serve is a framework-agnostic and Python-first model serving library built on Ray. In this introductory webinar on Ray Serve, we will highlight how Ray Serve makes it easy to deploy, operate and scale a machine learning API.

The core of the webinar will be a live demo that shows how to build a scalable API using Natural Language Processing models.

The demo will show how to:
- Deploy a trained Python model and scale it to a cluster using Ray Serve
- Improve the HTTP API using Ray Serve’s native FastAPI integration
- Compose multiple independently-scalable models into a single model, and run them in parallel to minimize latency.

Af07bbf978a0989644b039ae6b8904a5?s=128

Anyscale
PRO

September 10, 2021
Tweet

Transcript

  1. Building a scalable ML model serving API with Ray Serve

    Tricia Fu // Product Manager @ Anyscale
  2. What is this webinar about? You trained a model, now

    what? “Put it in production”
  3. • Model serving - options and challenges • Ray Serve

    overview • Demo Agenda
  4. Challenges with deploying ML models: • New models are developed

    over time • Scaling out a single model • Composing multiple models
  5. Existing serving solutions have tradeoffs Ease of development Custom Tooling

    Hard to • develop • deploy • manage Web Frameworks Can’t achieve • high performance • low cost Generality Specialized Systems Not • flexible • easy
  6. • Complex YAML • Scalability issue • $$$ … but

    are struggle with handling multiple models Many existing tools run 1 model well
  7. • Simple to use • Python API • Built for

    putting ML models into production … built from the learnings from 100s of practitioners Ray Serve is a model serving framework
  8. The Ray Ecosystem RAY Universal framework for distributed computing

  9. Ray Serve: Web Framework Simple to deploy web services on

    Ray
  10. Ray Serve: Model Serving Specialized for ML model serving •

    GPUs • Batching • Scale-out • Model composition
  11. • Python API • High Performance Calls (No HTTP) •

    1 line to scale to 100 machines Ray Serve enables seamless model composition
  12. • Separate I/O and Compute-heavy Work • Native integration with

    FastAPI • Easily scale your models and business logic Combine ML with custom business logic
  13. Mobile gaming giant Wildlife Studios is able to serve in-game

    offers 3X faster with Ray Serve.
  14. Mobile gaming giant is able to serve in-game offers 3X

    faster with Ray Serve.
  15. • Deploy a trained Python model and scale it to

    a cluster using Ray Serve • Utilize Ray Serve’s native FastAPI integration • Compose multiple independent models into a single model and run them in parallel Demo
  16. Let’s say you want to translate tweets to French and

    filter out negative content. We need an endpoint that: ◦ Fetches a Tweet (I/O-intensive work) ◦ Analyzes the sentiment (compute-intensive model) ◦ Translates it to French (compute-intensive model) ◦ Combine & present the results (business logic)
  17. Let’s break it down ◦ Step 1 - Write a

    Python script that translates Tweets to French and filters out negative content. Doesn’t use Ray Serve ◦ Step 2 - Put it behind a local Ray Serve endpoint ◦ Step 3 - Separate it into multiple local deployments ◦ Step 4 - Deploy it onto a cluster
  18. None
  19. Ray Serve: A framework for 1+ models in production 01

    02 03 Deployments Ingress Handle Python interface Scalable Rolling upgrade Fully-featured HTTP FastAPI Integration Model composition Offload computation Just Python
  20. Ray Serve is easy to use, easy to deploy, and

    ready for production: ◦ Python interface ◦ Multi-model composition ◦ Scaling models independently ◦ Programmatic deployment API ◦ Easy to incorporate custom business logic
  21. Thank you! To learn more, check out ray.io and rayserve.org

  22. Building a scalable ML model serving API with Ray Serve

    Tricia Fu // Product Manager @ Anyscale
  23. • Ray Serve Overview • Demo Agenda

  24. Ray Serve: Web Framework Simple to Deploy Web Services on

    Ray
  25. Ray Serve: Model Serving Specialized for ML Model Serving GPUs

    Batching Scale-out Model Composition
  26. Wildlife Studios is using Ray Serve to optimize and serve

    more relevant and timely in-game offers With Ray Serve, they have been able to serve in-game offers 3x faster at 1/20th the cost Saved up to $400k annually from more infrastructure utilization Wildlife Studios & Ray Serve
  27. Ray Core - universal distributed framework Open Source Cluster Manager

    Integrated Ecosystem Native Libraries
  28. Ready for Production Ease of development Building ML Service Web

    Frameworks Can’t achieve • high performance • low cost Custom Tooling Hard to • develop • deploy • manage Specialized Systems Lost • flexibility • ease of use
  29. What makes Serve Different?

  30. What makes Serve Different? Many Tools Run 1 Model Well

    With 1+ copies of the model -> Impossible? -> Complex YAML -> Scalability issue -> $$$
  31. Reality: New models are developed over time Scale out a

    single model Compose multiple models together for real world use cases
  32. Ray Serve Enables Seamless Model Composition Pythonic API High Performance

    Calls (No HTTP) 1 line to scale to 100 machines
  33. Combine Machine Learning w/ Custom Business Logic Separate I/O and

    Compute-heavy Work Native integration with FastAPI Easily scale your models and business logic
  34. Ray Serve: A Framework for 1+ Models in Production Deployment

    Ingress Handle Pythonic Interface Scalable Deployment Rolling Upgrade Fully Featured HTTP FastAPI Integration Arbitrary Composition Offload Computation Just Python
  35. • Deploy a trained Python model and scale it to

    a cluster using Ray Serve • Utilize Ray Serve’s native FastAPI integration • Compose multiple independent models into a single model and run them in parallel Demo
  36. • Let’s say you want to translate tweets to French

    and filter out negative content. • We need an endpoint that: ◦ Fetches a Tweet (I/O-intensive work) ◦ Analyzes the sentiment (compute-intensive model) ◦ Translates it to French
  37. None
  38. Thank you! To learn more, check out ray.io and rayserve.org

  39. None
  40. None