Ray Serve: Patterns of ML Models in Production (Simon Mo)

Ray Serve: Patterns of ML Models in Production Simon Mo
@ Anyscale Ray Summit 2021

Building Ray Serve @ Anyscale Previously: Prediction Serving System @
Berkeley RISELab Constantly Talking to ML Practitioners Who am I?

Scalable and Programmable Serving Framework on Ray Framework Agnostic, Python
First, and Easy to Use Helps you Scale in Production Ray Serve

Ray Serve Common patterns of ML in production Ray Serve:
your go-to framework for deploying ML models This talk is about

Ray Serve: Web Framework Simple to Deploy Web Services on
Ray

Ray Serve: Model Serving Specialized for ML Model Serving GPUs
Batching Scale-out Model Composition

Native Libraries 3rd Party Libraries most comprehensive set of distributed
libraries universal framework for distributed computing Ray: Ecosystem

Primitives for Distributed Apps Framework for ML Serving By leveraging
Ray, Ray Serve is built for scale. [How Ray’s benefit -> Serve’s benefit] (Ion’s Slide? scalability, dev experience,)

Ready for Production Ease of development Building ML Service Web
Frameworks Can’t achieve • high performance • low cost Custom Tooling Hard to • develop • deploy • manage Specialized Systems Lost • flexibility • ease of use

What makes Serve Different?

What makes Serve Different? Many Tools Run 1 Model Well
With 1+ copies of the model -> Impossible? -> Complex YAML -> Scalability issue -> $$$

Reality: - New models are developed over time - Scale
out a single model - Compose multiple models together for real world use cases

• Pipeline • Ensemble • Business Logic • Online Learning
Patterns of ML Models in Production

Patterns

A Typical Computer Vision Pipeline CAT Standing Cat

• Break Tasks into Steps • Scikit-Learn Pipeline: Pipeline([('scaler', StandardScaler()),
('svc', SVC())]) • Recommendation Systems: [EmbeddingLookup(), FeatureInteraction(), NearestNeighbors(), Ranking()] • Common Preprocessing: [HeavyWeightMLMegaModel(), DecisionTree()/BoostingModel()] Pipeline

Pipeline Implementation Wrap models in web server Many specialized microservices
Simple but not performant Complex and hard to operate

Ray Serve: Handle Allow Deployments to Call Other Deployments

Ray Serve: Architecture

Ray Serve Enables Seamless Model Composition Pythonic API High Performance
Calls (No HTTP) 1 line to scale to 100 machines

Patterns

Mixing the output from 1+ models Ensemble

Model Update Ensemble Use Cases Aggregate Dynamic Selection

Wrap the models in the same handler Ensemble Deployment Many
microservices to manage

Ensemble Example @ Ray Summit 2020

Patterns

Business Logic • Database Lookup • Web API Calls •
Feature Store Lookup • Feature Transformations

Business Logic in Action Network Bound I/O Heavy Compute Bound
Memory Hungry

Key Question: Where to Run the Business Logic?

Business Logic in Ray Serve Network Bound I/O Heavy Offloaded
to Another Deployment Just Python Just Python

Ray Serve: Ingress

Ray Serve Enables Arbitrary Business Logic Separate I/O and Compute
Heavy Work Native FastAPI Ingress Scale-out Web Serving to Replicas

Patterns

Online Learning • Dynamically learn the model weights • Personalized
models • Dynamically learn parameters to orchestrate the models • Model selections, Contextual Bandit • Reinforcement learning (RL) • State of the art “learn by interacting with the environment” • AlphaGo

Online Learning Example https://www.anyscale.com/blog/online-resource-allocation-with-ray-at-ant-group

Patterns Framework

Wrap the models in the same handler Many microservices to
manage Ray Serve offer the best of both worlds (Bring back the quad chart)

Ray Serve: A Framework for 1+ Models in Production Deployment
Ingress Handle Pythonic Interface Scalable Deployment Rolling Upgrade Fully Featured HTTP FastAPI Integration Arbitrary Composition Offload Computation Just Python

Ready for Production Ease of development Building ML Service Web
Frameworks Can’t achieve • high performance • low cost Custom Tooling Hard to • develop • deploy • manage Specialized Systems Lost • flexibility • ease of use Pythonic API Native FastAPI High Performance Scalability

Ray Serve: Production Use Cases Leveraging the Possibilities of Ray
Serve in Implementing a Scalable, Fully Automated Digital Authentication Service (Widas) [Thurs 12:25-12:55pm] How Ray and Anyscale Make it Easy to do Massive-scale ML on Aerial Imagery (Dendra) [Wed 1:45-2:15pm] Achieving Scalability and Interactivity with Ray Serve (Ikigai Labs) [Wed 01:45-2:15pm] Building High Availability and Scalability Online Computing Applications on Ray (Ant Group) [Wed 01:45-2:15pm] Ray and Anyscale: An Optimization Journey (OXW.io) [Wed 12:25-12:55pm]

Takeaway: - ML in Productions = Many Models - 4
Patterns: pipeline, ensemble, biz logic, online learning - Ray Serve is purpose built for scalable deployment More about Ray: - ray.io, rayserve.org - @raydistributed Career: Anyscale is hiring (anyscale.com) Thank you

Ray Serve: Patterns of ML Models in Production ...

Ray Serve: Patterns of ML Models in Production (Simon Mo)

More Decks by Anyscale

Other Decks in Technology

Featured

Transcript