Developing and deploying scalable multi-model inference pipelines

Serve Deployment Graph Jiao Dong @ Anyscale

• Motivation for multi-model inference graphs Outline • Ray and
Ray Serve Background • Deployment Graph API walkthrough • Real live demo: content understanding!

Motivation Machine learning inference graphs are getting longer, wider, and
more dynamic. Blog: Ray Serve - Patterns of ML Models in Production

• Scalable Unique strengths of Ray • Low-latency • Part
of ML ecosystem • Efficient

@serve.deployment Serve deployment - single model MyModel HTTP endpoint Python
handle http://localhost:8000/api?data=A python_handle.remote(“A”) Python deployment handle facilitates multi-model inference graph composition

Multi-model inference today Image Pre-process Model_1 Model_2 Model_3 combine Post-process
User needs to explicitly call and get handle Dependency is hidden Hard to write efficient graph

Challenges with manual composition • Deployment graph topology is hidden
• Hard to operate for production • Hard to write efficient graph Solution: Graph building API!

Solution: Serve Deployment Graph API • Fully Python programmable graph
without writing YAML • Can be developed, instantiated, and tested locally ◦ YAML can be auto-generated for production usage • Each model can be scaled and configured individually • Uses a unified graph API across the Ray ecosystem

Deployment Graph API in Five Steps Your Input Preprocessor #1
Combine Model #1 Model #2 Dynamic Aggregate Preprocessor #2

Step 1/5: User InputNode and preprocessor Your Input preprocessor avg_preprocessor
InputNode() – Your input to the graph .bind() – Graph building API on decorated body

Step 2/5: Model and combiner class Your Input Preprocessor #1
Combine Model #1 Model #2 Preprocessor #2

Step 3/5: Dynamic aggregation Your Input Preprocessor #1 Combine Model
#1 Model #2 Preprocessor #2 Dynamic Aggregate

DAG Step 4/5: Driver for HTTP ingress Your Input Preprocessor
#1 Combine Model #1 Model #2 Preprocessor #2 Dynamic Aggregate Driver HTTP endpoint Python handle Input Schema adapter

Step 5/5: Running the deployment graph Operator • Consistent updates
• Many Replicas • YAML Developer • Quick updates • Few Replicas • Python

• Improved operational story (see Shreyas’ talk!) Future Improvements •
Automatic performance optimizations • UX and visualization support

Bonus: Unified Ray DAG API • DAG will be a
first class API in Ray 2.0 across the libraries Common DAG API (@ray.remote tasks and actors) Ray Core Ray Serve Ray Workflows Eager execution Durable execution as workflow Online serving pipelines Ray Datasets Batch inference pipelines

Problem: Multi-model inference increasingly important • Hard to author and
iterate locally • Performance is critical Conclusion Solution: Serve Deployment Graph API • Enables Python local development and testing • Efficient and scalable in production

• Join the community ◦ discuss.ray.io ◦ github.com/ray-project/ray ◦ @raydistributed
and @anyscalecompute • Fill out our survey (QR code) for: ◦ Feedback to help shape the future of Ray Serve ◦ One-on-one sessions with developers ◦ Updates about upcoming features Please get in touch 18

Demo - High level imge_url user_id: 5769 Classification Model_version: 1
—-------------------- ('hummingbird', 0.9991544485092163), ('bucket', 0.0001098369830287993) Image Caption “a bird sitting on a table with a frisbee” Image Segmentation

Demo - Details Your Input Downloader Preprocessor Image Segmentation Dynamic
Dispatch Image Classifier #1 Image Classifier #2 Image Classifier #3 Image Captioning Render output Image features url = “https://bird/image.jepg” user_id = 5769 Hummingbird: 0.9991544 bucket: 0.000109 … Object mask Description user_id = 5769 Object mask

Demo - End to end flow Local graph building Run
and iterate Add DAG Driver [CLI] serve run [CLI] serve build [CLI] serve deploy HTTP endpoint Configure HTTP HTTP endpoint Test Test Reconfigure

Developing and deploying scalable multi-model i...

Developing and deploying scalable multi-model inference pipelines

Anyscale

More Decks by Anyscale

Other Decks in Technology

Featured

Transcript

Serve Deployment Graph Jiao Dong @ Anyscale

• Motivation for multi-model inference graphs Outline • Ray and

Motivation Machine learning inference graphs are getting longer, wider, and

• Scalable Unique strengths of Ray • Low-latency • Part

@serve.deployment Serve deployment - single model MyModel HTTP endpoint Python

Multi-model inference today Image Pre-process Model_1 Model_2 Model_3 combine Post-process

Challenges with manual composition • Deployment graph topology is hidden

Solution: Serve Deployment Graph API • Fully Python programmable graph

Deployment Graph API in Five Steps Your Input Preprocessor #1

Step 1/5: User InputNode and preprocessor Your Input preprocessor avg_preprocessor

Step 2/5: Model and combiner class Your Input Preprocessor #1

Step 3/5: Dynamic aggregation Your Input Preprocessor #1 Combine Model

DAG Step 4/5: Driver for HTTP ingress Your Input Preprocessor

Step 5/5: Running the deployment graph Operator • Consistent updates

• Improved operational story (see Shreyas’ talk!) Future Improvements •

Bonus: Unified Ray DAG API • DAG will be a

Problem: Multi-model inference increasingly important • Hard to author and

• Join the community ◦ discuss.ray.io ◦ github.com/ray-project/ray ◦ @raydistributed

Demo - High level imge_url user_id: 5769 Classification Model_version: 1

Demo - Details Your Input Downloader Preprocessor Image Segmentation Dynamic

Demo - End to end flow Local graph building Run