Reconciling High Accuracy, Cost-Efficiency, and Low Latency of Inference Serving Systems

Reconciling High Accuracy, Cost-Ef f iciency, and Low Latency of
Inference Serving Systems Pooyan Jamshidi University of South Carolina

2 Teamwork

Outline 3 InfAdapter IPA Background

Multi-objective performance tradeoff

5 Research Production Objectives Model performance* Different stakeholders have different
objectives “*” It’s actively being worked. See Utility is in the Eye of the User: A Critique of NLP Leaderboards (Ethayarajh and Jurafsky, EMNLP 2020) ML in research vs. in production

6 ML team highest accuracy Stakeholder objectives

7 ML team highest accuracy Sales sells more ads Stakeholder
objectives

8 ML team highest accuracy Sales sells more ads Stakeholder
objectives Product fastest inference

9 ML team highest accuracy Sales sells more ads Manager
maximizes profit = laying off ML teams Stakeholder objectives Product fastest inference

10 Research Production Objectives Model performance Different stakeholders have different
objectives Computational priority Fast training, high throughput Fast inference, low latency Computational priority generating predictions

Latency matters Latency 100 -> 400 ms reduces searches 0.2%
- 0.6% (2009) 30% increase in latency costs 0.5% conversion rate (2019) 11

12 • Latency: time to move a leaf • Throughput:
how many leaves in 1 sec

13 • Real-time: low latency = high throughput • Batched:
high latency, high throughput

ML Serving

System = Software + Middleware + Hardware CPU Memory Controller
GPU Lib API Clients Devices Network Task Scheduler Device Drivers File System Compilers Memory Manager Process Manager Frontend Application Layer OS/Kernel Layer Hardware Layer Deployment SoC Generic hardware Production Servers

Model Serving Abstract level

Model Serving TF Serving

Model Serving Web app

Model Serving Internet of Thing

Model Serving Stream Processing System

Model Serving Pipeline

InfAdapter

EuroMLSys ’23, May 8, 2023, Rome, Italy 23

“More than 90% of data center compute for ML workload,
is used by inference services” 24

ML inference services have strict requirements 25 Highly Responsive!

ML inference services have strict requirements 26 Highly Responsive! Cost-Efficient!

ML inference services have strict requirements 27 Highly Accurate! Highly
Responsive! Cost-Efficient!

ML inference services have strict & conflicting requirements 28 Highly
Accurate! Highly Responsive! Cost-Efficient!

More challenge: Dynamic workload 29

Existing adaptation mechanisms 30 Resource Scaling Vertical Scaling (AutoPilot EuroSys’20)
Horizontal Scaling (MArk ATC’19) Quality Adaptation Multi Variants (Model-Switching Hotcloud’20)

Resource allocation 31 Over Provisioning Under Provisioning

Resource allocation 32

Quality adaptation 37 ResNet18: Tiger ResNet152: Dog

Quality adaptation 38

Solution: InfAdapter InfAdapter is a latency SLO-aware, highly accurate, and
cost-efficient inference serving system. 39

InfAdapter: Why? Different throughputs with different model variants 40

InfAdapter: Why? Higher average accuracy by using multiple model variants
41

InfAdapter: How? Selecting a subset of model variants, each having
its own size Meeting latency requirement for the predicted workload while maximizing accuracy and minimizing cost 42

InfAdapter: Design 43

InfAdapter: Formulation 45

InfAdapter: Formulation 46 Maximizing Average Accuracy

InfAdapter: Formulation 47 Maximizing Average Accuracy Minimizing Resource and Loading
Costs

InfAdapter: Formulation 48

InfAdapter: Formulation 49 Supporting incoming workload

InfAdapter: Formulation 50 Supporting incoming workload Guaranteeing end-to-end latency

InfAdapter: Experimental evaluation setup Twitter-trace sample (2022-08) Baselines Kubernetes VPA
and adapted Model-Switching Used models Resnet18, Resnet34, Resnet50, Resnet101, Resnet152 Interval adaptation 30 seconds A Kubernetes cluster of 2 computing nodes 48 Cores, 192 GiB RAM 52

Workload Pattern 53

InfAdapter: P99-Latency evaluation 54

InfAdapter: Accuracy evaluation 62

63 InfAdapter: Cost evaluation

InfAdapter: Experimental evaluation 64 Compare aggregated metrics of latency SLO
violation, accuracy and cost with other works on different β values to see how they perform on different accuracy-cost trade-off

Takeaway 65 Inference Serving Systems should consider accuracy, latency, and
cost at the same time.

Takeaway 66 Model variants provide the opportunity to reduce resource
costs while adapting to the dynamic workload. Using a set of model variants simultaneously provides higher average accuracy compared to having one variant. Inference Serving Systems should consider accuracy, latency, and cost at the same time.

Takeaway 67 Model variants provide the opportunity to reduce resource
costs while adapting to the dynamic workload. Using a set of model variants simultaneously provides higher average accuracy compared to having one variant. Inference Serving Systems should consider accuracy, latency, and cost at the same time. InfAdapter!

68 https://github.com/reconfigurable-ml-pipeline/InfAdapter

Inference Pipeline Recommender Systems Source: https://developer.nvidia.com/blog/ optimizing-dlrm-on-nvidia-gpus/ Video Pipelines Source:
https://docs.nvidia.com/metropolis/ deepstream/5.0/dev-guide/index.html#page/ DeepStream_Development_Guide/ deepstream_overview.html

72 Autoscaling Previous works have used auto scaling for cost
optimization of inference pipeline

Is only scaling enough? ?

Effect of Batching

How to navigate Accuracy/latency trade off? Model Variants and Model
Switching! Previous works INFaaS and Model-Switch have proven that there is a big a latency-accuracy- resource footprint tradeoffs of models trained for the same task

Search Space

77 Goal: Providing a flexible inference pipeline

78 Snapshot of the System

79 System Design

Problem Formulation Objective function Accuracy Objective Resource Objective Batch Control
Latency SLA Throughput Constraint One active model per node

Implementation and Experimental  Setup 81

1. Industry standard 2. Used in recent research 3. Complete
set of autoscaling, scheduling, observability tools (e.g. CPU usage) 4. APIs for changing the current AutoScaling algorithms 1. Industry standard ML server 2. Have the ability make inference graph 3. Rest and GRPC endpoints 4. Have many of the features we need like monitoring stack out of the box How to navigate Model Variants

83 Experimental Setup • A six node Kubernetes cluster

Experimental Results 84

85 Video Pipeline

86 Audio + QA Pipeline

87 Summarization + QA  Pipeline

88 Summarization + QA  Pipeline

89 NLP Pipeline

90 Adaptivity to multiple objectives

91 Effect of predictor

92 Gurobi solver scalability

Model Serving Pipeline https://github.com/reconfigurable-ml-pipeline/ipa

Model Serving Pipeline Is only scaling enough? ? https://github.com/reconfigurable-ml-pipeline/ipa

Model Serving Pipeline Is only scaling enough? ? X Snapshot
of the System https://github.com/reconfigurable-ml-pipeline/ipa

Model Serving Pipeline Is only scaling enough? ? X Snapshot
of the System X Adaptivity to multiple objectives https://github.com/reconfigurable-ml-pipeline/ipa

Reconciling High Accuracy, Cost-Efficiency, and...

Reconciling High Accuracy, Cost-Efficiency, and Low Latency of Inference Serving Systems

More Decks by Pooyan Jamshidi

Other Decks in Research

Featured

Transcript