Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Reconciling High Accuracy, Cost-Efficiency, and Low Latency of Inference Serving Systems

Reconciling High Accuracy, Cost-Efficiency, and Low Latency of Inference Serving Systems

Pooyan Jamshidi

November 29, 2023
Tweet

More Decks by Pooyan Jamshidi

Other Decks in Research

Transcript

  1. Reconciling High Accuracy, Cost-Ef f iciency, and Low Latency of

    Inference Serving Systems Pooyan Jamshidi University of South Carolina
  2. 5 Research Production Objectives Model performance* Different stakeholders have different

    objectives “*” It’s actively being worked. See Utility is in the Eye of the User: A Critique of NLP Leaderboards (Ethayarajh and Jurafsky, EMNLP 2020) ML in research vs. in production
  3. 9 ML team highest accuracy Sales sells more ads Manager

    maximizes profit = laying off ML teams Stakeholder objectives Product fastest inference
  4. 10 Research Production Objectives Model performance Different stakeholders have different

    objectives Computational priority Fast training, high throughput Fast inference, low latency Computational priority generating predictions
  5. Latency matters Latency 100 -> 400 ms reduces searches 0.2%

    - 0.6% (2009) 30% increase in latency costs 0.5% conversion rate (2019) 11
  6. System = Software + Middleware + Hardware CPU Memory Controller

    GPU Lib API Clients Devices Network Task Scheduler Device Drivers File System Compilers Memory Manager Process Manager Frontend Application Layer OS/Kernel Layer Hardware Layer Deployment SoC Generic hardware Production Servers
  7. “More than 90% of data center compute for ML workload,

    is used by inference services” 24
  8. ML inference services have strict & conflicting requirements 28 Highly

    Accurate! Highly Responsive! Cost-Efficient!
  9. Existing adaptation mechanisms 30 Resource Scaling Vertical Scaling (AutoPilot EuroSys’20)

    Horizontal Scaling (MArk ATC’19) Quality Adaptation Multi Variants (Model-Switching Hotcloud’20)
  10. InfAdapter: How? Selecting a subset of model variants, each having

    its own size Meeting latency requirement for the predicted workload while maximizing accuracy and minimizing cost 42
  11. InfAdapter: Experimental evaluation setup Twitter-trace sample (2022-08) Baselines Kubernetes VPA

    and adapted Model-Switching Used models Resnet18, Resnet34, Resnet50, Resnet101, Resnet152 Interval adaptation 30 seconds A Kubernetes cluster of 2 computing nodes 48 Cores, 192 GiB RAM 52
  12. InfAdapter: Experimental evaluation 64 Compare aggregated metrics of latency SLO

    violation, accuracy and cost with other works on different β values to see how they perform on different accuracy-cost trade-off
  13. Takeaway 66 Model variants provide the opportunity to reduce resource

    costs while adapting to the dynamic workload. Using a set of model variants simultaneously provides higher average accuracy compared to having one variant. Inference Serving Systems should consider accuracy, latency, and cost at the same time.
  14. Takeaway 67 Model variants provide the opportunity to reduce resource

    costs while adapting to the dynamic workload. Using a set of model variants simultaneously provides higher average accuracy compared to having one variant. Inference Serving Systems should consider accuracy, latency, and cost at the same time. InfAdapter!
  15. IPA

  16. 70

  17. Inference Pipeline Recommender Systems Source: https://developer.nvidia.com/blog/ optimizing-dlrm-on-nvidia-gpus/ Video Pipelines Source:

    https://docs.nvidia.com/metropolis/ deepstream/5.0/dev-guide/index.html#page/ DeepStream_Development_Guide/ deepstream_overview.html
  18. How to navigate Accuracy/latency trade off? Model Variants and Model

    Switching! Previous works INFaaS and Model-Switch have proven that there is a big a latency-accuracy- resource footprint tradeoffs of models trained for the same task
  19. How to navigate Accuracy/latency trade off? Model Variants and Model

    Switching! Previous works INFaaS and Model-Switch have proven that there is a big a latency-accuracy- resource footprint tradeoffs of models trained for the same task
  20. How to navigate Accuracy/latency trade off? Model Variants and Model

    Switching! Previous works INFaaS and Model-Switch have proven that there is a big a latency-accuracy- resource footprint tradeoffs of models trained for the same task
  21. Problem Formulation Objective function Accuracy Objective Resource Objective Batch Control

    Latency SLA Throughput Constraint One active model per node
  22. 1. Industry standard 2. Used in recent research 3. Complete

    set of autoscaling, scheduling, observability tools (e.g. CPU usage) 4. APIs for changing the current AutoScaling algorithms 1. Industry standard ML server 2. Have the ability make inference graph 3. Rest and GRPC endpoints 4. Have many of the features we need like monitoring stack out of the box How to navigate Model Variants
  23. Model Serving Pipeline Is only scaling enough? ? X Snapshot

    of the System https://github.com/reconfigurable-ml-pipeline/ipa
  24. Model Serving Pipeline Is only scaling enough? ? X Snapshot

    of the System X Adaptivity to multiple objectives https://github.com/reconfigurable-ml-pipeline/ipa