Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Reconciling High Accuracy, Cost-Efficiency, and Low Latency of Inference Serving Systems

Reconciling High Accuracy, Cost-Efficiency, and Low Latency of Inference Serving Systems

Pooyan Jamshidi

November 29, 2023
Tweet

More Decks by Pooyan Jamshidi

Other Decks in Research

Transcript

  1. Reconciling High Accuracy, Cost-Ef
    f
    iciency,
    and Low Latency of Inference Serving Systems
    Pooyan Jamshidi
    University of South Carolina

    View full-size slide

  2. Outline
    3
    InfAdapter
    IPA
    Background

    View full-size slide

  3. Multi-objective
    performance tradeoff

    View full-size slide

  4. 5
    Research Production
    Objectives Model performance* Different stakeholders have different
    objectives
    “*” It’s actively being worked. See Utility is in the Eye of the User: A Critique of NLP Leaderboards (Ethayarajh and Jurafsky, EMNLP 2020)
    ML in research vs. in production

    View full-size slide

  5. 6
    ML team
    highest accuracy
    Stakeholder objectives

    View full-size slide

  6. 7
    ML team
    highest accuracy
    Sales
    sells more ads
    Stakeholder objectives

    View full-size slide

  7. 8
    ML team
    highest accuracy
    Sales
    sells more ads
    Stakeholder objectives
    Product
    fastest inference

    View full-size slide

  8. 9
    ML team
    highest accuracy
    Sales
    sells more ads
    Manager
    maximizes profit
    = laying off ML teams
    Stakeholder objectives
    Product
    fastest inference

    View full-size slide

  9. 10
    Research Production
    Objectives Model performance Different stakeholders have different
    objectives
    Computational priority Fast training, high throughput Fast inference, low latency
    Computational priority
    generating predictions

    View full-size slide

  10. Latency matters
    Latency 100 -> 400 ms reduces searches 0.2% - 0.6% (2009)
    30% increase in latency costs 0.5% conversion rate (2019)
    11

    View full-size slide

  11. 12
    ● Latency: time to move a leaf
    ● Throughput: how many leaves in 1 sec

    View full-size slide

  12. 13
    ● Real-time: low latency = high throughput
    ● Batched: high latency, high throughput

    View full-size slide

  13. System = Software + Middleware + Hardware
    CPU Memory
    Controller
    GPU
    Lib API
    Clients
    Devices
    Network
    Task Scheduler Device Drivers
    File System
    Compilers
    Memory Manager
    Process Manager
    Frontend
    Application
    Layer
    OS/Kernel
    Layer
    Hardware
    Layer
    Deployment
    SoC Generic hardware Production Servers

    View full-size slide

  14. Model Serving
    Abstract level

    View full-size slide

  15. Model Serving
    TF Serving

    View full-size slide

  16. Model Serving
    Web app

    View full-size slide

  17. Model Serving
    Internet of Thing

    View full-size slide

  18. Model Serving
    Stream Processing System

    View full-size slide

  19. Model Serving
    Pipeline

    View full-size slide

  20. EuroMLSys ’23, May 8, 2023, Rome, Italy 23

    View full-size slide

  21. “More than 90% of data center compute for ML
    workload, is used by inference services”
    24

    View full-size slide

  22. ML inference services have strict requirements
    25
    Highly Responsive!

    View full-size slide

  23. ML inference services have strict requirements
    26
    Highly Responsive! Cost-Efficient!

    View full-size slide

  24. ML inference services have strict requirements
    27
    Highly Accurate!
    Highly Responsive! Cost-Efficient!

    View full-size slide

  25. ML inference services have strict & conflicting
    requirements
    28
    Highly Accurate!
    Highly Responsive! Cost-Efficient!

    View full-size slide

  26. More challenge: Dynamic workload
    29

    View full-size slide

  27. Existing adaptation mechanisms
    30
    Resource Scaling
    Vertical Scaling (AutoPilot EuroSys’20)
    Horizontal Scaling (MArk ATC’19)
    Quality Adaptation
    Multi Variants (Model-Switching Hotcloud’20)

    View full-size slide

  28. Resource allocation
    31
    Over
    Provisioning
    Under
    Provisioning

    View full-size slide

  29. Resource allocation
    32

    View full-size slide

  30. Resource allocation
    33

    View full-size slide

  31. Resource allocation
    34

    View full-size slide

  32. Resource allocation
    35

    View full-size slide

  33. Resource allocation
    36

    View full-size slide

  34. Quality adaptation
    37
    ResNet18: Tiger ResNet152: Dog

    View full-size slide

  35. Quality adaptation
    38

    View full-size slide

  36. Solution: InfAdapter
    InfAdapter is a latency SLO-aware, highly accurate, and cost-efficient
    inference serving system.
    39

    View full-size slide

  37. InfAdapter: Why?
    Different throughputs with different model variants
    40

    View full-size slide

  38. InfAdapter: Why?
    Higher average accuracy by using multiple model variants
    41

    View full-size slide

  39. InfAdapter: How?
    Selecting a subset of model variants, each having its own size
    Meeting latency requirement for the predicted workload while maximizing accuracy and
    minimizing cost 42

    View full-size slide

  40. InfAdapter: Design
    43

    View full-size slide

  41. InfAdapter: Design
    44

    View full-size slide

  42. InfAdapter: Formulation
    45

    View full-size slide

  43. InfAdapter: Formulation
    46
    Maximizing Average Accuracy

    View full-size slide

  44. InfAdapter: Formulation
    47
    Maximizing Average Accuracy Minimizing Resource and Loading Costs

    View full-size slide

  45. InfAdapter: Formulation
    48

    View full-size slide

  46. InfAdapter: Formulation
    49
    Supporting incoming workload

    View full-size slide

  47. InfAdapter: Formulation
    50
    Supporting incoming workload
    Guaranteeing end-to-end latency

    View full-size slide

  48. InfAdapter: Design
    51

    View full-size slide

  49. InfAdapter: Experimental evaluation setup
    Twitter-trace sample (2022-08)
    Baselines
    Kubernetes VPA and adapted Model-Switching
    Used models
    Resnet18, Resnet34, Resnet50, Resnet101, Resnet152
    Interval adaptation
    30 seconds
    A Kubernetes cluster of 2 computing nodes
    48 Cores, 192 GiB RAM
    52

    View full-size slide

  50. Workload Pattern
    53

    View full-size slide

  51. InfAdapter: P99-Latency evaluation
    54

    View full-size slide

  52. InfAdapter: P99-Latency evaluation
    55

    View full-size slide

  53. InfAdapter: P99-Latency evaluation
    56

    View full-size slide

  54. InfAdapter: P99-Latency evaluation
    57

    View full-size slide

  55. InfAdapter: P99-Latency evaluation
    58

    View full-size slide

  56. InfAdapter: P99-Latency evaluation
    59

    View full-size slide

  57. InfAdapter: P99-Latency evaluation
    60

    View full-size slide

  58. InfAdapter: P99-Latency evaluation
    61

    View full-size slide

  59. InfAdapter: Accuracy evaluation
    62

    View full-size slide

  60. 63
    InfAdapter: Cost evaluation

    View full-size slide

  61. InfAdapter: Experimental evaluation
    64
    Compare aggregated metrics of
    latency SLO violation, accuracy and
    cost with other works on different
    β values to see how they perform
    on different accuracy-cost trade-off

    View full-size slide

  62. Takeaway
    65
    Inference Serving Systems should consider
    accuracy, latency, and cost at the same time.

    View full-size slide

  63. Takeaway
    66
    Model variants provide the opportunity
    to reduce resource costs while adapting
    to the dynamic workload.
    Using a set of model variants
    simultaneously provides higher average
    accuracy compared to having one
    variant.
    Inference Serving Systems should consider
    accuracy, latency, and cost at the same time.

    View full-size slide

  64. Takeaway
    67
    Model variants provide the opportunity
    to reduce resource costs while adapting
    to the dynamic workload.
    Using a set of model variants
    simultaneously provides higher average
    accuracy compared to having one
    variant.
    Inference Serving Systems should consider
    accuracy, latency, and cost at the same time.
    InfAdapter!

    View full-size slide

  65. 68
    https://github.com/reconfigurable-ml-pipeline/InfAdapter

    View full-size slide

  66. Inference Pipeline
    Recommender Systems
    Source: https://developer.nvidia.com/blog/
    optimizing-dlrm-on-nvidia-gpus/
    Video Pipelines
    Source: https://docs.nvidia.com/metropolis/
    deepstream/5.0/dev-guide/index.html#page/
    DeepStream_Development_Guide/
    deepstream_overview.html

    View full-size slide

  67. 72
    Autoscaling
    Previous works have used auto scaling for cost optimization of inference pipeline

    View full-size slide

  68. Is only scaling enough?
    ?

    View full-size slide

  69. Effect of Batching

    View full-size slide

  70. How to navigate Accuracy/latency trade off? Model
    Variants and Model Switching!
    Previous works INFaaS and Model-Switch have
    proven that there is a big a latency-accuracy-
    resource footprint tradeoffs of models trained for
    the same task

    View full-size slide

  71. How to navigate Accuracy/latency trade off? Model
    Variants and Model Switching!
    Previous works INFaaS and Model-Switch have
    proven that there is a big a latency-accuracy-
    resource footprint tradeoffs of models trained for
    the same task

    View full-size slide

  72. How to navigate Accuracy/latency trade off? Model
    Variants and Model Switching!
    Previous works INFaaS and Model-Switch have
    proven that there is a big a latency-accuracy-
    resource footprint tradeoffs of models trained for
    the same task

    View full-size slide

  73. Search Space

    View full-size slide

  74. 77
    Goal: Providing a flexible inference
    pipeline

    View full-size slide

  75. 78
    Snapshot of the System

    View full-size slide

  76. 79
    System Design

    View full-size slide

  77. Problem Formulation
    Objective function
    Accuracy
    Objective
    Resource
    Objective
    Batch Control
    Latency SLA
    Throughput
    Constraint
    One active
    model per
    node

    View full-size slide

  78. Implementation and Experimental

    Setup
    81

    View full-size slide

  79. 1. Industry standard
    2. Used in recent research
    3. Complete set of autoscaling, scheduling,
    observability tools (e.g. CPU usage)
    4. APIs for changing the current AutoScaling
    algorithms
    1. Industry standard ML server
    2. Have the ability make inference graph
    3. Rest and GRPC endpoints
    4. Have many of the features we need like
    monitoring stack out of the box
    How to navigate Model Variants

    View full-size slide

  80. 83
    Experimental Setup
    ● A six node Kubernetes cluster

    View full-size slide

  81. Experimental Results
    84

    View full-size slide

  82. 85
    Video Pipeline

    View full-size slide

  83. 86
    Audio + QA
    Pipeline

    View full-size slide

  84. 87
    Summarization + QA

    Pipeline

    View full-size slide

  85. 88
    Summarization + QA

    Pipeline

    View full-size slide

  86. 89
    NLP Pipeline

    View full-size slide

  87. 90
    Adaptivity to multiple objectives

    View full-size slide

  88. 91
    Effect of predictor

    View full-size slide

  89. 92
    Gurobi solver scalability

    View full-size slide

  90. Model Serving
    Pipeline
    https://github.com/reconfigurable-ml-pipeline/ipa

    View full-size slide

  91. Model Serving
    Pipeline
    Is only scaling enough?
    ?
    https://github.com/reconfigurable-ml-pipeline/ipa

    View full-size slide

  92. Model Serving
    Pipeline
    Is only scaling enough?
    ?
    X
    Snapshot of the System
    https://github.com/reconfigurable-ml-pipeline/ipa

    View full-size slide

  93. Model Serving
    Pipeline
    Is only scaling enough?
    ?
    X
    Snapshot of the System
    X
    Adaptivity to multiple objectives
    https://github.com/reconfigurable-ml-pipeline/ipa

    View full-size slide