Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Unicorn: Reasoning about Configurable System Performance through the Lens of Causality

Unicorn: Reasoning about Configurable System Performance through the Lens of Causality

This is the EuroSys'22 presentation, delivered by Shahriar Iqbal.
Paper: https://dl.acm.org/doi/abs/10.1145/3492321.3519575
Code + Data + Replication Package: https://github.com/softsys4ai/unicorn

Pooyan Jamshidi

April 06, 2022
Tweet

More Decks by Pooyan Jamshidi

Other Decks in Science

Transcript

  1. Md Shahriar Iqbal UNICORN: Reasoning about Configurable System Performance through

    the Lens of Causality Rahul Krishna MA Javidian Baishakhi Ray Pooyan Jamshidi
  2. Correlation vs Causation 2

  3. 3 Outline Motivation Causal Inference UNICORN Results

  4. Consider a data analytics pipeline 4 Video Decoder Stream Muxer

    Primary Detector Object Tracker Secondary Classifier # Configuration Options 55 86 14 44 86
  5. 5 Video Decoder Stream Muxer Primary Detector Object Tracker Secondary

    Classifier # Configuration Options 55 86 14 44 86 Composed System Compression … Each component has a plethora of configuration options Encryption …
  6. Each component has a plethora of configuration options 6 Video

    Decoder Stream Muxer Primary Detector Object Tracker Secondary Classifier # Configuration Options 55 86 14 44 86 Con fi gurations Possible 2285 Complex interactions between options (intra or inter components) give rise to a combinatorially large con fi guration space Compression … Encryption …
  7. Energy (Joules) Performance varies significantly when systems are deployed with

    different configurations 7 Latency Energy Consumption 5 10 15 20 25 10 20 30 40 50 >99.99% >99.99% Latency (Seconds) Varies by 5x Varies by 4.5x It is expected to set the system to a con fi guration for which the performance remains optimal or close to the optimal
  8. Energy (Joules) Performance varies significantly when systems are deployed with

    different configurations 8 Latency Energy Consumption 5 10 15 20 25 10 20 30 40 50 >99.99% >99.99% Latency (Seconds) Varies by 5x Varies by 4.5x It is expected to set the system to a con fi guration for which the performance remains optimal or close to the optimal Reaching desired performance goal is di ffi cult due to sheer size of the con fi guration space and high con fi guration measurement cost
  9. Computer systems undergo several environmental changes 9 Source Environment Decoder

    Muxer Detector Tracker Classifier Target Environment Decoder Muxer Detector Tracker Classifier
  10. Real world example: Deployment environment change 10 When we are

    trying to transplant our CUDA source code from TX1 to TX2, it behaved strange. We noticed that TX2 has twice computing-ability as TX1 in GPU, as expectation, we think TX2 will 30% - 40% faster than TX1 at least. Unfortunately, most of our code base spent twice the time as TX1, in other words, TX2 only has 1/2 speed as TX1, mostly. We believe that TX2’s CUDA API runs much slower than TX1 in many cases. The user is transferring the code from one hardware to another The code ran 2x slower on the more powerful hardware
  11. Real world example: Deployment environment change 11 When we are

    trying to transplant our CUDA source code from TX1 to TX2, it behaved strange. We noticed that TX2 has twice computing-ability as TX1 in GPU, as expectation, we think TX2 will 30% - 40% faster than TX1 at least. Unfortunately, most of our code base spent twice the time as TX1, in other words, TX2 only has 1/2 speed as TX1, mostly. We believe that TX2’s CUDA API runs much slower than TX1 in many cases. The user is transferring the code from one hardware to another The code ran 2x slower on the more powerful hardware Incorrect understanding about the performance behavior often leads to miscon fi guration
  12. What is misconfiguration? 12 Miscon fi gurations happen due to

    unexpected interactions between con fi guration options in the deployment system stack.
  13. What is misconfiguration? 13 Miscon fi gurations happen due to

    unexpected interactions between con fi guration options in the deployment system stack. The system does not crash but remains operational with degraded performance e.g., high latency, low throughput, high energy consumption. Latency Energy Consumption 5 10 15 20 25 10 20 30 40 50 >99.99% >99.99% Latency (Seconds) Energy (Joules) Miscon fi guration
  14. Performance task: Debugging 14 Performance debugging aims at fi nding

    the root cause of the miscon fi guration and fi x it. When we are trying to transplant our CUDA source code from TX1 to TX2, it behaved strange. We noticed that TX2 has twice computing-ability as TX1 in GPU, as expectation, we think TX2 will 30% - 40% faster than TX1 at least. Unfortunately, most of our code base spent twice the time as TX1, in other words, TX2 only has 1/2 speed as TX1, mostly. We believe that TX2’s CUDA API runs much slower than TX1 in many cases. The user is transferring the code from one hardware to another The code ran 2x slower on the more powerful hardware The user expects 30-40% improvement
  15. Energy (Joules) Performance task: Optimization 15 Latency Energy Consumption 5

    10 15 20 25 10 20 30 40 50 >99.99% >99.99% Latency (Seconds) Here, the developer aims at fi nding the optimal con fi guration with or without experiencing any miscon fi guration.
  16. Performance debugging tasks take significantly long time, the fixes are

    typically non-intuitive (changes to seemingly underrated options) 16 June 3 June 4 June 4 June 5 Any suggestions on how to improve my performance? Thanks! Please do the following and let us know if it works 1. Install JetPack 3.0 2. Set nvpmodel=MAX-N 3. Run jetson_clock.sh We have already tried this. We still have high latency. Any other suggestions? TX2 is pascal architecture. Please update your CMakeLists: + set(CUDA_STATIC_RUNTIME OFF) ... + -gencode=arch=compute_62,code=sm_62 The user had several misconfigurations In Software: ✖ Wrong compilation flags ✖ Wrong SDK version In Hardware: ✖ Wrong power mode ✖ Wrong clock/fan settings
  17. 17 How to resolve these issues? Current approaches: Reasoning based

    on correlation! Our key idea: Reasoning based on causation :)
  18. 18 Performance In fl uence Models number of counters number

    of splitters latency (ms) 100 150 1 200 250 2 300 Cubic Interpolation Over Finer Grid 2 4 3 6 8 4 10 12 5 14 16 6 18 Bitrate (bits/s) Enable Padding … Cache Misses … Through put (fps) c1 1k 1 … 42m … 7 c2 2k 1 … 32m … 22 … … … … … … … cn 5k 0 … 12m … 25 Observational Data Black-box models Regression Equation Throughput = 5.1 × Bitrate + 2.5 × BatchSize + 12.3 × Bitrate × BatchSize Interactions Options Options This is a representative work, but there are many other works related to using regression models (as well as other statistical models) for building performance models We have selection bias here ;)
  19. 19 These methods rely on statistical correlations to extract meaningful

    information required for performance tasks. Performance In fl uence Models number of counters number of splitters latency (ms) 100 150 1 200 250 2 300 Cubic Interpolation Over Finer Grid 2 4 3 6 8 4 10 12 5 14 16 6 18 Bitrate (bits/s) Enable Padding … Cache Misses … Through put (fps) c1 1k 1 … 42m … 7 c2 2k 1 … 32m … 22 … … … … … … … cn 5k 0 … 12m … 25 Observational Data Black-box models Regression Equation Throughput = 5.1 × Bitrate + 2.5 × BatchSize + 12.3 × Bitrate × BatchSize Discovered Interactions Options Options
  20. 20 • Incorrect Explanations and Unreliable Predictions • Non-transferable across

    Environments Performance In fl uence Models su ff er from several shortcomings
  21. Performance Influence Models might be Unreliable Cache Misses Throughput (FPS)

    20 10 0 100k 200k 21 Increasing Cache Misses increases Throughput.
  22. Cache Misses Throughput (FPS) 20 10 0 100k 200k 22

    Increasing Cache Misses increases Throughput. More Cache Misses should reduce Throughput not increase it Purely statistical models built on this data will be unreliable. This is counter-intuitive Performance Influence Models might be Unreliable
  23. Cache Misses Throughput (FPS) 20 10 0 100k 200k 23

    Cache Misses Throughput (FPS) LRU FIFO LIFO MRU 20 10 0 100k 200k Segregating data on Cache Policy indicates that within each group increase of Cache Misses result in a decrease in Throughput. FIFO LIFO MRU LRU Performance Influence Models might be Unreliable
  24. 24 DeepStream (Environment: TX2) DeepStream (Environment: Xavier) Throughput = 5.1

    × Bitrate + 2.5 × BatchSize + 12.3 × Bitrate × BatchSize Throughput = 2 × Bitrate + 1.9 × BatchSize + 1.8 × BufferSize + 0.5 × EnablePadding + 5.9 × Bitrate × BufferSize +6.2 × Bitrate × EnablePadding + 4.1 × Bitrate × BufferSize × EnablePadding Performance Influence Models are not Transferable Each term in the regression equations is considered a predictor
  25. 25 Performance In fl uence Models change signi fi cantly

    in new environments resulting in less accuracy. Throughput = 5.1 × Bitrate + 2.5 × BatchSize + 12.3 × Bitrate × BatchSize Throughput = 2 × Bitrate + 1.9 × BatchSize + 1.8 × BufferSize + 0.5 × EnablePadding + 5.9 × Bitrate × BufferSize +6.2 × Bitrate × EnablePadding + 4.1 × Bitrate × BufferSize × EnablePadding Performance Influence Models are not Transferable DeepStream (Environment: TX2) DeepStream (Environment: Xavier)
  26. 26 Performance in fl uence cannot be reliably used across

    environments. Throughput = 5.1 × Bitrate + 2.5 × BatchSize + 12.3 × Bitrate × BatchSize Throughput = 2 × Bitrate + 1.9 × BatchSize + 1.8 × BufferSize + 0.5 × EnablePadding + 5.9 × Bitrate × BufferSize +6.2 × Bitrate × EnablePadding + 4.1 × Bitrate × BufferSize × EnablePadding Performance Influence Models are not Transferable DeepStream (Environment: TX2) DeepStream (Environment: Xavier)
  27. 27 Outline Motivation Causal Inference UNICORN Results

  28. 28 Our Key Idea: Building Causal Performance Model instead of

    Performance Influence Models Expresses the relationships between Con fi guration options System Events Non-functional Properties Cache Misses Throughput (FPS) 20 10 0 100k 200k interacting variables as a causal graph Direction of Causality Cache Policy Cache Misses Through put
  29. Why Causal Performance Model? To build reliable models that produce

    correct explanations 29 Cache Misses Throughput (FPS) 20 10 0 100k 200k Cache Misses Throughput (FPS) LRU FIFO LIFO MRU 20 10 0 100k 200k Cache Policy a ff ects Throughput via Cache Misses. Causal Performance Models recover the correct interactions. Cache Policy Cache Misses Through put
  30. Why Causal Performance Models? To reuse them when the system

    environment changes 30 Causal models remain relatively stable A partial causal performance model in Jetson Xavier A partial causal performance model in Jetson TX2 Bitrate Buffer Size Batch Size Enable Padding Branch Misses Cache Misses Cycles FPS Energy Bitrate Buffer Size Batch Size Enable Padding Branch Misses Cache Misses Cycles FPS Energy
  31. How to use Causal Performance Models? ? Cache Policy Cache

    Misses Through put How to generate a causal performance model? 31
  32. How to use Causal Performance Models? ? How to use

    the causal performance model for performance tasks? ? Cache Policy Cache Misses Through put How to generate a causal performance model? 32
  33. 33 Outline Motivation Causal Inference UNICORN Results

  34. UNICORN: End-to-end Pipeline 34 5- Estimate Causal Queries • What

    is the root-cause of fault? • How do I fix misconfiguration? • How do I optimize perf.? • How do I understand perf.? Software: DeepStream Middleware: TF, TensorRT Hardware: Nvidia Xavier Configuration: Default Estimate probability of satisfying QoS if BufferSize is set to 6k? 2- Learn Causal Performance Model Initial Perf. Data P(Th > 40/s|do(Buffersize = 6k)) 1- Specify Performance Query QoS : Th > 40/s Observed : Th < 30/s ± 5/s Causal Performance Model System Stack Performance Tasks Performance Fault/Issue 4- Update Causal Performance Model Causal Inference Engine 3- Determine Next Configuration Uses Stage Stages I. Specify Performance Query II. Learn Causal Performance Model III. Iterative Sampling IV. Update Causal Performance Model V. Estimate Causal Queries
  35. Stage-I: Specify Performance Query 35 Performance Queries Query: What are

    the root causes of my performance fault and how can I improve performance by 70%?
  36. Stage-I: Specify Performance Query 36 Performance Queries Query: What are

    the root causes of my performance fault and how can I improve performance by 70%? Query Engine Extracted Information Info: 70% gain expected Extracts meaningful information which is useful for subsequent stages for a performance task.
  37. Bitrate (bits/s) Enable Padding … Cache Misses … Through put

    (fps) c1 1k 1 … 42m … 7 c2 2k 1 … 32m … 22 … … … … … … … cn 5k 0 … 12m … 25 1- Recovering the Skeleton fully connected graph given constraints (e.g., no connections btw configuration options) Stage-II: Learn Causal Performance Model 37 FPS Energy Branch Misses Cache Misses No of Cycles Bitrate Buffer Size Batch Size Enable Padding
  38. Bitrate (bits/s) Enable Padding … Cache Misses … Through put

    (fps) c1 1k 1 … 42m … 7 c2 2k 1 … 32m … 22 … … … … … … … cn 5k 0 … 12m … 25 FPS Energy Branch Misses Cache Misses No of Cycles Bitrate Buffer Size Batch Size Enable Padding 1- Recovering the Skeleton 2- Pruning Causal Model statistical independence tests fully connected graph given constraints (e.g., no connections btw configuration options) Stage-II: Learn Causal Performance Model 38 FPS Energy Branch Misses Cache Misses No of Cycles Bitrate Buffer Size Batch Size Enable Padding
  39. Bitrate (bits/s) Enable Padding … Cache Misses … Through put

    (fps) c1 1k 1 … 42m … 7 c2 2k 1 … 32m … 22 … … … … … … … cn 5k 0 … 12m … 25 FPS Energy Branch Misses Cache Misses No of Cycles Bitrate Buffer Size Batch Size Enable Padding 1- Recovering the Skeleton 2- Pruning Causal Model 3- Orienting Causal Relations statistical independence tests fully connected graph given constraints (e.g., no connections btw configuration options) orientation rules & measures + constraints (colliders, v-structures) FPS Energy Branch Misses Cache Misses No of Cycles Bitrate Buffer Size Batch Size Enable Padding Stage-II: Learn Causal Performance Model Partial Ancestral Graph (PAG) 39 FPS Energy Branch Misses Cache Misses No of Cycles Bitrate Buffer Size Batch Size Enable Padding
  40. Stage-II: Learn Causal Performance Model 40 FPS Energy Branch Misses

    Cache Misses No of Cycles Bitrate Buffer Size Batch Size Enable Padding Partial Ancestral Graph (PAG) A PAG can have three types of edges between any nodes X and Y X Y X is a parent of Y X Y A confounder exists between X and Y X Y Not su ffi cient data to recover causal direction X Y X Y or or
  41. Stage-II: Learn Causal Performance Model 41 FPS Energy Branch Misses

    Cache Misses No of Cycles Bitrate Buffer Size Batch Size Enable Padding Partial Ancestral Graph (PAG) A PAG can have three types of edges between any nodes X and Y X Y X is a parent of Y X Y A confounder exists between X and Y X Y Not su ffi cient data to recover causal direction X Y X Y or or
  42. Stage-II: Learn Causal Performance Model 42 FPS Energy Branch Misses

    Cache Misses No of Cycles Bitrate Buffer Size Batch Size Enable Padding Partial Ancestral Graph (PAG) A PAG can have three types of edges between any nodes X and Y X Y X is a parent of Y X Y A confounder exists between X and Y X Y Not su ffi cient data to recover causal direction X Y X Y or or
  43. 43 FPS Energy Branch Misses Cache Misses No of Cycles

    Bitrate Buffer Size Batch Size Enable Padding Bitrate (bits/s) Enable Padding … Cache Misses … Through put (fps) c1 1k 1 … 42m … 7 c2 2k 1 … 32m … 22 … … … … … … … cn 5k 0 … 12m … 25 FPS Energy Branch Misses Cache Misses No of Cycles Bitrate Buffer Size Batch Size Enable Padding 1- Recovering the Skeleton 2- Pruning Causal Model 3- Orienting Causal Relations statistical independence tests fully connected graph given constraints (e.g., no connections btw configuration options) orientation rules & measures + constraints (colliders, v-structures) FPS Energy Branch Misses Cache Misses No of Cycles Bitrate Buffer Size Batch Size Enable Padding 3- Refining Causal Directions Latent search and entropy Stage-II: Learn Causal Performance Model Acyclic Directed Mixed Graph (ADMG) Partial Ancestral Graph (PAG) FPS Energy Branch Misses Cache Misses No of Cycles Bitrate Buffer Size Batch Size Enable Padding
  44. 44 Stage-III: Iterative Sampling FPS Energy Branch Misses Cache Misses

    No of Cycles Bitrate Buffer Size Batch Size Enable Padding FPS Branch Misses Bitrate Causal Performance Model Selected Subsection of Causal Performance Model Recommended Con fi guration Bitrate (bits/s) Enable Padding … Cache Misses … Through put (fps) c1 1k 1 … 42m … 7 c2 2k 1 … 32m … 22 … … … … … … … cn 5k 0 … 12m … 25 Individual Causal E ff ect (ICE) Estimation Interventional Measurement Select Top K Paths using Average Causal E ff ect (ACE)
  45. 45 x In real world case, the causal graphs can

    be very complex x It may be intractable to reason over the entire graph directly A real world causal graph for a data analytics pipeline Why Select Top K Paths?
  46. 46 Extracting Causal Paths from the Causal Model Extract paths

    Always begins with a configuration option Or a system event Always terminates at a performance objective Cache Misses Bitrate Branch Misses FPS Bitrate Branch Misses FPS FPS Branch Misses Cache Misses
  47. Ranking Causal Paths from the Causal Model Expected value of

    Bitrate when we artificially intervene by setting Bitrate to the value b Expected value of Branch Misses when we artificially intervene by setting Bitrate to the value a If this difference is large, then small changes to Bitrate will cause large changes to Branch Misses Average over all permitted values of Bitrate. ACE(BranchMisses, Bitrate) = 1 N ∑ E(BranchMisses|do(Bitrate = b)) − E(BranchMisses|do(Bitrate = a)) 47 Bitrate Branch Misses FPS • There may be too many causal paths. • We need to select the most useful ones. • Compute the Average Causal E ff ect (ACE) of each pair of neighbors in a path.
  48. 48 Ranking Causal Paths from the Causal Model • Average

    the ACE of all pairs of adjacent nodes in the path • Rank paths from highest path ACE (PACE) score to the lowest • Use the top K paths for subsequent analysis Sum over all pairs of nodes in the causal path. PACE (Z, Y) = 1 2 (ACE(Z, X) + ACE(X, Y)) Bitrate Branch Misses FPS
  49. How to reason over a path? 49 To reason, we

    need to evaluate counterfactual queries that can be formulated using the con fi guration options and performance objectives in a particular path to resolve a particular performance task.
  50. Counterfactual Queries 50 • Counterfactual inference asks “what if” questions

    about changes to the misconfigurations We are interested in the scenario where: • We hypothetically have low throughput; Conditioned on the following events: • We hypothetically set the new Bitrate to 10000 • Bitrate was initially set to 6000 • We observed low throughput when Bitrate was set to 6000 • Everything else remains the same Example "Given that my current Bitrate is 6000 and I have low throughput, what is the probability of having low throughput if my Bitrate is increased to 10000"?
  51. Selecting configuration for next intervention Top K paths Enumerate all

    possible changes Change with the largest ICE Set every configuration option in the path to all permitted values ICE (change) Inferred from observational data. This is very cheap 51 Bitrate Branch Misses FPS
  52. Selecting configuration for next intervention Change with the largest ICE

    Yes No • Proceed to next stage Measure Performance 52 Query Satis fi ed? • Terminate
  53. Stage-IV: Update Causal Performance Model 53 FPS Energy Branch Misses

    Cache Misses No of Cycles Bitrate Buffer Size Batch Size Enable Padding 1- Evaluate Candidate Interventions FPS Energy Branch Misses Cache Misses No of Cycles Bitrate Buffer Size Batch Size Enable Padding Option/Event/Obj Values Bitrate 1k Buffer Size 20k Batch Size 10 Enable Padding 1 Branch Misses 24m Cache Misses 42m No of Cycles 73b FPS 31/s Energy 42J 2- Determine & Perform next Perf. Measurement 3- Update Causal Performance Model Performance Data Model averaging Expected change in belief & KL; Causal effects on objectives Interventions on Hardware, Workload, and Kernel options Intervention 1 … Intervention n Belief Update Prior Belief 4- Replace Causal Performance Model
  54. Stage-V: Estimate Causal Queries 54 P(Throughput > 40/s|do(BufferSize = 20000))

    Estimate the probability of satisfying QoS given Bu ff erSize=20000 • Use do-calculus to evaluate causal queries • Estimate budget and additional constraints
  55. 55 Outline Motivation Causal Inference UNICORN Results

  56. Experimental Setup: Systems, Workload, Hardware 56 Nvidia TX1 CPU 4

    cores, 1.3 GHz GPU 128 Cores, 0.9 GHz Memory 4 Gb, 25 GB/s Nvidia TX2 CPU 6 cores, 2.0 GHz GPU 256 Cores, 1.3 GHz Memory 8 Gb, 58 GB/s Nvidia Xavier CPU 8 cores, 2.26 GHz GPU 512 Cores, 1.3 GHz Memory 16 Gb, 137 GB/s Xception Image Recognition (50,000 test images) DeepSpeech Voice Recognition (5 sec. audio clip) BERT Sentiment Analysis (10000 IMDb reviews) x264 Video Encoder (11 Mb, 1080p video)
  57. Experimental Setup: Baselines 57 Optimization Debugging

  58. Results: Efficiency Unicorn fi nds the root-causes accurately

  59. Results: Efficiency Unicorn fi nds the root-causes accurately Unicorn achieves

    higher gain
  60. Results: Efficiency 60 Unicorn fi nds the root-causes accurately Unicorn

    achieves higher gain Unicorn performs them much faster UNICORN achieves higher sample e ffi ciency than other baselines. Takeaway
  61. Results: Transferability 61 10k 20k 50k 0 30 60 90

    Workload Size Gain % Unicorn + 20% Unicorn + 10% Unicorn (Reuse) Smac + 20% Smac + 10% Smac (Reuse) UNICORN fi nds con fi guration with higher gain when workload changes.
  62. Results: Transferability 62 10k 20k 50k 0 30 60 90

    Workload Size Gain % Unicorn + 20% Unicorn + 10% Unicorn (Reuse) Smac + 20% Smac + 10% Smac (Reuse) UNICORN can be e ff ectively reused in new environments for di ff erent performance tasks Takeaway
  63. Results: Scalability 63 Discovery time, query evaluation time and total

    time do not increase exponentially as the number of con fi guration options and systems events are increased
  64. Results: Scalability 64 Causal graphs are sparse

  65. Results: Scalability 65 UNICORN is scalable for larger multi-component systems

    and systems with large con fi guration space. Takeaway
  66. 66 Decoder Muxer Detector Tracker Classi fi er Causal reasoning

    enables more reliable performance analyses and more transferable performance models
  67. 67 Cache Misses Throughput (FPS) 20 10 0 100k 200k

    Cache Misses Throughput (FPS) LRU FIFO LIFO MRU 20 10 0 100k 200k Through- put Cache Misses Cache Policy Decoder Muxer Detector Tracker Classi fi er Causal reasoning enables more reliable performance analyses and more transferable performance models
  68. 68 Cache Misses Throughput (FPS) 20 10 0 100k 200k

    Cache Misses Throughput (FPS) LRU FIFO LIFO MRU 20 10 0 100k 200k Through- put Cache Misses Cache Policy 5- Estimate Causal Queries • What is the root-cause of fault? • How do I fix misconfiguration? • How do I optimize perf.? • How do I understand perf.? Software: DeepStream Middleware: TF, TensorRT Hardware: Nvidia Xavier Configuration: Default Estimate probability of satisfying QoS if BufferSize is set to 6k? 2- Learn Causal Performance Model Initial Perf. Data P(Th > 40/s|do(Buffersize = 6k)) 1- Specify Performance Query QoS : Th > 40/s Observed : Th < 30/s ± 5/s Causal Performance Model System Stack Performance Tasks Performance Fault/Issue 4- Update Causal Performance Model Causal Inference Engine 3- Determine Next Configuration Uses Stage Decoder Muxer Detector Tracker Classi fi er Causal reasoning enables more reliable performance analyses and more transferable performance models
  69. Causal reasoning enables more reliable performance analyses and more transferable

    performance models 69 Cache Misses Throughput (FPS) 20 10 0 100k 200k Cache Misses Throughput (FPS) LRU FIFO LIFO MRU 20 10 0 100k 200k Through- put Cache Misses Cache Policy 5- Estimate Causal Queries • What is the root-cause of fault? • How do I fix misconfiguration? • How do I optimize perf.? • How do I understand perf.? Software: DeepStream Middleware: TF, TensorRT Hardware: Nvidia Xavier Configuration: Default Estimate probability of satisfying QoS if BufferSize is set to 6k? 2- Learn Causal Performance Model Initial Perf. Data P(Th > 40/s|do(Buffersize = 6k)) 1- Specify Performance Query QoS : Th > 40/s Observed : Th < 30/s ± 5/s Causal Performance Model System Stack Performance Tasks Performance Fault/Issue 4- Update Causal Performance Model Causal Inference Engine 3- Determine Next Configuration Uses Stage Decoder Muxer Detector Tracker Classi fi er
  70. https://github.com/softsys4ai/UNICORN

  71. Supplementary Slides 71

  72. Maintaining performance in a highly configurable system is challenging 72

    • The con fi guration space is combinatorially large with 1000's of con fi guration options.
  73. Maintaining performance in a highly configurable system is challenging 73

    • The con fi guration space is combinatorially large with 1000's of con fi guration options. • Con fi guration options from each components interact non-trivially with one another.
  74. Maintaining performance in a highly configurable system is challenging 74

    • The con fi guration space is combinatorially large with 1000's of con fi guration options. • Con fi guration options from each components interact non-trivially with one another. • Individual component developers have a localized and limited understanding of the performance behavior of these systems.
  75. Maintaining performance in a highly configurable system is challenging 75

    • The con fi guration space is combinatorially large with 1000's of con fi guration options. • Con fi guration options from each components interact non-trivially with one another. • Individual component developers have a localized and limited understanding of the performance behavior of these systems. • Each deployment needs to be con fi gured correctly which is prone to miscon fi gurations.
  76. Maintaining performance in a highly configurable system is challenging 76

    • The con fi guration space is combinatorially large with 1000's of con fi guration options. • Con fi guration options from each components interact non-trivially with one another. • Individual component developers have a localized and limited understanding of the performance behavior of these systems. • Each deployment needs to be con fi gured correctly every time an environmental changes occur. Incorrect understanding about the performance behavior often leads to miscon fi guration
  77. Building configuration space of a highly configurable system 77 C

    = O1 × O2 × O3 × . . . × On Batch Size Interval Enable Past Frame Presets
  78. Modern computer system is composed of multiple components. 78 Video

    Decoder Stream Muxer Primary Detector Object Tracker Secondary Classifier # Configuration Options 55 86 14 44 86 Component 1 Component 2 Component 3 Composed System ...
  79. Building configuration space of a highly configurable system. 79 C

    = O1 × O2 × O3 × . . . × On Batch Size Interval Enable Past Frame Presets c1 ∈ C False × 20 × 5 × . . . × True
  80. Building configuration space of a highly configurable system. 80 C

    = O1 × O2 × O3 × . . . × On Batch Size Interval Enable Past Frame Presets c1 ∈ C False × 20 × 5 × . . . × True f1 (c1 ) = 32/seconds f2 (c1 ) = 63.6 Joules Throughput Energy
  81. Results: Transferability 81 10k 20k 50k 0 30 60 90

    Workload Size Gain % Unicorn + 20% Unicorn + 10% Unicorn (Reuse) Smac + 20% Smac + 10% Smac (Reuse) Accuracy Precision Recall Gain 30 60 90 % Unicorn (Reuse) Unicorn + 25 Unicorn (Rerun) Bugdoc (Reuse) Bugdoc + 25 Bugdoc (Rerun) Time 0 2 4 Hours. x UNICORN quickly fi xes the bug and achieves higher gain, accuracy, precision and recall when hardware changes
  82. Why Causal Inference? - Accurate across Environments 82 Performance Influence

    Model 0 5 10 15 20 25 30 35 40 45 50 Terms (a) Common Terms (Source ! Target) Total Terms (Source) Total Terms (Target) Error (Source) Error (Target) Error (Source ! Target) 0 30 60 90 Regression Models Causal Performance Model 0 5 10 15 20 25 30 35 40 45 50 0 30 60 90 Regression Models MAPE (%) Common Predictors are Large Common Predictors are lower in number
  83. 83 Performance Influence Model 0 5 10 15 20 25

    30 35 40 45 50 Terms (a) Common Terms (Source ! Target) Total Terms (Source) Total Terms (Target) Error (Source) Error (Target) Error (Source ! Target) 0 30 60 90 Regression Models Causal Performance Model 0 5 10 15 20 25 30 35 40 45 50 0 30 60 90 Regression Models MAPE (%) Low error when reused High error when reused Common Predictors are Large Common Predictors are lower in number Causal models can be reliably reused when environmental changes occur. Why Causal Inference? - Accurate across Environments
  84. 84 Performance Influence Model 0 5 10 15 20 25

    30 35 40 45 50 Terms (a) Common Terms (Source ! Target) Total Terms (Source) Total Terms (Target) Error (Source) Error (Target) Error (Source ! Target) 0 30 60 90 Regression Models Causal Performance Model 0 5 10 15 20 25 30 35 40 45 50 0 30 60 90 Regression Models MAPE (%) Causal models are more generalizable than Performance in fl uence models. Why Causal Inference? - Generalizability
  85. 85 Future work • Determining more accurate causal graphs by

    incorporating domain knowledge Bitrate Buffer Size Batch Size Enable Padding Branch Misses Cache Misses Cycles FPS Energy Bitrate Buffer Size Batch Size Enable Padding Branch Misses Cache Misses Cycles FPS Energy Domain Expert Causal Performance Model from Observational Data Causal Performance Model Corrected by Expert Background knowledge
  86. 86 Future work • Developing new domain-speci fi c languages

    for performance query speci fi cation Unstructured Performance Queries Semantic Analysis Query Engine Useful Information End user