Unicorn: Reasoning about Configurable System Performance through the Lens of Causality

Slide 1

Slide 1 text

Md Shahriar Iqbal UNICORN: Reasoning about Configurable System Performance through the Lens of Causality Rahul Krishna MA Javidian Baishakhi Ray Pooyan Jamshidi

Slide 2

Slide 2 text

Correlation vs Causation 2

Slide 3

Slide 3 text

3 Outline Motivation Causal Inference UNICORN Results

Slide 4

Slide 4 text

Consider a data analytics pipeline 4 Video Decoder Stream Muxer Primary Detector Object Tracker Secondary Classifier # Configuration Options 55 86 14 44 86

Slide 5

Slide 5 text

5 Video Decoder Stream Muxer Primary Detector Object Tracker Secondary Classifier # Configuration Options 55 86 14 44 86 Composed System Compression … Each component has a plethora of configuration options Encryption …

Slide 6

Slide 6 text

Each component has a plethora of configuration options 6 Video Decoder Stream Muxer Primary Detector Object Tracker Secondary Classifier # Configuration Options 55 86 14 44 86 Con fi gurations Possible 2285 Complex interactions between options (intra or inter components) give rise to a combinatorially large con fi guration space Compression … Encryption …

Slide 7

Slide 7 text

Energy (Joules) Performance varies significantly when systems are deployed with different configurations 7 Latency Energy Consumption 5 10 15 20 25 10 20 30 40 50 >99.99% >99.99% Latency (Seconds) Varies by 5x Varies by 4.5x It is expected to set the system to a con fi guration for which the performance remains optimal or close to the optimal

Slide 8

Slide 8 text

Energy (Joules) Performance varies significantly when systems are deployed with different configurations 8 Latency Energy Consumption 5 10 15 20 25 10 20 30 40 50 >99.99% >99.99% Latency (Seconds) Varies by 5x Varies by 4.5x It is expected to set the system to a con fi guration for which the performance remains optimal or close to the optimal Reaching desired performance goal is di ffi cult due to sheer size of the con fi guration space and high con fi guration measurement cost

Slide 9

Slide 9 text

Computer systems undergo several environmental changes 9 Source Environment Decoder Muxer Detector Tracker Classifier Target Environment Decoder Muxer Detector Tracker Classifier

Slide 10

Slide 10 text

Real world example: Deployment environment change 10 When we are trying to transplant our CUDA source code from TX1 to TX2, it behaved strange. We noticed that TX2 has twice computing-ability as TX1 in GPU, as expectation, we think TX2 will 30% - 40% faster than TX1 at least. Unfortunately, most of our code base spent twice the time as TX1, in other words, TX2 only has 1/2 speed as TX1, mostly. We believe that TX2’s CUDA API runs much slower than TX1 in many cases. The user is transferring the code from one hardware to another The code ran 2x slower on the more powerful hardware

Slide 11

Slide 11 text

Real world example: Deployment environment change 11 When we are trying to transplant our CUDA source code from TX1 to TX2, it behaved strange. We noticed that TX2 has twice computing-ability as TX1 in GPU, as expectation, we think TX2 will 30% - 40% faster than TX1 at least. Unfortunately, most of our code base spent twice the time as TX1, in other words, TX2 only has 1/2 speed as TX1, mostly. We believe that TX2’s CUDA API runs much slower than TX1 in many cases. The user is transferring the code from one hardware to another The code ran 2x slower on the more powerful hardware Incorrect understanding about the performance behavior often leads to miscon fi guration

Slide 12

Slide 12 text

What is misconfiguration? 12 Miscon fi gurations happen due to unexpected interactions between con fi guration options in the deployment system stack.

Slide 13

Slide 13 text

What is misconfiguration? 13 Miscon fi gurations happen due to unexpected interactions between con fi guration options in the deployment system stack. The system does not crash but remains operational with degraded performance e.g., high latency, low throughput, high energy consumption. Latency Energy Consumption 5 10 15 20 25 10 20 30 40 50 >99.99% >99.99% Latency (Seconds) Energy (Joules) Miscon fi guration

Slide 14

Slide 14 text

Performance task: Debugging 14 Performance debugging aims at fi nding the root cause of the miscon fi guration and fi x it. When we are trying to transplant our CUDA source code from TX1 to TX2, it behaved strange. We noticed that TX2 has twice computing-ability as TX1 in GPU, as expectation, we think TX2 will 30% - 40% faster than TX1 at least. Unfortunately, most of our code base spent twice the time as TX1, in other words, TX2 only has 1/2 speed as TX1, mostly. We believe that TX2’s CUDA API runs much slower than TX1 in many cases. The user is transferring the code from one hardware to another The code ran 2x slower on the more powerful hardware The user expects 30-40% improvement

Slide 15

Slide 15 text

Energy (Joules) Performance task: Optimization 15 Latency Energy Consumption 5 10 15 20 25 10 20 30 40 50 >99.99% >99.99% Latency (Seconds) Here, the developer aims at fi nding the optimal con fi guration with or without experiencing any miscon fi guration.

Slide 16

Slide 16 text

Performance debugging tasks take significantly long time, the fixes are typically non-intuitive (changes to seemingly underrated options) 16 June 3 June 4 June 4 June 5 Any suggestions on how to improve my performance? Thanks! Please do the following and let us know if it works 1. Install JetPack 3.0 2. Set nvpmodel=MAX-N 3. Run jetson_clock.sh We have already tried this. We still have high latency. Any other suggestions? TX2 is pascal architecture. Please update your CMakeLists: + set(CUDA_STATIC_RUNTIME OFF) ... + -gencode=arch=compute_62,code=sm_62 The user had several misconfigurations In Software: ✖ Wrong compilation flags ✖ Wrong SDK version In Hardware: ✖ Wrong power mode ✖ Wrong clock/fan settings

Slide 17

Slide 17 text

17 How to resolve these issues? Current approaches: Reasoning based on correlation! Our key idea: Reasoning based on causation :)

Slide 18

Slide 18 text

18 Performance In fl uence Models number of counters number of splitters latency (ms) 100 150 1 200 250 2 300 Cubic Interpolation Over Finer Grid 2 4 3 6 8 4 10 12 5 14 16 6 18 Bitrate (bits/s) Enable Padding … Cache Misses … Through put (fps) c1 1k 1 … 42m … 7 c2 2k 1 … 32m … 22 … … … … … … … cn 5k 0 … 12m … 25 Observational Data Black-box models Regression Equation Throughput = 5.1 × Bitrate + 2.5 × BatchSize + 12.3 × Bitrate × BatchSize Interactions Options Options This is a representative work, but there are many other works related to using regression models (as well as other statistical models) for building performance models We have selection bias here ;)

Slide 19

Slide 19 text

19 These methods rely on statistical correlations to extract meaningful information required for performance tasks. Performance In fl uence Models number of counters number of splitters latency (ms) 100 150 1 200 250 2 300 Cubic Interpolation Over Finer Grid 2 4 3 6 8 4 10 12 5 14 16 6 18 Bitrate (bits/s) Enable Padding … Cache Misses … Through put (fps) c1 1k 1 … 42m … 7 c2 2k 1 … 32m … 22 … … … … … … … cn 5k 0 … 12m … 25 Observational Data Black-box models Regression Equation Throughput = 5.1 × Bitrate + 2.5 × BatchSize + 12.3 × Bitrate × BatchSize Discovered Interactions Options Options

Slide 20

Slide 20 text

20 • Incorrect Explanations and Unreliable Predictions • Non-transferable across Environments Performance In fl uence Models su ff er from several shortcomings

Slide 21

Slide 21 text

Performance Influence Models might be Unreliable Cache Misses Throughput (FPS) 20 10 0 100k 200k 21 Increasing Cache Misses increases Throughput.

Slide 22

Slide 22 text

Cache Misses Throughput (FPS) 20 10 0 100k 200k 22 Increasing Cache Misses increases Throughput. More Cache Misses should reduce Throughput not increase it Purely statistical models built on this data will be unreliable. This is counter-intuitive Performance Influence Models might be Unreliable

Slide 23

Slide 23 text

Cache Misses Throughput (FPS) 20 10 0 100k 200k 23 Cache Misses Throughput (FPS) LRU FIFO LIFO MRU 20 10 0 100k 200k Segregating data on Cache Policy indicates that within each group increase of Cache Misses result in a decrease in Throughput. FIFO LIFO MRU LRU Performance Influence Models might be Unreliable

Slide 24

Slide 24 text

24 DeepStream (Environment: TX2) DeepStream (Environment: Xavier) Throughput = 5.1 × Bitrate + 2.5 × BatchSize + 12.3 × Bitrate × BatchSize Throughput = 2 × Bitrate + 1.9 × BatchSize + 1.8 × BufferSize + 0.5 × EnablePadding + 5.9 × Bitrate × BufferSize +6.2 × Bitrate × EnablePadding + 4.1 × Bitrate × BufferSize × EnablePadding Performance Influence Models are not Transferable Each term in the regression equations is considered a predictor

Slide 25

Slide 25 text

25 Performance In fl uence Models change signi fi cantly in new environments resulting in less accuracy. Throughput = 5.1 × Bitrate + 2.5 × BatchSize + 12.3 × Bitrate × BatchSize Throughput = 2 × Bitrate + 1.9 × BatchSize + 1.8 × BufferSize + 0.5 × EnablePadding + 5.9 × Bitrate × BufferSize +6.2 × Bitrate × EnablePadding + 4.1 × Bitrate × BufferSize × EnablePadding Performance Influence Models are not Transferable DeepStream (Environment: TX2) DeepStream (Environment: Xavier)

Slide 26

Slide 26 text

26 Performance in fl uence cannot be reliably used across environments. Throughput = 5.1 × Bitrate + 2.5 × BatchSize + 12.3 × Bitrate × BatchSize Throughput = 2 × Bitrate + 1.9 × BatchSize + 1.8 × BufferSize + 0.5 × EnablePadding + 5.9 × Bitrate × BufferSize +6.2 × Bitrate × EnablePadding + 4.1 × Bitrate × BufferSize × EnablePadding Performance Influence Models are not Transferable DeepStream (Environment: TX2) DeepStream (Environment: Xavier)

Slide 27

Slide 27 text

27 Outline Motivation Causal Inference UNICORN Results

Slide 28

Slide 28 text

28 Our Key Idea: Building Causal Performance Model instead of Performance Influence Models Expresses the relationships between Con fi guration options System Events Non-functional Properties Cache Misses Throughput (FPS) 20 10 0 100k 200k interacting variables as a causal graph Direction of Causality Cache Policy Cache Misses Through put

Slide 29

Slide 29 text

Why Causal Performance Model? To build reliable models that produce correct explanations 29 Cache Misses Throughput (FPS) 20 10 0 100k 200k Cache Misses Throughput (FPS) LRU FIFO LIFO MRU 20 10 0 100k 200k Cache Policy a ff ects Throughput via Cache Misses. Causal Performance Models recover the correct interactions. Cache Policy Cache Misses Through put

Slide 30

Slide 30 text

Why Causal Performance Models? To reuse them when the system environment changes 30 Causal models remain relatively stable A partial causal performance model in Jetson Xavier A partial causal performance model in Jetson TX2 Bitrate Buffer Size Batch Size Enable Padding Branch Misses Cache Misses Cycles FPS Energy Bitrate Buffer Size Batch Size Enable Padding Branch Misses Cache Misses Cycles FPS Energy

Slide 31

Slide 31 text

How to use Causal Performance Models? ? Cache Policy Cache Misses Through put How to generate a causal performance model? 31

Slide 32

Slide 32 text

How to use Causal Performance Models? ? How to use the causal performance model for performance tasks? ? Cache Policy Cache Misses Through put How to generate a causal performance model? 32

Slide 33

Slide 33 text

33 Outline Motivation Causal Inference UNICORN Results

Slide 34

Slide 34 text

UNICORN: End-to-end Pipeline 34 5- Estimate Causal Queries • What is the root-cause of fault? • How do I fix misconfiguration? • How do I optimize perf.? • How do I understand perf.? Software: DeepStream Middleware: TF, TensorRT Hardware: Nvidia Xavier Configuration: Default Estimate probability of satisfying QoS if BufferSize is set to 6k? 2- Learn Causal Performance Model Initial Perf. Data P(Th > 40/s|do(Buffersize = 6k)) 1- Specify Performance Query QoS : Th > 40/s Observed : Th < 30/s ± 5/s Causal Performance Model System Stack Performance Tasks Performance Fault/Issue 4- Update Causal Performance Model Causal Inference Engine 3- Determine Next Configuration Uses Stage Stages I. Specify Performance Query II. Learn Causal Performance Model III. Iterative Sampling IV. Update Causal Performance Model V. Estimate Causal Queries

Slide 35

Slide 35 text

Stage-I: Specify Performance Query 35 Performance Queries Query: What are the root causes of my performance fault and how can I improve performance by 70%?

Slide 36

Slide 36 text

Stage-I: Specify Performance Query 36 Performance Queries Query: What are the root causes of my performance fault and how can I improve performance by 70%? Query Engine Extracted Information Info: 70% gain expected Extracts meaningful information which is useful for subsequent stages for a performance task.

Slide 37

Slide 37 text

Bitrate (bits/s) Enable Padding … Cache Misses … Through put (fps) c1 1k 1 … 42m … 7 c2 2k 1 … 32m … 22 … … … … … … … cn 5k 0 … 12m … 25 1- Recovering the Skeleton fully connected graph given constraints (e.g., no connections btw configuration options) Stage-II: Learn Causal Performance Model 37 FPS Energy Branch Misses Cache Misses No of Cycles Bitrate Buffer Size Batch Size Enable Padding

Slide 38

Slide 38 text

Bitrate (bits/s) Enable Padding … Cache Misses … Through put (fps) c1 1k 1 … 42m … 7 c2 2k 1 … 32m … 22 … … … … … … … cn 5k 0 … 12m … 25 FPS Energy Branch Misses Cache Misses No of Cycles Bitrate Buffer Size Batch Size Enable Padding 1- Recovering the Skeleton 2- Pruning Causal Model statistical independence tests fully connected graph given constraints (e.g., no connections btw configuration options) Stage-II: Learn Causal Performance Model 38 FPS Energy Branch Misses Cache Misses No of Cycles Bitrate Buffer Size Batch Size Enable Padding

Slide 39

Slide 39 text

Bitrate (bits/s) Enable Padding … Cache Misses … Through put (fps) c1 1k 1 … 42m … 7 c2 2k 1 … 32m … 22 … … … … … … … cn 5k 0 … 12m … 25 FPS Energy Branch Misses Cache Misses No of Cycles Bitrate Buffer Size Batch Size Enable Padding 1- Recovering the Skeleton 2- Pruning Causal Model 3- Orienting Causal Relations statistical independence tests fully connected graph given constraints (e.g., no connections btw configuration options) orientation rules & measures + constraints (colliders, v-structures) FPS Energy Branch Misses Cache Misses No of Cycles Bitrate Buffer Size Batch Size Enable Padding Stage-II: Learn Causal Performance Model Partial Ancestral Graph (PAG) 39 FPS Energy Branch Misses Cache Misses No of Cycles Bitrate Buffer Size Batch Size Enable Padding

Slide 40

Slide 40 text

Stage-II: Learn Causal Performance Model 40 FPS Energy Branch Misses Cache Misses No of Cycles Bitrate Buffer Size Batch Size Enable Padding Partial Ancestral Graph (PAG) A PAG can have three types of edges between any nodes X and Y X Y X is a parent of Y X Y A confounder exists between X and Y X Y Not su ffi cient data to recover causal direction X Y X Y or or

Slide 41

Slide 41 text

Stage-II: Learn Causal Performance Model 41 FPS Energy Branch Misses Cache Misses No of Cycles Bitrate Buffer Size Batch Size Enable Padding Partial Ancestral Graph (PAG) A PAG can have three types of edges between any nodes X and Y X Y X is a parent of Y X Y A confounder exists between X and Y X Y Not su ffi cient data to recover causal direction X Y X Y or or

Slide 42

Slide 42 text

Stage-II: Learn Causal Performance Model 42 FPS Energy Branch Misses Cache Misses No of Cycles Bitrate Buffer Size Batch Size Enable Padding Partial Ancestral Graph (PAG) A PAG can have three types of edges between any nodes X and Y X Y X is a parent of Y X Y A confounder exists between X and Y X Y Not su ffi cient data to recover causal direction X Y X Y or or

Slide 43

Slide 43 text

43 FPS Energy Branch Misses Cache Misses No of Cycles Bitrate Buffer Size Batch Size Enable Padding Bitrate (bits/s) Enable Padding … Cache Misses … Through put (fps) c1 1k 1 … 42m … 7 c2 2k 1 … 32m … 22 … … … … … … … cn 5k 0 … 12m … 25 FPS Energy Branch Misses Cache Misses No of Cycles Bitrate Buffer Size Batch Size Enable Padding 1- Recovering the Skeleton 2- Pruning Causal Model 3- Orienting Causal Relations statistical independence tests fully connected graph given constraints (e.g., no connections btw configuration options) orientation rules & measures + constraints (colliders, v-structures) FPS Energy Branch Misses Cache Misses No of Cycles Bitrate Buffer Size Batch Size Enable Padding 3- Refining Causal Directions Latent search and entropy Stage-II: Learn Causal Performance Model Acyclic Directed Mixed Graph (ADMG) Partial Ancestral Graph (PAG) FPS Energy Branch Misses Cache Misses No of Cycles Bitrate Buffer Size Batch Size Enable Padding

Slide 44

Slide 44 text

44 Stage-III: Iterative Sampling FPS Energy Branch Misses Cache Misses No of Cycles Bitrate Buffer Size Batch Size Enable Padding FPS Branch Misses Bitrate Causal Performance Model Selected Subsection of Causal Performance Model Recommended Con fi guration Bitrate (bits/s) Enable Padding … Cache Misses … Through put (fps) c1 1k 1 … 42m … 7 c2 2k 1 … 32m … 22 … … … … … … … cn 5k 0 … 12m … 25 Individual Causal E ff ect (ICE) Estimation Interventional Measurement Select Top K Paths using Average Causal E ff ect (ACE)

Slide 45

Slide 45 text

45 x In real world case, the causal graphs can be very complex x It may be intractable to reason over the entire graph directly A real world causal graph for a data analytics pipeline Why Select Top K Paths?

Slide 46

Slide 46 text

46 Extracting Causal Paths from the Causal Model Extract paths Always begins with a configuration option Or a system event Always terminates at a performance objective Cache Misses Bitrate Branch Misses FPS Bitrate Branch Misses FPS FPS Branch Misses Cache Misses

Slide 47

Slide 47 text

Ranking Causal Paths from the Causal Model Expected value of Bitrate when we artificially intervene by setting Bitrate to the value b Expected value of Branch Misses when we artificially intervene by setting Bitrate to the value a If this difference is large, then small changes to Bitrate will cause large changes to Branch Misses Average over all permitted values of Bitrate. ACE(BranchMisses, Bitrate) = 1 N ∑ E(BranchMisses|do(Bitrate = b)) − E(BranchMisses|do(Bitrate = a)) 47 Bitrate Branch Misses FPS • There may be too many causal paths. • We need to select the most useful ones. • Compute the Average Causal E ff ect (ACE) of each pair of neighbors in a path.

Slide 48

Slide 48 text

48 Ranking Causal Paths from the Causal Model ● Average the ACE of all pairs of adjacent nodes in the path ● Rank paths from highest path ACE (PACE) score to the lowest ● Use the top K paths for subsequent analysis Sum over all pairs of nodes in the causal path. PACE (Z, Y) = 1 2 (ACE(Z, X) + ACE(X, Y)) Bitrate Branch Misses FPS

Slide 49

Slide 49 text

How to reason over a path? 49 To reason, we need to evaluate counterfactual queries that can be formulated using the con fi guration options and performance objectives in a particular path to resolve a particular performance task.

Slide 50

Slide 50 text

Counterfactual Queries 50 ● Counterfactual inference asks “what if” questions about changes to the misconfigurations We are interested in the scenario where: • We hypothetically have low throughput; Conditioned on the following events: • We hypothetically set the new Bitrate to 10000 • Bitrate was initially set to 6000 • We observed low throughput when Bitrate was set to 6000 • Everything else remains the same Example "Given that my current Bitrate is 6000 and I have low throughput, what is the probability of having low throughput if my Bitrate is increased to 10000"?

Slide 51

Slide 51 text

Selecting configuration for next intervention Top K paths Enumerate all possible changes Change with the largest ICE Set every configuration option in the path to all permitted values ICE (change) Inferred from observational data. This is very cheap 51 Bitrate Branch Misses FPS

Slide 52

Slide 52 text

Selecting configuration for next intervention Change with the largest ICE Yes No • Proceed to next stage Measure Performance 52 Query Satis fi ed? • Terminate

Slide 53

Slide 53 text

Stage-IV: Update Causal Performance Model 53 FPS Energy Branch Misses Cache Misses No of Cycles Bitrate Buffer Size Batch Size Enable Padding 1- Evaluate Candidate Interventions FPS Energy Branch Misses Cache Misses No of Cycles Bitrate Buffer Size Batch Size Enable Padding Option/Event/Obj Values Bitrate 1k Buffer Size 20k Batch Size 10 Enable Padding 1 Branch Misses 24m Cache Misses 42m No of Cycles 73b FPS 31/s Energy 42J 2- Determine & Perform next Perf. Measurement 3- Update Causal Performance Model Performance Data Model averaging Expected change in belief & KL; Causal effects on objectives Interventions on Hardware, Workload, and Kernel options Intervention 1 … Intervention n Belief Update Prior Belief 4- Replace Causal Performance Model

Slide 54

Slide 54 text

Stage-V: Estimate Causal Queries 54 P(Throughput > 40/s|do(BufferSize = 20000)) Estimate the probability of satisfying QoS given Bu ff erSize=20000 • Use do-calculus to evaluate causal queries • Estimate budget and additional constraints

Slide 55

Slide 55 text

55 Outline Motivation Causal Inference UNICORN Results

Slide 56

Slide 56 text

Experimental Setup: Systems, Workload, Hardware 56 Nvidia TX1 CPU 4 cores, 1.3 GHz GPU 128 Cores, 0.9 GHz Memory 4 Gb, 25 GB/s Nvidia TX2 CPU 6 cores, 2.0 GHz GPU 256 Cores, 1.3 GHz Memory 8 Gb, 58 GB/s Nvidia Xavier CPU 8 cores, 2.26 GHz GPU 512 Cores, 1.3 GHz Memory 16 Gb, 137 GB/s Xception Image Recognition (50,000 test images) DeepSpeech Voice Recognition (5 sec. audio clip) BERT Sentiment Analysis (10000 IMDb reviews) x264 Video Encoder (11 Mb, 1080p video)

Slide 57

Slide 57 text

Experimental Setup: Baselines 57 Optimization Debugging

Slide 58

Slide 58 text

Results: Efficiency Unicorn fi nds the root-causes accurately

Slide 59

Slide 59 text

Results: Efficiency Unicorn fi nds the root-causes accurately Unicorn achieves higher gain

Slide 60

Slide 60 text

Results: Efficiency 60 Unicorn fi nds the root-causes accurately Unicorn achieves higher gain Unicorn performs them much faster UNICORN achieves higher sample e ffi ciency than other baselines. Takeaway

Slide 61

Slide 61 text

Results: Transferability 61 10k 20k 50k 0 30 60 90 Workload Size Gain % Unicorn + 20% Unicorn + 10% Unicorn (Reuse) Smac + 20% Smac + 10% Smac (Reuse) UNICORN fi nds con fi guration with higher gain when workload changes.

Slide 62

Slide 62 text

Results: Transferability 62 10k 20k 50k 0 30 60 90 Workload Size Gain % Unicorn + 20% Unicorn + 10% Unicorn (Reuse) Smac + 20% Smac + 10% Smac (Reuse) UNICORN can be e ff ectively reused in new environments for di ff erent performance tasks Takeaway

Slide 63

Slide 63 text

Results: Scalability 63 Discovery time, query evaluation time and total time do not increase exponentially as the number of con fi guration options and systems events are increased

Slide 64

Slide 64 text

Results: Scalability 64 Causal graphs are sparse

Slide 65

Slide 65 text

Results: Scalability 65 UNICORN is scalable for larger multi-component systems and systems with large con fi guration space. Takeaway

Slide 66

Slide 66 text

66 Decoder Muxer Detector Tracker Classi fi er Causal reasoning enables more reliable performance analyses and more transferable performance models

Slide 67

Slide 67 text

67 Cache Misses Throughput (FPS) 20 10 0 100k 200k Cache Misses Throughput (FPS) LRU FIFO LIFO MRU 20 10 0 100k 200k Through- put Cache Misses Cache Policy Decoder Muxer Detector Tracker Classi fi er Causal reasoning enables more reliable performance analyses and more transferable performance models

Slide 68

Slide 68 text

68 Cache Misses Throughput (FPS) 20 10 0 100k 200k Cache Misses Throughput (FPS) LRU FIFO LIFO MRU 20 10 0 100k 200k Through- put Cache Misses Cache Policy 5- Estimate Causal Queries • What is the root-cause of fault? • How do I fix misconfiguration? • How do I optimize perf.? • How do I understand perf.? Software: DeepStream Middleware: TF, TensorRT Hardware: Nvidia Xavier Configuration: Default Estimate probability of satisfying QoS if BufferSize is set to 6k? 2- Learn Causal Performance Model Initial Perf. Data P(Th > 40/s|do(Buffersize = 6k)) 1- Specify Performance Query QoS : Th > 40/s Observed : Th < 30/s ± 5/s Causal Performance Model System Stack Performance Tasks Performance Fault/Issue 4- Update Causal Performance Model Causal Inference Engine 3- Determine Next Configuration Uses Stage Decoder Muxer Detector Tracker Classi fi er Causal reasoning enables more reliable performance analyses and more transferable performance models

Slide 69

Slide 69 text

Causal reasoning enables more reliable performance analyses and more transferable performance models 69 Cache Misses Throughput (FPS) 20 10 0 100k 200k Cache Misses Throughput (FPS) LRU FIFO LIFO MRU 20 10 0 100k 200k Through- put Cache Misses Cache Policy 5- Estimate Causal Queries • What is the root-cause of fault? • How do I fix misconfiguration? • How do I optimize perf.? • How do I understand perf.? Software: DeepStream Middleware: TF, TensorRT Hardware: Nvidia Xavier Configuration: Default Estimate probability of satisfying QoS if BufferSize is set to 6k? 2- Learn Causal Performance Model Initial Perf. Data P(Th > 40/s|do(Buffersize = 6k)) 1- Specify Performance Query QoS : Th > 40/s Observed : Th < 30/s ± 5/s Causal Performance Model System Stack Performance Tasks Performance Fault/Issue 4- Update Causal Performance Model Causal Inference Engine 3- Determine Next Configuration Uses Stage Decoder Muxer Detector Tracker Classi fi er

Slide 70

Slide 70 text

https://github.com/softsys4ai/UNICORN

Slide 71

Slide 71 text

Supplementary Slides 71

Slide 72

Slide 72 text

Maintaining performance in a highly configurable system is challenging 72 • The con fi guration space is combinatorially large with 1000's of con fi guration options.

Slide 73

Slide 73 text

Maintaining performance in a highly configurable system is challenging 73 • The con fi guration space is combinatorially large with 1000's of con fi guration options. • Con fi guration options from each components interact non-trivially with one another.

Slide 74

Slide 74 text

Maintaining performance in a highly configurable system is challenging 74 • The con fi guration space is combinatorially large with 1000's of con fi guration options. • Con fi guration options from each components interact non-trivially with one another. • Individual component developers have a localized and limited understanding of the performance behavior of these systems.

Slide 75

Slide 75 text

Maintaining performance in a highly configurable system is challenging 75 • The con fi guration space is combinatorially large with 1000's of con fi guration options. • Con fi guration options from each components interact non-trivially with one another. • Individual component developers have a localized and limited understanding of the performance behavior of these systems. • Each deployment needs to be con fi gured correctly which is prone to miscon fi gurations.

Slide 76

Slide 76 text

Maintaining performance in a highly configurable system is challenging 76 • The con fi guration space is combinatorially large with 1000's of con fi guration options. • Con fi guration options from each components interact non-trivially with one another. • Individual component developers have a localized and limited understanding of the performance behavior of these systems. • Each deployment needs to be con fi gured correctly every time an environmental changes occur. Incorrect understanding about the performance behavior often leads to miscon fi guration

Slide 77

Slide 77 text

Building configuration space of a highly configurable system 77 C = O1 × O2 × O3 × . . . × On Batch Size Interval Enable Past Frame Presets

Slide 78

Slide 78 text

Modern computer system is composed of multiple components. 78 Video Decoder Stream Muxer Primary Detector Object Tracker Secondary Classifier # Configuration Options 55 86 14 44 86 Component 1 Component 2 Component 3 Composed System ...

Slide 79

Slide 79 text

Building configuration space of a highly configurable system. 79 C = O1 × O2 × O3 × . . . × On Batch Size Interval Enable Past Frame Presets c1 ∈ C False × 20 × 5 × . . . × True

Slide 80

Slide 80 text

Building configuration space of a highly configurable system. 80 C = O1 × O2 × O3 × . . . × On Batch Size Interval Enable Past Frame Presets c1 ∈ C False × 20 × 5 × . . . × True f1 (c1 ) = 32/seconds f2 (c1 ) = 63.6 Joules Throughput Energy

Slide 81

Slide 81 text

Results: Transferability 81 10k 20k 50k 0 30 60 90 Workload Size Gain % Unicorn + 20% Unicorn + 10% Unicorn (Reuse) Smac + 20% Smac + 10% Smac (Reuse) Accuracy Precision Recall Gain 30 60 90 % Unicorn (Reuse) Unicorn + 25 Unicorn (Rerun) Bugdoc (Reuse) Bugdoc + 25 Bugdoc (Rerun) Time 0 2 4 Hours. x UNICORN quickly fi xes the bug and achieves higher gain, accuracy, precision and recall when hardware changes

Slide 82

Slide 82 text

Why Causal Inference? - Accurate across Environments 82 Performance Inﬂuence Model 0 5 10 15 20 25 30 35 40 45 50 Terms (a) Common Terms (Source ! Target) Total Terms (Source) Total Terms (Target) Error (Source) Error (Target) Error (Source ! Target) 0 30 60 90 Regression Models Causal Performance Model 0 5 10 15 20 25 30 35 40 45 50 0 30 60 90 Regression Models MAPE (%) Common Predictors are Large Common Predictors are lower in number

Slide 83

Slide 83 text

83 Performance Inﬂuence Model 0 5 10 15 20 25 30 35 40 45 50 Terms (a) Common Terms (Source ! Target) Total Terms (Source) Total Terms (Target) Error (Source) Error (Target) Error (Source ! Target) 0 30 60 90 Regression Models Causal Performance Model 0 5 10 15 20 25 30 35 40 45 50 0 30 60 90 Regression Models MAPE (%) Low error when reused High error when reused Common Predictors are Large Common Predictors are lower in number Causal models can be reliably reused when environmental changes occur. Why Causal Inference? - Accurate across Environments

Slide 84

Slide 84 text

84 Performance Inﬂuence Model 0 5 10 15 20 25 30 35 40 45 50 Terms (a) Common Terms (Source ! Target) Total Terms (Source) Total Terms (Target) Error (Source) Error (Target) Error (Source ! Target) 0 30 60 90 Regression Models Causal Performance Model 0 5 10 15 20 25 30 35 40 45 50 0 30 60 90 Regression Models MAPE (%) Causal models are more generalizable than Performance in fl uence models. Why Causal Inference? - Generalizability

Slide 85

Slide 85 text

85 Future work • Determining more accurate causal graphs by incorporating domain knowledge Bitrate Buffer Size Batch Size Enable Padding Branch Misses Cache Misses Cycles FPS Energy Bitrate Buffer Size Batch Size Enable Padding Branch Misses Cache Misses Cycles FPS Energy Domain Expert Causal Performance Model from Observational Data Causal Performance Model Corrected by Expert Background knowledge

Slide 86

Slide 86 text

86 Future work • Developing new domain-speci fi c languages for performance query speci fi cation Unstructured Performance Queries Semantic Analysis Query Engine Useful Information End user