Slide 1

Slide 1 text

Causal AI for Systems Learning Causal Performance Models for conducting Performance Tasks in a Principled and Transferable Fashion Pooyan Jamshidi

Slide 2

Slide 2 text

It is all about team work I played a very minor role

Slide 3

Slide 3 text

Arti fi cial Intelligence and Systems Laboratory (AISys Lab) Machine Learning Computer Systems Autonomy AI/ML Systems https://pooyanjamshidi.github.io/AISys/ 3 Ying Meng (PhD student) Shuge Lei (PhD student) Kimia Noorbakhsh (Undergrad) Shahriar Iqbal (PhD student) Jianhai Su (PhD student) M.A. Javidian (postdoc) Sponsors, thanks! Fatemeh Ghofrani (PhD student) Abir Hossen (PhD student) Hamed Damirchi (PhD student) Mahdi Shari fi (PhD student) Mahdi Shari fi (Intern)

Slide 4

Slide 4 text

4 Rahul Krishna Columbia Shahriar Iqbal UofSC M. A. Javidian Purdue Baishakhi Ray Columbia Christian Kästner CMU Sven Apel Saarland Marco Valtorta UofSC Madelyn Khoury REU student Forest Agostinelli UofSC Causal AI for Systems Causal AI for Robot Learning (Causal RL + Transfer Learning + Robotics) Abir Hossen UofSC Theory of Causal AI Ahana Biswas IIT Om Pandey KIIT Hamed Damirchi UofSC Causal AI for Adversarial ML Ying Meng UofSC Fatemeh Ghofrani UofSC Mahdi Shari fi UofSC Collaborators (Causal AI) Sugato Basu Google AdsAI Garima Pruthi Google AdsAI Causal Representation Learning

Slide 5

Slide 5 text

Outline 5 Cas e Study Causal A I For Systems Results Futur e Directions Motivation

Slide 6

Slide 6 text

6 Goal: Enable developers/users to fi nd the right quality tradeoff

Slide 7

Slide 7 text

Today’s most popular systems are con fi gurable 7 built

Slide 8

Slide 8 text

8

Slide 9

Slide 9 text

Empirical observations con fi rm that systems are becoming increasingly con fi gurable 9 08 7/2010 7/2012 7/2014 Release time 1/1999 1/2003 1/2007 1/2011 0 1/2014 N Release time 02 1/2006 1/2010 1/2014 2.2.14 2.3.4 2.0.35 .3.24 Release time Apache 1/2006 1/2008 1/2010 1/2012 1/2014 0 40 80 120 160 200 2.0.0 1.0.0 0.19.0 0.1.0 Hadoop Number of parameters Release time MapReduce HDFS [Tianyin Xu, et al., “Too Many Knobs…”, FSE’15]

Slide 10

Slide 10 text

Empirical observations con fi rm that systems are becoming increasingly con fi gurable 10 nia San Diego, ‡Huazhong Univ. of Science & Technology, †NetApp, Inc tixu, longjin, xuf001, yyzhou}@cs.ucsd.edu kar.Pasupathy, Rukma.Talwadker}@netapp.com prevalent, but also severely software. One fundamental y of configuration, reflected parameters (“knobs”). With m software to ensure high re- aunting, error-prone task. nderstanding a fundamental users really need so many answer, we study the con- including thousands of cus- m (Storage-A), and hundreds ce system software projects. ng findings to motivate soft- ore cautious and disciplined these findings, we provide ich can significantly reduce A as an example, the guide- ters and simplify 19.7% of on existing users. Also, we tion methods in the context 7/2006 7/2008 7/2010 7/2012 7/2014 0 100 200 300 400 500 600 700 Storage-A Number of parameters Release time 1/1999 1/2003 1/2007 1/2011 0 100 200 300 400 500 5.6.2 5.5.0 5.0.16 5.1.3 4.1.0 4.0.12 3.23.0 1/2014 MySQL Number of parameters Release time 1/1998 1/2002 1/2006 1/2010 1/2014 0 100 200 300 400 500 600 1.3.14 2.2.14 2.3.4 2.0.35 1.3.24 Number of parameters Release time Apache 1/2006 1/2008 1/2010 1/2012 1/2014 0 40 80 120 160 200 2.0.0 1.0.0 0.19.0 0.1.0 Hadoop Number of parameters Release time MapReduce HDFS [Tianyin Xu, et al., “Too Many Knobs…”, FSE’15]

Slide 11

Slide 11 text

Today’s most popular systems are complex! multiscale, multi-modal, and multi-stream 11 Variability Space = Con fi guration Space + System Architecture + Deployment Environment Video Decoder Stream Muxer Primary Detector Object Tracker Secondary Classifier # Configuration Options 55 86 14 44 86

Slide 12

Slide 12 text

Con fi gurations determine the performance behavior 12 void Parrot_setenv(. . . name,. . . value){ #ifdef PARROT_HAS_SETENV my_setenv(name, value, 1); #else int name_len=strlen(name); int val_len=strlen(value); char* envs=glob_env; if(envs==NULL){ return; } strcpy(envs,name); strcpy(envs+name_len,"="); strcpy(envs+name_len + 1,value); putenv(envs); #endif } #ifdef LINUX extern int Parrot_signbit(double x){ endif else PARROT_HAS_SETENV LINUX Speed Energy

Slide 13

Slide 13 text

Performance distributions are multi-modal and have long tails • Certain con fi gurations can cause performance to take abnormally large values
 • Faulty con fi gurations take the tail values (worse than 99.99th percentile)
 • Certain con fi gurations can cause faults on multiple performance objectives. 
 13

Slide 14

Slide 14 text

Misconfiguration and its Effects ● Misconfigurations can elicit unexpected interactions between software and hardwar e ● These can result in non-functional fault s ○ Affecting non-functional system properties like latency, throughput, energy consumption, etc. 14 The system doesn’t crash or exhibit an obvious misbehavior Systems are still operational but with a degraded performance, e.g., high latency, low throughput, high energy consumption, high heat dissipation, or a combination of several

Slide 15

Slide 15 text

15 CUDA performance issue on tx2 When we are trying to transplant our CUDA source code from TX1 to TX2, it behaved strange. We noticed that TX2 has twice computing-ability as TX1 in GPU, as expectation, we think TX2 will 30% - 40% faster than TX1 at least. Unfortunately, most of our code base spent twice the time as TX1, in other words, TX2 only has 1/2 speed as TX1, mostly. We believe that TX2’s CUDA API runs much slower than TX1 in many cases. When we are trying to transplant our CUDA source code from TX1 to TX2, it behaved strange. We noticed that TX2 has twice computing-ability as TX1 in GPU, as expectation, we think TX2 will 30% - 40% faster than TX1 at least. Unfortunately, most of our code base spent twice the time as TX1, in other words, TX2 only has 1/2 speed as TX1, mostly. We believe that TX2’s CUDA API runs much slower than TX1 in many cases. The user is transferring the cod e from one hardware to another When we are trying to transplant our CUDA source code from TX1 to TX2, it behaved strange. We noticed that TX2 has twice computing-ability as TX1 in GPU, as expectation, we think TX2 will 30% - 40% faster than TX1 at least. Unfortunately, most of our code base spent twice the time as TX1, in other words, TX2 only has 1/2 speed as TX1, mostly. We believe that TX2’s CUDA API runs much slower than TX1 in many cases. The target hardware is faster than the the source hardware . User expects the code to run at least 30-40% faster. Motivating Example When we are trying to transplant our CUDA source code from TX1 to TX2, it behaved strange. We noticed that TX2 has twice computing-ability as TX1 in GPU, as expectation, we think TX2 will 30% - 40% faster than TX1 at least. Unfortunately, most of our code base spent twice the time as TX1, in other words, TX2 only has 1/2 speed as TX1, mostly. We believe that TX2’s CUDA API runs much slower than TX1 in many cases. The code ran 2x slower on the more powerful hardware

Slide 16

Slide 16 text

Motivating Example 16 June 3rd We have already tried this. We still have high latency. Any other suggestions? June 4th Please do the following and let us know if it works 1. Install JetPack 3.0 2. Set nvpmodel=MAX-N 3. Run jetson_clock.sh June 5th June 4th TX2 is pascal architecture. Please update your CMakeLists: + set(CUDA_STATIC_RUNTIME OFF ) .. . + -gencode=arch=compute_62,code=sm_62 The user had several misconfigurations In Software: ✖ Wrong compilation flags ✖ Wrong SDK version In Hardware: ✖ Wrong power mode ✖ Wrong clock/fan settings The discussions took 2 days Any suggestions on how to improve my performance? Thanks! How to resolve such issues faster? ?

Slide 17

Slide 17 text

Users want to understand the effect of configuration options 17

Slide 18

Slide 18 text

Outline 18 Motivation Causal A I For Systems Results Futur e Directions Cas e Study

Slide 19

Slide 19 text

SocialSensor •Identifying trending topics •Identifying user de fi ned topics •Social media search 19

Slide 20

Slide 20 text

SocialSensor 20 Content Analysis Orchestrator Crawling Search and Integration Tweets: [5k-20k/min] Every 10 min: [100k tweets] Tweets: [10M] Fetch Store Push Store Crawled items Fetch Internet

Slide 21

Slide 21 text

Challenges 21 Content Analysis Orchestrator Crawling Search and Integration Tweets: [5k-20k/min] Every 10 min: [100k tweets] Tweets: [10M] Fetch Store Push Store Crawled items Fetch Internet 100X 10X Real time

Slide 22

Slide 22 text

22 How can we gain a better performance without using more resources?

Slide 23

Slide 23 text

23 Let’s try out di ff erent system con fi gurations!

Slide 24

Slide 24 text

Opportunity: Data processing engines in the pipeline were all con fi gurable 24 > 100 > 100 > 100 2300

Slide 25

Slide 25 text

25 More combinations than estimated atoms in the universe

Slide 26

Slide 26 text

0 500 1000 1500 Throughput (ops/sec) 0 1000 2000 3000 4000 5000 Average write latency ( s) The default con fi guration is typically bad and the optimal con fi guration is noticeably better than median 26 Default Con fi guration Optimal Con fi guration better better • Default is ba d • 2X-10X faster than worst • Noticeably faster than median

Slide 27

Slide 27 text

Performance behavior varies in different environments 27

Slide 28

Slide 28 text

100X more user cloud resources reduced 20% outperform expert recommendation

Slide 29

Slide 29 text

Outline 29 Motivation Cas e Study Causal A I Results Futur e Directions Causal A I For Systems

Slide 30

Slide 30 text

Causal AI in Systems and Software 30 Computer Architecture Database Operating Systems Programming Languages BigData Software Engineering https://github.com/y-ding/causal-system-papers

Slide 31

Slide 31 text

31 Throughput = 9 × Bitrate + 2.1 × Buffersize − 4.4 × Bitrate × Buffersize × BatchSize Causal Performance Model Traditional Performance Model VS Throughput Energy Branch Misses Cache Misses No. of Cycles Bitrate Buffer Size Batch Size Enabl e Padding f3 f4 f f1 f2 Causal Interaction Causal Paths Software Options Intermediate Causal Mechanisms Performance Objective f Branchmisses = 2 × Bitrate + 8.1 × Buffersize + 4.1 × Bitrate × Buffersize × Cachemisses Decoder Muxer

Slide 32

Slide 32 text

Critical Issues of Correlation-based Performance Analysis • Performance in fl uence models could produce unreliable predictions. • Performance in fl uence models could produce unstable predictions across environments and in the presence of measurement noise. • Performance in fl uence models could produce incorrect explanations. 32

Slide 33

Slide 33 text

Why Causal Inference? (Simpson’s Paradox) 33 Increasing GPU memor y increases Latency More GPU memory usage should reduce latency not increase it. Counterintuitive! Any ML-/statistical models built on this data will be incorrect !

Slide 34

Slide 34 text

Why Causal Inference? (Simpson’s Paradox) 34 Segregate data on swap memory Available swap memory is reducing GPU memory borrows memory from the swap for some intensive workloads. Other host processes may reduce the available swap. Little will be left for the GPU to use.

Slide 35

Slide 35 text

35 Why Causal Inference? Real world problems can have 100s if not 1000s of interacting configuration options ! Manually understanding and evaluating each combination is impractical, if not impossible.

Slide 36

Slide 36 text

Load GPU Mem. Swap Mem. Latency Express the relationships between interacting variables as a causal graph 36 Causal Performance Models Configuration option Direction(s) of the causality • Latency is affected by GPU Mem. which in turn is influenced by swap memory • External factors like resource pressure also affects swap memory Non-functional property System event

Slide 37

Slide 37 text

37 Causal Performance Models How to construc t this causal graph? ? If there is a fault in latency, how to diagnose and fix it? ? Load GPU Mem. Swap Mem. Latency

Slide 38

Slide 38 text

• Build a Causal Performance Model that capture the interactions options in the variability space using the observation performance data. • Iterative causal performance model evaluation and model update • Perform downstream performance tasks such as performance debugging & optimization using Causal Reasoning UNICORN: Our Causal AI for Systems Method

Slide 39

Slide 39 text

UNICORN: Our Causal AI for Systems Method Software: DeepStream Middleware: TF, TensorR T Hardware: Nvidia Xavie r Configuration: Default number of counters number of splitters latency (ms) 100 150 1 200 250 2 300 Cubic Interpolation Over Finer Grid 2 4 3 6 8 4 10 12 5 14 16 6 18 Budget Exhausted? Yes No 5- Update Causal Performance Model Query Engine 4- Estimate Causal Queries Estimate probability of satisfying QoS if BufferSize is set to 6k? 2- Learn Causal Performance Model Performanc e Debugging Performanc e Optimization 3- Translate Perf. Query to Causal Queries •What is the root-cause of observed perf. fault ? •How do I fix the misconfig. ? •How can I improve throughput without sacrificing accuracy ? •How do I understand perf behavior? Measure performance of the configuration(s) that maximizes information gain Performance Data Causal Model P(Th > 40/s|do(Buffersize = 6k)) 1- Specify Performance Query QoS : Th > 40/s Observed : Th < 30/s ± 5/s

Slide 40

Slide 40 text

Software: DeepStream Middleware: TF, TensorR T Hardware: Nvidia Xavie r Configuration: Default number of counters number of splitters latency (ms) 100 150 1 200 250 2 300 Cubic Interpolation Over Finer Grid 2 4 3 6 8 4 10 12 5 14 16 6 18 Budget Exhausted? Yes No 5- Update Causal Performance Model Query Engine 4- Estimate Causal Queries Estimate probability of satisfying QoS if BufferSize is set to 6k? 2- Learn Causal Perf. Model Performanc e Debugging Performanc e Optimization 3- Translate Performance Query to Causal Queries •What is the root-cause of observed perf. fault ? •How do I fix the misconfig. ? •How can I improve throughput without sacrificing accuracy ? •How do I understand perf behavior? Measure performance of the configuration(s) that maximizes information gain Performance Data Causal Model P(Th > 40/s|do(Buffersize = 6k)) 1- Specify Performance Query QoS : Th > 40/s Observed : Th < 30/s ± 5/s UNICORN: Our Causal AI for Systems Method

Slide 41

Slide 41 text

Software: DeepStream Middleware: TF, TensorR T Hardware: Nvidia Xavie r Configuration: Default number of counters number of splitters latency (ms) 100 150 1 200 250 2 300 Cubic Interpolation Over Finer Grid 2 4 3 6 8 4 10 12 5 14 16 6 18 Budget Exhausted? Yes No 5- Update Causal Performance Model Query Engine 4- Estimate Causal Queries Estimate probability of satisfying QoS if BufferSize is set to 6k? 2- Learn Causal Perf. Model Performanc e Debugging Performanc e Optimization 3- Translate Performance Query to Causal Queries •What is the root-cause of observed perf. fault ? •How do I fix the misconfig. ? •How can I improve throughput without sacrificing accuracy ? •How do I understand perf behavior? Measure performance of the configuration(s) that maximizes information gain Performance Data Causal Model P(Th > 40/s|do(Buffersize = 6k)) 1- Specify Performance Query QoS : Th > 40/s Observed : Th < 30/s ± 5/s UNICORN: Our Causal AI for Systems Method

Slide 42

Slide 42 text

FPS Energy Branch Misses Cache Misses No of Cycles Bitrate Buffer Size Batch Size Enabl e Padding FPS Energy Branch Misses Cache Misses No of Cycles Bitrate Buffer Size Batch Size Enabl e Padding Bitrate (bits/s) Enable Padding … Cache Misses … Through put (fps) c1 1k 1 … 42m … 7 c2 2k 1 … 32m … 22 … … … … … … … cn 5k 0 … 12m … 25 FPS Energy Branch Misses Cache Misses No of Cycles Bitrate Buffer Size Batch Size Enabl e Padding 1- Recovering the Skelton 2- Prunin g Causal Structure 3- Orienting Causal Relations statistical independence tests fully connected graph given constraints (e.g., no connections btw configuration options) orientation rules & measures (entropy) + structural constraints (colliders, v-structures) Learning Causal Performance Model

Slide 43

Slide 43 text

Performance measurement 43 ℂ = O1 × O2 × ⋯ × O19 × O20 Dead code removal Con fi guration Space Constant folding Loop unrolling Function inlining c1 = 0 × 0 × ⋯ × 0 × 1 c1 ∈ ℂ fc (c1 ) = 11.1ms Compile time Execution time Energy Compiler (e.f., SaC, LLVM) Program Compiled Code Instrumented Binary Hardware Compile Deploy Con fi gure fe (c1 ) = 110.3ms fen (c1 ) = 100mwh Non-functiona l measurable/quanti fi able aspect

Slide 44

Slide 44 text

Our setup for performance measurements 44

Slide 45

Slide 45 text

Hardware platforms in our experiments The reason behind using di ff erent types of hardware platforms is that they exhibit di ff erent behaviors due to di ff erences in terms of resources, their microarchitecture, etc. 45 AWS DeepLens: Cloud-connected device System on Chip (SoC) Microcontrollers (MCUs)

Slide 46

Slide 46 text

Measuring performance for systems involves lots of challenges Each hardware requires di ff erent ways of instrumentations and clean measurement that contains least amount of noise is the most challenging part of our experiments. 46

Slide 47

Slide 47 text

FPS Energy Branch Misses Cache Misses No of Cycles Bitrate Buffer Size Batch Size Enabl e Padding FPS Energy Branch Misses Cache Misses No of Cycles Bitrate Buffer Size Batch Size Enabl e Padding Bitrate (bits/s) Enable Padding … Cache Misses … Through put (fps) c1 1k 1 … 42m … 7 c2 2k 1 … 32m … 22 … … … … … … … cn 5k 0 … 12m … 25 FPS Energy Branch Misses Cache Misses No of Cycles Bitrate Buffer Size Batch Size Enabl e Padding 1- Recovering the Skelton 2- Prunin g Causal Structure 3- Orienting Causal Relations statistical independence tests fully connected graph given constraints (e.g., no connections btw configuration options) orientation rules & measures (entropy) + structural constraints (colliders, v-structures) Learning Causal Performance Model

Slide 48

Slide 48 text

FPS Energy Branch Misses Cache Misses No of Cycles Bitrate Buffer Size Batch Size Enabl e Padding FPS Energy Branch Misses Cache Misses No of Cycles Bitrate Buffer Size Batch Size Enabl e Padding Bitrate (bits/s) Enable Padding … Cache Misses … Through put (fps) c1 1k 1 … 42m … 7 c2 2k 1 … 32m … 22 … … … … … … … cn 5k 0 … 12m … 25 FPS Energy Branch Misses Cache Misses No of Cycles Bitrate Buffer Size Batch Size Enabl e Padding 1- Recovering the Skelton 2- Prunin g Causal Structure 3- Orienting Causal Relations statistical independence tests fully connected graph given constraints (e.g., no connections btw configuration options) orientation rules & measures (entropy) + structural constraints (colliders, v-structures) Learning Causal Performance Model

Slide 49

Slide 49 text

FPS Energy Branch Misses Cache Misses No of Cycles Bitrate Buffer Size Batch Size Enabl e Padding FPS Energy Branch Misses Cache Misses No of Cycles Bitrate Buffer Size Batch Size Enabl e Padding Bitrate (bits/s) Enable Padding … Cache Misses … Through put (fps) c1 1k 1 … 42m … 7 c2 2k 1 … 32m … 22 … … … … … … … cn 5k 0 … 12m … 25 FPS Energy Branch Misses Cache Misses No of Cycles Bitrate Buffer Size Batch Size Enabl e Padding 1- Recovering the Skelton 2- Prunin g Causal Structure 3- Orienting Causal Relations statistical independence tests fully connected graph given constraints (e.g., no connections btw configuration options) orientation rules & measures (entropy) + structural constraints (colliders, v-structures) Learning Causal Performance Model

Slide 50

Slide 50 text

Throughput Energy Branch Misses Cache Misses No of Cycles Bitrate Buffer Size Batch Size Enabl e Padding f f f f f Causal Interaction Causal Paths Software Options Perf. Events Performance Objective f Branchmisses = 2 × Bitrate + 8.1 × Buffersize + 4.1 × Bitrate × Buffersize × Cachemisses Decoder Muxer Causal Performance Model

Slide 51

Slide 51 text

Software: DeepStream Middleware: TF, TensorR T Hardware: Nvidia Xavie r Configuration: Default number of counters number of splitters latency (ms) 100 150 1 200 250 2 300 Cubic Interpolation Over Finer Grid 2 4 3 6 8 4 10 12 5 14 16 6 18 Budget Exhausted? Yes No 5- Update Causal Performance Model Query Engine 4- Estimate Causal Queries Estimate probability of satisfying QoS if BufferSize is set to 6k? 2- Learn Causal Perf. Model Performanc e Debugging Performanc e Optimization 3- Translate Performance Query to Causal Queries •What is the root-cause of observed perf. fault ? •How do I fix the misconfig. ? •How can I improve throughput without sacrificing accuracy ? •How do I understand perf behavior? Measure performance of the configuration(s) that maximizes information gain Performance Data Causal Model P(Th > 40/s|do(Buffersize = 6k)) 1- Specify Performance Query QoS : Th > 40/s Observed : Th < 30/s ± 5/s UNICORN: Our Causal AI for Systems Method

Slide 52

Slide 52 text

52 Diagnose and fix the root-cause of misconfigurations that cause non-functional faults Objective Causal Debugging: An example of downstream performance task Ὂ Use causal models to model various cross-stack configuration interactions; an d Ὂ Counterfactual reasoning to recommend fixes for these misconfigurations Approach

Slide 53

Slide 53 text

53 Causal Debugging • What is the root-cause of my fault ? • How do I fix my misconfigurations to improve performance? Misconfiguration Fault fixed? Observational Data Build Causal Graph Extract Causal Paths Best Query Yes No updat e observationa l data Counterfactual Queries Rank Paths What if questions . E.g., What if the configuration option X was set to a value ‘x’? About 25 sample configurations (training data)

Slide 54

Slide 54 text

Best Query Counterfactual Queries Rank Paths What if questions . E.g., What if the configuration option X was set to a value ‘x’? Extract Causal Paths 54 Extracting Causal Paths from the Causal Model • What is the root-cause of my fault ? • How do I fix my misconfigurations to improve performance? Misconfiguration Fault fixed? Observational Data Build Causal Graph Yes No updat e observationa l data About 25 sample configurations (training data)

Slide 55

Slide 55 text

Extracting Causal Paths from the Causal Model Problem ✕ In real world cases, this causal graph can be very complex ✕ It may be intractable to reason over the entire graph directly 55 Solution ✓ Extract paths from the causal graph ✓ Rank them based on their Average Causal Effect on latency, etc. ✓ Reason over the top K paths

Slide 56

Slide 56 text

Extracting Causal Paths from the Causal Model 56 GPU Mem. Latency Swap Mem. Extract paths Always begins with a configuration option Or a system event Always terminates at a performance objective Load GPU Mem. Latency Swap Mem. Swap Mem. Latency Load GPU Mem.

Slide 57

Slide 57 text

Ranking Causal Paths from the Causal Model 57 ● They may be too many causal path s ● We need to select the most useful one s ● Compute the Average Causal Effect (ACE) of each pair of neighbors in a path GPU Mem. Swap Mem. Latency 𝐴𝐶 𝐸 (GPU Mem . , Swap) = 1 𝑁 ∑ 𝑎 , 𝑏 ∈ 𝑍 𝔼 (GPU Mem . 𝑑 𝑜 (Swap = 𝑏 )) − 𝔼 (GPU Mem . 𝑑 𝑜 (Swap = 𝑎 )) Expected value of GPU Mem. when we artificially intervene by setting Swap to the value b Expected value of GPU Mem. when we artificially intervene by setting Swap to the value a If this difference is large, then small changes to Swap Mem. will cause large changes to GPU Mem. Average over all permitted values of Swap memory.

Slide 58

Slide 58 text

Ranking Causal Paths from the Causal Model 58 ● Average the ACE of all pairs of adjacent nodes in the pat h ● Rank paths from highest path ACE (PACE) score to the lowes t ● Use the top K paths for subsequent analysis 𝑃𝐴𝐶𝐸 ( 𝑍 , 𝑌 ) = 1 2 ( 𝐴 𝐶 𝐸 ( 𝑍 , 𝑋 ) + 𝐴𝐶 𝐸 ( 𝑋 , 𝑌 )) X Y Z Sum over all pairs of nodes in the causal path. GPU Mem. Latency Swap Mem.

Slide 59

Slide 59 text

Best Query Counterfactual Queries Rank Paths What if questions . E.g., What if the configuration option X was set to a value ‘x’? Extract Causal Paths 59 Diagnosing and Fixing the Faults • What is the root-cause of my fault ? • How do I fix my misconfigurations to improve performance? Misconfiguration Fault fixed? Observational Data Build Causal Graph Yes No updat e observationa l data About 25 sample configurations (training data)

Slide 60

Slide 60 text

Diagnosing and Fixing the Faults 60 ● Counterfactual inference asks “what if” questions about changes to the misconfigurations We are interested in the scenario where: • We hypothetically have low latency; Conditioned on the following events : • We hypothetically set the new Swap memory to 4 G b • Swap Memory was initially set to 2 Gb • We observed high latency when Swap was set to 2 G b • Everything else remains the same Example Given that my current swap memory is 2 Gb, and I have high latency. What is the probability of having low latency if swap memory was increased to 4 Gb?

Slide 61

Slide 61 text

Low? Load GPU Mem. Latency Swap = 4 Gb Diagnosing and Fixing the Faults 61 GPU Mem. Latency Swap Original Path Load GPU Mem. Latency Swap = 4 Gb Path after proposed change Load Remove incoming edges. Assume no external influence. Modify to reflect the hypothetical scenario Low? Load GPU Mem. Latency Swap = 4 Gb Low? Use both the models to compute the answer to the counterfactual question

Slide 62

Slide 62 text

Diagnosing and Fixing the Faults 62 GPU Mem. Latency Swap Original Path Load GPU Mem. Latency Swap = 4 Gb Path after proposed change Load 𝑃 𝑜 𝑡 𝑒 𝑛 𝑡𝑖 𝑎 𝑙 = 𝑃 ( ^ 𝐿𝑎 𝑡 𝑒 𝑛𝑐 𝑦 = 𝑙 𝑜𝑤 . . ^ 𝑆𝑤 𝑎𝑝 = 4 𝐺 𝑏 , . 𝑆 𝑤 𝑎𝑝 = 2 𝐺 𝑏 , 𝐿𝑎 𝑡 𝑒 𝑛𝑐𝑦 𝑠 𝑤 𝑎 𝑝 =2 𝐺 𝑏 = h 𝑖𝑔 h, 𝑈 ) We expect a low latency The latency was high The Swap is now 4 Gb The Swap was initially 2 Gb Everything else stays the same

Slide 63

Slide 63 text

Diagnosing and Fixing the Faults 63 Potential = 𝑃 ( ^ 𝑜𝑢𝑡𝑐𝑜𝑚 𝑒 = 𝑔𝑜 𝑜𝑑 ~ ~ 𝑐 h 𝑎 𝑛 𝑔 𝑒 , ~ 𝑜 𝑢 𝑡𝑐𝑜 𝑚 𝑒 ¬ 𝑐 h 𝑎 𝑛 𝑔 𝑒 = 𝑏𝑎𝑑 , ~¬ 𝑐 h 𝑎 𝑛 𝑔𝑒 , 𝑈 ) Probability that the outcome is good after a change, conditioned on the past If this difference is large, then our change is useful Individual Treatment Effect = Potential − Outcome Control = 𝑃 ( ^ 𝑜𝑢 𝑡 𝑐 𝑜 𝑚 𝑒 = 𝑏𝑎𝑑 ~ ~¬ 𝑐 h 𝑎 𝑛𝑔 𝑒 , 𝑈 ) Probability that the outcome was bad before the change

Slide 64

Slide 64 text

Diagnosing and Fixing the Faults 64 GPU Mem. Latency Swap Mem. Top K paths ⋮ Enumerate all possible changes 𝐼 𝑇 𝐸 ( 𝑐 h 𝑎𝑛𝑔 𝑒 ) Change with the largest ITE Set every configuration option in the path to all permitted values Inferred from observed data. This is very cheap. !

Slide 65

Slide 65 text

Diagnosing and Fixing the Faults 65 Change with the largest ITE Fault fixed? Yes No • Add to observational dat a • Update causal mode l • Repeat… Measure Performance

Slide 66

Slide 66 text

Software: DeepStream Middleware: TF, TensorR T Hardware: Nvidia Xavie r Configuration: Default number of counters number of splitters latency (ms) 100 150 1 200 250 2 300 Cubic Interpolation Over Finer Grid 2 4 3 6 8 4 10 12 5 14 16 6 18 Budget Exhausted? Yes No 5- Update Causal Performance Model Query Engine 4- Estimate Causal Queries Estimate probability of satisfying QoS if BufferSize is set to 6k? 2- Learn Causal Perf. Model Performanc e Debugging Performanc e Optimization 3- Translate Performance Query to Causal Queries •What is the root-cause of observed perf. fault ? •How do I fix the misconfig. ? •How can I improve throughput without sacrificing accuracy ? •How do I understand perf behavior? Measure performance of the configuration(s) that maximizes information gain Performance Data Causal Model P(Th > 40/s|do(Buffersize = 6k)) 1- Specify Performance Query QoS : Th > 40/s Observed : Th < 30/s ± 5/s UNICORN: Our Causal AI for Systems Method

Slide 67

Slide 67 text

FPS Energy Branch Misses Cache Misses No of Cycles Bitrate Buffer Size Batch Size Enabl e Padding 1- Evaluate Candidate Interventions FPS Energy Branch Misses Cache Misses No of Cycles Bitrate Buffer Size Batch Size Enabl e Padding Option/Event/Obj Values Bitrate 1k Buffer Size 20k Batch Size 10 Enable Padding 1 Branch Misses 24m Cache Misses 42m No of Cycles 73b FPS 31/s Energy 42J 2- Determine & Perform next Perf Measurement 3- Updating Causal Model Performance Data Model averaging Expected change in belief & KL; Causal effects on objectives Interventions on Hardware, Workload, and Kernel Options Active Learning for Updating Causal Performance Model

Slide 68

Slide 68 text

FPS Energy Branch Misses Cache Misses No of Cycles Bitrate Buffer Size Batch Size Enabl e Padding 1- Evaluate Candidate Interventions FPS Energy Branch Misses Cache Misses No of Cycles Bitrate Buffer Size Batch Size Enabl e Padding Option/Event/Obj Values Bitrate 1k Buffer Size 20k Batch Size 10 Enable Padding 1 Branch Misses 24m Cache Misses 42m No of Cycles 73b FPS 31/s Energy 42J 2- Determine & Perform next Perf Measurement 3- Updating Causal Model Performance Data Model averaging Expected change in belief & KL; Causal effects on objectives Interventions on Hardware, Workload, and Kernel Options Active Learning for Updating Causal Performance Model

Slide 69

Slide 69 text

FPS Energy Branch Misses Cache Misses No of Cycles Bitrate Buffer Size Batch Size Enabl e Padding 1- Evaluate Candidate Interventions FPS Energy Branch Misses Cache Misses No of Cycles Bitrate Buffer Size Batch Size Enabl e Padding Option/Event/Obj Values Bitrate 1k Buffer Size 20k Batch Size 10 Enable Padding 1 Branch Misses 24m Cache Misses 42m No of Cycles 73b FPS 31/s Energy 42J 2- Determine & Perform next Perf Measurement 3- Updating Causal Model Performance Data Model averaging Expected change in belief & KL; Causal effects on objectives Interventions on Hardware, Workload, and Kernel Options Active Learning for Updating Causal Performance Model

Slide 70

Slide 70 text

Benefits of Causal Reasoning for System Performance Analysis

Slide 71

Slide 71 text

There are two fundamental benefits that we get by our “Causal AI for Systems” methodology 1. We learn one central (causal) performance model from the data across di ff erent performance tasks: • Performance understanding • Performance optimization • Performance debugging and repair • Performance prediction for di ff erent environments (e.g., canary-> production) 2. The causal model is transferable across environments. • We observed Sparse Mechanism Shift in systems too! • Alternative non-causal models (e.g., regression-based models for performance tasks) are not transferable as they rely on i.i.d. setting. 71

Slide 72

Slide 72 text

Questions of this nature require precise mathematical language lest they will be misleading. Here we are simultaneously conditioning on two values of GPU memory growth (i.e., 𝑋 ˆ = 0.66 and 𝑋 = 0.33). Traditional machine learning approaches cannot handle such expressions. Instead, we must resort to causal models to compute them. 72

Slide 73

Slide 73 text

Difference between statistical (left) and causal models (right) on a given set of three variables While a statistical model speci fi es a single probability distribution, a causal model represents a set of distributions, one for each possible intervention. 73

Slide 74

Slide 74 text

Independent Causal Mechanisms (ICM) Principle

Slide 75

Slide 75 text

Sparse Mechanism Shift (SMS) Hypothesis Example of SMS hypothesis, where an intervention (which may or may not be intentional/observed) changes the position of one fi nger, and as a consequence, the object falls. The change in pixel space is entangled (or distributed), in contrast to the change in the causal model.

Slide 76

Slide 76 text

76 NeurIPS 2020 (ML For Systems), Dec 12th, 2020 https://arxiv.org/pdf/2010.06061.pdf https://github.com/softsys4ai/CADET

Slide 77

Slide 77 text

Outline 77 Motivation Cas e Study Futur e Directions Causal A I For Systems Results

Slide 78

Slide 78 text

Results: Case Study 78 When we are trying to transplant our CUDA source code from TX1 to TX2, it behaved strange. We noticed that TX2 has twice computing-ability as TX1 in GPU, as expectation, we think TX2 will 30% - 40% faster than TX1 at least. Unfortunately, most of our code base spent twice the time as TX1, in other words, TX2 only has 1/2 speed as TX1, mostly. We believe that TX2’s CUDA API runs much slower than TX1 in many cases. When we are trying to transplant our CUDA source code from TX1 to TX2, it behaved strange. We noticed that TX2 has twice computing-ability as TX1 in GPU, as expectation, we think TX2 will 30% - 40% faster than TX1 at least. Unfortunately, most of our code base spent twice the time as TX1, in other words, TX2 only has 1/2 speed as TX1, mostly. We believe that TX2’s CUDA API runs much slower than TX1 in many cases. When we are trying to transplant our CUDA source code from TX1 to TX2, it behaved strange. We noticed that TX2 has twice computing-ability as TX1 in GPU, as expectation, we think TX2 will 30% - 40% faster than TX1 at least. Unfortunately, most of our code base spent twice the time as TX1, in other words, TX2 only has 1/2 speed as TX1, mostly. We believe that TX2’s CUDA API runs much slower than TX1 in many cases. When we are trying to transplant our CUDA source code from TX1 to TX2, it behaved strange. We noticed that TX2 has twice computing-ability as TX1 in GPU, as expectation, we think TX2 will 30% - 40% faster than TX1 at least. Unfortunately, most of our code base spent twice the time as TX1, in other words, TX2 only has 1/2 speed as TX1, mostly. We believe that TX2’s CUDA API runs much slower than TX1 in many cases. The user is transferring the cod e from one hardware to another The target hardware is faster than the the source hardware . User expects the code to run at least 30-40% faster. The code ran 2x slower on the more powerful hardware

Slide 79

Slide 79 text

More powerful Results: Case Study 79 Nvidia TX1 CPU 4 cores, 1.3 GHz GPU 128 Cores, 0.9 GHz Memory 4 Gb, 25 Gb/s Nvidia TX2 CPU 6 cores, 2 GHz GPU 256 Cores, 1.3 GHz Memory 8 Gb, 58 Gb/s Embedded real-time stereo estimation Source code 17 Fps 4 Fps 4 Slower! ×

Slide 80

Slide 80 text

Results: Case Study 80 Configuration CADET Decision Tree Forum CPU Cores ✓ ✓ ✓ CPU Freq. ✓ ✓ ✓ EMC Freq. ✓ ✓ ✓ GPU Freq. ✓ ✓ ✓ Sched. Policy ✓ Sched. Runtime ✓ Sched. Child Proc ✓ Dirty Bg. Ratio ✓ Drop Caches ✓ CUDA_STATIC_R T ✓ ✓ ✓ Swap Memory ✓ CADET Decision Tree Forum Throughput (on TX2) 26 FPS 20 FPS 23 FPS Throughput Gain (over TX1) 53 % 21 % 39 % Time to resolve 24 min. 31/2 Hrs. 2 days X Finds the root-causes accuratel y X No unnecessary change s X Better improvements than forum’s recommendatio n X Much faster Results The user expected 30-40% gain

Slide 81

Slide 81 text

Evaluation: Experimental Setup Nvidia TX1 CPU 4 cores, 1.3 GHz GPU 128 Cores, 0.9 GHz Memory 4 Gb, 25 GB/s Nvidia TX2 CPU 6 cores, 2 GHz GPU 256 Cores, 1.3 GHz Memory 8 Gb, 58 GB/s Nvidia Xavier CPU 8 cores, 2.26 GHz GPU 512 cores, 1.3 GHz Memory 32 Gb, 137 GB/s Hardware Systems Software Systems Xception Image recognitio n (50,000 test images) DeepSpeech Voice recognitio n (5 sec. audio clip) BERT Sentiment Analysi s (10000 IMDb reviews) x264 Video Encode r (11 Mb, 1080p video) Configuration Space X 30 Configuration s X 17 System Events • 10 software • 10 OS/Kernel • 10 hardware 81

Slide 82

Slide 82 text

Evaluation: Data Collection ● For each software/hardware combination create a benchmark datase t ○ Exhaustively set each of configuration option to all permitted values. ○ For continuous options (e.g., GPU memory Mem.), sample 10 equally spaced values between [min, max] ● Measure the latency, energy consumption, and heat dissipatio n ○ Repeat 5x and average 82 Multiple Faults ! Latency Faults ! Energy Faults !

Slide 83

Slide 83 text

Evaluation: Ground Truth ● For each performance fault : ○ Manually investigate the root-cause ○ “Fix” the misconfigurations ● A “fix” implies the configuration no longer has tail performanc e ○ User defined benchmark (i.e., 10th percentile) ○ Or some QoS/SLA benchmark ● Record the configurations that were changed 83 Multiple Faults ! Latency Faults ! Energy Faults !

Slide 84

Slide 84 text

Evaluation: Metrics 84 Relevance Scores 𝐺 𝑎 𝑖 𝑛 = NFP fault − NFP repair NFP fault × 100 Repair Quality NFP = Non-Functional Propert y (e.g., Latency, Energy, etc.) Repair value Faulty value Larger the gain, better the repair

Slide 85

Slide 85 text

85 RQ1: How does CADET perform compared to Model based Diagnostics RQ2: How does CADET perform compared to Search-Based Optimization Results: Research Questions

Slide 86

Slide 86 text

86 Results: Research Question 1 (single objective) RQ1: How does CADET perform compared to Model based Diagnostics X Finds the root-causes accurately X Better gain X Much faster Takeaways More accurate tha n ML-based methods Better Gain Up to 20x faster

Slide 87

Slide 87 text

87 Results: Research Question 1 (multi-objective) RQ1: How does CADET perform compared to Model based Diagnostics X No deterioration of other performance objectives Takeaways Multiple Fault s in Latency & Energy usage

Slide 88

Slide 88 text

88 RQ1: How does CADET perform compared to Model based Diagnostics RQ2: How does CADET perform compared to Search-Based Optimization Results: Research Questions

Slide 89

Slide 89 text

Results: Research Question 2 RQ2: How does CADET perform compared to Search-Based Optimization X Better with no deterioration of other performance objectives Takeaways 89

Slide 90

Slide 90 text

90 Results: Research Question 3 RQ2: How does CADET perform compared to Search-Based Optimization X Considerably faster than search-based optimization Takeaways

Slide 91

Slide 91 text

Outline 91 Motivation Cas e Study Causal A I For Systems Results Futur e Directions

Slide 92

Slide 92 text

Causal AI for Serverless • Evaluating our Causal AI for Systems methodology with Serverless systems provide the following opportunities: 1. Dynamic system recon fi gurations • Dynamic placement of functions • Dynamic recon fi gurations of the network of functions • Dynamic multi-cloud placement of functions. 2. Root cause analysis of failures or QoS drop 92

Slide 93

Slide 93 text

Causal AI for Autonomous Robot Testing • Testing cyberphysical systems such as robots are di ff i cult. The key reason is that there are additional interactions with the environment and the task that the robot is performing. • Evaluating our Causal AI for Systems methodology with autonomous robots provide the following opportunities: 1. Identifying di ff i cult to catch bugs in robots 2. Identifying the root cause of an observed fault and repairing the issue automatically during mission time. 93

Slide 94

Slide 94 text

Summary: Causal AI for Systems 1. Learning a Functional Causal Model for di ff erent downstream systems tasks 2. The learned causal model is transferable across di ff erent environments 94 Software: DeepStream Middleware: TF, TensorR T Hardware: Nvidia Xavie r Configuration: Default number of counters number of splitters latency (ms) 100 150 1 200 250 2 300 Cubic Interpolation Over Finer Grid 2 4 3 6 8 4 10 12 5 14 16 6 18 Budget Exhausted? Yes No 5- Update Causal Performance Model Query Engine 4- Estimate Causal Queries Estimate probability of satisfying QoS if BufferSize is set to 6k? 2- Learn Causal Performance Model Performanc e Debugging Performanc e Optimization 3- Translate Perf. Query to Causal Queries •What is the root-cause of observed perf. fault ? •How do I fix the misconfig. ? •How can I improve throughput without sacrificing accuracy ? •How do I understand perf behavior? Measure performance of the configuration(s) that maximizes information gain Performance Data Causal Model P(Th > 40/s|do(Buffersize = 6k)) 1- Specify Performance Query QoS : Th > 40/s Observed : Th < 30/s ± 5/s

Slide 95

Slide 95 text

No content