patient waiting for a heart transplant. On 1 January, he received a new heart. Five days later, he died. • Imagine that we can somehow know, that had Zeus not received a heart transplant on 1 January then he would have been alive fi ve days later. • All others things in his life being unchanged. • Now, what do you think was the cause of Zeus’s death?! • Most people would agree that the transplant caused Zeus’ death. • The intervention had a causal e ff ect.
heart transplant on 1 January. Five days later she was alive. • Again, imagine we can somehow know that had Hera not received the heart on 1 January then she would still have been alive fi ve days later. • All others things in his life being unchanged. • The transplant did not have a causal e ff ect on Hera’s fi ve day survival.
that would have developed the outcome Y had all subjects in the population of interest received exposure value a. • The exposure has a causal e ff ect in the population if Pr[Ya=1=1] Pr[Ya=0=1]. • Unlike individual causal e ff ects, population causal e ff ects can sometimes be computed—or, more rigorously, consistently estimated. ≠ Pr[Ya=1 = 1] − Pr[Ya=0 = 1] ≠ 0
ect measures cannot be directly computed because of missing data. However, e ff ect measures can be computed/estimated in randomized experiments! • Suppose we have a (near-in fi nite) population and that we fl ip a coin for each subject in such population. We assign the subject to group 1 if the coin turns tails, and to group 2 if it turns heads. • Next we administer the treatment or exposure of interest (A = 1) to subjects in group 1 and placebo (A = 0) to those in group 2. Five days later, at the end of the study, we compute the mortality risks in each group, Pr[Y = 1|A = 1] and Pr[Y = 1|A = 0]. • When subjects are randomly assigned to groups 1 and 2, the proportion of deaths among the exposed, Pr[Y = 1|A = 1], will be the same whether subjects in group 1 receive the exposure and subjects in group 2 receive placebo, or vice versa. • Because group membership is randomised, both groups are ‘‘comparable’’: which particular group got the exposure is irrelevant for the value of Pr[Y = 1|A = 1]. (The same reasoning applies to Pr[Y = 1|A = 0].) • Formally, we say that both groups are exchangeable.
con fi gurable 18 nia San Diego, ‡Huazhong Univ. of Science & Technology, †NetApp, Inc tixu, longjin, xuf001, yyzhou}@cs.ucsd.edu kar.Pasupathy, Rukma.Talwadker}@netapp.com prevalent, but also severely software. One fundamental y of configuration, reflected parameters (“knobs”). With m software to ensure high re- aunting, error-prone task. nderstanding a fundamental users really need so many answer, we study the con- including thousands of cus- m (Storage-A), and hundreds ce system software projects. ng findings to motivate soft- ore cautious and disciplined these findings, we provide ich can significantly reduce A as an example, the guide- ters and simplify 19.7% of on existing users. Also, we tion methods in the context 7/2006 7/2008 7/2010 7/2012 7/2014 0 100 200 300 400 500 600 700 Storage-A Number of parameters Release time 1/1999 1/2003 1/2007 1/2011 0 100 200 300 400 500 5.6.2 5.5.0 5.0.16 5.1.3 4.1.0 4.0.12 3.23.0 1/2014 MySQL Number of parameters Release time 1/1998 1/2002 1/2006 1/2010 1/2014 0 100 200 300 400 500 600 1.3.14 2.2.14 2.3.4 2.0.35 1.3.24 Number of parameters Release time Apache 1/2006 1/2008 1/2010 1/2012 1/2014 0 40 80 120 160 200 2.0.0 1.0.0 0.19.0 0.1.0 Hadoop Number of parameters Release time MapReduce HDFS [Tianyin Xu, et al., “Too Many Knobs…”, FSE’15]
between software and hardware • These can result in non-functional faults ◦ Affecting non-functional system properties like latency, throughput, energy consumption, etc. 20 The system doesn’t crash or exhibit an obvious misbehavior Systems are still operational but with a degraded performance, e.g., high latency, low throughput, high energy consumption, high heat dissipation, or a combination of several
to transplant our CUDA source code from TX1 to TX2, it behaved strange. We noticed that TX2 has twice computing-ability as TX1 in GPU, as expectation, we think TX2 will 30% - 40% faster than TX1 at least. Unfortunately, most of our code base spent twice the time as TX1, in other words, TX2 only has 1/2 speed as TX1, mostly. We believe that TX2’s CUDA API runs much slower than TX1 in many cases. When we are trying to transplant our CUDA source code from TX1 to TX2, it behaved strange. We noticed that TX2 has twice computing-ability as TX1 in GPU, as expectation, we think TX2 will 30% - 40% faster than TX1 at least. Unfortunately, most of our code base spent twice the time as TX1, in other words, TX2 only has 1/2 speed as TX1, mostly. We believe that TX2’s CUDA API runs much slower than TX1 in many cases. The user is transferring the code from one hardware to another When we are trying to transplant our CUDA source code from TX1 to TX2, it behaved strange. We noticed that TX2 has twice computing-ability as TX1 in GPU, as expectation, we think TX2 will 30% - 40% faster than TX1 at least. Unfortunately, most of our code base spent twice the time as TX1, in other words, TX2 only has 1/2 speed as TX1, mostly. We believe that TX2’s CUDA API runs much slower than TX1 in many cases. The target hardware is faster than the the source hardware. User expects the code to run at least 30-40% faster. Motivating Example When we are trying to transplant our CUDA source code from TX1 to TX2, it behaved strange. We noticed that TX2 has twice computing-ability as TX1 in GPU, as expectation, we think TX2 will 30% - 40% faster than TX1 at least. Unfortunately, most of our code base spent twice the time as TX1, in other words, TX2 only has 1/2 speed as TX1, mostly. We believe that TX2’s CUDA API runs much slower than TX1 in many cases. The code ran 2x slower on the more powerful hardware
We still have high latency. Any other suggestions? June 4th Please do the following and let us know if it works 1. Install JetPack 3.0 2. Set nvpmodel=MAX-N 3. Run jetson_clock.sh June 5th June 4th TX2 is pascal architecture. Please update your CMakeLists: + set(CUDA_STATIC_RUNTIME OFF) ... + -gencode=arch=compute_62,code=sm_62 The user had several misconfigurations In Software: ✖ Wrong compilation flags ✖ Wrong SDK version In Hardware: ✖ Wrong power mode ✖ Wrong clock/fan settings The discussions took 2 days ! Any suggestions on how to improve my performance? Thanks! How to resolve such issues faster? ?
4000 5000 Average write latency ( s) The default con fi guration is typically bad and the optimal con fi guration is noticeably better than median 25 Default Con fi guration Optimal Con fi guration better better • Default is bad • 2X-10X faster than worst • Noticeably faster than median
uence models could produce unreliable predictions. • Performance in fl uence models could produce unstable predictions across environments and in the presence of measurement noise. • Performance in fl uence models could produce incorrect explanations. 30
Latency More GPU memory usage should reduce latency not increase it. Counterintuitive! Any ML-/statistical models built on this data will be incorrect !
memory Available swap memory is reducing GPU memory borrows memory from the swap for some intensive workloads. Other host processes may reduce the available swap. Little will be left for the GPU to use.
interacting variables as a causal graph 34 Causal Performance Models Configuration option Direction(s) of the causality • Latency is affected by GPU Mem. which in turn is influenced by swap memory • External factors like resource pressure also affects swap memory Non-functional property System event
options in the variability space using the observation performance data. • Iterative causal performance model evaluation and model update • Perform downstream performance tasks such as performance debugging & optimization using Causal Reasoning UNICORN: Our Causal AI for Systems Method
TF, TensorRT Hardware: Nvidia Xavier Configuration: Default number of counters number of splitters latency (ms) 100 150 1 200 250 2 300 Cubic Interpolation Over Finer Grid 2 4 3 6 8 4 10 12 5 14 16 6 18 Budget Exhausted? Yes No 5- Update Causal Performance Model Query Engine 4- Estimate Causal Queries Estimate probability of satisfying QoS if BufferSize is set to 6k? 2- Learn Causal Performance Model Performance Debugging Performance Optimization 3- Translate Perf. Query to Causal Queries •What is the root-cause of observed perf. fault? •How do I fix the misconfig.? •How can I improve throughput without sacrificing accuracy? •How do I understand perf behavior? Measure performance of the configuration(s) that maximizes information gain Performance Data Causal Model P(Th > 40/s|do(Buffersize = 6k)) 1- Specify Performance Query QoS : Th > 40/s Observed : Th < 30/s ± 5/s
number of counters number of splitters latency (ms) 100 150 1 200 250 2 300 Cubic Interpolation Over Finer Grid 2 4 3 6 8 4 10 12 5 14 16 6 18 Budget Exhausted? Yes No 5- Update Causal Performance Model Query Engine 4- Estimate Causal Queries Estimate probability of satisfying QoS if BufferSize is set to 6k? 2- Learn Causal Perf. Model Performance Debugging Performance Optimization 3- Translate Performance Query to Causal Queries •What is the root-cause of observed perf. fault? •How do I fix the misconfig.? •How can I improve throughput without sacrificing accuracy? •How do I understand perf behavior? Measure performance of the configuration(s) that maximizes information gain Performance Data Causal Model P(Th > 40/s|do(Buffersize = 6k)) 1- Specify Performance Query QoS : Th > 40/s Observed : Th < 30/s ± 5/s UNICORN: Our Causal AI for Systems Method
number of counters number of splitters latency (ms) 100 150 1 200 250 2 300 Cubic Interpolation Over Finer Grid 2 4 3 6 8 4 10 12 5 14 16 6 18 Budget Exhausted? Yes No 5- Update Causal Performance Model Query Engine 4- Estimate Causal Queries Estimate probability of satisfying QoS if BufferSize is set to 6k? 2- Learn Causal Perf. Model Performance Debugging Performance Optimization 3- Translate Performance Query to Causal Queries •What is the root-cause of observed perf. fault? •How do I fix the misconfig.? •How can I improve throughput without sacrificing accuracy? •How do I understand perf behavior? Measure performance of the configuration(s) that maximizes information gain Performance Data Causal Model P(Th > 40/s|do(Buffersize = 6k)) 1- Specify Performance Query QoS : Th > 40/s Observed : Th < 30/s ± 5/s UNICORN: Our Causal AI for Systems Method
× O19 × O20 Dead code removal Con fi guration Space Constant folding Loop unrolling Function inlining c1 = 0 × 0 × ⋯ × 0 × 1 c1 ∈ ℂ fc (c1 ) = 11.1ms Compile time Execution time Energy Compiler (e.f., SaC, LLVM) Program Compiled Code Instrumented Binary Hardware Compile Deploy Con fi gure fe (c1 ) = 110.3ms fen (c1 ) = 100mwh Non-functional measurable/quanti fi able aspect
ff erent types of hardware platforms is that they exhibit di ff erent behaviors due to di ff erences in terms of resources, their microarchitecture, etc. 45 AWS DeepLens: Cloud-connected device System on Chip (SoC) Microcontrollers (MCUs)
requires di ff erent ways of instrumentations and clean measurement that contains least amount of noise is the most challenging part of our experiments. 46
number of counters number of splitters latency (ms) 100 150 1 200 250 2 300 Cubic Interpolation Over Finer Grid 2 4 3 6 8 4 10 12 5 14 16 6 18 Budget Exhausted? Yes No 5- Update Causal Performance Model Query Engine 4- Estimate Causal Queries Estimate probability of satisfying QoS if BufferSize is set to 6k? 2- Learn Causal Perf. Model Performance Debugging Performance Optimization 3- Translate Performance Query to Causal Queries •What is the root-cause of observed perf. fault? •How do I fix the misconfig.? •How can I improve throughput without sacrificing accuracy? •How do I understand perf behavior? Measure performance of the configuration(s) that maximizes information gain Performance Data Causal Model P(Th > 40/s|do(Buffersize = 6k)) 1- Specify Performance Query QoS : Th > 40/s Observed : Th < 30/s ± 5/s UNICORN: Our Causal AI for Systems Method
non-functional faults Objective Causal Debugging: An example of downstream performance task Ὂ Use causal models to model various cross-stack configuration interactions; and Ὂ Counterfactual reasoning to recommend fixes for these misconfigurations Approach
fault? • How do I fix my misconfigurations to improve performance? Misconfiguration Fault fixed? Observational Data Build Causal Graph Extract Causal Paths Best Query Yes No update observational data Counterfactual Queries Rank Paths What if questions. E.g., What if the configuration option X was set to a value ‘x’? About 25 sample configurations (training data)
What if the configuration option X was set to a value ‘x’? Extract Causal Paths 54 Extracting Causal Paths from the Causal Model • What is the root-cause of my fault? • How do I fix my misconfigurations to improve performance? Misconfiguration Fault fixed? Observational Data Build Causal Graph Yes No update observational data About 25 sample configurations (training data)
real world cases, this causal graph can be very complex ✕ It may be intractable to reason over the entire graph directly 55 Solution ✓ Extract paths from the causal graph ✓ Rank them based on their Average Causal Effect on latency, etc. ✓ Reason over the top K paths
Latency Swap Mem. Extract paths Always begins with a configuration option Or a system event Always terminates at a performance objective Load GPU Mem. Latency Swap Mem. Swap Mem. Latency Load GPU Mem.
may be too many causal paths • We need to select the most useful ones • Compute the Average Causal Effect (ACE) of each pair of neighbors in a path GPU Mem. Swap Mem. Latency 𝐴𝐶 𝐸 (GPU Mem . , Swap) = 1 𝑁 ∑ 𝑎 , 𝑏 ∈ 𝑍 𝔼 (GPU Mem . 𝑑 𝑜 (Swap = 𝑏 )) − 𝔼 (GPU Mem . 𝑑 𝑜 (Swap = 𝑎 )) Expected value of GPU Mem. when we artificially intervene by setting Swap to the value b Expected value of GPU Mem. when we artificially intervene by setting Swap to the value a If this difference is large, then small changes to Swap Mem. will cause large changes to GPU Mem. Average over all permitted values of Swap memory.
the ACE of all pairs of adjacent nodes in the path • Rank paths from highest path ACE (PACE) score to the lowest • Use the top K paths for subsequent analysis 𝑃𝐴𝐶𝐸 ( 𝑍 , 𝑌 ) = 1 2 ( 𝐴 𝐶 𝐸 ( 𝑍 , 𝑋 ) + 𝐴𝐶 𝐸 ( 𝑋 , 𝑌 )) X Y Z Sum over all pairs of nodes in the causal path. GPU Mem. Latency Swap Mem.
What if the configuration option X was set to a value ‘x’? Extract Causal Paths 59 Diagnosing and Fixing the Faults • What is the root-cause of my fault? • How do I fix my misconfigurations to improve performance? Misconfiguration Fault fixed? Observational Data Build Causal Graph Yes No update observational data About 25 sample configurations (training data)
“what if” questions about changes to the misconfigurations We are interested in the scenario where: • We hypothetically have low latency; Conditioned on the following events: • We hypothetically set the new Swap memory to 4 Gb • Swap Memory was initially set to 2 Gb • We observed high latency when Swap was set to 2 Gb • Everything else remains the same Example Given that my current swap memory is 2 Gb, and I have high latency. What is the probability of having low latency if swap memory was increased to 4 Gb?
and Fixing the Faults 61 GPU Mem. Latency Swap Original Path Load GPU Mem. Latency Swap = 4 Gb Path after proposed change Load Remove incoming edges. Assume no external influence. Modify to reflect the hypothetical scenario Low? Load GPU Mem. Latency Swap = 4 Gb Low? Use both the models to compute the answer to the counterfactual question
^ 𝑜𝑢𝑡𝑐𝑜𝑚 𝑒 = 𝑔𝑜 𝑜𝑑 ~ ~ 𝑐 h 𝑎 𝑛 𝑔 𝑒 , ~ 𝑜 𝑢 𝑡𝑐𝑜 𝑚 𝑒 ¬ 𝑐 h 𝑎 𝑛 𝑔 𝑒 = 𝑏𝑎𝑑 , ~¬ 𝑐 h 𝑎 𝑛 𝑔𝑒 , 𝑈 ) Probability that the outcome is good after a change, conditioned on the past If this difference is large, then our change is useful Individual Treatment Effect = Potential − Outcome Control = 𝑃 ( ^ 𝑜𝑢 𝑡 𝑐 𝑜 𝑚 𝑒 = 𝑏𝑎𝑑 ~ ~¬ 𝑐 h 𝑎 𝑛𝑔 𝑒 , 𝑈 ) Probability that the outcome was bad before the change
Mem. Top K paths ⋮ Enumerate all possible changes 𝐼 𝑇 𝐸 ( 𝑐 h 𝑎𝑛𝑔 𝑒 ) Change with the largest ITE Set every configuration option in the path to all permitted values Inferred from observed data. This is very cheap. !
number of counters number of splitters latency (ms) 100 150 1 200 250 2 300 Cubic Interpolation Over Finer Grid 2 4 3 6 8 4 10 12 5 14 16 6 18 Budget Exhausted? Yes No 5- Update Causal Performance Model Query Engine 4- Estimate Causal Queries Estimate probability of satisfying QoS if BufferSize is set to 6k? 2- Learn Causal Perf. Model Performance Debugging Performance Optimization 3- Translate Performance Query to Causal Queries •What is the root-cause of observed perf. fault? •How do I fix the misconfig.? •How can I improve throughput without sacrificing accuracy? •How do I understand perf behavior? Measure performance of the configuration(s) that maximizes information gain Performance Data Causal Model P(Th > 40/s|do(Buffersize = 6k)) 1- Specify Performance Query QoS : Th > 40/s Observed : Th < 30/s ± 5/s UNICORN: Our Causal AI for Systems Method
“Causal AI for Systems” methodology 1. We learn one central (causal) performance model from the data across di ff erent performance tasks: • Performance understanding • Performance optimization • Performance debugging and repair • Performance prediction for di ff erent environments (e.g., canary-> production) 2. The causal model is transferable across environments. • We observed Sparse Mechanism Shift in systems too! • Alternative non-causal models (e.g., regression-based models for performance tasks) are not transferable as they rely on i.i.d. setting. 71
will be misleading. Here we are simultaneously conditioning on two values of GPU memory growth (i.e., 𝑋 ˆ = 0.66 and 𝑋 = 0.33). Traditional machine learning approaches cannot handle such expressions. Instead, we must resort to causal models to compute them. 72
given set of three variables While a statistical model speci fi es a single probability distribution, a causal model represents a set of distributions, one for each possible intervention. 73
an intervention (which may or may not be intentional/observed) changes the position of one fi nger, and as a consequence, the object falls. The change in pixel space is entangled (or distributed), in contrast to the change in the causal model.
our CUDA source code from TX1 to TX2, it behaved strange. We noticed that TX2 has twice computing-ability as TX1 in GPU, as expectation, we think TX2 will 30% - 40% faster than TX1 at least. Unfortunately, most of our code base spent twice the time as TX1, in other words, TX2 only has 1/2 speed as TX1, mostly. We believe that TX2’s CUDA API runs much slower than TX1 in many cases. When we are trying to transplant our CUDA source code from TX1 to TX2, it behaved strange. We noticed that TX2 has twice computing-ability as TX1 in GPU, as expectation, we think TX2 will 30% - 40% faster than TX1 at least. Unfortunately, most of our code base spent twice the time as TX1, in other words, TX2 only has 1/2 speed as TX1, mostly. We believe that TX2’s CUDA API runs much slower than TX1 in many cases. When we are trying to transplant our CUDA source code from TX1 to TX2, it behaved strange. We noticed that TX2 has twice computing-ability as TX1 in GPU, as expectation, we think TX2 will 30% - 40% faster than TX1 at least. Unfortunately, most of our code base spent twice the time as TX1, in other words, TX2 only has 1/2 speed as TX1, mostly. We believe that TX2’s CUDA API runs much slower than TX1 in many cases. When we are trying to transplant our CUDA source code from TX1 to TX2, it behaved strange. We noticed that TX2 has twice computing-ability as TX1 in GPU, as expectation, we think TX2 will 30% - 40% faster than TX1 at least. Unfortunately, most of our code base spent twice the time as TX1, in other words, TX2 only has 1/2 speed as TX1, mostly. We believe that TX2’s CUDA API runs much slower than TX1 in many cases. The user is transferring the code from one hardware to another The target hardware is faster than the the source hardware. User expects the code to run at least 30-40% faster. The code ran 2x slower on the more powerful hardware
CPU Cores ✓ ✓ ✓ CPU Freq. ✓ ✓ ✓ EMC Freq. ✓ ✓ ✓ GPU Freq. ✓ ✓ ✓ Sched. Policy ✓ Sched. Runtime ✓ Sched. Child Proc ✓ Dirty Bg. Ratio ✓ Drop Caches ✓ CUDA_STATIC_RT ✓ ✓ ✓ Swap Memory ✓ UNICORN Decision Tree Forum Throughput (on TX2) 26 FPS 20 FPS 23 FPS Throughput Gain (over TX1) 53 % 21 % 39 % Time to resolve 24 min. 31/2 Hrs. 2 days X Finds the root-causes accurately X No unnecessary changes X Better improvements than forum’s recommendation X Much faster Results The user expected 30-40% gain
benchmark dataset ◦ Exhaustively set each of configuration option to all permitted values. ◦ For continuous options (e.g., GPU memory Mem.), sample 10 equally spaced values between [min, max] • Measure the latency, energy consumption, and heat dissipation ◦ Repeat 5x and average 83 Multiple Faults ! Latency Faults ! Energy Faults !
investigate the root-cause ◦ “Fix” the misconfigurations • A “fix” implies the configuration no longer has tail performance ◦ User defined benchmark (i.e., 10th percentile) ◦ Or some QoS/SLA benchmark • Record the configurations that were changed 84 Multiple Faults ! Latency Faults ! Energy Faults !
UNICORN perform compared to Model based Diagnostics X Finds the root-causes accurately X Better gain X Much faster Takeaways More accurate than ML-based methods Better Gain Up to 20x faster
Systems methodology with Serverless systems provide the following opportunities: 1. Dynamic system recon fi gurations • Dynamic placement of functions • Dynamic recon fi gurations of the network of functions • Dynamic multi-cloud placement of functions. 2. Root cause analysis of failures or QoS drop 93
such as robots are di ff i cult. The key reason is that there are additional interactions with the environment and the task that the robot is performing. • Evaluating our Causal AI for Systems methodology with autonomous robots provide the following opportunities: 1. Identifying di ff i cult to catch bugs in robots 2. Identifying the root cause of an observed fault and repairing the issue automatically during mission time. 94
Model for di ff erent downstream systems tasks 2. The learned causal model is transferable across di ff erent environments 95 Software: DeepStream Middleware: TF, TensorRT Hardware: Nvidia Xavier Configuration: Default number of counters number of splitters latency (ms) 100 150 1 200 250 2 300 Cubic Interpolation Over Finer Grid 2 4 3 6 8 4 10 12 5 14 16 6 18 Budget Exhausted? Yes No 5- Update Causal Performance Model Query Engine 4- Estimate Causal Queries Estimate probability of satisfying QoS if BufferSize is set to 6k? 2- Learn Causal Performance Model Performance Debugging Performance Optimization 3- Translate Perf. Query to Causal Queries •What is the root-cause of observed perf. fault? •How do I fix the misconfig.? •How can I improve throughput without sacrificing accuracy? •How do I understand perf behavior? Measure performance of the configuration(s) that maximizes information gain Performance Data Causal Model P(Th > 40/s|do(Buffersize = 6k)) 1- Specify Performance Query QoS : Th > 40/s Observed : Th < 30/s ± 5/s
Purdue Baishakhi Ray Columbia Christian Kästner CMU Sven Apel Saarland Marco Valtorta UofSC Madelyn Khoury REU student Forest Agostinelli UofSC Causal AI for Systems Causal AI for Robot Learning (Causal RL + Transfer Learning + Robotics) Abir Hossen UofSC Theory of Causal AI Ahana Biswas IIT Om Pandey KIIT Hamed Damirchi UofSC Causal AI for Adversarial ML Ying Meng UofSC Fatemeh Ghofrani UofSC Mahdi Shari fi UofSC Collaborators (Causal AI) Sugato Basu Google AdsAI Garima Pruthi Google AdsAI Causal Representation Learning