Slide 1

Slide 1 text

Causal AI for Systems A journey from performance optimization to transfer learning all the way to Causal AI Pooyan Jamshidi UofSC & Google

Slide 2

Slide 2 text

It is all about team work I played a very minor role

Slide 3

Slide 3 text

Artificial Intelligence and Systems Laboratory (AISys Lab) Machine Learning Computer Systems Autonomy AI/ML Systems https://pooyanjamshidi.github.io/AISys/ 3 Ying Meng (PhD student) Shuge Lei (PhD student) Kimia Noorbakhsh (Undergrad) Shahriar Iqbal (PhD student) Jianhai Su (PhD student) M.A. Javidian (postdoc) Sponsors, thanks! Fatemeh Ghofrani (PhD student) Abir Hossen (PhD student) Hamed Damirchi (PhD student) Mahdi Sharifi (PhD student) Mahdi Sharifi (Intern)

Slide 4

Slide 4 text

Collaborators (Systems) 4 Rahul Krishna Columbia Shahriar Iqbal UofSC Baishakhi Ray Columbia Christian Kästner CMU Norbert Siegmund Leipzig Miguel Velez CMU Sven Apel Saarland Lars Kotthoff Wyoming Vivek Nair Facebook Tim Menzies NCSU Ramtin Zand UofSC Mohsen Amini UofSC

Slide 5

Slide 5 text

5 Rahul Krishna Columbia Shahriar Iqbal UofSC M. A. Javidian Purdue Baishakhi Ray Columbia Christian Kästner CMU Sven Apel Saarland Marco Valtorta UofSC Madelyn Khoury REU student Forest Agostinelli UofSC Causal AI for Systems Causal AI for Robot Learning (Causal RL + Transfer Learning + Robotics) Abir Hossen UofSC Theory of Causal AI Ahana Biswas IIT Om Pandey KIIT Hamed Damirchi UofSC Causal AI for Adversarial ML Ying Meng UofSC Fatemeh Ghofrani UofSC Mahdi Sharifi UofSC The Causal AI Team! Sugato Basu Google AdsAI Garima Pruthi Google AdsAI Causal Representation Learning

Slide 6

Slide 6 text

Configuration Space (Software, Deployment, Hardware) Program (Code) Performance Modeling Performance Visualization Whitebox Sampling number of counters number of splitters latency (ms) 100 150 1 200 250 2 300 Cubic Interpolation Over Finer Grid 2 4 3 6 8 4 10 12 5 14 16 6 18 Developer User Transfer Learning Performance Understanding Tradeoff Analysis Hands-off Debugging Performance Debugging Active Learning Q1 Q2 Q3 Q4 Foundation Application Artifacts Techniques Program Analysis Causal Inference Causal-based Documentation Q5 Cause Localization Q3 Baishakhi Ray Columbia Christian Kästner CMU Co-PIs Causal Performance Debugging for Highly-Configurable Systems

Slide 7

Slide 7 text

Causal AI + Representation Learning Causal Representation Learning Learned Representation FCA (Attribution via Causal Inference and Counterfactual Reasoning) Multi-Objective Optimization, RL, Active Learning Visualization Specification (contextual badness, model robustness) Intervention/Update Data - Slices - Groups • Generalization • Robustness • Bias • Explainability Sugato Basu Google AdsAI Garima Pruthi Google AdsAI

Slide 8

Slide 8 text

Outline 8 Case Study Causal AI For Systems CADET Current Results Future Directions

Slide 9

Slide 9 text

9 Goal: Enable developers/users to find the right quality tradeoff

Slide 10

Slide 10 text

Today’s most popular systems are configurable 10 built

Slide 11

Slide 11 text

11

Slide 12

Slide 12 text

Empirical observations confirm that systems are becoming increasingly configurable 12 08 7/2010 7/2012 7/2014 Release time 1/1999 1/2003 1/2007 1/2011 0 1/2014 N Release time 02 1/2006 1/2010 1/2014 2.2.14 2.3.4 2.0.35 .3.24 Release time Apache 1/2006 1/2008 1/2010 1/2012 1/2014 0 40 80 120 160 200 2.0.0 1.0.0 0.19.0 0.1.0 Hadoop Number of parameters Release time MapReduce HDFS [Tianyin Xu, et al., “Too Many Knobs…”, FSE’15]

Slide 13

Slide 13 text

Empirical observations confirm that systems are becoming increasingly configurable 13 nia San Diego, ‡Huazhong Univ. of Science & Technology, †NetApp, Inc tixu, longjin, xuf001, yyzhou}@cs.ucsd.edu kar.Pasupathy, Rukma.Talwadker}@netapp.com prevalent, but also severely software. One fundamental y of configuration, reflected parameters (“knobs”). With m software to ensure high re- aunting, error-prone task. nderstanding a fundamental users really need so many answer, we study the con- including thousands of cus- m (Storage-A), and hundreds ce system software projects. ng findings to motivate soft- ore cautious and disciplined these findings, we provide ich can significantly reduce A as an example, the guide- ters and simplify 19.7% of on existing users. Also, we tion methods in the context 7/2006 7/2008 7/2010 7/2012 7/2014 0 100 200 300 400 500 600 700 Storage-A Number of parameters Release time 1/1999 1/2003 1/2007 1/2011 0 100 200 300 400 500 5.6.2 5.5.0 5.0.16 5.1.3 4.1.0 4.0.12 3.23.0 1/2014 MySQL Number of parameters Release time 1/1998 1/2002 1/2006 1/2010 1/2014 0 100 200 300 400 500 600 1.3.14 2.2.14 2.3.4 2.0.35 1.3.24 Number of parameters Release time Apache 1/2006 1/2008 1/2010 1/2012 1/2014 0 40 80 120 160 200 2.0.0 1.0.0 0.19.0 0.1.0 Hadoop Number of parameters Release time MapReduce HDFS [Tianyin Xu, et al., “Too Many Knobs…”, FSE’15]

Slide 14

Slide 14 text

Configuration options live across stack 14 CPU Memory Controller GPU Lib API Clients Devices Network Task Scheduler Device Drivers File System Compilers Memory Manager Process Manager Frontend Application Layer OS/Kernel Layer Hardware Layer Deployment SoC Generic hardware Production Servers

Slide 15

Slide 15 text

Today’s most popular systems are also composable! Data analytic pipelines 15

Slide 16

Slide 16 text

Today’s most popular systems are complex! multiscale, multi-modal, and multi-stream 16 Multi-Modal Data (Configurable) Image Processing Voice Recognition Context Extraction ML Models (Configurable) Deployment Environment (Configurable) System Components (Configurable) Multi-Cloud Variability Space = Configuration Space + System Architecture + Deployment Environment

Slide 17

Slide 17 text

Configurations determine the performance behavior 17 void Parrot_setenv(. . . name,. . . value){ #ifdef PARROT_HAS_SETENV my_setenv(name, value, 1); #else int name_len=strlen(name); int val_len=strlen(value); char* envs=glob_env; if(envs==NULL){ return; } strcpy(envs,name); strcpy(envs+name_len,"="); strcpy(envs+name_len + 1,value); putenv(envs); #endif } #ifdef LINUX extern int Parrot_signbit(double x){ endif else PARROT_HAS_SETENV LINUX Speed Energy

Slide 18

Slide 18 text

Performance distributions are multi-modal and have long tails • Certain configurations can cause performance to take abnormally large values
 • Faulty configurations take the tail values (worse than 99.99th percentile)
 • Certain configurations can cause faults on multiple performance objectives. 
 18

Slide 19

Slide 19 text

Identifying the root cause of performance faults is difficult ● A auto-pilot code was transplanted from TX1 to TX2 ● TX2 is more powerful, but software was 2x slower than TX1 Fig 1. Performance fault on NVIDIA TX2 https://forums.developer.nvidia.com/t/50477 19

Slide 20

Slide 20 text

20 Long conversations in issue tracking is common to find root causes and possible fixes

Slide 21

Slide 21 text

Users want to understand the effect of configuration options 21

Slide 22

Slide 22 text

Fixing performance faults is difficult and not obvious ● These were not in the default settings ● Took 1 month to fix in the end... ● Three misconfigurations: ○ Wrong compilation flags for compiling CUDA (didn't use 'dynamic' flag) ○ Wrong CPU/GPU modes (didn't use TX2 optimized cores) ○ Wrong Fan mode (didn't change to handle thermal throttling) ● We need to do this better Performance fault on NVIDIA TX2 https://forums.developer.nvidia.com/t/50477 22

Slide 23

Slide 23 text

We performed a systematic study of performance faults in ML systems 1. There are different kinds of performance faults across ML systems as a result of misconfigurations. I. Latency, Thermal, Energy, Throughput II. Combinations of above faults 2. Configuration options interact with one another across stack I. e.g., software options with hardware options 3. The interactions between options are usually low degree (2-5). 4. The interactions between options may change across environments, however, such change is local in few causal mechanisms. 5. Non-functional faults take a long time to resolve 23

Slide 24

Slide 24 text

We performed a systematic study of performance in different types of systems with options living across stack and with different deployment topologies 1. ML Systems 2. Data analytics Pipelines 3. Big Data Systems 4. Stream Processing Systems 5. Compilers 6. Video Encoders 7. Databases 8. SAT solvers 24 DNN ec1 : [h1 ! h2] S 0.98 0.30 0.98 0.97 0.93 8 6 5 1 0.82 16 12 12 0.9 ec2 : [h1 ! h3] S 1.00 0.19 0.99 0.93 0.94 8 7 7 0 0.90 16 12 12 0.9 ec3 : [h3 ! h4] M 0.89 0.41 0.47 0.46 0.66 7 7 5 1 0.80 12 18 12 0.6 ec4 : [w1 ! w2] S 1.00 0.01 1.00 0.95 0.95 7 7 6 1 0.82 12 12 12 0.9 ec5 : [w1 ! w3] S 1.00 0.01 1.00 0.94 0.95 7 7 6 1 0.89 12 12 12 0.9 ec6 : [w1 ! w4] S 1.00 0.01 1.00 0.95 0.95 7 8 6 1 0.85 12 12 12 0.9 ec7 : [v1 ! v2] M 0.97 0.24 0.96 0.86 0.93 6 6 6 0 0.78 12 14 12 0.9 ec8 : [v1 ! v3] M 0.94 0.21 0.93 0.58 0.79 6 7 6 0 0.66 16 21 16 0.7 ec9 : [v2 ! v3] M 0.95 0.04 0.93 0.54 0.79 6 7 6 0 0.73 17 21 16 0.7 ec10 : [h4w3v1 ! h4w2v2] L 0.48 0.31 0.45 0.66 0.70 7 6 6 0 0.70 18 14 14 0.6 h1: Azure, h2: AWS, h3: TK1, h4: GPU; w1: Co↵ee, w2: DiatomSizeReduction, w3: Adiac, w4: ShapesAll; v1: TensorFlow, v2: Theano, v3: CNTK; Metrics: M1: Pearson correlation; M2: Kullback-Leibler (KL) divergence; M3: Spearman correlation; M4/M5: P of top/bottom conf.; M6/M7: Number of influential options; M8/M9: Number of options agree/disagree; M Correlation btw importance of options; M11/M12: Number of interactions; M13: Number of interactions agree e↵ects; M14: Correlation btw the coe↵s; Input #1 Input #2 Input #3 Input #4 Output Hidden layer Input layer Output layer 4 Technical Aims and Research Plan We will pursue the following technical aims: (1) investigate potential criteria for e↵ective sampling exploration of the design space of DNN architectures (Section 4.2), (2) build analytical models that curately predict the performance of a given architecture configuration given other similar architectu which either have been measured in the target environments or other similar environments, with measuring the network performance directly (Section 4.3), and (3), develop a tunning mechan that exploit the performance model from previous step to e↵ectively search for optimal architectu (Section 4.4). 4.1 Project Timeline We plan to complete the proposed project in two years. To mitigate project risks, we will divide project into three major phases: 8 Network Design Model Compiler Hybrid Deployment OS/ Hardware Scope of this Project Neural Search Hardware Optimization Hyper-parameter DNN system development stack Deployment Topology

Slide 25

Slide 25 text

Each system has different performance objectives and configuration options 25 SPEAR (SAT Solver) Analysis time 14 options 16,384 configurations SAT problems 3 hardware 2 versions X264 (video encoder) Encoding time 16 options 4,000 configurations Video quality/size 2 hardware 3 versions SQLite (DB engine) Query time 14 options 1,000 configurations DB Queries 2 hardware 2 versions SaC (Compiler) Execution time 50 options 71,267 configurations 10 Demo programs

Slide 26

Slide 26 text

More information regarding setup and the gained insights can be found here 26 Transfer Learning for Performance Modeling of Configurable Systems: An Exploratory Analysis Pooyan Jamshidi Carnegie Mellon University, USA Norbert Siegmund Bauhaus-University Weimar, Germany Miguel Velez, Christian K¨ astner Akshay Patel, Yuvraj Agarwal Carnegie Mellon University, USA Abstract—Modern software systems provide many configura- tion options which significantly influence their non-functional properties. To understand and predict the effect of configuration options, several sampling and learning strategies have been proposed, albeit often with significant cost to cover the highly dimensional configuration space. Recently, transfer learning has been applied to reduce the effort of constructing performance models by transferring knowledge about performance behavior across environments. While this line of research is promising to learn more accurate models at a lower cost, it is unclear why and when transfer learning works for performance modeling. To shed light on when it is beneficial to apply transfer learning, we conducted an empirical study on four popular software systems, varying software configurations and environmental conditions, such as hardware, workload, and software versions, to identify the key knowledge pieces that can be exploited for transfer learning. Our results show that in small environmental changes (e.g., homogeneous workload change), by applying a linear transformation to the performance model, we can understand the performance behavior of the target environment, while for severe environmental changes (e.g., drastic workload change) we can transfer only knowledge that makes sampling more efficient, e.g., by reducing the dimensionality of the configuration space. Index Terms—Performance analysis, transfer learning. I. INTRODUCTION Highly configurable software systems, such as mobile apps, compilers, and big data engines, are increasingly exposed to end users and developers on a daily basis for varying use cases. Users are interested not only in the fastest configuration but also in whether the fastest configuration for their applications also remains the fastest when the environmental situation has been changed. For instance, a mobile developer might be interested to know if the software that she has configured to consume minimal energy on a testing platform will also remain energy efficient on the users’ mobile platform; or, in general, whether the configuration will remain optimal when the software is used in a different environment (e.g., with a different workload, on different hardware). Performance models have been extensively used to learn and describe the performance behavior of configurable sys- Fig. 1: Transfer learning is a form of machine learning that takes advantage of transferable knowledge from source to learn an accurate, reliable, and less costly model for the target environment. their byproducts across environments is demanded by many application scenarios, here we mention two common scenarios: • Scenario 1: Hardware change: The developers of a soft- ware system performed a performance benchmarking of the system in its staging environment and built a performance model. The model may not be able to provide accurate predictions for the performance of the system in the actual production environment though (e.g., due to the instability of measurements in its staging environment [6], [30], [38]). • Scenario 2: Workload change: The developers of a database system built a performance model using a read-heavy workload, however, the model may not be able to provide accurate predictions once the workload changes to a write- heavy one. The reason is that if the workload changes, different functions of the software might get activated (more often) and so the non-functional behavior changes, too. In such scenarios, not every user wants to repeat the costly process of building a new performance model to find a

Slide 27

Slide 27 text

Outline 27 Case Study Causal AI For Systems CADET Current Results Future Directions

Slide 28

Slide 28 text

SocialSensor •Identifying trending topics •Identifying user defined topics •Social media search 28

Slide 29

Slide 29 text

SocialSensor 29 Content Analysis Orchestrator Crawling Search and Integration Tweets: [5k-20k/min] Every 10 min: [100k tweets] Tweets: [10M] Fetch Store Push Store Crawled items Fetch Internet

Slide 30

Slide 30 text

Challenges 30 Content Analysis Orchestrator Crawling Search and Integration Tweets: [5k-20k/min] Every 10 min: [100k tweets] Tweets: [10M] Fetch Store Push Store Crawled items Fetch Internet 100X 10X Real time

Slide 31

Slide 31 text

31 How can we gain a better performance without using more resources?

Slide 32

Slide 32 text

32 Let’s try out different system configurations!

Slide 33

Slide 33 text

Opportunity: Data processing engines in the pipeline were all configurable 33 > 100 > 100 > 100 2300

Slide 34

Slide 34 text

34 More combinations than estimated atoms in the universe

Slide 35

Slide 35 text

0 500 1000 1500 Throughput (ops/sec) 0 1000 2000 3000 4000 5000 Average write latency ( s) The default configuration is typically bad and the optimal configuration is noticeably better than median 35 Default Configuration Optimal Configuration better better • Default is bad • 2X-10X faster than worst • Noticeably faster than median

Slide 36

Slide 36 text

Performance behavior varies in different environments 36

Slide 37

Slide 37 text

100X more user cloud resources reduced 20% outperform expert recommendation

Slide 38

Slide 38 text

Outline 39 Case Study CADET Current Results Future Directions Causal AI for Systems

Slide 39

Slide 39 text

Causal AI in Systems and Software 38 Computer Architecture Database Operating Systems Programming Languages BigData Software Engineering https://github.com/y-ding/causal-system-papers

Slide 40

Slide 40 text

• Build a Causal Model that capture the interactions options in the variability space using the observation performance data. • Iterative causal model evaluation and model update • Perform downstream tasks such as performance debugging or performance optimization using Causal Inference, Counterfactuals Reasoning, Causal Interactions, Causal Invariances, Causal Representation Our Causal AI for Systems methodology

Slide 41

Slide 41 text

Our Causal AI for Systems methodology 41

Slide 42

Slide 42 text

Step1: Determining the variability space The large the variability space the more difficult the downstream tasks get 42 Multi-Modal Data (Configurable) Image Processing Voice Recognition Context Extraction ML Models (Configurable) Deployment Environment (Configurable) System Components (Configurable) Multi-Cloud Variability Space = Configuration Space + System Architecture + Deployment Environment

Slide 43

Slide 43 text

Determining the variability space 43 ℂ = O 1 × O 2 × ⋯ × O 19 × O 20 Dead code removal Configuration Space Constant folding Loop unrolling Function inlining c 1 = 0 × 0 × ⋯ × 0 × 1 c 1 ∈ ℂ f c (c 1 ) = 11.1ms Compile time Execution time Energy Compiler (e.f., SaC, LLVM) Program Compiled Code Instrumented Binary Hardware Compile Deploy Configure f e (c 1 ) = 110.3ms f en (c 1 ) = 100mwh Non-functional measurable/quantifiable aspect

Slide 44

Slide 44 text

Step2: Collecting observational data By instrumenting the system across stack (software, middleware, hardware) and measuring performance objectives (depending on the system the perf objective will be different, e.g., throughput in data analytics pipelines) for different configurations 44 GPU Mem. Swap Mem. Load Latency c1 0.2 2 Gb 10% 1 sec c2 0.5 1 Gb 20% 2 sec cn 1.0 4 Gb 40% 0.1 sec Multi-Modal Data (Configurable) Image Processing Voice Recognition Context Extraction ML Models (Configurable) Deployment Environment (Configurable) System Components (Configurable) Multi-Cloud Variability Space = Configuration Space + System Architecture + Deployment Environment Measurements

Slide 45

Slide 45 text

Our setup for performance measurements 45

Slide 46

Slide 46 text

Hardware platforms in our experiments The reason behind using different types of hardware platforms is that they exhibit different behaviors due to differences in terms of resources, their microarchitecture, etc. 46 AWS DeepLens: Cloud-connected device System on Chip (SoC) Microcontrollers (MCUs)

Slide 47

Slide 47 text

47 System-on-Module (SoM) Hardware platforms in our experiments The reason behind using different types of hardware platforms is that they exhibit different behaviors due to differences in terms of resources, their microarchitecture, etc.

Slide 48

Slide 48 text

48 Edge TPU devices Hardware platforms in our experiments The reason behind using different types of hardware platforms is that they exhibit different behaviors due to differences in terms of resources, their microarchitecture, etc.

Slide 49

Slide 49 text

49 FPGA Hardware platforms in our experiments The reason behind using different types of hardware platforms is that they exhibit different behaviors due to differences in terms of resources, their microarchitecture, etc.

Slide 50

Slide 50 text

Measuring performance for systems involves lots of challenges Each hardware requires different ways of instrumentations and clean measurement that contains least amount of noise is the most challenging part of our experiments. 50

Slide 51

Slide 51 text

Step3: Learning a Functional Causal Model We developed Perf-SCM, an instantiation of SCM for Performance, which captures causal interactions via functional nodes 51 Batch Size f5 Batch Timeout f1 Memory Growth f7 f16 QoS Interval Cache Pressure f10 f12 Swappiness f9 f14 Cache Size f4 CPU Freq f2 f3 f6 GPU Freq f11 f15 CPU Utilization EMC Freq CPU Cores f13 Context Switches Migrations f8 Num Cycles Cache Misses Branch Misses Num Instructions Scheduler Wait Time Major Faults Cache References Scheduler Sleep Time Minor Faults Scheduler Task Migrations Softirq Entry GPU Utilization Throughput Energy GPU Mem. Swap Mem. Load Latency c1 0.2 2 Gb 10% 1 sec c2 0.5 1 Gb 20% 2 sec cn 1.0 4 Gb 40% 0.1 sec Causal Structure Learning (e.g., CGNN)

Slide 52

Slide 52 text

Step4: Formulating queries for the downstream tasks E.g., conditional probabilities for performance prediction tasks. 52 Batch Size f5 Batch Timeout f1 Memory Growth f7 f16 QoS Interval Cache Pressure f10 f12 Swappiness f9 f14 Cache Size f4 CPU Freq f2 f3 f6 GPU Freq f11 f15 CPU Utilization EMC Freq CPU Cores f13 Context Switches Migrations f8 Num Cycles Cache Misses Branch Misses Num Instructions Scheduler Wait Time Major Faults Cache References Scheduler Sleep Time Minor Faults Scheduler Task Migrations Softirq Entry GPU Utilization Throughput Energy P(Throughput|M, Configuration = C 1 ) For performance understanding tasks, one may formulate the following query: P(Throughput|M, Configuration = C 1 ) For performance debugging, one may formulate the following query:

Slide 53

Slide 53 text

Questions of this nature require precise mathematical language lest they will be misleading. Here we are simultaneously conditioning on two values of GPU memory growth (i.e., ˆ = 0.66 and = 0.33). Traditional machine learning approaches cannot handle such expressions. Instead, we must resort to causal models to compute them. 53

Slide 54

Slide 54 text

There are two fundamental benefits that we get by our “Causal AI for Systems” methodology 1. We learn one central (causal) model from the data and use it reliably across different performance tasks: • Performance understanding • Performance optimization • Performance debugging and repair • Performance prediction for different environments where we cannot intervene (e.g., canary-> production, we can intervene in canary environment, while it is not possible to disturb production environment, we may only be able to use measurement data) 2. The causal model is transferable across environments. • We observed Sparse Mechanism Shift in systems too! • Alternative non-causal models (e.g., regression-based models for performance tasks) are not transferable as they rely on i.i.d. setting and only capture association/correlations among variables, resulting in many non- causal terms that may drastically change when the system is deployed in different environments. 54

Slide 55

Slide 55 text

Difference between statistical (left) and causal models (right) on a given set of three variables While a statistical model specifies a single probability distribution, a causal model represents a set of distributions, one for each possible intervention. 55

Slide 56

Slide 56 text

Independent Causal Mechanisms (ICM) Principle

Slide 57

Slide 57 text

Sparse Mechanism Shift (SMS) Hypothesis Example of SMS hypothesis, where an intervention (which may or may not be intentional/observed) changes the position of one finger, and as a consequence, the object falls. The change in pixel space is entangled (or distributed), in contrast to the change in the causal model.

Slide 58

Slide 58 text

Step5: Estimating the queries based on the learned causal model Estimation process involves traversing the causal model: (i) extracting the causal paths by backtracking from perf objectives, (ii) ranking the causal paths by calculating the average causal effect, (iii) extract the required information from the causal paths. 58 Batch Size f5 Batch Timeout f1 Memory Growth f7 f16 QoS Interval Cache Pressure f10 f12 Swappiness f9 f14 Cache Size f4 CPU Freq f2 f3 f6 GPU Freq f11 f15 CPU Utilization EMC Freq CPU Cores f13 Context Switches Migrations f8 Num Cycles Cache Misses Branch Misses Num Instructions Scheduler Wait Time Major Faults Cache References Scheduler Sleep Time Minor Faults Scheduler Task Migrations Softirq Entry GPU Utilization Throughput Energy

Slide 59

Slide 59 text

Step6: Evaluating and updating the causal model We evaluate ground truth queries to test whether the causal model is accurate enough to estimate queries with certain accuracy. In a typical setting, we have limited sampling budget, say 100 measurements. 59 Batch Size f5 Batch Timeout f1 Memory Growth f7 f16 QoS Interval Cache Pressure f10 f12 Swappiness f9 f14 Cache Size f4 CPU Freq f2 f3 f6 GPU Freq f11 f15 CPU Utilization EMC Freq CPU Cores f13 Context Switches Migrations f8 Num Cycles Cache Misses Branch Misses Num Instructions Scheduler Wait Time Major Faults Cache References Scheduler Sleep Time Minor Faults Scheduler Task Migrations Softirq Entry GPU Utilization Throughput Energy Batch Size f5 Batch Timeout f1 Memory Growth f7 f16 QoS Interval Cache Pressure f10 f12 Swappiness f9 f14 Cache Size f4 CPU Freq f2 f3 f6 GPU Freq f11 f15 CPU Utilization EMC Freq CPU Cores f13 Context Switches Migrations f8 Num Cycles Cache Misses Branch Misses Num Instructions Scheduler Wait Time Major Faults Cache References Scheduler Sleep Time Minor Faults Scheduler Task Migrations Softirq Entry GPU Utilization Throughput Energy Causal Model Update

Slide 60

Slide 60 text

Step7: Calculating quantities of the downstream tasks Depending on the down stream tasks (the estimated queries), we may need to do some final transformations or calculations, here we assume the downstream task is performance optimization. 60 Batch Size f5 Batch Timeout f1 Memory Growth f7 f16 QoS Interval Cache Pressure f10 f12 Swappiness f9 f14 Cache Size f4 CPU Freq f2 f3 f6 GPU Freq f11 f15 CPU Utilization EMC Freq CPU Cores f13 Context Switches Migrations f8 Num Cycles Cache Misses Branch Misses Num Instructions Scheduler Wait Time Major Faults Cache References Scheduler Sleep Time Minor Faults Scheduler Task Migrations Softirq Entry GPU Utilization Throughput Energy P(Throughput|M, Configuration = C 1 ) 1- Estimation 2- Using this estimation in optimization loop for performance optimization tasks -1.5 -1 -0.5 0 0.5 1 1.5 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 Configuration Space Empirical Model Experiment Experiment 0 20 40 60 80 100 120 140 160 180 200 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 Selection Criteria Sequential Design

Slide 61

Slide 61 text

Outline 61 Case Study Current Results Future Directions Causal AI for Systems CADET

Slide 62

Slide 62 text

A Typical Software Lifecycle > Test Deploy Monitor Write Vulnerabilities Deep bugs Poor product metrics Artifact 62 Misconfigurations CADET Misconfigurations Diagnosing and fixing misconfigurations with causal inference TODAY’S TALK

Slide 63

Slide 63 text

Today’s Talk Deploy Artifact Challenge Ὂ Each deployment environment must be configured correctly Ὂ This is challenging and prone to misconfigurations Software may be deployed in several environments Server Personal Devices Embedded Hardware Autonomous Vehicles Deployment Environments 63

Slide 64

Slide 64 text

Today’s Talk Problem Ὂ Each deployment environment must be configured correctly Ὂ This is challenging and prone to misconfigurations Why? Ὂ The configuration options lie across the software stack Ὂ There are several non-trivial interactions with one another Ὂ The configuration space is combinatorially large with 100’s of configuration options 64 CPU Memory Controller GPU Lib API Clients Devices Network Task Scheduler Device Drivers File System Compilers Memory Manager Process Manager Frontend Application Layer OS/Kernel Layer Hardware Layer Deployment SoC Generic hardware Production Servers

Slide 65

Slide 65 text

Misconfiguration and its Effects ● Misconfigurations can elicit unexpected interactions between software and hardware ● These can result in non-functional faults ○ Affecting non-functional system properties like latency, throughput, energy consumption, etc. 65 The system doesn’t crash or exhibit an obvious misbehavior Systems are still operational but with a degraded performance, e.g., high latency, low throughput, high energy consumption, high heat dissipation, or a combination of several

Slide 66

Slide 66 text

66 CUDA performance issue on tx2 When we are trying to transplant our CUDA source code from TX1 to TX2, it behaved strange. We noticed that TX2 has twice computing-ability as TX1 in GPU, as expectation, we think TX2 will 30% - 40% faster than TX1 at least. Unfortunately, most of our code base spent twice the time as TX1, in other words, TX2 only has 1/2 speed as TX1, mostly. We believe that TX2’s CUDA API runs much slower than TX1 in many cases. When we are trying to transplant our CUDA source code from TX1 to TX2, it behaved strange. We noticed that TX2 has twice computing-ability as TX1 in GPU, as expectation, we think TX2 will 30% - 40% faster than TX1 at least. Unfortunately, most of our code base spent twice the time as TX1, in other words, TX2 only has 1/2 speed as TX1, mostly. We believe that TX2’s CUDA API runs much slower than TX1 in many cases. The user is transferring the code from one hardware to another When we are trying to transplant our CUDA source code from TX1 to TX2, it behaved strange. We noticed that TX2 has twice computing-ability as TX1 in GPU, as expectation, we think TX2 will 30% - 40% faster than TX1 at least. Unfortunately, most of our code base spent twice the time as TX1, in other words, TX2 only has 1/2 speed as TX1, mostly. We believe that TX2’s CUDA API runs much slower than TX1 in many cases. The target hardware is faster than the the source hardware. User expects the code to run at least 30-40% faster. Motivating Example When we are trying to transplant our CUDA source code from TX1 to TX2, it behaved strange. We noticed that TX2 has twice computing-ability as TX1 in GPU, as expectation, we think TX2 will 30% - 40% faster than TX1 at least. Unfortunately, most of our code base spent twice the time as TX1, in other words, TX2 only has 1/2 speed as TX1, mostly. We believe that TX2’s CUDA API runs much slower than TX1 in many cases. The code ran 2x slower on the more powerful hardware

Slide 67

Slide 67 text

Motivating Example 67 June 3rd We have already tried this. We still have high latency. Any other suggestions? June 4th Please do the following and let us know if it works 1. Install JetPack 3.0 2. Set nvpmodel=MAX-N 3. Run jetson_clock.sh June 5th June 4th TX2 is pascal architecture. Please update your CMakeLists: + set(CUDA_STATIC_RUNTIME OFF) ... + -gencode=arch=compute_62,code=sm_62 The user had several misconfigurations In Software: ✖ Wrong compilation flags ✖ Wrong SDK version In Hardware: ✖ Wrong power mode ✖ Wrong clock/fan settings The discussions took 2 days Any suggestions on how to improve my performance? Thanks! How to resolve such issues faster? ?

Slide 68

Slide 68 text

68 Diagnose and fix the root-cause of misconfigurations that cause non-functional faults Objective Causal Debugging (with CADET) Ὂ Use causal models to model various cross-stack configuration interactions; and Ὂ Counterfactual reasoning to recommend fixes for these misconfigurations Approach

Slide 69

Slide 69 text

69 NeurIPS 2020 (ML For Systems), Dec 12th, 2020 https://arxiv.org/pdf/2010.06061.pdf https://github.com/rahlk/CADET

Slide 70

Slide 70 text

Why Causal Inference? (Simpson’s Paradox) 70 Increasing GPU memory increases Latency More GPU memory usage should reduce latency not increase it. Counterintuitive! Any ML-/statistical models built on this data will be incorrect !

Slide 71

Slide 71 text

Why Causal Inference? (Simpson’s Paradox) 71 Segregate data on swap memory Available swap memory is reducing GPU memory borrows memory from the swap for some intensive workloads. Other host processes may reduce the available swap. Little will be left for the GPU to use.

Slide 72

Slide 72 text

72 Why Causal Inference? Real world problems can have 100s if not 1000s of interacting configuration options ! Manually understanding and evaluating each combination is impractical, if not impossible.

Slide 73

Slide 73 text

Load GPU Mem. Swap Mem. Latency Express the relationships between interacting variables as a causal graph 73 Causal Models Configuration option Direction(s) of the causality • Latency is affected by GPU Mem. which in turn is influenced by swap memory • External factors like resource pressure also affects swap memory Non-functional property System event

Slide 74

Slide 74 text

74 Causal Models How to construct this causal graph? ? If there is a fault in latency, how to diagnose and fix it? ? Load GPU Mem. Swap Mem. Latency

Slide 75

Slide 75 text

75 CADET: Causal Debugging Tool • What is the root-cause of my fault? • How do I fix my misconfigurations to improve performance? Misconfiguration Fault fixed? Observational Data Build Causal Graph Extract Causal Paths Best Query Yes No update observational data Counterfactual Queries Rank Paths What if questions. E.g., What if the configuration option X was set to a value ‘x’? About 25 sample configurations (training data)

Slide 76

Slide 76 text

Best Query Counterfactual Queries Rank Paths What if questions. E.g., What if the configuration option X was set to a value ‘x’? Extract Causal Paths 76 STEP 1: Generating a Causal Graph • What is the root-cause of my fault? • How do I fix my misconfigurations to improve performance? Misconfiguration Fault fixed? Observational Data Yes No update observational data About 25 sample configurations (training data) Build Causal Graph

Slide 77

Slide 77 text

Directed Acyclic Graph Load GPU Mem. Latency Swap Mem. 77 Generating a Causal Graph: With FCI GPU Mem. Swap Mem. Load Latency c1 0.2 2 Gb 10% 1 sec c2 0.5 1 Gb 20% 2 sec cn 1.0 4 Gb 40% 0.1 sec ⋮ ⋮ ⋮ ⋮ Load Swap Mem. Latency GPU Mem. ⋮ Fully connected skeleton Prune away edges between independent variables use statistical independence tests orient remaining edges Use standard orientation rules for forks, colliders, v-structures, and cycles Load GPU Mem. Latency Swap Mem.

Slide 78

Slide 78 text

Best Query Counterfactual Queries Rank Paths What if questions. E.g., What if the configuration option X was set to a value ‘x’? Extract Causal Paths 78 STEP 2: Extracting Paths from the Graph • What is the root-cause of my fault? • How do I fix my misconfigurations to improve performance? Misconfiguration Fault fixed? Observational Data Build Causal Graph Yes No update observational data About 25 sample configurations (training data)

Slide 79

Slide 79 text

Extracting Paths from the Causal Graph Problem ✕ In real world cases, this causal graph can be very complex ✕ It may be intractable to reason over the entire graph directly 79 Solution ✓ Extract paths from the causal graph ✓ Rank them based on their Average Causal Effect on latency, etc. ✓ Reason over the top K paths

Slide 80

Slide 80 text

Extracting Paths from the Causal Graph 80 GPU Mem. Latency Swap Mem. Extract paths Always begins with a configuration option Or a system event Always terminates at a performance objective Load GPU Mem. Latency Swap Mem. Swap Mem. Latency Load GPU Mem.

Slide 81

Slide 81 text

Ranking Paths from the Causal Graph 81 ● They may be too many causal paths ● We need to select the most useful ones ● Compute the Average Causal Effect (ACE) of each pair of neighbors in a path GPU Mem. Swap Mem. Latency (GPU Mem . , Swap) = 1 ∑ , ∈ (GPU Mem . (Swap = )) − (GPU Mem . (Swap = )) Expected value of GPU Mem. when we artificially intervene by setting Swap to the value b Expected value of GPU Mem. when we artificially intervene by setting Swap to the value a If this difference is large, then small changes to Swap Mem. will cause large changes to GPU Mem. Average over all permitted values of Swap memory.

Slide 82

Slide 82 text

Ranking Paths from the Causal Graph 82 ● Average the ACE of all pairs of adjacent nodes in the path ( , ) = 1 2 ( ( , ) + ( , )) X Y Z Sum over all pairs of nodes in the causal path. GPU Mem. Latency Swap Mem.

Slide 83

Slide 83 text

Best Query Counterfactual Queries Rank Paths What if questions. E.g., What if the configuration option X was set to a value ‘x’? Extract Causal Paths 83 STEP 3: Diagnosing and Fixing the Faults • What is the root-cause of my fault? • How do I fix my misconfigurations to improve performance? Misconfiguration Fault fixed? Observational Data Build Causal Graph Yes No update observational data About 25 sample configurations (training data)

Slide 84

Slide 84 text

Diagnosing and Fixing the Faults 84 ● Counterfactual inference asks “what if” questions about changes to the misconfigurations We are interested in the scenario where: • We hypothetically have low latency; Conditioned on the following events: • We hypothetically set the new Swap memory to 4 Gb • Swap Memory was initially set to 2 Gb • We observed high latency when Swap was set to 2 Gb • Everything else remains the same Example Given that my current swap memory is 2 Gb, and I have high latency. What is the probability of having low latency if swap memory was increased to 4 Gb?

Slide 85

Slide 85 text

Low? Load GPU Mem. Latency Swap = 4 Gb Diagnosing and Fixing the Faults 85 GPU Mem. Latency Swap Original Path Load GPU Mem. Latency Swap = 4 Gb Path after proposed change Load Remove incoming edges. Assume no external influence. Modify to reflect the hypothetical scenario Low? Load GPU Mem. Latency Swap = 4 Gb Low? Use both the models to compute the answer to the counterfactual question

Slide 86

Slide 86 text

Diagnosing and Fixing the Faults 86 GPU Mem. Latency Swap Original Path Load GPU Mem. Latency Swap = 4 Gb Path after proposed change Load = ( ^ = . . ^ = 4 , . = 2 , =2 = h h, ) We expect a low latency The latency was high The Swap is now 4 Gb The Swap was initially 2 Gb Everything else stays the same

Slide 87

Slide 87 text

Diagnosing and Fixing the Faults 87 Potential = ( ^ = ~ ~ h , ~ ¬ h = , ~¬ h , ) Probability that the outcome is good after a change, conditioned on the past If this difference is large, then our change is useful Individual Treatment Effect = Potential − Outcome Control = ( ^ = ~ ~¬ h , ) Probability that the outcome was bad before the change

Slide 88

Slide 88 text

Diagnosing and Fixing the Faults 88 GPU Mem. Latency Swap Mem. Top K paths ⋮ Enumerate all possible changes ( h ) Change with the largest ITE Set every configuration option in the path to all permitted values Inferred from observed data. This is very cheap. !

Slide 89

Slide 89 text

Diagnosing and Fixing the Faults 89 Change with the largest ITE Fault fixed? Yes No • Add to observational data • Update causal model • Repeat… Measure Performance

Slide 90

Slide 90 text

90 CADET: End-to-End Pipeline • What is the root-cause of my fault? • How do I fix my misconfigurations to improve performance? Misconfiguration Fault fixed? Observational Data Build Causal Graph Extract Causal Paths Best Query Yes No update observational data Counterfactual Queries Rank Paths What if questions. E.g., What if the configuration option X was set to a value ‘x’? About 25 sample configurations (training data)

Slide 91

Slide 91 text

Results: Motivating Example 91 When we are trying to transplant our CUDA source code from TX1 to TX2, it behaved strange. We noticed that TX2 has twice computing-ability as TX1 in GPU, as expectation, we think TX2 will 30% - 40% faster than TX1 at least. Unfortunately, most of our code base spent twice the time as TX1, in other words, TX2 only has 1/2 speed as TX1, mostly. We believe that TX2’s CUDA API runs much slower than TX1 in many cases. When we are trying to transplant our CUDA source code from TX1 to TX2, it behaved strange. We noticed that TX2 has twice computing-ability as TX1 in GPU, as expectation, we think TX2 will 30% - 40% faster than TX1 at least. Unfortunately, most of our code base spent twice the time as TX1, in other words, TX2 only has 1/2 speed as TX1, mostly. We believe that TX2’s CUDA API runs much slower than TX1 in many cases. When we are trying to transplant our CUDA source code from TX1 to TX2, it behaved strange. We noticed that TX2 has twice computing-ability as TX1 in GPU, as expectation, we think TX2 will 30% - 40% faster than TX1 at least. Unfortunately, most of our code base spent twice the time as TX1, in other words, TX2 only has 1/2 speed as TX1, mostly. We believe that TX2’s CUDA API runs much slower than TX1 in many cases. When we are trying to transplant our CUDA source code from TX1 to TX2, it behaved strange. We noticed that TX2 has twice computing-ability as TX1 in GPU, as expectation, we think TX2 will 30% - 40% faster than TX1 at least. Unfortunately, most of our code base spent twice the time as TX1, in other words, TX2 only has 1/2 speed as TX1, mostly. We believe that TX2’s CUDA API runs much slower than TX1 in many cases. The user is transferring the code from one hardware to another The target hardware is faster than the the source hardware. User expects the code to run at least 30-40% faster. The code ran 2x slower on the more powerful hardware

Slide 92

Slide 92 text

More powerful Results: Motivating Example 92 Nvidia TX1 CPU 4 cores, 1.3 GHz GPU 128 Cores, 0.9 GHz Memory 4 Gb, 25 Gb/s Nvidia TX2 CPU 6 cores, 2 GHz GPU 256 Cores, 1.3 GHz Memory 8 Gb, 58 Gb/s Embedded real-time stereo estimation Source code 17 Fps 4 Fps 4 Slower! ×

Slide 93

Slide 93 text

Results: Motivating Example 93 Configuration CADET Decision Tree Forum CPU Cores ✓ ✓ ✓ CPU Freq. ✓ ✓ ✓ EMC Freq. ✓ ✓ ✓ GPU Freq. ✓ ✓ ✓ Sched. Policy ✓ Sched. Runtime ✓ Sched. Child Proc ✓ Dirty Bg. Ratio ✓ Drop Caches ✓ CUDA_STATIC_R T ✓ ✓ ✓ Swap Memory ✓ CADET Decision Tree Forum Throughput (on TX2) 26 FPS 20 FPS 23 FPS Throughput Gain (over TX1) 53 % 21 % 39 % Time to resolve 24 min. 31/2 Hrs. 2 days X Finds the root-causes accurately X No unnecessary changes X Better improvements than forum’s recommendation X Much faster Results The user expected 30-40% gain

Slide 94

Slide 94 text

Evaluation: Experimental Setup Nvidia TX1 CPU 4 cores, 1.3 GHz GPU 128 Cores, 0.9 GHz Memory 4 Gb, 25 GB/s Nvidia TX2 CPU 6 cores, 2 GHz GPU 256 Cores, 1.3 GHz Memory 8 Gb, 58 GB/s Nvidia Xavier CPU 8 cores, 2.26 GHz GPU 512 cores, 1.3 GHz Memory 32 Gb, 137 GB/s Hardware Systems Software Systems Xception Image recognition (50,000 test images) DeepSpeech Voice recognition (5 sec. audio clip) BERT Sentiment Analysis (10000 IMDb reviews) x264 Video Encoder (11 Mb, 1080p video) Configuration Space X 30 Configurations X 17 System Events • 10 software • 10 OS/Kernel • 10 hardware 94

Slide 95

Slide 95 text

Outline 95 Case Study CADET Future Directions Causal AI for Systems Current Results

Slide 96

Slide 96 text

96 RQ1: How does CADET perform compared to Model based Diagnostics RQ2: How does CADET perform compared to Search-Based Optimization Results: Research Questions

Slide 97

Slide 97 text

97 Results: Research Question 1 (single objective) RQ1: How does CADET perform compared to Model based Diagnostics X Finds the root-causes accurately X Better gain X Much faster Takeaways More accurate than ML-based methods Better Gain Up to 20x faster

Slide 98

Slide 98 text

98 Results: Research Question 1 (multi-objective) RQ1: How does CADET perform compared to Model based Diagnostics X No deterioration of other performance objectives Takeaways Multiple Faults in Latency & Energy usage

Slide 99

Slide 99 text

99 RQ1: How does CADET perform compared to Model based Diagnostics RQ2: How does CADET perform compared to Search-Based Optimization Results: Research Questions

Slide 100

Slide 100 text

Results: Research Question 2 RQ2: How does CADET perform compared to Search-Based Optimization X Better with no deterioration of other performance objectives Takeaways 100

Slide 101

Slide 101 text

101 Results: Research Question 3 RQ2: How does CADET perform compared to Search-Based Optimization X Considerably faster than search-based optimization Takeaways

Slide 102

Slide 102 text

Outline 102 Case Study CADET Causal AI for Systems Current Results Future Directions

Slide 103

Slide 103 text

Opportunities of Causal AI for Serverless • Evaluating our Causal AI for Systems methodology with Serverless systems provide the following opportunities: 1. Dynamic system reconfigurations • Dynamic placement of functions • Dynamic reconfigurations of the network of functions • Dynamic multi-cloud placement of functions. 2. Root cause analysis of failures or QoS drop 103

Slide 104

Slide 104 text

Opportunities of Causal AI for autonomous robot testing • Testing cyberphysical systems such as robots are difficult. The key reason is that there are additional interactions with the environment and the task that the robot is performing. • Evaluating our Causal AI for Systems methodology with autonomous robots provide the following opportunities: 1. Identifying difficult to catch bugs in robots 2. Identifying the root cause of an observed fault and repairing the issue automatically during mission time. 104

Slide 105

Slide 105 text

Summary: Causal AI for Systems 1. Learning a Functional Causal Model for different downstream systems tasks 2. The learned causal model is transferable across different environments 105

Slide 106

Slide 106 text

106