Slide 1

Slide 1 text

Understanding and Explaining the Root Causes of Performance Faults with Causal AI A Path towards Building Dependable Computer Systems Pooyan Jamshidi

Slide 2

Slide 2 text

SEAMS’23

Slide 3

Slide 3 text

3 Melbourne, Australia 15-16 May 2023

Slide 4

Slide 4 text

4 Topics of Interest

Slide 5

Slide 5 text

5 Keynote Speaker at SEAMS’20 from NASA

Slide 6

Slide 6 text

Outline 6 UNICORN Results Causal AI For Systems Motivation Autonomy Evaluation at JPL Causal AI for Autonomy and Robotics

Slide 7

Slide 7 text

7 Goal: Enable developers/users to fi nd the right quality tradeoff

Slide 8

Slide 8 text

Today’s most popular systems are con fi gurable 8 built

Slide 9

Slide 9 text

9

Slide 10

Slide 10 text

Empirical observations con fi rm that systems are becoming increasingly con fi gurable 10 08 7/2010 7/2012 7/2014 Release time 1/1999 1/2003 1/2007 1/2011 0 1/2014 N Release time 02 1/2006 1/2010 1/2014 2.2.14 2.3.4 2.0.35 .3.24 Release time Apache 1/2006 1/2008 1/2010 1/2012 1/2014 0 40 80 120 160 200 2.0.0 1.0.0 0.19.0 0.1.0 Hadoop Number of parameters Release time MapReduce HDFS [Tianyin Xu, et al., “Too Many Knobs…”, FSE’15]

Slide 11

Slide 11 text

Empirical observations con fi rm that systems are becoming increasingly con fi gurable 11 nia San Diego, ‡Huazhong Univ. of Science & Technology, †NetApp, Inc tixu, longjin, xuf001, yyzhou}@cs.ucsd.edu kar.Pasupathy, Rukma.Talwadker}@netapp.com prevalent, but also severely software. One fundamental y of configuration, reflected parameters (“knobs”). With m software to ensure high re- aunting, error-prone task. nderstanding a fundamental users really need so many answer, we study the con- including thousands of cus- m (Storage-A), and hundreds ce system software projects. ng findings to motivate soft- ore cautious and disciplined these findings, we provide ich can significantly reduce A as an example, the guide- ters and simplify 19.7% of on existing users. Also, we tion methods in the context 7/2006 7/2008 7/2010 7/2012 7/2014 0 100 200 300 400 500 600 700 Storage-A Number of parameters Release time 1/1999 1/2003 1/2007 1/2011 0 100 200 300 400 500 5.6.2 5.5.0 5.0.16 5.1.3 4.1.0 4.0.12 3.23.0 1/2014 MySQL Number of parameters Release time 1/1998 1/2002 1/2006 1/2010 1/2014 0 100 200 300 400 500 600 1.3.14 2.2.14 2.3.4 2.0.35 1.3.24 Number of parameters Release time Apache 1/2006 1/2008 1/2010 1/2012 1/2014 0 40 80 120 160 200 2.0.0 1.0.0 0.19.0 0.1.0 Hadoop Number of parameters Release time MapReduce HDFS [Tianyin Xu, et al., “Too Many Knobs…”, FSE’15]

Slide 12

Slide 12 text

Con fi gurations determine the performance behavior 12 void Parrot_setenv(. . . name,. . . value){ #ifdef PARROT_HAS_SETENV my_setenv(name, value, 1); #else int name_len=strlen(name); int val_len=strlen(value); char* envs=glob_env; if(envs==NULL){ return; } strcpy(envs,name); strcpy(envs+name_len,"="); strcpy(envs+name_len + 1,value); putenv(envs); #endif } #ifdef LINUX extern int Parrot_signbit(double x){ endif else PARROT_HAS_SETENV LINUX Speed Energy

Slide 13

Slide 13 text

Outline 13 Motivation Causal AI For Systems Results Case Study Causal AI for Autonomy and Robotics Autonomy Evaluation at JPL

Slide 14

Slide 14 text

Case Study 1 SocialSensor

Slide 15

Slide 15 text

SocialSensor 15 Content Analysis Orchestrator Crawling Search and Integration Tweets: [5k-20k/min] Every 10 min: [100k tweets] Tweets: [10M] Fetch Store Push Store Crawled items Fetch Internet

Slide 16

Slide 16 text

Challenges 16 Content Analysis Orchestrator Crawling Search and Integration Tweets: [5k-20k/min] Every 10 min: [100k tweets] Tweets: [10M] Fetch Store Push Store Crawled items Fetch Internet 100X 10X Real time

Slide 17

Slide 17 text

17 How can we gain a better performance without using more resources?

Slide 18

Slide 18 text

18 Let’s try out di ff erent system con fi gurations!

Slide 19

Slide 19 text

Opportunity: Data processing engines in the pipeline were all con fi gurable 19 > 100 > 100 > 100 2300

Slide 20

Slide 20 text

20 More combinations than estimated atoms in the universe

Slide 21

Slide 21 text

0 500 1000 1500 Throughput (ops/sec) 0 1000 2000 3000 4000 5000 Average write latency ( s) The default con fi guration is typically bad and the optimal con fi guration is noticeably better than median 21 Default Con fi guration Optimal Con fi guration better better • Default is bad • 2X-10X faster than worst • Noticeably faster than median

Slide 22

Slide 22 text

Performance behavior varies in different environments 22

Slide 23

Slide 23 text

Case Study 2 Robotics

Slide 24

Slide 24 text

CoBot experiment: DARPA BRASS 0 2 4 6 8 Localization error [m] 10 15 20 25 30 35 40 CPU utilization [%] Energy constraint Safety constraint Pareto front Sweet Spot better better no_of_particles=x no_of_re fi nement=y

Slide 25

Slide 25 text

CoBot experiment 5 10 15 20 25 5 10 15 20 25 0 5 10 15 20 25 5 10 15 20 25 5 10 15 20 25 0 5 10 15 20 25 5 10 15 20 25 5 10 15 20 25 0 5 10 15 20 25 5 10 15 20 25 5 10 15 20 25 0 5 10 15 20 25 Source (given) Target (ground truth 6 months) Prediction with 4 samples Prediction with Transfer learning CPU [%] CPU [%]

Slide 26

Slide 26 text

Transfer Learning for Improving Model Predictions in Highly Configurable Software Pooyan Jamshidi, Miguel Velez, Christian K¨ astner Carnegie Mellon University, USA {pjamshid,mvelezce,kaestner}@cs.cmu.edu Norbert Siegmund Bauhaus-University Weimar, Germany [email protected] Prasad Kawthekar Stanford University, USA [email protected] Abstract —Modern software systems are built to be used in dynamic environments using configuration capabilities to adapt to changes and external uncertainties. In a self-adaptation context, we are often interested in reasoning about the performance of the systems under different configurations. Usually, we learn a black-box model based on real measurements to predict the performance of the system given a specific configuration. However, as modern systems become more complex, there are many configuration parameters that may interact and we end up learning an exponentially large configuration space. Naturally, this does not scale when relying on real measurements in the actual changing environment. We propose a different solution: Instead of taking the measurements from the real system, we learn the model using samples from other sources, such as simulators that approximate performance of the real system at Predictive Model Learn Model with Transfer Learning Measure Measure Data Source Target Simulator (Source) Robot (Target) Adaptation Fig. 1: Transfer learning for performance model learning. order to identify the best performing configuration for a robot Details: [SEAMS ’17]

Slide 27

Slide 27 text

Looking further: When transfer learning goes wrong 10 20 30 40 50 60 Absolute Percentage Error [%] Sources s s1 s2 s3 s4 s5 s6 noise-level 0 5 10 15 20 25 30 corr. coeff. 0.98 0.95 0.89 0.75 0.54 0.34 0.19 µ(pe) 15.34 14.14 17.09 18.71 33.06 40.93 46.75 It worked! It didn’t! Insight: Predictions become more accurate when the source is more related to the target. Non-transfer-learning

Slide 28

Slide 28 text

5 10 15 20 25 number of particles 5 10 15 20 25 number of refinements 5 10 15 20 25 30 5 10 15 20 25 number of particles 5 10 15 20 25 number of refinements 10 12 14 16 18 20 22 24 5 10 15 20 25 number of particles 5 10 15 20 25 number of refinements 10 15 20 25 5 10 15 20 25 number of particles 5 10 15 20 25 number of refinements 10 15 20 25 5 10 15 20 25 number of particles 5 10 15 20 25 number of refinements 6 8 10 12 14 16 18 20 22 24 (a) (b) (c) (d) (e) 5 10 15 20 25 number of particles 5 10 15 20 25 number of refinements 12 14 16 18 20 22 24 (f) CPU usage [%] CPU usage [%] CPU usage [%] CPU usage [%] CPU usage [%] CPU usage [%] It worked! It worked! It worked! It didn’t! It didn’t! It didn’t!

Slide 29

Slide 29 text

Key question: Can we develop a theory to explain when transfer learning works? Target (Learn) Source (Given) Data Model Transferable Knowledge II. INTUITION rstanding the performance behavior of configurable e systems can enable (i) performance debugging, (ii) mance tuning, (iii) design-time evolution, or (iv) runtime on [11]. We lack empirical understanding of how the mance behavior of a system will vary when the environ- the system changes. Such empirical understanding will important insights to develop faster and more accurate g techniques that allow us to make predictions and ations of performance for highly configurable systems ging environments [10]. For instance, we can learn mance behavior of a system on a cheap hardware in a ed lab environment and use that to understand the per- ce behavior of the system on a production server before g to the end user. More specifically, we would like to what the relationship is between the performance of a in a specific environment (characterized by software ration, hardware, workload, and system version) to the t we vary its environmental conditions. is research, we aim for an empirical understanding of mance behavior to improve learning via an informed g process. In other words, we at learning a perfor- model in a changed environment based on a well-suited g set that has been determined by the knowledge we in other environments. Therefore, the main research A. Preliminary concepts In this section, we provide formal definitions of four con- cepts that we use throughout this study. The formal notations enable us to concisely convey concept throughout the paper. 1) Configuration and environment space: Let Fi indicate the i-th feature of a configurable system A which is either enabled or disabled and one of them holds by default. The configuration space is mathematically a Cartesian product of all the features C = Dom(F1) ⇥ · · · ⇥ Dom(Fd), where Dom(Fi) = {0, 1}. A configuration of a system is then a member of the configuration space (feature space) where all the parameters are assigned to a specific value in their range (i.e., complete instantiations of the system’s parameters). We also describe an environment instance by 3 variables e = [w, h, v] drawn from a given environment space E = W ⇥H ⇥V , where they respectively represent sets of possible values for workload, hardware and system version. 2) Performance model: Given a software system A with configuration space F and environmental instances E, a per- formance model is a black-box function f : F ⇥ E ! R given some observations of the system performance for each combination of system’s features x 2 F in an environment e 2 E. To construct a performance model for a system A with configuration space F, we run A in environment instance e 2 E on various combinations of configurations xi 2 F, and record the resulting performance values yi = f(xi) + ✏i, xi 2 ON behavior of configurable erformance debugging, (ii) e evolution, or (iv) runtime understanding of how the will vary when the environ- mpirical understanding will op faster and more accurate to make predictions and ighly configurable systems or instance, we can learn on a cheap hardware in a that to understand the per- a production server before cifically, we would like to ween the performance of a (characterized by software and system version) to the conditions. empirical understanding of learning via an informed we at learning a perfor- ment based on a well-suited ned by the knowledge we erefore, the main research A. Preliminary concepts In this section, we provide formal definitions of four con- cepts that we use throughout this study. The formal notations enable us to concisely convey concept throughout the paper. 1) Configuration and environment space: Let Fi indicate the i-th feature of a configurable system A which is either enabled or disabled and one of them holds by default. The configuration space is mathematically a Cartesian product of all the features C = Dom(F1) ⇥ · · · ⇥ Dom(Fd), where Dom(Fi) = {0, 1}. A configuration of a system is then a member of the configuration space (feature space) where all the parameters are assigned to a specific value in their range (i.e., complete instantiations of the system’s parameters). We also describe an environment instance by 3 variables e = [w, h, v] drawn from a given environment space E = W ⇥H ⇥V , where they respectively represent sets of possible values for workload, hardware and system version. 2) Performance model: Given a software system A with configuration space F and environmental instances E, a per- formance model is a black-box function f : F ⇥ E ! R given some observations of the system performance for each combination of system’s features x 2 F in an environment e 2 E. To construct a performance model for a system A with configuration space F, we run A in environment instance e 2 E on various combinations of configurations xi 2 F, and record the resulting performance values yi = f(xi) + ✏i, xi 2 oad, hardware and system version. e model: Given a software system A with ce F and environmental instances E, a per- is a black-box function f : F ⇥ E ! R rvations of the system performance for each ystem’s features x 2 F in an environment ruct a performance model for a system A n space F, we run A in environment instance combinations of configurations xi 2 F, and ng performance values yi = f(xi) + ✏i, xi 2 (0, i). The training data for our regression mply Dtr = {(xi, yi)}n i=1 . In other words, a is simply a mapping from the input space to ormance metric that produces interval-scaled ume it produces real numbers). e distribution: For the performance model, associated the performance response to each w let introduce another concept where we ment and we measure the performance. An mance distribution is a stochastic process, that defines a probability distribution over sures for each environmental conditions. To ormance distribution for a system A with ce F, similarly to the process of deriving models, we run A on various combinations 2 F, for a specific environment instance values for workload, hardware and system version. 2) Performance model: Given a software system A with configuration space F and environmental instances E, a per- formance model is a black-box function f : F ⇥ E ! R given some observations of the system performance for each combination of system’s features x 2 F in an environment e 2 E. To construct a performance model for a system A with configuration space F, we run A in environment instance e 2 E on various combinations of configurations xi 2 F, and record the resulting performance values yi = f(xi) + ✏i, xi 2 F where ✏i ⇠ N (0, i). The training data for our regression models is then simply Dtr = {(xi, yi)}n i=1 . In other words, a response function is simply a mapping from the input space to a measurable performance metric that produces interval-scaled data (here we assume it produces real numbers). 3) Performance distribution: For the performance model, we measured and associated the performance response to each configuration, now let introduce another concept where we vary the environment and we measure the performance. An empirical performance distribution is a stochastic process, pd : E ! (R), that defines a probability distribution over performance measures for each environmental conditions. To construct a performance distribution for a system A with configuration space F, similarly to the process of deriving the performance models, we run A on various combinations configurations xi 2 F, for a specific environment instance Extract Reuse Learn Learn Q1: How source and target are “related”? Q2: What characteristics are preserved? Q3: What are the actionable insights?

Slide 30

Slide 30 text

Transfer Learning for Performance Modeling of Configurable Systems: An Exploratory Analysis Pooyan Jamshidi Carnegie Mellon University, USA Norbert Siegmund Bauhaus-University Weimar, Germany Miguel Velez, Christian K¨ astner Akshay Patel, Yuvraj Agarwal Carnegie Mellon University, USA Abstract—Modern software systems provide many configura- tion options which significantly influence their non-functional properties. To understand and predict the effect of configuration options, several sampling and learning strategies have been proposed, albeit often with significant cost to cover the highly dimensional configuration space. Recently, transfer learning has been applied to reduce the effort of constructing performance models by transferring knowledge about performance behavior across environments. While this line of research is promising to learn more accurate models at a lower cost, it is unclear why and when transfer learning works for performance modeling. To shed light on when it is beneficial to apply transfer learning, we conducted an empirical study on four popular software systems, varying software configurations and environmental conditions, such as hardware, workload, and software versions, to identify the key knowledge pieces that can be exploited for transfer learning. Our results show that in small environmental changes (e.g., homogeneous workload change), by applying a linear transformation to the performance model, we can understand the performance behavior of the target environment, while for severe environmental changes (e.g., drastic workload change) we can transfer only knowledge that makes sampling more efficient, e.g., by reducing the dimensionality of the configuration space. Index Terms—Performance analysis, transfer learning. Fig. 1: Transfer learning is a form of machine learning that takes advantage of transferable knowledge from source to learn an accurate, reliable, and less costly model for the target environment. their byproducts across environments is demanded by many Details: [ASE ’17]

Slide 31

Slide 31 text

Details: [AAAI Spring Symposium ’19]

Slide 32

Slide 32 text

Outline 32 Motivation UNICORN Results Future Directions Causal AI For Systems

Slide 33

Slide 33 text

Causal AI in Systems and Software 33 Computer Architecture Database Operating Systems Programming Languages BigData Software Engineering https://github.com/y-ding/causal-system-papers

Slide 34

Slide 34 text

Misconfiguration and its Effects ● Misconfigurations can elicit unexpected interactions between software and hardware ● These can result in non-functional faults ○ Affecting non-functional system properties like latency, throughput, energy consumption, etc. 34 The system doesn’t crash or exhibit an obvious misbehavior Systems are still operational but with a degraded performance, e.g., high latency, low throughput, high energy consumption, high heat dissipation, or a combination of several

Slide 35

Slide 35 text

35 CUDA performance issue on tx2 When we are trying to transplant our CUDA source code from TX1 to TX2, it behaved strange. We noticed that TX2 has twice computing-ability as TX1 in GPU, as expectation, we think TX2 will 30% - 40% faster than TX1 at least. Unfortunately, most of our code base spent twice the time as TX1, in other words, TX2 only has 1/2 speed as TX1, mostly. We believe that TX2’s CUDA API runs much slower than TX1 in many cases. When we are trying to transplant our CUDA source code from TX1 to TX2, it behaved strange. We noticed that TX2 has twice computing-ability as TX1 in GPU, as expectation, we think TX2 will 30% - 40% faster than TX1 at least. Unfortunately, most of our code base spent twice the time as TX1, in other words, TX2 only has 1/2 speed as TX1, mostly. We believe that TX2’s CUDA API runs much slower than TX1 in many cases. The user is transferring the code from one hardware to another When we are trying to transplant our CUDA source code from TX1 to TX2, it behaved strange. We noticed that TX2 has twice computing-ability as TX1 in GPU, as expectation, we think TX2 will 30% - 40% faster than TX1 at least. Unfortunately, most of our code base spent twice the time as TX1, in other words, TX2 only has 1/2 speed as TX1, mostly. We believe that TX2’s CUDA API runs much slower than TX1 in many cases. The target hardware is faster than the the source hardware. User expects the code to run at least 30-40% faster. Motivating Example When we are trying to transplant our CUDA source code from TX1 to TX2, it behaved strange. We noticed that TX2 has twice computing-ability as TX1 in GPU, as expectation, we think TX2 will 30% - 40% faster than TX1 at least. Unfortunately, most of our code base spent twice the time as TX1, in other words, TX2 only has 1/2 speed as TX1, mostly. We believe that TX2’s CUDA API runs much slower than TX1 in many cases. The code ran 2x slower on the more powerful hardware

Slide 36

Slide 36 text

Motivating Example 36 June 3rd We have already tried this. We still have high latency. Any other suggestions? June 4th Please do the following and let us know if it works 1. Install JetPack 3.0 2. Set nvpmodel=MAX-N 3. Run jetson_clock.sh June 5th June 4th TX2 is pascal architecture. Please update your CMakeLists: + set(CUDA_STATIC_RUNTIME OFF) ... + -gencode=arch=compute_62,code=sm_62 The user had several misconfigurations In Software: ✖ Wrong compilation flags ✖ Wrong SDK version In Hardware: ✖ Wrong power mode ✖ Wrong clock/fan settings The discussions took 2 days ! Any suggestions on how to improve my performance? Thanks! How to resolve such issues faster? ?

Slide 37

Slide 37 text

37 How to resolve these issues faster?

Slide 38

Slide 38 text

38 Performance In fl uence Models number of counters number of splitters latency (ms) 100 150 1 200 250 2 300 Cubic Interpolation Over Finer Grid 2 4 3 6 8 4 10 12 5 14 16 6 18 Bitrate (bits/s) Enable Padding … Cache Misses … Through put (fps) c1 1k 1 … 42m … 7 c2 2k 1 … 32m … 22 … … … … … … … cn 5k 0 … 12m … 25 Observational Data Black-box models Regression Equation Throughput = 5.1 × Bitrate + 2.5 × BatchSize + 12.3 × Bitrate × BatchSize Discovered Interactions Options Options

Slide 39

Slide 39 text

39 These methods rely on statistical correlations to extract meaningful information required for performance tasks. Performance In fl uence Models number of counters number of splitters latency (ms) 100 150 1 200 250 2 300 Cubic Interpolation Over Finer Grid 2 4 3 6 8 4 10 12 5 14 16 6 18 Bitrate (bits/s) Enable Padding … Cache Misses … Through put (fps) c1 1k 1 … 42m … 7 c2 2k 1 … 32m … 22 … … … … … … … cn 5k 0 … 12m … 25 Observational Data Black-box models Regression Equation Throughput = 5.1 × Bitrate + 2.5 × BatchSize + 12.3 × Bitrate × BatchSize Discovered Interactions Options Options

Slide 40

Slide 40 text

40 • Performance in fl uence models could produce incorrect explanations • Performance in fl uence models could produce unreliable predictions. • Performance in fl uence models could produce unstable predictions across environments and in the presence of measurement noise. Performance In fl uence Models su ff er from several shortcomings

Slide 41

Slide 41 text

Performance Influence Models Issue: Incorrect Explanation Cache Misses Throughput (FPS) 20 10 0 100k 200k 41 Increasing Cache Misses increases Throughput.

Slide 42

Slide 42 text

Cache Misses Throughput (FPS) 20 10 0 100k 200k 42 Increasing Cache Misses increases Throughput. More Cache Misses should reduce Throughput not increase it Any ML/statistical models built on this data will be incorrect. This is counter-intuitive Performance Influence Models Issue: Incorrect Explanation

Slide 43

Slide 43 text

Cache Misses Throughput (FPS) 20 10 0 100k 200k 43 Cache Misses Throughput (FPS) LRU FIFO LIFO MRU 20 10 0 100k 200k Segregating data on Cache Policy indicates that within each group Increase of Cache Misses result in a decrease in Throughput. FIFO LIFO MRU LRU Performance Influence Models Issue: Incorrect Explanation

Slide 44

Slide 44 text

44 Performance In fl uence Models change signi fi cantly in new environments resulting in less accuracy. Performance in fl uence model in TX2. Performance in fl uence model in Xavier. Throughput = 5.1 × Bitrate + 2.5 × BatchSize + 12.3 × Bitrate × BatchSize Throughput = 2 × Bitrate + 1.9 × BatchSize + 1.8 × BufferSize + 0.5 × EnablePadding + 5.9 × Bitrate × BufferSize +6.2 × Bitrate × EnablePadding + 4.1 × Bitrate × BufferSize × EnablePadding Performance Influence Models Issue: Unstable Predictors

Slide 45

Slide 45 text

45 Performance in fl uence are cannot be reliably used across environments. Performance in fl uence model in TX2. Performance in fl uence model in Xavier. Throughput = 5.1 × Bitrate + 2.5 × BatchSize + 12.3 × Bitrate × BatchSize Throughput = 2 × Bitrate + 1.9 × BatchSize + 1.8 × BufferSize + 0.5 × EnablePadding + 5.9 × Bitrate × BufferSize +6.2 × Bitrate × EnablePadding + 4.1 × Bitrate × BufferSize × EnablePadding Performance Influence Models Issue: Unstable Predictors

Slide 46

Slide 46 text

46 Performance in fl uence models do not generalize well across deployment environments. Performance in fl uence model in TX2 Performance in fl uence model in Xavier. Throughput = 5.1 × Bitrate + 2.5 × BatchSize + 12.3 × Bitrate × BatchSize Throughput = 2 × Bitrate + 1.9 × BatchSize + 1.8 × BufferSize + 0.5 × EnablePadding + 5.9 × Bitrate × BufferSize +6.2 × Bitrate × EnablePadding + 4.1 × Bitrate × BufferSize × EnablePadding Performance Influence Models Issue: Non-generalizability

Slide 47

Slide 47 text

47 Causal Performance Model Expresses the relationships between Con fi guration options System Events Non-functional Properties Cache Misses Throughput (FPS) 20 10 0 100k 200k interacting variables as a causal graph Direction of Causality Cache Policy Cache Misses Through put

Slide 48

Slide 48 text

Why Causal Inference? - Produces Correct Explanations 48 Cache Misses Throughput (FPS) 20 10 0 100k 200k Cache Misses Throughput (FPS) LRU FIFO LIFO MRU 20 10 0 100k 200k Cache Policy a ff ects Throughput via Cache Misses. Causal Performance Models recovers the correct interactions. Cache Policy Cache Misses Through put

Slide 49

Slide 49 text

Why Causal Inference? - Minimal Structure Change 49 Causal models remain relatively stable A partial causal performance model in Jetson Xavier A partial causal performance model in Jetson TX2 Bitrate Buffer Size Batch Size Enable Padding Branch Misses Cache Misses Cycles FPS Energy Bitrate Buffer Size Batch Size Enable Padding Branch Misses Cache Misses Cycles FPS Energy

Slide 50

Slide 50 text

Why Causal Inference? - Accurate across Environments 50 Performance Influence Model 0 5 10 15 20 25 30 35 40 45 50 Terms (a) Common Terms (Source ! Target) Total Terms (Source) Total Terms (Target) Error (Source) Error (Target) Error (Source ! Target) 0 30 60 90 Regression Models Causal Performance Model 0 5 10 15 20 25 30 35 40 45 50 0 30 60 90 Regression Models MAPE (%) Common Predictors are Large Common Predictors are lower in number

Slide 51

Slide 51 text

51 Performance Influence Model 0 5 10 15 20 25 30 35 40 45 50 Terms (a) Common Terms (Source ! Target) Total Terms (Source) Total Terms (Target) Error (Source) Error (Target) Error (Source ! Target) 0 30 60 90 Regression Models Causal Performance Model 0 5 10 15 20 25 30 35 40 45 50 0 30 60 90 Regression Models MAPE (%) Low error when reused High error when reused Common Predictors are Large Common Predictors are lower in number Causal models can be reliably reused when environmental changes occur. Why Causal Inference? - Accurate across Environments

Slide 52

Slide 52 text

52 Performance Influence Model 0 5 10 15 20 25 30 35 40 45 50 Terms (a) Common Terms (Source ! Target) Total Terms (Source) Total Terms (Target) Error (Source) Error (Target) Error (Source ! Target) 0 30 60 90 Regression Models Causal Performance Model 0 5 10 15 20 25 30 35 40 45 50 0 30 60 90 Regression Models MAPE (%) Causal models are more generalizable than Performance in fl uence models. Why Causal Inference? - Generalizability

Slide 53

Slide 53 text

53 How to use Causal Performance Models? ? Cache Policy Cache Misses Through put How to generate a causal graph?

Slide 54

Slide 54 text

54 How to use Causal Performance Models? ? How to use the causal graph for performance tasks? ? Cache Policy Cache Misses Through put How to generate a causal graph?

Slide 55

Slide 55 text

Outline 55 Motivation Causal AI For Systems Results Future Directions UNICORN

Slide 56

Slide 56 text

• Build a Causal Performance Model that capture the interactions options in the variability space using the observation performance data. • Iterative causal performance model evaluation and model update • Perform downstream performance tasks such as performance debugging & optimization using Causal Reasoning UNICORN: Our Causal AI for Systems Method

Slide 57

Slide 57 text

UNICORN: Our Causal AI for Systems Method Software: DeepStream Middleware: TF, TensorRT Hardware: Nvidia Xavier Configuration: Default number of counters number of splitters latency (ms) 100 150 1 200 250 2 300 Cubic Interpolation Over Finer Grid 2 4 3 6 8 4 10 12 5 14 16 6 18 Budget Exhausted? Yes No 5- Update Causal Performance Model Query Engine 4- Estimate Causal Queries Estimate probability of satisfying QoS if BufferSize is set to 6k? 2- Learn Causal Performance Model Performance Debugging Performance Optimization 3- Translate Perf. Query to Causal Queries •What is the root-cause of observed perf. fault? •How do I fix the misconfig.? •How can I improve throughput without sacrificing accuracy? •How do I understand perf behavior? Measure performance of the configuration(s) that maximizes information gain Performance Data Causal Model P(Th > 40/s|do(Buffersize = 6k)) 1- Specify Performance Query QoS : Th > 40/s Observed : Th < 30/s ± 5/s

Slide 58

Slide 58 text

Software: DeepStream Middleware: TF, TensorRT Hardware: Nvidia Xavier Configuration: Default number of counters number of splitters latency (ms) 100 150 1 200 250 2 300 Cubic Interpolation Over Finer Grid 2 4 3 6 8 4 10 12 5 14 16 6 18 Budget Exhausted? Yes No 5- Update Causal Performance Model Query Engine 4- Estimate Causal Queries Estimate probability of satisfying QoS if BufferSize is set to 6k? 2- Learn Causal Perf. Model Performance Debugging Performance Optimization 3- Translate Performance Query to Causal Queries •What is the root-cause of observed perf. fault? •How do I fix the misconfig.? •How can I improve throughput without sacrificing accuracy? •How do I understand perf behavior? Measure performance of the configuration(s) that maximizes information gain Performance Data Causal Model P(Th > 40/s|do(Buffersize = 6k)) 1- Specify Performance Query QoS : Th > 40/s Observed : Th < 30/s ± 5/s UNICORN: Our Causal AI for Systems Method

Slide 59

Slide 59 text

Software: DeepStream Middleware: TF, TensorRT Hardware: Nvidia Xavier Configuration: Default number of counters number of splitters latency (ms) 100 150 1 200 250 2 300 Cubic Interpolation Over Finer Grid 2 4 3 6 8 4 10 12 5 14 16 6 18 Budget Exhausted? Yes No 5- Update Causal Performance Model Query Engine 4- Estimate Causal Queries Estimate probability of satisfying QoS if BufferSize is set to 6k? 2- Learn Causal Perf. Model Performance Debugging Performance Optimization 3- Translate Performance Query to Causal Queries •What is the root-cause of observed perf. fault? •How do I fix the misconfig.? •How can I improve throughput without sacrificing accuracy? •How do I understand perf behavior? Measure performance of the configuration(s) that maximizes information gain Performance Data Causal Model P(Th > 40/s|do(Buffersize = 6k)) 1- Specify Performance Query QoS : Th > 40/s Observed : Th < 30/s ± 5/s UNICORN: Our Causal AI for Systems Method

Slide 60

Slide 60 text

FPS Energy Branch Misses Cache Misses No of Cycles Bitrate Buffer Size Batch Size Enable Padding FPS Energy Branch Misses Cache Misses No of Cycles Bitrate Buffer Size Batch Size Enable Padding Bitrate (bits/s) Enable Padding … Cache Misses … Through put (fps) c1 1k 1 … 42m … 7 c2 2k 1 … 32m … 22 … … … … … … … cn 5k 0 … 12m … 25 FPS Energy Branch Misses Cache Misses No of Cycles Bitrate Buffer Size Batch Size Enable Padding 1- Recovering the Skelton 2- Pruning Causal Structure 3- Orienting Causal Relations statistical independence tests fully connected graph given constraints (e.g., no connections btw configuration options) orientation rules & measures (entropy) + structural constraints (colliders, v-structures) Learning Causal Performance Model

Slide 61

Slide 61 text

Performance measurement 61 ℂ = O1 × O2 × ⋯ × O19 × O20 Dead code removal Con fi guration Space Constant folding Loop unrolling Function inlining c1 = 0 × 0 × ⋯ × 0 × 1 c1 ∈ ℂ fc (c1 ) = 11.1ms Compile time Execution time Energy Compiler (e.f., SaC, LLVM) Program Compiled Code Instrumented Binary Hardware Compile Deploy Con fi gure fe (c1 ) = 110.3ms fen (c1 ) = 100mwh Non-functional measurable/quanti fi able aspect

Slide 62

Slide 62 text

Our setup for performance measurements 62

Slide 63

Slide 63 text

FPS Energy Branch Misses Cache Misses No of Cycles Bitrate Buffer Size Batch Size Enable Padding FPS Energy Branch Misses Cache Misses No of Cycles Bitrate Buffer Size Batch Size Enable Padding Bitrate (bits/s) Enable Padding … Cache Misses … Through put (fps) c1 1k 1 … 42m … 7 c2 2k 1 … 32m … 22 … … … … … … … cn 5k 0 … 12m … 25 FPS Energy Branch Misses Cache Misses No of Cycles Bitrate Buffer Size Batch Size Enable Padding 1- Recovering the Skelton 2- Pruning Causal Structure 3- Orienting Causal Relations statistical independence tests fully connected graph given constraints (e.g., no connections btw configuration options) orientation rules & measures (entropy) + structural constraints (colliders, v-structures) Learning Causal Performance Model

Slide 64

Slide 64 text

FPS Energy Branch Misses Cache Misses No of Cycles Bitrate Buffer Size Batch Size Enable Padding FPS Energy Branch Misses Cache Misses No of Cycles Bitrate Buffer Size Batch Size Enable Padding Bitrate (bits/s) Enable Padding … Cache Misses … Through put (fps) c1 1k 1 … 42m … 7 c2 2k 1 … 32m … 22 … … … … … … … cn 5k 0 … 12m … 25 FPS Energy Branch Misses Cache Misses No of Cycles Bitrate Buffer Size Batch Size Enable Padding 1- Recovering the Skelton 2- Pruning Causal Structure 3- Orienting Causal Relations statistical independence tests fully connected graph given constraints (e.g., no connections btw configuration options) orientation rules & measures (entropy) + structural constraints (colliders, v-structures) Learning Causal Performance Model

Slide 65

Slide 65 text

FPS Energy Branch Misses Cache Misses No of Cycles Bitrate Buffer Size Batch Size Enable Padding FPS Energy Branch Misses Cache Misses No of Cycles Bitrate Buffer Size Batch Size Enable Padding Bitrate (bits/s) Enable Padding … Cache Misses … Through put (fps) c1 1k 1 … 42m … 7 c2 2k 1 … 32m … 22 … … … … … … … cn 5k 0 … 12m … 25 FPS Energy Branch Misses Cache Misses No of Cycles Bitrate Buffer Size Batch Size Enable Padding 1- Recovering the Skelton 2- Pruning Causal Structure 3- Orienting Causal Relations statistical independence tests fully connected graph given constraints (e.g., no connections btw configuration options) orientation rules & measures (entropy) + structural constraints (colliders, v-structures) Learning Causal Performance Model

Slide 66

Slide 66 text

Throughput Energy Branch Misses Cache Misses No of Cycles Bitrate Buffer Size Batch Size Enable Padding f f f f f Causal Interaction Causal Paths Software Options Perf. Events Performance Objective f Branchmisses = 2 × Bitrate + 8.1 × Buffersize + 4.1 × Bitrate × Buffersize × Cachemisses Decoder Muxer Causal Performance Model

Slide 67

Slide 67 text

Software: DeepStream Middleware: TF, TensorRT Hardware: Nvidia Xavier Configuration: Default number of counters number of splitters latency (ms) 100 150 1 200 250 2 300 Cubic Interpolation Over Finer Grid 2 4 3 6 8 4 10 12 5 14 16 6 18 Budget Exhausted? Yes No 5- Update Causal Performance Model Query Engine 4- Estimate Causal Queries Estimate probability of satisfying QoS if BufferSize is set to 6k? 2- Learn Causal Perf. Model Performance Debugging Performance Optimization 3- Translate Performance Query to Causal Queries •What is the root-cause of observed perf. fault? •How do I fix the misconfig.? •How can I improve throughput without sacrificing accuracy? •How do I understand perf behavior? Measure performance of the configuration(s) that maximizes information gain Performance Data Causal Model P(Th > 40/s|do(Buffersize = 6k)) 1- Specify Performance Query QoS : Th > 40/s Observed : Th < 30/s ± 5/s UNICORN: Our Causal AI for Systems Method

Slide 68

Slide 68 text

68 Causal Debugging • What is the root-cause of my fault? • How do I fix my misconfigurations to improve performance? Misconfiguration Fault fixed? Observational Data Build Causal Graph Extract Causal Paths Best Query Yes No update observational data Counterfactual Queries Rank Paths What if questions. E.g., What if the configuration option X was set to a value ‘x’? About 25 sample configurations (training data)

Slide 69

Slide 69 text

Extracting Causal Paths from the Causal Model Problem ✕ In real world cases, this causal graph can be very complex ✕ It may be intractable to reason over the entire graph directly 69 Solution ✓ Extract paths from the causal graph ✓ Rank them based on their Average Causal Effect on latency, etc. ✓ Reason over the top K paths

Slide 70

Slide 70 text

Extracting Causal Paths from the Causal Model 70 GPU Mem. Latency Swap Mem. Extract paths Always begins with a configuration option Or a system event Always terminates at a performance objective Load GPU Mem. Latency Swap Mem. Swap Mem. Latency Load GPU Mem.

Slide 71

Slide 71 text

Ranking Causal Paths from the Causal Model 71 ● They may be too many causal paths ● We need to select the most useful ones ● Compute the Average Causal Effect (ACE) of each pair of neighbors in a path GPU Mem. Swap Mem. Latency 𝐴𝐶 𝐸 (GPU Mem . , Swap) = 1 𝑁 ∑ 𝑎 , 𝑏 ∈ 𝑍 𝔼 (GPU Mem . 𝑑 𝑜 (Swap = 𝑏 )) − 𝔼 (GPU Mem . 𝑑 𝑜 (Swap = 𝑎 )) Expected value of GPU Mem. when we artificially intervene by setting Swap to the value b Expected value of GPU Mem. when we artificially intervene by setting Swap to the value a If this difference is large, then small changes to Swap Mem. will cause large changes to GPU Mem. Average over all permitted values of Swap memory.

Slide 72

Slide 72 text

Ranking Causal Paths from the Causal Model 72 ● Average the ACE of all pairs of adjacent nodes in the path ● Rank paths from highest path ACE (PACE) score to the lowest ● Use the top K paths for subsequent analysis 𝑃𝐴𝐶𝐸 ( 𝑍 , 𝑌 ) = 1 2 ( 𝐴 𝐶 𝐸 ( 𝑍 , 𝑋 ) + 𝐴𝐶 𝐸 ( 𝑋 , 𝑌 )) X Y Z Sum over all pairs of nodes in the causal path. GPU Mem. Latency Swap Mem.

Slide 73

Slide 73 text

Best Query Counterfactual Queries Rank Paths What if questions. E.g., What if the configuration option X was set to a value ‘x’? Extract Causal Paths 73 Diagnosing and Fixing the Faults • What is the root-cause of my fault? • How do I fix my misconfigurations to improve performance? Misconfiguration Fault fixed? Observational Data Build Causal Graph Yes No update observational data About 25 sample configurations (training data)

Slide 74

Slide 74 text

Diagnosing and Fixing the Faults 74 ● Counterfactual inference asks “what if” questions about changes to the misconfigurations We are interested in the scenario where: • We hypothetically have low latency; Conditioned on the following events: • We hypothetically set the new Swap memory to 4 Gb • Swap Memory was initially set to 2 Gb • We observed high latency when Swap was set to 2 Gb • Everything else remains the same Example Given that my current swap memory is 2 Gb, and I have high latency. What is the probability of having low latency if swap memory was increased to 4 Gb?

Slide 75

Slide 75 text

Low? Load GPU Mem. Latency Swap = 4 Gb Diagnosing and Fixing the Faults 75 GPU Mem. Latency Swap Original Path Load GPU Mem. Latency Swap = 4 Gb Path after proposed change Load Remove incoming edges. Assume no external influence. Modify to reflect the hypothetical scenario Low? Load GPU Mem. Latency Swap = 4 Gb Low? Use both the models to compute the answer to the counterfactual question

Slide 76

Slide 76 text

Diagnosing and Fixing the Faults 76 GPU Mem. Latency Swap Original Path Load GPU Mem. Latency Swap = 4 Gb Path after proposed change Load 𝑃 𝑜 𝑡 𝑒 𝑛 𝑡𝑖 𝑎 𝑙 = 𝑃 ( ^ 𝐿𝑎 𝑡 𝑒 𝑛𝑐 𝑦 = 𝑙 𝑜𝑤 . . ^ 𝑆𝑤 𝑎𝑝 = 4 𝐺 𝑏 , . 𝑆 𝑤 𝑎𝑝 = 2 𝐺 𝑏 , 𝐿𝑎 𝑡 𝑒 𝑛𝑐𝑦 𝑠 𝑤 𝑎 𝑝 =2 𝐺 𝑏 = h 𝑖𝑔 h, 𝑈 ) We expect a low latency The latency was high The Swap is now 4 Gb The Swap was initially 2 Gb Everything else stays the same

Slide 77

Slide 77 text

Diagnosing and Fixing the Faults 77 Potential = 𝑃 ( ^ 𝑜𝑢𝑡𝑐𝑜𝑚 𝑒 = 𝑔𝑜 𝑜𝑑 ~ ~ 𝑐 h 𝑎 𝑛 𝑔 𝑒 , ~ 𝑜 𝑢 𝑡𝑐𝑜 𝑚 𝑒 ¬ 𝑐 h 𝑎 𝑛 𝑔 𝑒 = 𝑏𝑎𝑑 , ~¬ 𝑐 h 𝑎 𝑛 𝑔𝑒 , 𝑈 ) Probability that the outcome is good after a change, conditioned on the past If this difference is large, then our change is useful Individual Treatment Effect = Potential − Outcome Control = 𝑃 ( ^ 𝑜𝑢 𝑡 𝑐 𝑜 𝑚 𝑒 = 𝑏𝑎𝑑 ~ ~¬ 𝑐 h 𝑎 𝑛𝑔 𝑒 , 𝑈 ) Probability that the outcome was bad before the change

Slide 78

Slide 78 text

Diagnosing and Fixing the Faults 78 GPU Mem. Latency Swap Mem. Top K paths ⋮ Enumerate all possible changes 𝐼 𝑇 𝐸 ( 𝑐 h 𝑎𝑛𝑔 𝑒 ) Change with the largest ITE Set every configuration option in the path to all permitted values Inferred from observed data. This is very cheap. !

Slide 79

Slide 79 text

Diagnosing and Fixing the Faults 79 Change with the largest ITE Fault fixed? Yes No • Add to observational data • Update causal model • Repeat… Measure Performance

Slide 80

Slide 80 text

Software: DeepStream Middleware: TF, TensorRT Hardware: Nvidia Xavier Configuration: Default number of counters number of splitters latency (ms) 100 150 1 200 250 2 300 Cubic Interpolation Over Finer Grid 2 4 3 6 8 4 10 12 5 14 16 6 18 Budget Exhausted? Yes No 5- Update Causal Performance Model Query Engine 4- Estimate Causal Queries Estimate probability of satisfying QoS if BufferSize is set to 6k? 2- Learn Causal Perf. Model Performance Debugging Performance Optimization 3- Translate Performance Query to Causal Queries •What is the root-cause of observed perf. fault? •How do I fix the misconfig.? •How can I improve throughput without sacrificing accuracy? •How do I understand perf behavior? Measure performance of the configuration(s) that maximizes information gain Performance Data Causal Model P(Th > 40/s|do(Buffersize = 6k)) 1- Specify Performance Query QoS : Th > 40/s Observed : Th < 30/s ± 5/s UNICORN: Our Causal AI for Systems Method

Slide 81

Slide 81 text

FPS Energy Branch Misses Cache Misses No of Cycles Bitrate Buffer Size Batch Size Enable Padding 1- Evaluate Candidate Interventions FPS Energy Branch Misses Cache Misses No of Cycles Bitrate Buffer Size Batch Size Enable Padding Option/Event/Obj Values Bitrate 1k Buffer Size 20k Batch Size 10 Enable Padding 1 Branch Misses 24m Cache Misses 42m No of Cycles 73b FPS 31/s Energy 42J 2- Determine & Perform next Perf Measurement 3- Updating Causal Model Performance Data Model averaging Expected change in belief & KL; Causal effects on objectives Interventions on Hardware, Workload, and Kernel Options Active Learning for Updating Causal Performance Model

Slide 82

Slide 82 text

FPS Energy Branch Misses Cache Misses No of Cycles Bitrate Buffer Size Batch Size Enable Padding 1- Evaluate Candidate Interventions FPS Energy Branch Misses Cache Misses No of Cycles Bitrate Buffer Size Batch Size Enable Padding Option/Event/Obj Values Bitrate 1k Buffer Size 20k Batch Size 10 Enable Padding 1 Branch Misses 24m Cache Misses 42m No of Cycles 73b FPS 31/s Energy 42J 2- Determine & Perform next Perf Measurement 3- Updating Causal Model Performance Data Model averaging Expected change in belief & KL; Causal effects on objectives Interventions on Hardware, Workload, and Kernel Options Active Learning for Updating Causal Performance Model

Slide 83

Slide 83 text

FPS Energy Branch Misses Cache Misses No of Cycles Bitrate Buffer Size Batch Size Enable Padding 1- Evaluate Candidate Interventions FPS Energy Branch Misses Cache Misses No of Cycles Bitrate Buffer Size Batch Size Enable Padding Option/Event/Obj Values Bitrate 1k Buffer Size 20k Batch Size 10 Enable Padding 1 Branch Misses 24m Cache Misses 42m No of Cycles 73b FPS 31/s Energy 42J 2- Determine & Perform next Perf Measurement 3- Updating Causal Model Performance Data Model averaging Expected change in belief & KL; Causal effects on objectives Interventions on Hardware, Workload, and Kernel Options Active Learning for Updating Causal Performance Model

Slide 84

Slide 84 text

Benefits of Causal Reasoning for System Performance Analysis

Slide 85

Slide 85 text

There are two fundamental benefits that we get by our “Causal AI for Systems” methodology 1. We learn one central (causal) performance model from the data across di ff erent performance tasks: • Performance understanding • Performance optimization • Performance debugging and repair • Performance prediction for di ff erent environments (e.g., canary-> production) 2. The causal model is transferable across environments. • We observed Sparse Mechanism Shift in systems too! • Alternative non-causal models (e.g., regression-based models for performance tasks) are not transferable as they rely on i.i.d. setting. 85

Slide 86

Slide 86 text

86 The new version of CADET, called UNICORN, accepted at EuroSys 2022. https://github.com/softsys4ai/UNICORN

Slide 87

Slide 87 text

Outline 87 Motivation Causal AI For Systems Causal AI for Autonomy and Robotics UNICORN Results Autonomy Evaluation at JPL

Slide 88

Slide 88 text

Results: Case Study 88 When we are trying to transplant our CUDA source code from TX1 to TX2, it behaved strange. We noticed that TX2 has twice computing-ability as TX1 in GPU, as expectation, we think TX2 will 30% - 40% faster than TX1 at least. Unfortunately, most of our code base spent twice the time as TX1, in other words, TX2 only has 1/2 speed as TX1, mostly. We believe that TX2’s CUDA API runs much slower than TX1 in many cases. When we are trying to transplant our CUDA source code from TX1 to TX2, it behaved strange. We noticed that TX2 has twice computing-ability as TX1 in GPU, as expectation, we think TX2 will 30% - 40% faster than TX1 at least. Unfortunately, most of our code base spent twice the time as TX1, in other words, TX2 only has 1/2 speed as TX1, mostly. We believe that TX2’s CUDA API runs much slower than TX1 in many cases. When we are trying to transplant our CUDA source code from TX1 to TX2, it behaved strange. We noticed that TX2 has twice computing-ability as TX1 in GPU, as expectation, we think TX2 will 30% - 40% faster than TX1 at least. Unfortunately, most of our code base spent twice the time as TX1, in other words, TX2 only has 1/2 speed as TX1, mostly. We believe that TX2’s CUDA API runs much slower than TX1 in many cases. When we are trying to transplant our CUDA source code from TX1 to TX2, it behaved strange. We noticed that TX2 has twice computing-ability as TX1 in GPU, as expectation, we think TX2 will 30% - 40% faster than TX1 at least. Unfortunately, most of our code base spent twice the time as TX1, in other words, TX2 only has 1/2 speed as TX1, mostly. We believe that TX2’s CUDA API runs much slower than TX1 in many cases. The user is transferring the code from one hardware to another The target hardware is faster than the the source hardware. User expects the code to run at least 30-40% faster. The code ran 2x slower on the more powerful hardware

Slide 89

Slide 89 text

More powerful Results: Case Study 89 Nvidia TX1 CPU 4 cores, 1.3 GHz GPU 128 Cores, 0.9 GHz Memory 4 Gb, 25 Gb/s Nvidia TX2 CPU 6 cores, 2 GHz GPU 256 Cores, 1.3 GHz Memory 8 Gb, 58 Gb/s Embedded real-time stereo estimation Source code 17 Fps 4 Fps 4 Slower! ×

Slide 90

Slide 90 text

Results: Case Study 90 Configuration UNICO RN Decision Tree Forum CPU Cores ✓ ✓ ✓ CPU Freq. ✓ ✓ ✓ EMC Freq. ✓ ✓ ✓ GPU Freq. ✓ ✓ ✓ Sched. Policy ✓ Sched. Runtime ✓ Sched. Child Proc ✓ Dirty Bg. Ratio ✓ Drop Caches ✓ CUDA_STATIC_RT ✓ ✓ ✓ Swap Memory ✓ UNICORN Decision Tree Forum Throughput (on TX2) 26 FPS 20 FPS 23 FPS Throughput Gain (over TX1) 53 % 21 % 39 % Time to resolve 24 min. 31/2 Hrs. 2 days X Finds the root-causes accurately X No unnecessary changes X Better improvements than forum’s recommendation X Much faster Results The user expected 30-40% gain

Slide 91

Slide 91 text

Evaluation: Experimental Setup Nvidia TX1 CPU 4 cores, 1.3 GHz GPU 128 Cores, 0.9 GHz Memory 4 Gb, 25 GB/s Nvidia TX2 CPU 6 cores, 2 GHz GPU 256 Cores, 1.3 GHz Memory 8 Gb, 58 GB/s Nvidia Xavier CPU 8 cores, 2.26 GHz GPU 512 cores, 1.3 GHz Memory 32 Gb, 137 GB/s Hardware Systems Software Systems Xception Image recognition (50,000 test images) DeepSpeech Voice recognition (5 sec. audio clip) BERT Sentiment Analysis (10000 IMDb reviews) x264 Video Encoder (11 Mb, 1080p video) Configuration Space X 30 Configurations X 17 System Events • 10 software • 10 OS/Kernel • 10 hardware 91

Slide 92

Slide 92 text

Evaluation: Data Collection ● For each software/hardware combination create a benchmark dataset ○ Exhaustively set each of configuration option to all permitted values. ○ For continuous options (e.g., GPU memory Mem.), sample 10 equally spaced values between [min, max] ● Measure the latency, energy consumption, and heat dissipation ○ Repeat 5x and average 92 Multiple Faults ! Latency Faults ! Energy Faults !

Slide 93

Slide 93 text

Evaluation: Ground Truth ● For each performance fault: ○ Manually investigate the root-cause ○ “Fix” the misconfigurations ● A “fix” implies the configuration no longer has tail performance ○ User defined benchmark (i.e., 10th percentile) ○ Or some QoS/SLA benchmark ● Record the configurations that were changed 93 Multiple Faults ! Latency Faults ! Energy Faults !

Slide 94

Slide 94 text

RQ2: How does UNICORN perform compared to Search-Based Optimization 94 RQ1: How does UNICORN perform compared to Model based Diagnostics Results: Research Questions

Slide 95

Slide 95 text

95 Results: Research Question 1 (single objective) RQ1: How does UNICORN perform compared to Model based Diagnostics X Finds the root-causes accurately X Better gain X Much faster Takeaways More accurate than ML-based methods Better Gain Up to 20x faster

Slide 96

Slide 96 text

96 Results: Research Question 1 (multi-objective) RQ1: How does UNICORN perform compared to Model based Diagnostics X No deterioration of other performance objectives Takeaways Multiple Faults in Latency & Energy usage

Slide 97

Slide 97 text

RQ1: How does UNICORN perform compared to Model based Diagnostics 97 RQ2: How does UNICORN perform compared to Search-Based Optimization Results: Research Questions

Slide 98

Slide 98 text

Results: Research Question 2 RQ2: How does UNICORN perform compared to Search-Based Optimization X Better with no deterioration of other performance objectives Takeaways 98

Slide 99

Slide 99 text

99 Results: Research Question 3 RQ2: How does UNICORN perform compared to Search-Based Optimization X Considerably faster than search-based optimization Takeaways

Slide 100

Slide 100 text

Summary: Causal AI for Systems 1. Learning a Functional Causal Model for di ff erent downstream systems tasks 2. The learned causal model is transferable across di ff erent environments 100 Software: DeepStream Middleware: TF, TensorRT Hardware: Nvidia Xavier Configuration: Default number of counters number of splitters latency (ms) 100 150 1 200 250 2 300 Cubic Interpolation Over Finer Grid 2 4 3 6 8 4 10 12 5 14 16 6 18 Budget Exhausted? Yes No 5- Update Causal Performance Model Query Engine 4- Estimate Causal Queries Estimate probability of satisfying QoS if BufferSize is set to 6k? 2- Learn Causal Performance Model Performance Debugging Performance Optimization 3- Translate Perf. Query to Causal Queries •What is the root-cause of observed perf. fault? •How do I fix the misconfig.? •How can I improve throughput without sacrificing accuracy? •How do I understand perf behavior? Measure performance of the configuration(s) that maximizes information gain Performance Data Causal Model P(Th > 40/s|do(Buffersize = 6k)) 1- Specify Performance Query QoS : Th > 40/s Observed : Th < 30/s ± 5/s

Slide 101

Slide 101 text

Arti fi cial Intelligence and Systems Laboratory (AISys Lab) Machine Learning Computer Systems Autonomy AI/ML Systems https://pooyanjamshidi.github.io/AISys/ 101 Ying Meng (PhD student) Shuge Lei (PhD student) Kimia Noorbakhsh (Undergrad) Shahriar Iqbal (PhD student) Jianhai Su (PhD student) M.A. Javidian (postdoc) Sponsors, thanks! Fatemeh Ghofrani (PhD student) Abir Hossen (PhD student) Hamed Damirchi (PhD student) Mahdi Shari fi (PhD student) Lane Stanley (Intern) Sonam Kharde Postdoc

Slide 102

Slide 102 text

Rahul Krishna Columbia Shahriar Iqbal UofSC M. A. Javidian Purdue Baishakhi Ray Columbia Collaborators

Slide 103

Slide 103 text

No content

Slide 104

Slide 104 text

Outline 104 Causal AI For Systems UNICORN Results Motivation Causal AI for Autonomy and Robotics Autonomy Evaluation at JPL

Slide 105

Slide 105 text

RASPBERRY SI David Garlan CMU Co-I Bradley Schmerl CMU Co-I Pooyan Jamshidi UofSC PI Javier Camara York (UK) Collaborator Ellen Czaplinski NASA JPL Consultant Katherine Dzurilla UArk Consultant Jianhai Su UofSC Graduate Student Matt DeMinico NASA Co-I Resource Adaptive Software Purpose-Built for Extraordinary Robotic Research Yields - Science Instruments Abir Hossen UofSC Graduate Student Sonam Kharde UofSC Postdoc Autonomous Robotics Research for Ocean Worlds (ARROW)

Slide 106

Slide 106 text

K. MICHAEL DALAL Team Lead USSAMA NAAL Software Engineer LANSSIE MA Software Engineer Autonomy • Quantitative Planning • Transfer & Online Learning • Causal AI JIANHAI SU UofSC, Graduate Student BRADLEY SCHMERL CMU, Co-I DAVID GARLAN CMU, Co-I JAVIER CAMARA York, Collaborator MATT DeMINICO NASA, Co-I HARI D NAYAR Team Lead ANNA E BOETTCHER Robotics System Engineer ASHISH GOEL Research Technologist ANJAN CHAKRABARTY Software Engineer CHETAN KULKARNI Prognostics Researcher THOMAS STUCKY Software Engineer TERENCE WELSH Software Engineer CHRISTOPHER LIM Robotics Software Engineer JACEK SAWONIEWICZ Robotics System Engineer ABIR HOSSEN UofSC, Graduate Student ELLEN CZAPLINSKI Arkansas, Consultant KATHERINE DZURILLA Arkansas, Consultant POOYAN JAMSHIDI UofSC, PI RASPBERRY SI Physical Testbed Virtual Testbed AISR: Autonomous Robotics Research for Ocean Worlds (ARROW) CAROLYN R. MERCER Program Manager Develop Develop and maintain Evaluate Evaluate Develop and maintain Sonam Kharde UofSC, Postdoc

Slide 107

Slide 107 text

107

Slide 108

Slide 108 text

No content

Slide 109

Slide 109 text

Autonomy Module: Evaluation 109 Design • MAPE-K loop based design • Machine learning driven quantitative planning and adaptation Evaluation • Two testbeds: different fidelities & simulation flexibilities Monitor Analyze Plan Execute Knowledge System Under Test (NASA Lander) Autonomy Physical Testbed OWLAT (NASA/JPL) Virtual Testbed OceanWATERS (NASA/ARC)

Slide 110

Slide 110 text

Learning in Simulation for Transfer Learning to Physical Testbed Sim2Real Transfer 110 Physical Testbed Simulation Environment OWLAT OWLAT-sim Causal Invariances

Slide 111

Slide 111 text

Causal AI for Autonomous Robot Testing • Testing cyber-physical systems such as robots are complicated. The key reason is that there are additional interactions with the environment and the task that the robot is performing. • Evaluating our Causal AI for Systems methodology with autonomous robots provides the following opportunities: 1. Identifying di ff i cult-to-catch bugs in robots 2. Identifying the root cause of an observed fault and repairing the issue automatically during mission time. 111

Slide 112

Slide 112 text

Outline 112 Causal AI For Systems UNICORN Results Motivation Causal AI for Autonomy and Robotics Autonomy Evaluation at JPL

Slide 113

Slide 113 text

Lessons Learned • Open Science, Open Source, Open Data, and Open Collaborations • Diverse Team, Diverse Background, Diverse Expertise • Close Collaborations with the JPL and Ames teams • Evaluation in Real Environment Project Website: https://nasa-raspberry-si.github.io/raspberry-si

Slide 114

Slide 114 text

Lessons Learned • In the simulation, we can debug/develop/test our implementation without worrying about damaging the hardware. • High bandwidth and close interaction between the testbed provider (JPL Team) and the autonomy team (RASPBERRY-SI) • Faster identification of the issues • Resolving the issues a lot faster • Getting help for development

Slide 115

Slide 115 text

Lessons Learned • Importance of risk reduction phases • Integration testing • The interface and capability of the testbeds will evolve, and the autonomy needs to be designed at the same time. • The different capabilities between sim and physical testbed. • Rigorous testing remotely and in interaction with testbed providers. • The interaction would be beneficial for Autonomy providers as well as testbed providers.

Slide 116

Slide 116 text

Incremental Integration Testing 116 Component Test Integration Test Model Learning Transfer Learning Model Compression Online Learning A B C D Quantitative Planning E Learning A E Case (Baseline) A E B A E B C A E B C D Case 2 (Transfer) Case 3 (Compress) Case 4 (Online) Test 1 Expected Performance Case 1 < Case 2 < Case 3 < Case 4 OWLAT Code: https://github.com/nasa/ow_simulator Physical Autonomy Testbed: https://www1.grc.nasa.gov/wp-content/uploads/2020_ASCE_OWLAT_20191028.pdf

Slide 117

Slide 117 text

Real-World Experiments using OWLAT • Models learned from simulation • Adaptive System (Learning + Planning) • Sets of Tests 117 Adaptive System Machine learning Models Mission Environment Continual Learning: refining models Log Mission Reports Local Machine Cloud Storage

Slide 118

Slide 118 text

Test Coverage • Mission Types: landing and scientific explorations -> sampling • Mission Difficulty: • Rough regions for landing • Number of locations where a sample needs to be fetched • Unexpected events: • Changes in the environments: e.g., uneven terrain and weather • Changes to the lander capabilities: e.g., deploy new sensors • Faults (power, instruments, etc) 118

Slide 119

Slide 119 text

Infrastructure for Automated Evaluation 119 Test Generator Autonomy Module Test 1 Test Harness Mission Configuration Testbed Monitoring & Logging Communication Logging Logs Log Analysis Evaluation Report Environment & Lander Simulation Adapter Interface Learning & Planning Plan Executive

Slide 120

Slide 120 text

Thank You!