Understanding and Explaining the Root Causes of Performance Faults with Causal AI: A Path towards Building Dependable Computer Systems

Understanding and Explaining the Root Causes of Performance Faults with
Causal AI A Path towards Building Dependable Computer Systems Pooyan Jamshidi

SEAMS’23

3 Melbourne, Australia 15-16 May 2023

4 Topics of Interest

5 Keynote Speaker at SEAMS’20 from NASA

Outline 6 UNICORN Results Causal AI For Systems Motivation Autonomy
Evaluation at JPL Causal AI for Autonomy and Robotics

7 Goal: Enable developers/users to fi nd the right quality
tradeoff

Today’s most popular systems are con fi gurable 8 built

Empirical observations con fi rm that systems are becoming increasingly
con fi gurable 10 08 7/2010 7/2012 7/2014 Release time 1/1999 1/2003 1/2007 1/2011 0 1/2014 N Release time 02 1/2006 1/2010 1/2014 2.2.14 2.3.4 2.0.35 .3.24 Release time Apache 1/2006 1/2008 1/2010 1/2012 1/2014 0 40 80 120 160 200 2.0.0 1.0.0 0.19.0 0.1.0 Hadoop Number of parameters Release time MapReduce HDFS [Tianyin Xu, et al., “Too Many Knobs…”, FSE’15]

Empirical observations con fi rm that systems are becoming increasingly
con fi gurable 11 nia San Diego, ‡Huazhong Univ. of Science & Technology, †NetApp, Inc tixu, longjin, xuf001, yyzhou}@cs.ucsd.edu kar.Pasupathy, Rukma.Talwadker}@netapp.com prevalent, but also severely software. One fundamental y of configuration, reflected parameters (“knobs”). With m software to ensure high re- aunting, error-prone task. nderstanding a fundamental users really need so many answer, we study the con- including thousands of cus- m (Storage-A), and hundreds ce system software projects. ng findings to motivate soft- ore cautious and disciplined these findings, we provide ich can significantly reduce A as an example, the guide- ters and simplify 19.7% of on existing users. Also, we tion methods in the context 7/2006 7/2008 7/2010 7/2012 7/2014 0 100 200 300 400 500 600 700 Storage-A Number of parameters Release time 1/1999 1/2003 1/2007 1/2011 0 100 200 300 400 500 5.6.2 5.5.0 5.0.16 5.1.3 4.1.0 4.0.12 3.23.0 1/2014 MySQL Number of parameters Release time 1/1998 1/2002 1/2006 1/2010 1/2014 0 100 200 300 400 500 600 1.3.14 2.2.14 2.3.4 2.0.35 1.3.24 Number of parameters Release time Apache 1/2006 1/2008 1/2010 1/2012 1/2014 0 40 80 120 160 200 2.0.0 1.0.0 0.19.0 0.1.0 Hadoop Number of parameters Release time MapReduce HDFS [Tianyin Xu, et al., “Too Many Knobs…”, FSE’15]

Con fi gurations determine the performance behavior 12 void Parrot_setenv(.
. . name,. . . value){ #ifdef PARROT_HAS_SETENV my_setenv(name, value, 1); #else int name_len=strlen(name); int val_len=strlen(value); char* envs=glob_env; if(envs==NULL){ return; } strcpy(envs,name); strcpy(envs+name_len,"="); strcpy(envs+name_len + 1,value); putenv(envs); #endif } #ifdef LINUX extern int Parrot_signbit(double x){ endif else PARROT_HAS_SETENV LINUX Speed Energy

Outline 13 Motivation Causal AI For Systems Results Case Study
Causal AI for Autonomy and Robotics Autonomy Evaluation at JPL

Case Study 1 SocialSensor

SocialSensor 15 Content Analysis Orchestrator Crawling Search and Integration Tweets:
[5k-20k/min] Every 10 min: [100k tweets] Tweets: [10M] Fetch Store Push Store Crawled items Fetch Internet

Challenges 16 Content Analysis Orchestrator Crawling Search and Integration Tweets:
[5k-20k/min] Every 10 min: [100k tweets] Tweets: [10M] Fetch Store Push Store Crawled items Fetch Internet 100X 10X Real time

17 How can we gain a better performance without using
more resources?

18 Let’s try out di ff erent system con fi
gurations!

Opportunity: Data processing engines in the pipeline were all con
fi gurable 19 > 100 > 100 > 100 2300

20 More combinations than estimated atoms in the universe

0 500 1000 1500 Throughput (ops/sec) 0 1000 2000 3000
4000 5000 Average write latency ( s) The default con fi guration is typically bad and the optimal con fi guration is noticeably better than median 21 Default Con fi guration Optimal Con fi guration better better • Default is bad • 2X-10X faster than worst • Noticeably faster than median

Performance behavior varies in different environments 22

Case Study 2 Robotics

CoBot experiment: DARPA BRASS 0 2 4 6 8 Localization
error [m] 10 15 20 25 30 35 40 CPU utilization [%] Energy constraint Safety constraint Pareto front Sweet Spot better better no_of_particles=x no_of_re fi nement=y

CoBot experiment 5 10 15 20 25 5 10 15
20 25 0 5 10 15 20 25 5 10 15 20 25 5 10 15 20 25 0 5 10 15 20 25 5 10 15 20 25 5 10 15 20 25 0 5 10 15 20 25 5 10 15 20 25 5 10 15 20 25 0 5 10 15 20 25 Source (given) Target (ground truth 6 months) Prediction with 4 samples Prediction with Transfer learning CPU [%] CPU [%]

Transfer Learning for Improving Model Predictions in Highly Configurable Software
Pooyan Jamshidi, Miguel Velez, Christian K¨ astner Carnegie Mellon University, USA {pjamshid,mvelezce,kaestner}@cs.cmu.edu Norbert Siegmund Bauhaus-University Weimar, Germany [email protected] Prasad Kawthekar Stanford University, USA [email protected] Abstract —Modern software systems are built to be used in dynamic environments using configuration capabilities to adapt to changes and external uncertainties. In a self-adaptation context, we are often interested in reasoning about the performance of the systems under different configurations. Usually, we learn a black-box model based on real measurements to predict the performance of the system given a specific configuration. However, as modern systems become more complex, there are many configuration parameters that may interact and we end up learning an exponentially large configuration space. Naturally, this does not scale when relying on real measurements in the actual changing environment. We propose a different solution: Instead of taking the measurements from the real system, we learn the model using samples from other sources, such as simulators that approximate performance of the real system at Predictive Model Learn Model with Transfer Learning Measure Measure Data Source Target Simulator (Source) Robot (Target) Adaptation Fig. 1: Transfer learning for performance model learning. order to identify the best performing configuration for a robot Details: [SEAMS ’17]

Looking further: When transfer learning goes wrong 10 20 30
40 50 60 Absolute Percentage Error [%] Sources s s1 s2 s3 s4 s5 s6 noise-level 0 5 10 15 20 25 30 corr. coeff. 0.98 0.95 0.89 0.75 0.54 0.34 0.19 µ(pe) 15.34 14.14 17.09 18.71 33.06 40.93 46.75 It worked! It didn’t! Insight: Predictions become more accurate when the source is more related to the target. Non-transfer-learning

5 10 15 20 25 number of particles 5 10
15 20 25 number of refinements 5 10 15 20 25 30 5 10 15 20 25 number of particles 5 10 15 20 25 number of refinements 10 12 14 16 18 20 22 24 5 10 15 20 25 number of particles 5 10 15 20 25 number of refinements 10 15 20 25 5 10 15 20 25 number of particles 5 10 15 20 25 number of refinements 10 15 20 25 5 10 15 20 25 number of particles 5 10 15 20 25 number of refinements 6 8 10 12 14 16 18 20 22 24 (a) (b) (c) (d) (e) 5 10 15 20 25 number of particles 5 10 15 20 25 number of refinements 12 14 16 18 20 22 24 (f) CPU usage [%] CPU usage [%] CPU usage [%] CPU usage [%] CPU usage [%] CPU usage [%] It worked! It worked! It worked! It didn’t! It didn’t! It didn’t!

Key question: Can we develop a theory to explain when
transfer learning works? Target (Learn) Source (Given) Data Model Transferable Knowledge II. INTUITION rstanding the performance behavior of configurable e systems can enable (i) performance debugging, (ii) mance tuning, (iii) design-time evolution, or (iv) runtime on [11]. We lack empirical understanding of how the mance behavior of a system will vary when the environ- the system changes. Such empirical understanding will important insights to develop faster and more accurate g techniques that allow us to make predictions and ations of performance for highly configurable systems ging environments [10]. For instance, we can learn mance behavior of a system on a cheap hardware in a ed lab environment and use that to understand the per- ce behavior of the system on a production server before g to the end user. More specifically, we would like to what the relationship is between the performance of a in a specific environment (characterized by software ration, hardware, workload, and system version) to the t we vary its environmental conditions. is research, we aim for an empirical understanding of mance behavior to improve learning via an informed g process. In other words, we at learning a perfor- model in a changed environment based on a well-suited g set that has been determined by the knowledge we in other environments. Therefore, the main research A. Preliminary concepts In this section, we provide formal definitions of four concepts that we use throughout this study. The formal notations enable us to concisely convey concept throughout the paper. 1) Configuration and environment space: Let Fi indicate the i-th feature of a configurable system A which is either enabled or disabled and one of them holds by default. The configuration space is mathematically a Cartesian product of all the features C = Dom(F1) ⇥ · · · ⇥ Dom(Fd), where Dom(Fi) = {0, 1}. A configuration of a system is then a member of the configuration space (feature space) where all the parameters are assigned to a specific value in their range (i.e., complete instantiations of the system’s parameters). We also describe an environment instance by 3 variables e = [w, h, v] drawn from a given environment space E = W ⇥H ⇥V , where they respectively represent sets of possible values for workload, hardware and system version. 2) Performance model: Given a software system A with configuration space F and environmental instances E, a performance model is a black-box function f : F ⇥ E ! R given some observations of the system performance for each combination of system’s features x 2 F in an environment e 2 E. To construct a performance model for a system A with configuration space F, we run A in environment instance e 2 E on various combinations of configurations xi 2 F, and record the resulting performance values yi = f(xi) + ✏i, xi 2 ON behavior of configurable erformance debugging, (ii) e evolution, or (iv) runtime understanding of how the will vary when the environ- mpirical understanding will op faster and more accurate to make predictions and ighly configurable systems or instance, we can learn on a cheap hardware in a that to understand the per- a production server before cifically, we would like to ween the performance of a (characterized by software and system version) to the conditions. empirical understanding of learning via an informed we at learning a perfor- ment based on a well-suited ned by the knowledge we erefore, the main research A. Preliminary concepts In this section, we provide formal definitions of four concepts that we use throughout this study. The formal notations enable us to concisely convey concept throughout the paper. 1) Configuration and environment space: Let Fi indicate the i-th feature of a configurable system A which is either enabled or disabled and one of them holds by default. The configuration space is mathematically a Cartesian product of all the features C = Dom(F1) ⇥ · · · ⇥ Dom(Fd), where Dom(Fi) = {0, 1}. A configuration of a system is then a member of the configuration space (feature space) where all the parameters are assigned to a specific value in their range (i.e., complete instantiations of the system’s parameters). We also describe an environment instance by 3 variables e = [w, h, v] drawn from a given environment space E = W ⇥H ⇥V , where they respectively represent sets of possible values for workload, hardware and system version. 2) Performance model: Given a software system A with configuration space F and environmental instances E, a performance model is a black-box function f : F ⇥ E ! R given some observations of the system performance for each combination of system’s features x 2 F in an environment e 2 E. To construct a performance model for a system A with configuration space F, we run A in environment instance e 2 E on various combinations of configurations xi 2 F, and record the resulting performance values yi = f(xi) + ✏i, xi 2 oad, hardware and system version. e model: Given a software system A with ce F and environmental instances E, a per- is a black-box function f : F ⇥ E ! R rvations of the system performance for each ystem’s features x 2 F in an environment ruct a performance model for a system A n space F, we run A in environment instance combinations of configurations xi 2 F, and ng performance values yi = f(xi) + ✏i, xi 2 (0, i). The training data for our regression mply Dtr = {(xi, yi)}n i=1 . In other words, a is simply a mapping from the input space to ormance metric that produces interval-scaled ume it produces real numbers). e distribution: For the performance model, associated the performance response to each w let introduce another concept where we ment and we measure the performance. An mance distribution is a stochastic process, that defines a probability distribution over sures for each environmental conditions. To ormance distribution for a system A with ce F, similarly to the process of deriving models, we run A on various combinations 2 F, for a specific environment instance values for workload, hardware and system version. 2) Performance model: Given a software system A with configuration space F and environmental instances E, a performance model is a black-box function f : F ⇥ E ! R given some observations of the system performance for each combination of system’s features x 2 F in an environment e 2 E. To construct a performance model for a system A with configuration space F, we run A in environment instance e 2 E on various combinations of configurations xi 2 F, and record the resulting performance values yi = f(xi) + ✏i, xi 2 F where ✏i ⇠ N (0, i). The training data for our regression models is then simply Dtr = {(xi, yi)}n i=1 . In other words, a response function is simply a mapping from the input space to a measurable performance metric that produces interval-scaled data (here we assume it produces real numbers). 3) Performance distribution: For the performance model, we measured and associated the performance response to each configuration, now let introduce another concept where we vary the environment and we measure the performance. An empirical performance distribution is a stochastic process, pd : E ! (R), that defines a probability distribution over performance measures for each environmental conditions. To construct a performance distribution for a system A with configuration space F, similarly to the process of deriving the performance models, we run A on various combinations configurations xi 2 F, for a specific environment instance Extract Reuse Learn Learn Q1: How source and target are “related”? Q2: What characteristics are preserved? Q3: What are the actionable insights?

Transfer Learning for Performance Modeling of Configurable Systems: An Exploratory
Analysis Pooyan Jamshidi Carnegie Mellon University, USA Norbert Siegmund Bauhaus-University Weimar, Germany Miguel Velez, Christian K¨ astner Akshay Patel, Yuvraj Agarwal Carnegie Mellon University, USA Abstract—Modern software systems provide many configuration options which significantly influence their non-functional properties. To understand and predict the effect of configuration options, several sampling and learning strategies have been proposed, albeit often with significant cost to cover the highly dimensional configuration space. Recently, transfer learning has been applied to reduce the effort of constructing performance models by transferring knowledge about performance behavior across environments. While this line of research is promising to learn more accurate models at a lower cost, it is unclear why and when transfer learning works for performance modeling. To shed light on when it is beneficial to apply transfer learning, we conducted an empirical study on four popular software systems, varying software configurations and environmental conditions, such as hardware, workload, and software versions, to identify the key knowledge pieces that can be exploited for transfer learning. Our results show that in small environmental changes (e.g., homogeneous workload change), by applying a linear transformation to the performance model, we can understand the performance behavior of the target environment, while for severe environmental changes (e.g., drastic workload change) we can transfer only knowledge that makes sampling more efficient, e.g., by reducing the dimensionality of the configuration space. Index Terms—Performance analysis, transfer learning. Fig. 1: Transfer learning is a form of machine learning that takes advantage of transferable knowledge from source to learn an accurate, reliable, and less costly model for the target environment. their byproducts across environments is demanded by many Details: [ASE ’17]

Details: [AAAI Spring Symposium ’19]

Outline 32 Motivation UNICORN Results Future Directions Causal AI For
Systems

Causal AI in Systems and Software 33 Computer Architecture Database
Operating Systems Programming Languages BigData Software Engineering https://github.com/y-ding/causal-system-papers

Misconfiguration and its Effects • Misconfigurations can elicit unexpected interactions
between software and hardware • These can result in non-functional faults ◦ Affecting non-functional system properties like latency, throughput, energy consumption, etc. 34 The system doesn’t crash or exhibit an obvious misbehavior Systems are still operational but with a degraded performance, e.g., high latency, low throughput, high energy consumption, high heat dissipation, or a combination of several

35 CUDA performance issue on tx2 When we are trying
to transplant our CUDA source code from TX1 to TX2, it behaved strange. We noticed that TX2 has twice computing-ability as TX1 in GPU, as expectation, we think TX2 will 30% - 40% faster than TX1 at least. Unfortunately, most of our code base spent twice the time as TX1, in other words, TX2 only has 1/2 speed as TX1, mostly. We believe that TX2’s CUDA API runs much slower than TX1 in many cases. When we are trying to transplant our CUDA source code from TX1 to TX2, it behaved strange. We noticed that TX2 has twice computing-ability as TX1 in GPU, as expectation, we think TX2 will 30% - 40% faster than TX1 at least. Unfortunately, most of our code base spent twice the time as TX1, in other words, TX2 only has 1/2 speed as TX1, mostly. We believe that TX2’s CUDA API runs much slower than TX1 in many cases. The user is transferring the code from one hardware to another When we are trying to transplant our CUDA source code from TX1 to TX2, it behaved strange. We noticed that TX2 has twice computing-ability as TX1 in GPU, as expectation, we think TX2 will 30% - 40% faster than TX1 at least. Unfortunately, most of our code base spent twice the time as TX1, in other words, TX2 only has 1/2 speed as TX1, mostly. We believe that TX2’s CUDA API runs much slower than TX1 in many cases. The target hardware is faster than the the source hardware. User expects the code to run at least 30-40% faster. Motivating Example When we are trying to transplant our CUDA source code from TX1 to TX2, it behaved strange. We noticed that TX2 has twice computing-ability as TX1 in GPU, as expectation, we think TX2 will 30% - 40% faster than TX1 at least. Unfortunately, most of our code base spent twice the time as TX1, in other words, TX2 only has 1/2 speed as TX1, mostly. We believe that TX2’s CUDA API runs much slower than TX1 in many cases. The code ran 2x slower on the more powerful hardware

Motivating Example 36 June 3rd We have already tried this.
We still have high latency. Any other suggestions? June 4th Please do the following and let us know if it works 1. Install JetPack 3.0 2. Set nvpmodel=MAX-N 3. Run jetson_clock.sh June 5th June 4th TX2 is pascal architecture. Please update your CMakeLists: + set(CUDA_STATIC_RUNTIME OFF) ... + -gencode=arch=compute_62,code=sm_62 The user had several misconfigurations In Software: ✖ Wrong compilation flags ✖ Wrong SDK version In Hardware: ✖ Wrong power mode ✖ Wrong clock/fan settings The discussions took 2 days ! Any suggestions on how to improve my performance? Thanks! How to resolve such issues faster? ?

37 How to resolve these issues faster?

38 Performance In fl uence Models number of counters number
of splitters latency (ms) 100 150 1 200 250 2 300 Cubic Interpolation Over Finer Grid 2 4 3 6 8 4 10 12 5 14 16 6 18 Bitrate (bits/s) Enable Padding … Cache Misses … Through put (fps) c1 1k 1 … 42m … 7 c2 2k 1 … 32m … 22 … … … … … … … cn 5k 0 … 12m … 25 Observational Data Black-box models Regression Equation Throughput = 5.1 × Bitrate + 2.5 × BatchSize + 12.3 × Bitrate × BatchSize Discovered Interactions Options Options

39 These methods rely on statistical correlations to extract meaningful
information required for performance tasks. Performance In fl uence Models number of counters number of splitters latency (ms) 100 150 1 200 250 2 300 Cubic Interpolation Over Finer Grid 2 4 3 6 8 4 10 12 5 14 16 6 18 Bitrate (bits/s) Enable Padding … Cache Misses … Through put (fps) c1 1k 1 … 42m … 7 c2 2k 1 … 32m … 22 … … … … … … … cn 5k 0 … 12m … 25 Observational Data Black-box models Regression Equation Throughput = 5.1 × Bitrate + 2.5 × BatchSize + 12.3 × Bitrate × BatchSize Discovered Interactions Options Options

40 • Performance in fl uence models could produce incorrect
explanations • Performance in fl uence models could produce unreliable predictions. • Performance in fl uence models could produce unstable predictions across environments and in the presence of measurement noise. Performance In fl uence Models su ff er from several shortcomings

Performance Influence Models Issue: Incorrect Explanation Cache Misses Throughput (FPS)
20 10 0 100k 200k 41 Increasing Cache Misses increases Throughput.

Cache Misses Throughput (FPS) 20 10 0 100k 200k 42
Increasing Cache Misses increases Throughput. More Cache Misses should reduce Throughput not increase it Any ML/statistical models built on this data will be incorrect. This is counter-intuitive Performance Influence Models Issue: Incorrect Explanation

Cache Misses Throughput (FPS) 20 10 0 100k 200k 43
Cache Misses Throughput (FPS) LRU FIFO LIFO MRU 20 10 0 100k 200k Segregating data on Cache Policy indicates that within each group Increase of Cache Misses result in a decrease in Throughput. FIFO LIFO MRU LRU Performance Influence Models Issue: Incorrect Explanation

44 Performance In fl uence Models change signi fi cantly
in new environments resulting in less accuracy. Performance in fl uence model in TX2. Performance in fl uence model in Xavier. Throughput = 5.1 × Bitrate + 2.5 × BatchSize + 12.3 × Bitrate × BatchSize Throughput = 2 × Bitrate + 1.9 × BatchSize + 1.8 × BufferSize + 0.5 × EnablePadding + 5.9 × Bitrate × BufferSize +6.2 × Bitrate × EnablePadding + 4.1 × Bitrate × BufferSize × EnablePadding Performance Influence Models Issue: Unstable Predictors

45 Performance in fl uence are cannot be reliably used
across environments. Performance in fl uence model in TX2. Performance in fl uence model in Xavier. Throughput = 5.1 × Bitrate + 2.5 × BatchSize + 12.3 × Bitrate × BatchSize Throughput = 2 × Bitrate + 1.9 × BatchSize + 1.8 × BufferSize + 0.5 × EnablePadding + 5.9 × Bitrate × BufferSize +6.2 × Bitrate × EnablePadding + 4.1 × Bitrate × BufferSize × EnablePadding Performance Influence Models Issue: Unstable Predictors

46 Performance in fl uence models do not generalize well
across deployment environments. Performance in fl uence model in TX2 Performance in fl uence model in Xavier. Throughput = 5.1 × Bitrate + 2.5 × BatchSize + 12.3 × Bitrate × BatchSize Throughput = 2 × Bitrate + 1.9 × BatchSize + 1.8 × BufferSize + 0.5 × EnablePadding + 5.9 × Bitrate × BufferSize +6.2 × Bitrate × EnablePadding + 4.1 × Bitrate × BufferSize × EnablePadding Performance Influence Models Issue: Non-generalizability

47 Causal Performance Model Expresses the relationships between Con fi
guration options System Events Non-functional Properties Cache Misses Throughput (FPS) 20 10 0 100k 200k interacting variables as a causal graph Direction of Causality Cache Policy Cache Misses Through put

Why Causal Inference? - Produces Correct Explanations 48 Cache Misses
Throughput (FPS) 20 10 0 100k 200k Cache Misses Throughput (FPS) LRU FIFO LIFO MRU 20 10 0 100k 200k Cache Policy a ff ects Throughput via Cache Misses. Causal Performance Models recovers the correct interactions. Cache Policy Cache Misses Through put

Why Causal Inference? - Minimal Structure Change 49 Causal models
remain relatively stable A partial causal performance model in Jetson Xavier A partial causal performance model in Jetson TX2 Bitrate Buffer Size Batch Size Enable Padding Branch Misses Cache Misses Cycles FPS Energy Bitrate Buffer Size Batch Size Enable Padding Branch Misses Cache Misses Cycles FPS Energy

Why Causal Inference? - Accurate across Environments 50 Performance Inﬂuence
Model 0 5 10 15 20 25 30 35 40 45 50 Terms (a) Common Terms (Source ! Target) Total Terms (Source) Total Terms (Target) Error (Source) Error (Target) Error (Source ! Target) 0 30 60 90 Regression Models Causal Performance Model 0 5 10 15 20 25 30 35 40 45 50 0 30 60 90 Regression Models MAPE (%) Common Predictors are Large Common Predictors are lower in number

51 Performance Inﬂuence Model 0 5 10 15 20 25
30 35 40 45 50 Terms (a) Common Terms (Source ! Target) Total Terms (Source) Total Terms (Target) Error (Source) Error (Target) Error (Source ! Target) 0 30 60 90 Regression Models Causal Performance Model 0 5 10 15 20 25 30 35 40 45 50 0 30 60 90 Regression Models MAPE (%) Low error when reused High error when reused Common Predictors are Large Common Predictors are lower in number Causal models can be reliably reused when environmental changes occur. Why Causal Inference? - Accurate across Environments

52 Performance Inﬂuence Model 0 5 10 15 20 25
30 35 40 45 50 Terms (a) Common Terms (Source ! Target) Total Terms (Source) Total Terms (Target) Error (Source) Error (Target) Error (Source ! Target) 0 30 60 90 Regression Models Causal Performance Model 0 5 10 15 20 25 30 35 40 45 50 0 30 60 90 Regression Models MAPE (%) Causal models are more generalizable than Performance in fl uence models. Why Causal Inference? - Generalizability

53 How to use Causal Performance Models? ? Cache Policy
Cache Misses Through put How to generate a causal graph?

54 How to use Causal Performance Models? ? How to
use the causal graph for performance tasks? ? Cache Policy Cache Misses Through put How to generate a causal graph?

Outline 55 Motivation Causal AI For Systems Results Future Directions
UNICORN

• Build a Causal Performance Model that capture the interactions
options in the variability space using the observation performance data. • Iterative causal performance model evaluation and model update • Perform downstream performance tasks such as performance debugging & optimization using Causal Reasoning UNICORN: Our Causal AI for Systems Method

UNICORN: Our Causal AI for Systems Method Software: DeepStream Middleware:
TF, TensorRT Hardware: Nvidia Xavier Configuration: Default number of counters number of splitters latency (ms) 100 150 1 200 250 2 300 Cubic Interpolation Over Finer Grid 2 4 3 6 8 4 10 12 5 14 16 6 18 Budget Exhausted? Yes No 5- Update Causal Performance Model Query Engine 4- Estimate Causal Queries Estimate probability of satisfying QoS if BufferSize is set to 6k? 2- Learn Causal Performance Model Performance Debugging Performance Optimization 3- Translate Perf. Query to Causal Queries •What is the root-cause of observed perf. fault? •How do I fix the misconfig.? •How can I improve throughput without sacrificing accuracy? •How do I understand perf behavior? Measure performance of the configuration(s) that maximizes information gain Performance Data Causal Model P(Th > 40/s|do(Buffersize = 6k)) 1- Specify Performance Query QoS : Th > 40/s Observed : Th < 30/s ± 5/s

Software: DeepStream Middleware: TF, TensorRT Hardware: Nvidia Xavier Configuration: Default
number of counters number of splitters latency (ms) 100 150 1 200 250 2 300 Cubic Interpolation Over Finer Grid 2 4 3 6 8 4 10 12 5 14 16 6 18 Budget Exhausted? Yes No 5- Update Causal Performance Model Query Engine 4- Estimate Causal Queries Estimate probability of satisfying QoS if BufferSize is set to 6k? 2- Learn Causal Perf. Model Performance Debugging Performance Optimization 3- Translate Performance Query to Causal Queries •What is the root-cause of observed perf. fault? •How do I fix the misconfig.? •How can I improve throughput without sacrificing accuracy? •How do I understand perf behavior? Measure performance of the configuration(s) that maximizes information gain Performance Data Causal Model P(Th > 40/s|do(Buffersize = 6k)) 1- Specify Performance Query QoS : Th > 40/s Observed : Th < 30/s ± 5/s UNICORN: Our Causal AI for Systems Method

FPS Energy Branch Misses Cache Misses No of Cycles Bitrate
Buffer Size Batch Size Enable Padding FPS Energy Branch Misses Cache Misses No of Cycles Bitrate Buffer Size Batch Size Enable Padding Bitrate (bits/s) Enable Padding … Cache Misses … Through put (fps) c1 1k 1 … 42m … 7 c2 2k 1 … 32m … 22 … … … … … … … cn 5k 0 … 12m … 25 FPS Energy Branch Misses Cache Misses No of Cycles Bitrate Buffer Size Batch Size Enable Padding 1- Recovering the Skelton 2- Pruning Causal Structure 3- Orienting Causal Relations statistical independence tests fully connected graph given constraints (e.g., no connections btw configuration options) orientation rules & measures (entropy) + structural constraints (colliders, v-structures) Learning Causal Performance Model

Performance measurement 61 ℂ = O1 × O2 × ⋯
× O19 × O20 Dead code removal Con fi guration Space Constant folding Loop unrolling Function inlining c1 = 0 × 0 × ⋯ × 0 × 1 c1 ∈ ℂ fc (c1 ) = 11.1ms Compile time Execution time Energy Compiler (e.f., SaC, LLVM) Program Compiled Code Instrumented Binary Hardware Compile Deploy Con fi gure fe (c1 ) = 110.3ms fen (c1 ) = 100mwh Non-functional measurable/quanti fi able aspect

Our setup for performance measurements 62

Buffer Size Batch Size Enable Padding FPS Energy Branch Misses Cache Misses No of Cycles Bitrate Buffer Size Batch Size Enable Padding Bitrate (bits/s) Enable Padding … Cache Misses … Through put (fps) c1 1k 1 … 42m … 7 c2 2k 1 … 32m … 22 … … … … … … … cn 5k 0 … 12m … 25 FPS Energy Branch Misses Cache Misses No of Cycles Bitrate Buffer Size Batch Size Enable Padding 1- Recovering the Skelton 2- Pruning Causal Structure 3- Orienting Causal Relations statistical independence tests fully connected graph given constraints (e.g., no connections btw configuration options) orientation rules & measures (entropy) + structural constraints (colliders, v-structures) Learning Causal Performance Model

Throughput Energy Branch Misses Cache Misses No of Cycles Bitrate
Buffer Size Batch Size Enable Padding f f f f f Causal Interaction Causal Paths Software Options Perf. Events Performance Objective f Branchmisses = 2 × Bitrate + 8.1 × Buffersize + 4.1 × Bitrate × Buffersize × Cachemisses Decoder Muxer Causal Performance Model

68 Causal Debugging • What is the root-cause of my
fault? • How do I fix my misconfigurations to improve performance? Misconfiguration Fault fixed? Observational Data Build Causal Graph Extract Causal Paths Best Query Yes No update observational data Counterfactual Queries Rank Paths What if questions. E.g., What if the configuration option X was set to a value ‘x’? About 25 sample configurations (training data)

Extracting Causal Paths from the Causal Model Problem ✕ In
real world cases, this causal graph can be very complex ✕ It may be intractable to reason over the entire graph directly 69 Solution ✓ Extract paths from the causal graph ✓ Rank them based on their Average Causal Effect on latency, etc. ✓ Reason over the top K paths

Extracting Causal Paths from the Causal Model 70 GPU Mem.
Latency Swap Mem. Extract paths Always begins with a configuration option Or a system event Always terminates at a performance objective Load GPU Mem. Latency Swap Mem. Swap Mem. Latency Load GPU Mem.

Ranking Causal Paths from the Causal Model 71 • They
may be too many causal paths • We need to select the most useful ones • Compute the Average Causal Effect (ACE) of each pair of neighbors in a path GPU Mem. Swap Mem. Latency 𝐴𝐶 𝐸 (GPU Mem . , Swap) = 1 𝑁 ∑ 𝑎 , 𝑏 ∈ 𝑍 𝔼 (GPU Mem . 𝑑 𝑜 (Swap = 𝑏 )) − 𝔼 (GPU Mem . 𝑑 𝑜 (Swap = 𝑎 )) Expected value of GPU Mem. when we artificially intervene by setting Swap to the value b Expected value of GPU Mem. when we artificially intervene by setting Swap to the value a If this difference is large, then small changes to Swap Mem. will cause large changes to GPU Mem. Average over all permitted values of Swap memory.

Ranking Causal Paths from the Causal Model 72 • Average
the ACE of all pairs of adjacent nodes in the path • Rank paths from highest path ACE (PACE) score to the lowest • Use the top K paths for subsequent analysis 𝑃𝐴𝐶𝐸 ( 𝑍 , 𝑌 ) = 1 2 ( 𝐴 𝐶 𝐸 ( 𝑍 , 𝑋 ) + 𝐴𝐶 𝐸 ( 𝑋 , 𝑌 )) X Y Z Sum over all pairs of nodes in the causal path. GPU Mem. Latency Swap Mem.

Best Query Counterfactual Queries Rank Paths What if questions. E.g.,
What if the configuration option X was set to a value ‘x’? Extract Causal Paths 73 Diagnosing and Fixing the Faults • What is the root-cause of my fault? • How do I fix my misconfigurations to improve performance? Misconfiguration Fault fixed? Observational Data Build Causal Graph Yes No update observational data About 25 sample configurations (training data)

Diagnosing and Fixing the Faults 74 • Counterfactual inference asks
“what if” questions about changes to the misconfigurations We are interested in the scenario where: • We hypothetically have low latency; Conditioned on the following events: • We hypothetically set the new Swap memory to 4 Gb • Swap Memory was initially set to 2 Gb • We observed high latency when Swap was set to 2 Gb • Everything else remains the same Example Given that my current swap memory is 2 Gb, and I have high latency. What is the probability of having low latency if swap memory was increased to 4 Gb?

Low? Load GPU Mem. Latency Swap = 4 Gb Diagnosing
and Fixing the Faults 75 GPU Mem. Latency Swap Original Path Load GPU Mem. Latency Swap = 4 Gb Path after proposed change Load Remove incoming edges. Assume no external influence. Modify to reflect the hypothetical scenario Low? Load GPU Mem. Latency Swap = 4 Gb Low? Use both the models to compute the answer to the counterfactual question

Diagnosing and Fixing the Faults 76 GPU Mem. Latency Swap
Original Path Load GPU Mem. Latency Swap = 4 Gb Path after proposed change Load 𝑃 𝑜 𝑡 𝑒 𝑛 𝑡𝑖 𝑎 𝑙 = 𝑃 ( ^ 𝐿𝑎 𝑡 𝑒 𝑛𝑐 𝑦 = 𝑙 𝑜𝑤 . . ^ 𝑆𝑤 𝑎𝑝 = 4 𝐺 𝑏 , . 𝑆 𝑤 𝑎𝑝 = 2 𝐺 𝑏 , 𝐿𝑎 𝑡 𝑒 𝑛𝑐𝑦 𝑠 𝑤 𝑎 𝑝 =2 𝐺 𝑏 = h 𝑖𝑔 h, 𝑈 ) We expect a low latency The latency was high The Swap is now 4 Gb The Swap was initially 2 Gb Everything else stays the same

Diagnosing and Fixing the Faults 77 Potential = 𝑃 (
^ 𝑜𝑢𝑡𝑐𝑜𝑚 𝑒 = 𝑔𝑜 𝑜𝑑 ~ ~ 𝑐 h 𝑎 𝑛 𝑔 𝑒 , ~ 𝑜 𝑢 𝑡𝑐𝑜 𝑚 𝑒 ¬ 𝑐 h 𝑎 𝑛 𝑔 𝑒 = 𝑏𝑎𝑑 , ~¬ 𝑐 h 𝑎 𝑛 𝑔𝑒 , 𝑈 ) Probability that the outcome is good after a change, conditioned on the past If this difference is large, then our change is useful Individual Treatment Effect = Potential − Outcome Control = 𝑃 ( ^ 𝑜𝑢 𝑡 𝑐 𝑜 𝑚 𝑒 = 𝑏𝑎𝑑 ~ ~¬ 𝑐 h 𝑎 𝑛𝑔 𝑒 , 𝑈 ) Probability that the outcome was bad before the change

Diagnosing and Fixing the Faults 78 GPU Mem. Latency Swap
Mem. Top K paths ⋮ Enumerate all possible changes 𝐼 𝑇 𝐸 ( 𝑐 h 𝑎𝑛𝑔 𝑒 ) Change with the largest ITE Set every configuration option in the path to all permitted values Inferred from observed data. This is very cheap. !

Diagnosing and Fixing the Faults 79 Change with the largest
ITE Fault fixed? Yes No • Add to observational data • Update causal model • Repeat… Measure Performance

Buffer Size Batch Size Enable Padding 1- Evaluate Candidate Interventions FPS Energy Branch Misses Cache Misses No of Cycles Bitrate Buffer Size Batch Size Enable Padding Option/Event/Obj Values Bitrate 1k Buffer Size 20k Batch Size 10 Enable Padding 1 Branch Misses 24m Cache Misses 42m No of Cycles 73b FPS 31/s Energy 42J 2- Determine & Perform next Perf Measurement 3- Updating Causal Model Performance Data Model averaging Expected change in belief & KL; Causal effects on objectives Interventions on Hardware, Workload, and Kernel Options Active Learning for Updating Causal Performance Model

Benefits of Causal Reasoning for System Performance Analysis

There are two fundamental benefits that we get by our
“Causal AI for Systems” methodology 1. We learn one central (causal) performance model from the data across di ff erent performance tasks: • Performance understanding • Performance optimization • Performance debugging and repair • Performance prediction for di ff erent environments (e.g., canary-> production) 2. The causal model is transferable across environments. • We observed Sparse Mechanism Shift in systems too! • Alternative non-causal models (e.g., regression-based models for performance tasks) are not transferable as they rely on i.i.d. setting. 85

86 The new version of CADET, called UNICORN, accepted at
EuroSys 2022. https://github.com/softsys4ai/UNICORN

Outline 87 Motivation Causal AI For Systems Causal AI for
Autonomy and Robotics UNICORN Results Autonomy Evaluation at JPL

Results: Case Study 88 When we are trying to transplant
our CUDA source code from TX1 to TX2, it behaved strange. We noticed that TX2 has twice computing-ability as TX1 in GPU, as expectation, we think TX2 will 30% - 40% faster than TX1 at least. Unfortunately, most of our code base spent twice the time as TX1, in other words, TX2 only has 1/2 speed as TX1, mostly. We believe that TX2’s CUDA API runs much slower than TX1 in many cases. When we are trying to transplant our CUDA source code from TX1 to TX2, it behaved strange. We noticed that TX2 has twice computing-ability as TX1 in GPU, as expectation, we think TX2 will 30% - 40% faster than TX1 at least. Unfortunately, most of our code base spent twice the time as TX1, in other words, TX2 only has 1/2 speed as TX1, mostly. We believe that TX2’s CUDA API runs much slower than TX1 in many cases. When we are trying to transplant our CUDA source code from TX1 to TX2, it behaved strange. We noticed that TX2 has twice computing-ability as TX1 in GPU, as expectation, we think TX2 will 30% - 40% faster than TX1 at least. Unfortunately, most of our code base spent twice the time as TX1, in other words, TX2 only has 1/2 speed as TX1, mostly. We believe that TX2’s CUDA API runs much slower than TX1 in many cases. When we are trying to transplant our CUDA source code from TX1 to TX2, it behaved strange. We noticed that TX2 has twice computing-ability as TX1 in GPU, as expectation, we think TX2 will 30% - 40% faster than TX1 at least. Unfortunately, most of our code base spent twice the time as TX1, in other words, TX2 only has 1/2 speed as TX1, mostly. We believe that TX2’s CUDA API runs much slower than TX1 in many cases. The user is transferring the code from one hardware to another The target hardware is faster than the the source hardware. User expects the code to run at least 30-40% faster. The code ran 2x slower on the more powerful hardware

More powerful Results: Case Study 89 Nvidia TX1 CPU 4
cores, 1.3 GHz GPU 128 Cores, 0.9 GHz Memory 4 Gb, 25 Gb/s Nvidia TX2 CPU 6 cores, 2 GHz GPU 256 Cores, 1.3 GHz Memory 8 Gb, 58 Gb/s Embedded real-time stereo estimation Source code 17 Fps 4 Fps 4 Slower! ×

Results: Case Study 90 Configuration UNICO RN Decision Tree Forum
CPU Cores ✓ ✓ ✓ CPU Freq. ✓ ✓ ✓ EMC Freq. ✓ ✓ ✓ GPU Freq. ✓ ✓ ✓ Sched. Policy ✓ Sched. Runtime ✓ Sched. Child Proc ✓ Dirty Bg. Ratio ✓ Drop Caches ✓ CUDA_STATIC_RT ✓ ✓ ✓ Swap Memory ✓ UNICORN Decision Tree Forum Throughput (on TX2) 26 FPS 20 FPS 23 FPS Throughput Gain (over TX1) 53 % 21 % 39 % Time to resolve 24 min. 31/2 Hrs. 2 days X Finds the root-causes accurately X No unnecessary changes X Better improvements than forum’s recommendation X Much faster Results The user expected 30-40% gain

Evaluation: Experimental Setup Nvidia TX1 CPU 4 cores, 1.3 GHz
GPU 128 Cores, 0.9 GHz Memory 4 Gb, 25 GB/s Nvidia TX2 CPU 6 cores, 2 GHz GPU 256 Cores, 1.3 GHz Memory 8 Gb, 58 GB/s Nvidia Xavier CPU 8 cores, 2.26 GHz GPU 512 cores, 1.3 GHz Memory 32 Gb, 137 GB/s Hardware Systems Software Systems Xception Image recognition (50,000 test images) DeepSpeech Voice recognition (5 sec. audio clip) BERT Sentiment Analysis (10000 IMDb reviews) x264 Video Encoder (11 Mb, 1080p video) Configuration Space X 30 Configurations X 17 System Events • 10 software • 10 OS/Kernel • 10 hardware 91

Evaluation: Data Collection • For each software/hardware combination create a
benchmark dataset ◦ Exhaustively set each of configuration option to all permitted values. ◦ For continuous options (e.g., GPU memory Mem.), sample 10 equally spaced values between [min, max] • Measure the latency, energy consumption, and heat dissipation ◦ Repeat 5x and average 92 Multiple Faults ! Latency Faults ! Energy Faults !

Evaluation: Ground Truth • For each performance fault: ◦ Manually
investigate the root-cause ◦ “Fix” the misconfigurations • A “fix” implies the configuration no longer has tail performance ◦ User defined benchmark (i.e., 10th percentile) ◦ Or some QoS/SLA benchmark • Record the configurations that were changed 93 Multiple Faults ! Latency Faults ! Energy Faults !

RQ2: How does UNICORN perform compared to Search-Based Optimization 94
RQ1: How does UNICORN perform compared to Model based Diagnostics Results: Research Questions

95 Results: Research Question 1 (single objective) RQ1: How does
UNICORN perform compared to Model based Diagnostics X Finds the root-causes accurately X Better gain X Much faster Takeaways More accurate than ML-based methods Better Gain Up to 20x faster

96 Results: Research Question 1 (multi-objective) RQ1: How does UNICORN
perform compared to Model based Diagnostics X No deterioration of other performance objectives Takeaways Multiple Faults in Latency & Energy usage

RQ1: How does UNICORN perform compared to Model based Diagnostics
97 RQ2: How does UNICORN perform compared to Search-Based Optimization Results: Research Questions

Results: Research Question 2 RQ2: How does UNICORN perform compared
to Search-Based Optimization X Better with no deterioration of other performance objectives Takeaways 98

99 Results: Research Question 3 RQ2: How does UNICORN perform
compared to Search-Based Optimization X Considerably faster than search-based optimization Takeaways

Summary: Causal AI for Systems 1. Learning a Functional Causal
Model for di ff erent downstream systems tasks 2. The learned causal model is transferable across di ff erent environments 100 Software: DeepStream Middleware: TF, TensorRT Hardware: Nvidia Xavier Configuration: Default number of counters number of splitters latency (ms) 100 150 1 200 250 2 300 Cubic Interpolation Over Finer Grid 2 4 3 6 8 4 10 12 5 14 16 6 18 Budget Exhausted? Yes No 5- Update Causal Performance Model Query Engine 4- Estimate Causal Queries Estimate probability of satisfying QoS if BufferSize is set to 6k? 2- Learn Causal Performance Model Performance Debugging Performance Optimization 3- Translate Perf. Query to Causal Queries •What is the root-cause of observed perf. fault? •How do I fix the misconfig.? •How can I improve throughput without sacrificing accuracy? •How do I understand perf behavior? Measure performance of the configuration(s) that maximizes information gain Performance Data Causal Model P(Th > 40/s|do(Buffersize = 6k)) 1- Specify Performance Query QoS : Th > 40/s Observed : Th < 30/s ± 5/s

Arti fi cial Intelligence and Systems Laboratory (AISys Lab) Machine
Learning Computer Systems Autonomy AI/ML Systems https://pooyanjamshidi.github.io/AISys/ 101 Ying Meng (PhD student) Shuge Lei (PhD student) Kimia Noorbakhsh (Undergrad) Shahriar Iqbal (PhD student) Jianhai Su (PhD student) M.A. Javidian (postdoc) Sponsors, thanks! Fatemeh Ghofrani (PhD student) Abir Hossen (PhD student) Hamed Damirchi (PhD student) Mahdi Shari fi (PhD student) Lane Stanley (Intern) Sonam Kharde Postdoc

Rahul Krishna Columbia Shahriar Iqbal UofSC M. A. Javidian Purdue
Baishakhi Ray Columbia Collaborators

Outline 104 Causal AI For Systems UNICORN Results Motivation Causal
AI for Autonomy and Robotics Autonomy Evaluation at JPL

RASPBERRY SI David Garlan CMU Co-I Bradley Schmerl CMU Co-I
Pooyan Jamshidi UofSC PI Javier Camara York (UK) Collaborator Ellen Czaplinski NASA JPL Consultant Katherine Dzurilla UArk Consultant Jianhai Su UofSC Graduate Student Matt DeMinico NASA Co-I Resource Adaptive Software Purpose-Built for Extraordinary Robotic Research Yields - Science Instruments Abir Hossen UofSC Graduate Student Sonam Kharde UofSC Postdoc Autonomous Robotics Research for Ocean Worlds (ARROW)

K. MICHAEL DALAL Team Lead USSAMA NAAL Software Engineer LANSSIE
MA Software Engineer Autonomy • Quantitative Planning • Transfer & Online Learning • Causal AI JIANHAI SU UofSC, Graduate Student BRADLEY SCHMERL CMU, Co-I DAVID GARLAN CMU, Co-I JAVIER CAMARA York, Collaborator MATT DeMINICO NASA, Co-I HARI D NAYAR Team Lead ANNA E BOETTCHER Robotics System Engineer ASHISH GOEL Research Technologist ANJAN CHAKRABARTY Software Engineer CHETAN KULKARNI Prognostics Researcher THOMAS STUCKY Software Engineer TERENCE WELSH Software Engineer CHRISTOPHER LIM Robotics Software Engineer JACEK SAWONIEWICZ Robotics System Engineer ABIR HOSSEN UofSC, Graduate Student ELLEN CZAPLINSKI Arkansas, Consultant KATHERINE DZURILLA Arkansas, Consultant POOYAN JAMSHIDI UofSC, PI RASPBERRY SI Physical Testbed Virtual Testbed AISR: Autonomous Robotics Research for Ocean Worlds (ARROW) CAROLYN R. MERCER Program Manager Develop Develop and maintain Evaluate Evaluate Develop and maintain Sonam Kharde UofSC, Postdoc

Autonomy Module: Evaluation 109 Design • MAPE-K loop based design
• Machine learning driven quantitative planning and adaptation Evaluation • Two testbeds: different fidelities & simulation flexibilities Monitor Analyze Plan Execute Knowledge System Under Test (NASA Lander) Autonomy Physical Testbed OWLAT (NASA/JPL) Virtual Testbed OceanWATERS (NASA/ARC)

Learning in Simulation for Transfer Learning to Physical Testbed Sim2Real
Transfer 110 Physical Testbed Simulation Environment OWLAT OWLAT-sim Causal Invariances

Causal AI for Autonomous Robot Testing • Testing cyber-physical systems
such as robots are complicated. The key reason is that there are additional interactions with the environment and the task that the robot is performing. • Evaluating our Causal AI for Systems methodology with autonomous robots provides the following opportunities: 1. Identifying di ff i cult-to-catch bugs in robots 2. Identifying the root cause of an observed fault and repairing the issue automatically during mission time. 111

Outline 112 Causal AI For Systems UNICORN Results Motivation Causal
AI for Autonomy and Robotics Autonomy Evaluation at JPL

Lessons Learned • Open Science, Open Source, Open Data, and
Open Collaborations • Diverse Team, Diverse Background, Diverse Expertise • Close Collaborations with the JPL and Ames teams • Evaluation in Real Environment Project Website: https://nasa-raspberry-si.github.io/raspberry-si

Lessons Learned • In the simulation, we can debug/develop/test our
implementation without worrying about damaging the hardware. • High bandwidth and close interaction between the testbed provider (JPL Team) and the autonomy team (RASPBERRY-SI) • Faster identification of the issues • Resolving the issues a lot faster • Getting help for development

Lessons Learned • Importance of risk reduction phases • Integration
testing • The interface and capability of the testbeds will evolve, and the autonomy needs to be designed at the same time. • The different capabilities between sim and physical testbed. • Rigorous testing remotely and in interaction with testbed providers. • The interaction would be beneficial for Autonomy providers as well as testbed providers.

Incremental Integration Testing 116 Component Test Integration Test Model Learning
Transfer Learning Model Compression Online Learning A B C D Quantitative Planning E Learning A E Case (Baseline) A E B A E B C A E B C D Case 2 (Transfer) Case 3 (Compress) Case 4 (Online) Test 1 Expected Performance Case 1 < Case 2 < Case 3 < Case 4 OWLAT Code: https://github.com/nasa/ow_simulator Physical Autonomy Testbed: https://www1.grc.nasa.gov/wp-content/uploads/2020_ASCE_OWLAT_20191028.pdf

Real-World Experiments using OWLAT • Models learned from simulation •
Adaptive System (Learning + Planning) • Sets of Tests 117 Adaptive System Machine learning Models Mission Environment Continual Learning: refining models Log Mission Reports Local Machine Cloud Storage

Test Coverage • Mission Types: landing and scientific explorations ->
sampling • Mission Difficulty: • Rough regions for landing • Number of locations where a sample needs to be fetched • Unexpected events: • Changes in the environments: e.g., uneven terrain and weather • Changes to the lander capabilities: e.g., deploy new sensors • Faults (power, instruments, etc) 118

Infrastructure for Automated Evaluation 119 Test Generator Autonomy Module Test
1 Test Harness Mission Configuration Testbed Monitoring & Logging Communication Logging Logs Log Analysis Evaluation Report Environment & Lander Simulation Adapter Interface Learning & Planning Plan Executive

Thank You!

Understanding and Explaining the Root Causes of...

Understanding and Explaining the Root Causes of Performance Faults with Causal AI: A Path towards Building Dependable Computer Systems

More Decks by Pooyan Jamshidi

Featured

Transcript