Pooyan Jamshidi
January 15, 2022
180

# Causal AI for Systems

exploreCSR workshop, Jan 2022.
https://democratizeai.org/

January 15, 2022

## Transcript

1. Causal AI

for Systems
Learning Causal Performance Models for conducting Performance Tasks in a Principled and Transferable Fashion
Pooyan Jamshidi

2. What is Causal AI?

• Zeus is a patient waiting for a heart transplant. On 1 January, he received a
new heart. Five days later, he died.

• Imagine that we can somehow know, that had Zeus not received a heart
transplant on 1 January then he would have been alive
fi
ve days later.

• All others things in his life being unchanged.

• Now, what do you think was the cause of Zeus’s death?!

• Most people would agree that the transplant caused Zeus’ death.

• The intervention had a causal e
ff
ect.

• Hera, received a heart transplant on 1 January. Five days later she was alive.

• Again, imagine we can somehow know that had Hera not received the heart
on 1 January then she would still have been alive
fi
ve days later.

• All others things in his life being unchanged.

• The transplant did not have a causal e
ff
ect on Hera’s
fi
ve day survival.

5. Let’s collect some data!
Exposure variable A (1: exposed, 0: unexposed); Outcome variable Y (1: death, 0: survival)

6. Individual Causal Effect
contrast of the values of counterfactual outcomes, but only one of those values is observed.

7. Population Causal Effects
• Pr[Ya = 1]: proportion of subjects that would have developed the outcome Y
had all subjects in the population of interest received exposure value a.

• The exposure has a causal e
ff
ect in the population if
Pr[Ya=1=1] Pr[Ya=0=1].

• Unlike individual causal e
ff
ects, population causal e
ff
ects can sometimes be
computed—or, more rigorously, consistently estimated.

Pr[Ya=1
= 1] − Pr[Ya=0
= 1] ≠ 0

8. Now let’s do some cool ML
ML models characterize association
Pr[Y = 1|A = 1] = 7/13 Pr[Y = 1|A = 0] = 3/7

9. Association is not Causation!

10. Computing Causal Effects via Randomization
Unlike association measures, e
ff
ect measures cannot be directly computed because of missing data. However, e
ff
ect measures
can be computed/estimated in randomized experiments!
• Suppose we have a (near-in
fi
nite) population and that we
fl
ip a coin for each subject in such
population. We assign the subject to group 1 if the coin turns tails, and to group 2 if it turns heads.

• Next we administer the treatment or exposure of interest (A = 1) to subjects in group 1 and placebo
(A = 0) to those in group 2. Five days later, at the end of the study, we compute the mortality risks in
each group, Pr[Y = 1|A = 1] and Pr[Y = 1|A = 0].

• When subjects are randomly assigned to groups 1 and 2, the proportion of deaths among the
exposed, Pr[Y = 1|A = 1], will be the same whether subjects in group 1 receive the exposure and
subjects in group 2 receive placebo, or vice versa.

• Because group membership is randomised, both groups are ‘‘comparable’’: which particular group
got the exposure is irrelevant for the value of Pr[Y = 1|A = 1]. (The same reasoning applies to Pr[Y =
1|A = 0].)

• Formally, we say that both groups are exchangeable.

11. Let’s do some math!
Pr[Y = 1|A = 1] = Pr[Y = 1|A = 0] = Pr[Ya
= 1]
Pr[Ya
= 1|A = a] = Pr[Y = 1|A = a]
Pr[Y = 1|A = a] = Pr[Ya
= 1]
In ideal randomized experiments, Association is Causation!

12. But not in non-randomized observational studies
Still remember this?
Pr[Y = 1|A = 1] = 7/13
Pr[Y = 1|A = 0] = 3/7

13. Outline
13
Causal AI
For Systems
UNICORN
Results
Future
Directions
Motivation

14. 14
Goal: Enable developers/users

to
fi

15. Today’s most popular systems are con
fi
gurable
15
built

16. 16

17. Empirical observations con
fi
rm that systems are
becoming increasingly con
fi
gurable
17
08 7/2010 7/2012 7/2014
Release time
1/1999 1/2003 1/2007 1/2011
0
1/2014
N
Release time
02 1/2006 1/2010 1/2014
2.2.14
2.3.4
2.0.35
.3.24
Release time
Apache
1/2006 1/2008 1/2010 1/2012 1/2014
0
40
80
120
160
200
2.0.0
1.0.0
0.19.0
0.1.0
Number of parameters
Release time
MapReduce
HDFS
[Tianyin Xu, et al., “Too Many Knobs…”, FSE’15]

18. Empirical observations con
fi
rm that systems are
becoming increasingly con
fi
gurable
18
nia San Diego, ‡Huazhong Univ. of Science & Technology, †NetApp, Inc
tixu, longjin, xuf001, yyzhou}@cs.ucsd.edu
prevalent, but also severely
software. One fundamental
y of conﬁguration, reﬂected
parameters (“knobs”). With
m software to ensure high re-
nderstanding a fundamental
users really need so many
including thousands of cus-
m (Storage-A), and hundreds
ce system software projects.
ng ﬁndings to motivate soft-
ore cautious and disciplined
these ﬁndings, we provide
ich can signiﬁcantly reduce
A as an example, the guide-
ters and simplify 19.7% of
on existing users. Also, we
tion methods in the context
7/2006 7/2008 7/2010 7/2012 7/2014
0
100
200
300
400
500
600
700
Storage-A
Number of parameters
Release time
1/1999 1/2003 1/2007 1/2011
0
100
200
300
400
500
5.6.2
5.5.0
5.0.16
5.1.3
4.1.0
4.0.12
3.23.0
1/2014
MySQL
Number of parameters
Release time
1/1998 1/2002 1/2006 1/2010 1/2014
0
100
200
300
400
500
600
1.3.14
2.2.14
2.3.4
2.0.35
1.3.24
Number of parameters
Release time
Apache
1/2006 1/2008 1/2010 1/2012 1/2014
0
40
80
120
160
200
2.0.0
1.0.0
0.19.0
0.1.0
Number of parameters
Release time
MapReduce
HDFS
[Tianyin Xu, et al., “Too Many Knobs…”, FSE’15]

19. Con
fi
gurations determine the performance
behavior
19
void Parrot_setenv(. . . name,. . . value){
#ifdef PARROT_HAS_SETENV
my_setenv(name, value, 1);
#else
int name_len=strlen(name);
int val_len=strlen(value);
char* envs=glob_env;
if(envs==NULL){
return;
}
strcpy(envs,name);
strcpy(envs+name_len,"=");
strcpy(envs+name_len + 1,value);
putenv(envs);
#endif
}
#ifdef LINUX
extern int Parrot_signbit(double x){
endif
else
PARROT_HAS_SETENV
LINUX
Speed
Energy

20. Misconfiguration and its Effects
● Misconfigurations can elicit unexpected interactions between
software and hardware
● These can result in non-functional faults
○ Affecting non-functional system properties like
latency, throughput, energy consumption, etc.
20
The system doesn’t crash or
exhibit an obvious misbehavior
Systems are still operational but with a
degraded performance, e.g., high latency, low
throughput, high energy consumption, high
heat dissipation, or a combination of several

21. 21
CUDA performance issue on tx2
When we are trying to transplant our CUDA source code from TX1 to TX2, it
behaved strange.
We noticed that TX2 has twice computing-ability as TX1 in GPU, as expectation,
we think TX2 will 30% - 40% faster than TX1 at least.
Unfortunately, most of our code base spent twice the time as TX1, in other words,
TX2 only has 1/2 speed as TX1, mostly. We believe that TX2’s CUDA API runs
much slower than TX1 in many cases.
When we are trying to transplant our CUDA source code from TX1 to TX2, it
behaved strange.
We noticed that TX2 has twice computing-ability as TX1 in GPU, as expectation,
we think TX2 will 30% - 40% faster than TX1 at least.
Unfortunately, most of our code base spent twice the time as TX1, in other words,
TX2 only has 1/2 speed as TX1, mostly. We believe that TX2’s CUDA API runs
much slower than TX1 in many cases.
The user is transferring the code
from one hardware to another
When we are trying to transplant our CUDA source code from TX1 to TX2, it
behaved strange.
We noticed that TX2 has twice computing-ability as TX1 in GPU, as expectation,
we think TX2 will 30% - 40% faster than TX1 at least.
Unfortunately, most of our code base spent twice the time as TX1, in other words,
TX2 only has 1/2 speed as TX1, mostly. We believe that TX2’s CUDA API runs
much slower than TX1 in many cases.
The target hardware is faster
than the the source hardware.
User expects the code to run
at least 30-40% faster.
Motivating Example
When we are trying to transplant our CUDA source code from TX1 to TX2, it
behaved strange.
We noticed that TX2 has twice computing-ability as TX1 in GPU, as expectation,
we think TX2 will 30% - 40% faster than TX1 at least.
Unfortunately, most of our code base spent twice the time as TX1, in other words,
TX2 only has 1/2 speed as TX1, mostly. We believe that TX2’s CUDA API runs
much slower than TX1 in many cases.
The code ran 2x slower on the
more powerful hardware

22. Motivating Example
22
June 3rd
We have already tried this. We still have high latency.
Any other suggestions?
June 4th
Please do the following and let us know if it works
1. Install JetPack 3.0
2. Set nvpmodel=MAX-N
3. Run jetson_clock.sh
June 5th
June 4th
+ set(CUDA_STATIC_RUNTIME OFF)
...
+ -gencode=arch=compute_62,code=sm_62
In Software:
✖ Wrong compilation flags
✖ Wrong SDK version
In Hardware:
✖ Wrong power mode
✖ Wrong clock/fan settings
The discussions took 2 days
!
Any suggestions on how to improve my performance?
Thanks!
How to resolve such issues faster?
?

23. Today’s most popular systems are complex!
multiscale, multi-modal, and multi-stream
23
Variability Space =
Con
fi
guration Space +

System Architecture +

Deployment Environment
Video

Decoder
Stream
Muxer
Primary
Detector
Object
Tracker
Secondary
Classifier
# Configuration Options
55
86
14
44 86

24. 24
More combinations than estimated
atoms in the universe

25. 0 500 1000 1500
Throughput (ops/sec)
0
1000
2000
3000
4000
5000
Average write latency ( s)
The default con
fi
guration is typically bad and the
optimal con
fi
guration is noticeably better than median
25
Default Con
fi
guration
Optimal

Con
fi
guration
better
better
• 2X-10X faster than worst
• Noticeably faster than median

26. Performance behavior varies in different environments
26

27. Outline
27
Motivation UNICORN
Results
Future
Directions
Causal AI
For Systems

28. Causal AI in Systems and Software
28
Computer Architecture
Database
Operating Systems
Programming Languages
BigData Software Engineering
https://github.com/y-ding/causal-system-papers

29. 29
Throughput = 9 × Bitrate + 2.1 × Buffersize − 4.4 × Bitrate × Buffersize × BatchSize
Causal Performance Model
VS
Throughput Energy
Branch
Misses
Cache
Misses
No. of
Cycles
Bitrate
Buffer
Size
Batch
Size
Enable
f3 f4
f
f1
f2
Causal

Interaction
Causal

Paths
Software

Options
Intermediate

Causal Mechanisms
Performance

Objective
f
Branchmisses = 2 × Bitrate + 8.1 × Buffersize + 4.1 × Bitrate × Buffersize × Cachemisses
Decoder Muxer

30. Critical Issues of Correlation-based Performance
Analysis
• Performance in
fl
uence models could produce unreliable predictions.

• Performance in
fl
uence models could produce unstable predictions
across environments and in the presence of measurement noise.

• Performance in
fl
uence models could produce incorrect explanations.
30

31. Why Causal Inference? (Simpson’s Paradox)
31
Increasing GPU memory
increases Latency
More GPU memory
usage should reduce
latency not increase it.
Counterintuitive!
Any ML-/statistical models built
on this data will be incorrect
!

32. Why Causal Inference? (Simpson’s Paradox)
32
Segregate data on swap memory
Available swap
memory is
reducing
GPU memory borrows memory from the swap for some intensive workloads. Other
host processes may reduce the available swap. Little will be left for the GPU to use.

33. 33
Why Causal Inference?
Real world problems can have
100s if not 1000s of interacting
configuration options
!
Manually understanding and
evaluating each combination
is impractical, if not
impossible.

GPU Mem.
Swap Mem.
Latency
Express the relationships between
interacting variables as a causal graph
34
Causal Performance Models
Configuration option Direction(s) of the causality
• Latency is affected by GPU Mem. which
in turn is influenced by swap memory
• External factors like resource pressure
also affects swap memory
Non-functional property
System event

35. 35
Causal Performance Models
How to construct
this causal graph?
?
If there is a fault in latency,
how to diagnose and fix it?
?
GPU Mem.
Swap Mem.
Latency

36. Outline
37
Motivation
Causal AI
For Systems
Results
Future
Directions
UNICORN

37. • Build a Causal Performance
Model that capture the interactions
options in the variability space
using the observation performance
data.

• Iterative causal performance model
evaluation and model update
• Perform downstream performance
debugging & optimization using
Causal Reasoning
UNICORN: Our Causal AI for
Systems Method

38. UNICORN: Our Causal AI for Systems Method
Software: DeepStream
Middleware: TF, TensorRT
Hardware: Nvidia Xavier
Configuration: Default
number of counters
number of splitters
latency (ms)
100
150
1
200
250
2
300
Cubic Interpolation Over Finer Grid
2
4
3 6
8
4 10
12
5 14
16
6 18
Budget
Exhausted?
Yes
No
5- Update Causal
Performance Model
Query Engine
4- Estimate Causal
Queries
Estimate
probability of
satisfying QoS
if BufferSize is
set to 6k?
2- Learn Causal
Performance Model Performance
Debugging
Performance
Optimization
3- Translate Perf. Query
to Causal Queries
•What is the root-cause
of observed perf. fault?
•How do I fix the
misconfig.?
•How can I improve
throughput without
sacrificing accuracy?
•How do I understand
perf behavior?
Measure performance
of the configuration(s)
that maximizes
information gain
Performance Data Causal Model
P(Th > 40/s|do(Buffersize = 6k))
1- Specify
Performance Query
QoS : Th > 40/s
Observed : Th < 30/s ± 5/s

39. Software: DeepStream
Middleware: TF, TensorRT
Hardware: Nvidia Xavier
Configuration: Default
number of counters
number of splitters
latency (ms)
100
150
1
200
250
2
300
Cubic Interpolation Over Finer Grid
2
4
3 6
8
4 10
12
5 14
16
6 18
Budget
Exhausted?
Yes
No
5- Update Causal
Performance Model
Query Engine
4- Estimate Causal Queries
Estimate
probability of
satisfying QoS
if BufferSize is
set to 6k?
2- Learn
Causal Perf. Model Performance
Debugging
Performance
Optimization
3- Translate Performance Query
to Causal Queries
•What is the root-cause
of observed perf. fault?
•How do I fix the
misconfig.?
•How can I improve
throughput without
sacrificing accuracy?
•How do I understand
perf behavior?
Measure performance
of the configuration(s)
that maximizes
information gain
Performance Data Causal Model
P(Th > 40/s|do(Buffersize = 6k))
1- Specify
Performance Query
QoS : Th > 40/s
Observed : Th < 30/s ± 5/s
UNICORN: Our Causal AI for Systems Method

40. Software: DeepStream
Middleware: TF, TensorRT
Hardware: Nvidia Xavier
Configuration: Default
number of counters
number of splitters
latency (ms)
100
150
1
200
250
2
300
Cubic Interpolation Over Finer Grid
2
4
3 6
8
4 10
12
5 14
16
6 18
Budget
Exhausted?
Yes
No
5- Update Causal
Performance Model
Query Engine
4- Estimate Causal Queries
Estimate
probability of
satisfying QoS
if BufferSize is
set to 6k?
2- Learn
Causal Perf. Model Performance
Debugging
Performance
Optimization
3- Translate Performance Query
to Causal Queries
•What is the root-cause
of observed perf. fault?
•How do I fix the
misconfig.?
•How can I improve
throughput without
sacrificing accuracy?
•How do I understand
perf behavior?
Measure performance
of the configuration(s)
that maximizes
information gain
Performance Data Causal Model
P(Th > 40/s|do(Buffersize = 6k))
1- Specify
Performance Query
QoS : Th > 40/s
Observed : Th < 30/s ± 5/s
UNICORN: Our Causal AI for Systems Method

41. FPS Energy
Branch
Misses
Cache
Misses
No of
Cycles
Bitrate
Buffer
Size
Batch
Size
Enable
FPS Energy
Branch
Misses
Cache
Misses
No of
Cycles
Bitrate Buffer
Size
Batch
Size
Enable
Bitrate
(bits/s)
Enable
… Cache
Misses
… Through
put (fps)
c1
1k 1 … 42m … 7
c2
2k 1 … 32m … 22
… … … … … … …
cn
5k 0 … 12m … 25
FPS Energy
Branch
Misses
Cache
Misses
No of
Cycles
Bitrate Buffer
Size
Batch
Size
Enable
1- Recovering the
Skelton
2- Pruning
Causal Structure
3- Orienting
Causal Relations
statistical
independence
tests
fully connected graph
given constraints (e.g.,
no connections btw
configuration options)
orientation rules &
measures (entropy) +
structural constraints
(colliders, v-structures)
Learning Causal Performance Model

42. Performance measurement
43
ℂ = O1
× O2
× ⋯ × O19
× O20
Con
fi
guration
Space
Constant folding
Loop unrolling
Function inlining
c1
= 0 × 0 × ⋯ × 0 × 1
c1
∈ ℂ
fc
(c1
) = 11.1ms
Compile
time
Execution
time
Energy
Compiler

(e.f., SaC, LLVM)
Program Compiled
Code
Instrumented
Binary
Hardware
Compile Deploy
Con
fi
gure
fe
(c1
) = 110.3ms
fen
(c1
) = 100mwh
Non-functional
measurable/quanti
fi
able
aspect

43. Our setup for performance measurements
44

44. Hardware platforms in our experiments
The reason behind using di
ff
erent types of hardware platforms is that they exhibit di
ff
erent behaviors due to di
ff
erences in terms
of resources, their microarchitecture, etc.
45
AWS DeepLens:

Cloud-connected device
System on Chip (SoC)
Microcontrollers (MCUs)

45. Measuring performance for systems involves lots of challenges
Each hardware requires di
ff
erent ways of instrumentations and clean measurement that contains least amount of noise is the
most challenging part of our experiments.
46

46. FPS Energy
Branch
Misses
Cache
Misses
No of
Cycles
Bitrate
Buffer
Size
Batch
Size
Enable
FPS Energy
Branch
Misses
Cache
Misses
No of
Cycles
Bitrate Buffer
Size
Batch
Size
Enable
Bitrate
(bits/s)
Enable
… Cache
Misses
… Through
put (fps)
c1
1k 1 … 42m … 7
c2
2k 1 … 32m … 22
… … … … … … …
cn
5k 0 … 12m … 25
FPS Energy
Branch
Misses
Cache
Misses
No of
Cycles
Bitrate Buffer
Size
Batch
Size
Enable
1- Recovering the
Skelton
2- Pruning
Causal Structure
3- Orienting
Causal Relations
statistical
independence
tests
fully connected graph
given constraints (e.g.,
no connections btw
configuration options)
orientation rules &
measures (entropy) +
structural constraints
(colliders, v-structures)
Learning Causal Performance Model

47. FPS Energy
Branch
Misses
Cache
Misses
No of
Cycles
Bitrate
Buffer
Size
Batch
Size
Enable
FPS Energy
Branch
Misses
Cache
Misses
No of
Cycles
Bitrate Buffer
Size
Batch
Size
Enable
Bitrate
(bits/s)
Enable
… Cache
Misses
… Through
put (fps)
c1
1k 1 … 42m … 7
c2
2k 1 … 32m … 22
… … … … … … …
cn
5k 0 … 12m … 25
FPS Energy
Branch
Misses
Cache
Misses
No of
Cycles
Bitrate Buffer
Size
Batch
Size
Enable
1- Recovering the
Skelton
2- Pruning
Causal Structure
3- Orienting
Causal Relations
statistical
independence
tests
fully connected graph
given constraints (e.g.,
no connections btw
configuration options)
orientation rules &
measures (entropy) +
structural constraints
(colliders, v-structures)
Learning Causal Performance Model

48. FPS Energy
Branch
Misses
Cache
Misses
No of
Cycles
Bitrate
Buffer
Size
Batch
Size
Enable
FPS Energy
Branch
Misses
Cache
Misses
No of
Cycles
Bitrate Buffer
Size
Batch
Size
Enable
Bitrate
(bits/s)
Enable
… Cache
Misses
… Through
put (fps)
c1
1k 1 … 42m … 7
c2
2k 1 … 32m … 22
… … … … … … …
cn
5k 0 … 12m … 25
FPS Energy
Branch
Misses
Cache
Misses
No of
Cycles
Bitrate Buffer
Size
Batch
Size
Enable
1- Recovering the
Skelton
2- Pruning
Causal Structure
3- Orienting
Causal Relations
statistical
independence
tests
fully connected graph
given constraints (e.g.,
no connections btw
configuration options)
orientation rules &
measures (entropy) +
structural constraints
(colliders, v-structures)
Learning Causal Performance Model

49. Throughput Energy
Branch
Misses
Cache
Misses
No of
Cycles
Bitrate Buffer
Size
Batch
Size
Enable
f f
f
f f
Causal

Interaction
Causal

Paths
Software

Options
Perf.

Events
Performance

Objective
f
Branchmisses = 2 × Bitrate + 8.1 × Buffersize + 4.1 × Bitrate × Buffersize × Cachemisses
Decoder Muxer
Causal Performance Model

50. Software: DeepStream
Middleware: TF, TensorRT
Hardware: Nvidia Xavier
Configuration: Default
number of counters
number of splitters
latency (ms)
100
150
1
200
250
2
300
Cubic Interpolation Over Finer Grid
2
4
3 6
8
4 10
12
5 14
16
6 18
Budget
Exhausted?
Yes
No
5- Update Causal
Performance Model
Query Engine
4- Estimate Causal Queries
Estimate
probability of
satisfying QoS
if BufferSize is
set to 6k?
2- Learn
Causal Perf. Model Performance
Debugging
Performance
Optimization
3- Translate Performance Query
to Causal Queries
•What is the root-cause
of observed perf. fault?
•How do I fix the
misconfig.?
•How can I improve
throughput without
sacrificing accuracy?
•How do I understand
perf behavior?
Measure performance
of the configuration(s)
that maximizes
information gain
Performance Data Causal Model
P(Th > 40/s|do(Buffersize = 6k))
1- Specify
Performance Query
QoS : Th > 40/s
Observed : Th < 30/s ± 5/s
UNICORN: Our Causal AI for Systems Method

51. 52
Diagnose and fix the root-cause of misconfigurations that cause non-functional faults
Objective
Causal Debugging: An example of downstream performance task
Ὂ Use causal models to model various cross-stack configuration interactions;
and
Ὂ Counterfactual reasoning to recommend fixes for these misconfigurations
Approach

52. 53
Causal Debugging
• What is the root-cause
of my fault?
• How do I fix my
misconfigurations to
improve performance?
Misconfiguration
Fault
fixed?
Observational Data Build Causal Graph Extract Causal Paths
Best Query
Yes
No
update
observational
data
Counterfactual Queries
Rank Paths
What if questions.
E.g., What if the configuration
option X was set to a value ‘x’?
configurations
(training data)

53. Best Query
Counterfactual Queries
Rank Paths
What if questions.
E.g., What if the
configuration option X was
set to a value ‘x’?
Extract Causal Paths
54
Extracting Causal Paths from the Causal Model
• What is the root-cause
of my fault?
• How do I fix my
misconfigurations to
improve performance?
Misconfiguration
Fault
fixed?
Observational Data Build Causal Graph
Yes
No
update
observational
data
configurations
(training data)

54. Extracting Causal Paths from the Causal Model
Problem
✕ In real world cases, this causal graph can be
very complex
✕ It may be intractable to reason over the entire
graph directly
55
Solution
✓ Extract paths from the causal graph
✓ Rank them based on their Average Causal
Effect on latency, etc.
✓ Reason over the top K paths

55. Extracting Causal Paths from the Causal Model
56
GPU Mem. Latency
Swap Mem.
Extract paths
Always begins with a
configuration option
Or a system
event
Always terminates at a
performance objective
GPU Mem. Latency
Swap Mem.
Swap Mem. Latency

56. Ranking Causal Paths from the Causal Model
57
● They may be too many causal paths
● We need to select the most useful ones
● Compute the Average Causal Effect (ACE) of
each pair of neighbors in a path
GPU Mem.
Swap Mem. Latency
𝐴𝐶
𝐸
(GPU Mem . , Swap) =
1
𝑁

𝑎
,
𝑏

𝑍
𝔼
(GPU Mem .
𝑑 𝑜
(Swap =
𝑏
)) −
𝔼
(GPU Mem .
𝑑
𝑜
(Swap =
𝑎
))
Expected value of GPU
Mem. when we artificially
intervene by setting Swap to
the value b
Expected value of GPU
Mem. when we artificially
intervene by setting Swap to
the value a
If this difference is large, then
small changes to Swap Mem.
will cause large changes to GPU
Mem.
Average over all permitted
values of Swap memory.

57. Ranking Causal Paths from the Causal Model
58
● Average the ACE of all pairs of adjacent nodes in the path
● Rank paths from highest path ACE (PACE) score to the lowest
● Use the top K paths for subsequent analysis
𝑃𝐴𝐶𝐸
(
𝑍
,
𝑌
) =
1
2
(
𝐴 𝐶 𝐸
(
𝑍
,
𝑋
) +
𝐴𝐶 𝐸
(
𝑋
,
𝑌
))
X Y
Z
Sum over all pairs of
nodes in the causal path.
GPU Mem. Latency
Swap Mem.

58. Best Query
Counterfactual Queries
Rank Paths
What if questions.
E.g., What if the
configuration option X was
set to a value ‘x’?
Extract Causal Paths
59
Diagnosing and Fixing the Faults
• What is the root-cause
of my fault?
• How do I fix my
misconfigurations to
improve performance?
Misconfiguration
Fault
fixed?
Observational Data Build Causal Graph
Yes
No
update
observational
data
configurations
(training data)

59. Diagnosing and Fixing the Faults
60
misconfigurations
We are interested in the scenario where:
• We hypothetically have low latency;
Conditioned on the following events:
• We hypothetically set the new Swap memory to 4 Gb
• Swap Memory was initially set to 2 Gb
• We observed high latency when Swap was set to 2 Gb
• Everything else remains the same
Example
Given that my current swap memory is 2 Gb, and I have high latency. What is
the probability of having low latency if swap memory was increased to 4 Gb?

60. Low?
GPU Mem. Latency
Swap = 4 Gb
Diagnosing and Fixing the Faults
61
GPU Mem. Latency
Swap
Original Path
GPU Mem. Latency
Swap = 4 Gb
Path after proposed change
Remove incoming
edges. Assume no
external influence.
Modify to reflect the
hypothetical scenario
Low?
GPU Mem. Latency
Swap = 4 Gb
Low?
Use both the models to compute the answer to the counterfactual question

61. Diagnosing and Fixing the Faults
62
GPU Mem. Latency
Swap
Original Path
GPU Mem. Latency
Swap = 4 Gb
Path after proposed change
𝑃 𝑜 𝑡
𝑒
𝑛 𝑡𝑖
𝑎
𝑙
=
𝑃
(
^
𝐿𝑎 𝑡
𝑒 𝑛𝑐
𝑦
=
𝑙
𝑜𝑤
. . ^
𝑆𝑤 𝑎𝑝
= 4
𝐺 𝑏
, .
𝑆 𝑤
𝑎𝑝
= 2
𝐺
𝑏
,
𝐿𝑎
𝑡 𝑒 𝑛𝑐𝑦
𝑠 𝑤
𝑎
𝑝
=2
𝐺 𝑏
= h
𝑖𝑔
h,
𝑈
)
We expect a low latency
The latency was high
The Swap is now 4 Gb
The Swap was initially 2 Gb Everything else
stays the same

62. Diagnosing and Fixing the Faults
63
Potential =
𝑃
(
^
𝑜𝑢𝑡𝑐𝑜𝑚
𝑒
=
𝑔𝑜
𝑜𝑑
~ ~
𝑐
h
𝑎 𝑛
𝑔 𝑒
, ~
𝑜 𝑢
𝑡𝑐𝑜 𝑚
𝑒
¬
𝑐
h
𝑎
𝑛 𝑔 𝑒
=
𝑏𝑎𝑑
, ~¬
𝑐
h
𝑎
𝑛 𝑔𝑒
,
𝑈
)
Probability that the outcome is good after a change, conditioned on the past
If this difference is large, then our change is useful
Individual Treatment Effect = Potential − Outcome
Control =
𝑃
(
^
𝑜𝑢
𝑡 𝑐
𝑜
𝑚 𝑒
=
𝑏𝑎𝑑
~ ~¬
𝑐
h
𝑎 𝑛𝑔 𝑒
,
𝑈
)
Probability that the outcome was bad before the change

63. Diagnosing and Fixing the Faults
64
GPU Mem.
Latency
Swap Mem.
Top K paths

Enumerate all
possible changes
𝐼 𝑇 𝐸
(
𝑐
h
𝑎𝑛𝑔
𝑒
)
Change with
the largest ITE
Set every configuration
option in the path to all
permitted values
Inferred from observed
data. This is very cheap.
!

64. Diagnosing and Fixing the Faults
65
Change with
the largest ITE
Fault
fixed?
Yes
No • Add to observational data
• Update causal model
• Repeat…
Measure
Performance

65. Software: DeepStream
Middleware: TF, TensorRT
Hardware: Nvidia Xavier
Configuration: Default
number of counters
number of splitters
latency (ms)
100
150
1
200
250
2
300
Cubic Interpolation Over Finer Grid
2
4
3 6
8
4 10
12
5 14
16
6 18
Budget
Exhausted?
Yes
No
5- Update Causal
Performance Model
Query Engine
4- Estimate Causal Queries
Estimate
probability of
satisfying QoS
if BufferSize is
set to 6k?
2- Learn
Causal Perf. Model Performance
Debugging
Performance
Optimization
3- Translate Performance Query
to Causal Queries
•What is the root-cause
of observed perf. fault?
•How do I fix the
misconfig.?
•How can I improve
throughput without
sacrificing accuracy?
•How do I understand
perf behavior?
Measure performance
of the configuration(s)
that maximizes
information gain
Performance Data Causal Model
P(Th > 40/s|do(Buffersize = 6k))
1- Specify
Performance Query
QoS : Th > 40/s
Observed : Th < 30/s ± 5/s
UNICORN: Our Causal AI for Systems Method

66. FPS Energy
Branch
Misses
Cache
Misses
No of
Cycles
Bitrate
Buffer
Size
Batch
Size
Enable
Interventions
FPS Energy
Branch
Misses
Cache
Misses
No of
Cycles
Bitrate
Buffer
Size
Batch
Size
Enable
Option/Event/Obj Values
Bitrate 1k
Buffer Size 20k
Batch Size 10
Branch Misses 24m
Cache Misses 42m
No of Cycles 73b
FPS 31/s
Energy 42J
2- Determine & Perform
next Perf Measurement
3- Updating
Causal Model Performance
Data
Model averaging
Expected change in
belief & KL; Causal
effects on objectives
Interventions on Hardware,
Active Learning for Updating Causal Performance Model

67. FPS Energy
Branch
Misses
Cache
Misses
No of
Cycles
Bitrate
Buffer
Size
Batch
Size
Enable
Interventions
FPS Energy
Branch
Misses
Cache
Misses
No of
Cycles
Bitrate
Buffer
Size
Batch
Size
Enable
Option/Event/Obj Values
Bitrate 1k
Buffer Size 20k
Batch Size 10
Branch Misses 24m
Cache Misses 42m
No of Cycles 73b
FPS 31/s
Energy 42J
2- Determine & Perform
next Perf Measurement
3- Updating
Causal Model Performance
Data
Model averaging
Expected change in
belief & KL; Causal
effects on objectives
Interventions on Hardware,
Active Learning for Updating Causal Performance Model

68. FPS Energy
Branch
Misses
Cache
Misses
No of
Cycles
Bitrate
Buffer
Size
Batch
Size
Enable
Interventions
FPS Energy
Branch
Misses
Cache
Misses
No of
Cycles
Bitrate
Buffer
Size
Batch
Size
Enable
Option/Event/Obj Values
Bitrate 1k
Buffer Size 20k
Batch Size 10
Branch Misses 24m
Cache Misses 42m
No of Cycles 73b
FPS 31/s
Energy 42J
2- Determine & Perform
next Perf Measurement
3- Updating
Causal Model Performance
Data
Model averaging
Expected change in
belief & KL; Causal
effects on objectives
Interventions on Hardware,
Active Learning for Updating Causal Performance Model

69. Benefits of Causal
Reasoning for
System
Performance
Analysis

70. There are two fundamental benefits that we get by our “Causal AI for Systems”
methodology
1. We learn one central (causal) performance model from the data across di
ff
erent

• Performance understanding

• Performance optimization

• Performance debugging and repair

• Performance prediction for di
ff
erent environments (e.g., canary-> production)

2. The causal model is transferable across environments.

• We observed Sparse Mechanism Shift in systems too!

• Alternative non-causal models (e.g., regression-based models for performance tasks)
are not transferable as they rely on i.i.d. setting.
71

71. Questions of this nature require precise mathematical language lest they will
Here we are simultaneously conditioning on two values of GPU memory growth (i.e.,
𝑋
ˆ = 0.66 and
𝑋
approaches cannot handle such expressions. Instead, we must resort to causal models to compute them.
72

72. Difference between statistical (left) and causal models (right) on a given set of
three variables
While a statistical model speci
fi
es a single probability distribution, a causal model represents a set of distributions, one for each
possible intervention.
73

73. Independent Causal Mechanisms (ICM)
Principle

74. Sparse Mechanism Shift (SMS)
Hypothesis
Example of SMS hypothesis,
where an intervention (which may
or may not be intentional/observed)
changes the position of one
fi
nger,
and as a consequence, the object
falls. The change in pixel space is
entangled (or distributed), in
contrast to the change in the causal
model.

75. 76
NeurIPS 2020 (ML For Systems), Dec 12th, 2020
https://arxiv.org/pdf/2010.06061.pdf

76. 77
The new version of CADET, called UNICORN, accepted at EuroSys 2022.
https://github.com/softsys4ai/UNICORN

77. Outline
78
Motivation
Causal AI
For Systems
Future
Directions
UNICORN
Results

78. Results: Case Study
79
When we are trying to transplant our CUDA source code from TX1 to TX2, it
behaved strange.
We noticed that TX2 has twice computing-ability as TX1 in GPU, as expectation,
we think TX2 will 30% - 40% faster than TX1 at least.
Unfortunately, most of our code base spent twice the time as TX1, in other words,
TX2 only has 1/2 speed as TX1, mostly. We believe that TX2’s CUDA API runs
much slower than TX1 in many cases.
When we are trying to transplant our CUDA source code from TX1 to TX2, it
behaved strange.
We noticed that TX2 has twice computing-ability as TX1 in GPU, as expectation,
we think TX2 will 30% - 40% faster than TX1 at least.
Unfortunately, most of our code base spent twice the time as TX1, in other words,
TX2 only has 1/2 speed as TX1, mostly. We believe that TX2’s CUDA API runs
much slower than TX1 in many cases.
When we are trying to transplant our CUDA source code from TX1 to TX2, it
behaved strange.
We noticed that TX2 has twice computing-ability as TX1 in GPU, as expectation,
we think TX2 will 30% - 40% faster than TX1 at least.
Unfortunately, most of our code base spent twice the time as TX1, in other words,
TX2 only has 1/2 speed as TX1, mostly. We believe that TX2’s CUDA API runs
much slower than TX1 in many cases.
When we are trying to transplant our CUDA source code from TX1 to TX2, it
behaved strange.
We noticed that TX2 has twice computing-ability as TX1 in GPU, as expectation,
we think TX2 will 30% - 40% faster than TX1 at least.
Unfortunately, most of our code base spent twice the time as TX1, in other words,
TX2 only has 1/2 speed as TX1, mostly. We believe that TX2’s CUDA API runs
much slower than TX1 in many cases.
The user is transferring the code
from one hardware to another
The target hardware is faster
than the the source hardware.
User expects the code to run
at least 30-40% faster.
The code ran 2x slower on the
more powerful hardware

79. More powerful
Results: Case Study
80
Nvidia TX1
CPU 4 cores, 1.3 GHz
GPU 128 Cores, 0.9 GHz
Memory 4 Gb, 25 Gb/s
Nvidia TX2
CPU 6 cores, 2 GHz
GPU 256 Cores, 1.3 GHz
Memory 8 Gb, 58 Gb/s
Embedded real-time
stereo estimation
Source code
17 Fps
4 Fps
4
Slower!
×

80. Results: Case Study
81
Configuration UNICO
RN
Decision
Tree
Forum
CPU Cores ✓ ✓ ✓
CPU Freq. ✓ ✓ ✓
EMC Freq. ✓ ✓ ✓
GPU Freq. ✓ ✓ ✓
Sched. Policy ✓
Sched. Runtime ✓
Sched. Child Proc ✓
Dirty Bg. Ratio ✓
Drop Caches ✓
CUDA_STATIC_RT ✓ ✓ ✓
Swap Memory ✓
UNICORN Decision Tree Forum
Throughput (on TX2) 26 FPS 20 FPS 23 FPS
Throughput Gain (over TX1) 53 % 21 % 39 %
Time to resolve 24 min. 31/2
Hrs. 2 days
X Finds the root-causes accurately
X No unnecessary changes
X Better improvements than forum’s recommendation
X Much faster
Results
The user expected 30-40% gain

81. Evaluation: Experimental Setup
Nvidia TX1
CPU 4 cores, 1.3 GHz
GPU 128 Cores, 0.9 GHz
Memory 4 Gb, 25 GB/s
Nvidia TX2
CPU 6 cores, 2 GHz
GPU 256 Cores, 1.3 GHz
Memory 8 Gb, 58 GB/s
Nvidia Xavier
CPU 8 cores, 2.26 GHz
GPU 512 cores, 1.3 GHz
Memory 32 Gb, 137 GB/s
Hardware Systems
Software Systems
Xception
Image recognition
(50,000 test images)
DeepSpeech
Voice recognition
(5 sec. audio clip)
BERT
Sentiment Analysis
(10000 IMDb reviews)
x264
Video Encoder
(11 Mb, 1080p video)
Configuration Space
X 30 Configurations
X 17 System Events
• 10 software

• 10 OS/Kernel

• 10 hardware
82

82. Evaluation: Data Collection
● For each software/hardware
combination create a benchmark
dataset
○ Exhaustively set each of configuration
option to all permitted values.
○ For continuous options (e.g., GPU memory
Mem.), sample 10 equally spaced values
between [min, max]
● Measure the latency, energy
consumption, and heat dissipation
○ Repeat 5x and average
83
Multiple
Faults
!
Latency
Faults
!
Energy
Faults
!

83. Evaluation: Ground Truth
● For each performance fault:
○ Manually investigate the root-cause
○ “Fix” the misconfigurations
● A “fix” implies the configuration no longer
has tail performance
○ User defined benchmark (i.e., 10th percentile)
○ Or some QoS/SLA benchmark
● Record the configurations that were
changed
84
Multiple
Faults
!
Latency
Faults
!
Energy
Faults
!

84. Evaluation: Metrics
85
Relevance Scores
𝐺 𝑎
𝑖
𝑛
=
NFP fault − NFP repair
NFP fault
× 100
Repair Quality
NFP = Non-Functional Property
(e.g., Latency, Energy, etc.)
Repair value
Faulty value
Larger the gain, better the repair

85. RQ2: How does UNICORN perform compared to Search-Based
Optimization
86
RQ1: How does UNICORN perform compared to Model
based Diagnostics
Results: Research Questions

86. 87
Results: Research Question 1 (single objective)
RQ1: How does UNICORN perform compared to Model based Diagnostics
X Finds the root-causes accurately
X Better gain
X Much faster
Takeaways
More accurate than
ML-based methods
Better Gain
Up to 20x
faster

87. 88
Results: Research Question 1 (multi-objective)
RQ1: How does UNICORN perform compared to Model based Diagnostics
X No deterioration of other performance objectives
Takeaways
Multiple Faults
in Latency &
Energy usage

88. RQ1: How does UNICORN perform compared to Model based
Diagnostics
89
RQ2: How does UNICORN perform compared to Search-Based
Optimization
Results: Research Questions

89. Results: Research Question 2
RQ2: How does UNICORN perform compared to Search-Based
Optimization
X Better with no deterioration of other performance objectives
Takeaways
90

90. 91
Results: Research Question 3
RQ2: How does UNICORN perform compared to Search-Based
Optimization
X Considerably faster than search-based optimization
Takeaways

91. Outline
92
Motivation
Causal AI
For Systems
UNICORN
Results
Future
Directions

92. Causal AI for Serverless
• Evaluating our Causal AI for Systems methodology with Serverless
systems provide the following opportunities:

1. Dynamic system recon
fi
gurations

• Dynamic placement of functions

• Dynamic recon
fi
gurations of the network of functions

• Dynamic multi-cloud placement of functions.

2. Root cause analysis of failures or QoS drop
93

93. Causal AI for Autonomous Robot Testing
• Testing cyberphysical systems such as robots are di
ff
i
cult. The key reason
is that there are additional interactions with the environment and the task
that the robot is performing.

• Evaluating our Causal AI for Systems methodology with autonomous
robots provide the following opportunities:

1. Identifying di
ff i
cult to catch bugs in robots

2. Identifying the root cause of an observed fault and repairing the issue
automatically during mission time.
94

94. Summary: Causal AI for Systems
1. Learning a
Functional Causal
Model for di
ff
erent
downstream

2. The learned
causal model is
transferable
across di
ff
erent
environments
95
Software: DeepStream
Middleware: TF, TensorRT
Hardware: Nvidia Xavier
Configuration: Default
number of counters
number of splitters
latency (ms)
100
150
1
200
250
2
300
Cubic Interpolation Over Finer Grid
2
4
3 6
8
4 10
12
5 14
16
6 18
Budget
Exhausted?
Yes
No
5- Update Causal
Performance Model
Query Engine
4- Estimate Causal
Queries
Estimate
probability of
satisfying QoS
if BufferSize is
set to 6k?
2- Learn Causal
Performance Model Performance
Debugging
Performance
Optimization
3- Translate Perf. Query
to Causal Queries
•What is the root-cause
of observed perf. fault?
•How do I fix the
misconfig.?
•How can I improve
throughput without
sacrificing accuracy?
•How do I understand
perf behavior?
Measure performance
of the configuration(s)
that maximizes
information gain
Performance Data Causal Model
P(Th > 40/s|do(Buffersize = 6k))
1- Specify
Performance Query
QoS : Th > 40/s
Observed : Th < 30/s ± 5/s

95. I played a very minor role

96. Arti
fi
cial Intelligence and Systems Laboratory

(AISys Lab)
Machine
Learning
Computer
Systems
Autonomy
AI/ML Systems
https://pooyanjamshidi.github.io/AISys/
97
Ying Meng

(PhD student)
Shuge Lei

(PhD student)
Kimia Noorbakhsh

Shahriar Iqbal

(PhD student)
Jianhai Su

(PhD student)
M.A. Javidian

(postdoc)
Fatemeh Ghofrani

(PhD student)
Abir Hossen

(PhD student)
Hamed Damirchi

(PhD student)
Mahdi Shari
fi

(PhD student)
Lane Stanley

(Intern)

97. 98
Rahul Krishna

Columbia
Shahriar Iqbal

UofSC
M. A. Javidian

Purdue
Baishakhi Ray

Columbia
Christian Kästner

CMU
Sven Apel

Saarland
Marco Valtorta

UofSC

REU student
Forest Agostinelli

UofSC
Causal AI
for Systems
Causal AI for
Robot Learning
(Causal RL +
Transfer Learning +
Robotics) Abir Hossen

UofSC
Theory of
Causal AI
Ahana Biswas

IIT
Om Pandey

KIIT
Hamed Damirchi

UofSC
Causal AI for
Ying Meng

UofSC
Fatemeh Ghofrani

UofSC
Mahdi Shari
fi

UofSC
Collaborators (Causal AI)
Sugato Basu