Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Causal AI for Systems

Causal AI for Systems

Stanford MLSys Seminars: https://youtu.be/csB_cF6MA9A

Pooyan Jamshidi

August 14, 2021
Tweet

More Decks by Pooyan Jamshidi

Other Decks in Research

Transcript

  1. Causal AI
    for Systems
    A journey from performance optimization to transfer learning all the way to Causal AI
    Pooyan Jamshidi
    UofSC & Google

    View Slide

  2. It is all about team work
    I played a very minor role

    View Slide

  3. Artificial Intelligence and Systems Laboratory
    (AISys Lab)
    Machine
    Learning
    Computer
    Systems
    Autonomy
    AI/ML Systems
    https://pooyanjamshidi.github.io/AISys/
    3
    Ying Meng
    (PhD student)
    Shuge Lei
    (PhD student)
    Kimia Noorbakhsh
    (Undergrad)
    Shahriar Iqbal
    (PhD student)
    Jianhai Su
    (PhD student)
    M.A. Javidian
    (postdoc)
    Sponsors, thanks!
    Fatemeh Ghofrani
    (PhD student)
    Abir Hossen
    (PhD student)
    Hamed Damirchi
    (PhD student)
    Mahdi Sharifi
    (PhD student)
    Mahdi Sharifi
    (Intern)

    View Slide

  4. Collaborators (Systems)
    4
    Rahul Krishna
    Columbia
    Shahriar Iqbal
    UofSC
    Baishakhi Ray
    Columbia
    Christian Kästner
    CMU
    Norbert Siegmund
    Leipzig
    Miguel Velez
    CMU
    Sven Apel
    Saarland
    Lars Kotthoff
    Wyoming
    Vivek Nair
    Facebook
    Tim Menzies
    NCSU
    Ramtin Zand
    UofSC
    Mohsen Amini
    UofSC

    View Slide

  5. 5
    Rahul Krishna
    Columbia
    Shahriar Iqbal
    UofSC
    M. A. Javidian
    Purdue
    Baishakhi Ray
    Columbia
    Christian Kästner
    CMU
    Sven Apel
    Saarland
    Marco Valtorta
    UofSC
    Madelyn Khoury
    REU student
    Forest Agostinelli
    UofSC
    Causal AI
    for Systems
    Causal AI for
    Robot Learning
    (Causal RL +
    Transfer Learning +
    Robotics) Abir Hossen
    UofSC
    Theory of
    Causal AI
    Ahana Biswas
    IIT
    Om Pandey
    KIIT
    Hamed Damirchi
    UofSC
    Causal AI for
    Adversarial ML
    Ying Meng
    UofSC
    Fatemeh Ghofrani
    UofSC
    Mahdi Sharifi
    UofSC
    The Causal AI Team!
    Sugato Basu
    Google AdsAI
    Garima Pruthi
    Google AdsAI
    Causal
    Representation
    Learning

    View Slide

  6. Configuration Space
    (Software, Deployment, Hardware)
    Program
    (Code)
    Performance
    Modeling
    Performance
    Visualization
    Whitebox
    Sampling
    number of counters
    number of splitters
    latency (ms)
    100
    150
    1
    200
    250
    2
    300
    Cubic Interpolation Over Finer Grid
    2
    4
    3 6
    8
    4 10
    12
    5 14
    16
    6 18
    Developer User
    Transfer
    Learning
    Performance
    Understanding
    Tradeoff
    Analysis
    Hands-off
    Debugging
    Performance
    Debugging
    Active
    Learning
    Q1
    Q2
    Q3
    Q4
    Foundation Application
    Artifacts Techniques
    Program
    Analysis
    Causal
    Inference
    Causal-based
    Documentation
    Q5
    Cause
    Localization
    Q3
    Baishakhi Ray
    Columbia
    Christian Kästner
    CMU
    Co-PIs
    Causal Performance Debugging
    for Highly-Configurable Systems

    View Slide

  7. Causal AI + Representation Learning
    Causal Representation Learning
    Learned
    Representation
    FCA
    (Attribution via Causal
    Inference and
    Counterfactual
    Reasoning)
    Multi-Objective
    Optimization, RL,
    Active Learning
    Visualization
    Specification
    (contextual
    badness, model
    robustness)
    Intervention/Update
    Data
    - Slices
    - Groups
    • Generalization

    • Robustness

    • Bias

    • Explainability
    Sugato Basu
    Google AdsAI
    Garima Pruthi
    Google AdsAI

    View Slide

  8. Outline
    8
    Case
    Study
    Causal AI
    For Systems
    CADET
    Current
    Results
    Future
    Directions

    View Slide

  9. 9
    Goal: Enable developers/users
    to find the right quality tradeoff

    View Slide

  10. Today’s most popular systems are configurable
    10
    built

    View Slide

  11. 11

    View Slide

  12. Empirical observations confirm that systems are
    becoming increasingly configurable
    12
    08 7/2010 7/2012 7/2014
    Release time
    1/1999 1/2003 1/2007 1/2011
    0
    1/2014
    N
    Release time
    02 1/2006 1/2010 1/2014
    2.2.14
    2.3.4
    2.0.35
    .3.24
    Release time
    Apache
    1/2006 1/2008 1/2010 1/2012 1/2014
    0
    40
    80
    120
    160
    200
    2.0.0
    1.0.0
    0.19.0
    0.1.0
    Hadoop
    Number of parameters
    Release time
    MapReduce
    HDFS
    [Tianyin Xu, et al., “Too Many Knobs…”, FSE’15]

    View Slide

  13. Empirical observations confirm that systems are
    becoming increasingly configurable
    13
    nia San Diego, ‡Huazhong Univ. of Science & Technology, †NetApp, Inc
    tixu, longjin, xuf001, yyzhou}@cs.ucsd.edu
    kar.Pasupathy, Rukma.Talwadker}@netapp.com
    prevalent, but also severely
    software. One fundamental
    y of configuration, reflected
    parameters (“knobs”). With
    m software to ensure high re-
    aunting, error-prone task.
    nderstanding a fundamental
    users really need so many
    answer, we study the con-
    including thousands of cus-
    m (Storage-A), and hundreds
    ce system software projects.
    ng findings to motivate soft-
    ore cautious and disciplined
    these findings, we provide
    ich can significantly reduce
    A as an example, the guide-
    ters and simplify 19.7% of
    on existing users. Also, we
    tion methods in the context
    7/2006 7/2008 7/2010 7/2012 7/2014
    0
    100
    200
    300
    400
    500
    600
    700
    Storage-A
    Number of parameters
    Release time
    1/1999 1/2003 1/2007 1/2011
    0
    100
    200
    300
    400
    500
    5.6.2
    5.5.0
    5.0.16
    5.1.3
    4.1.0
    4.0.12
    3.23.0
    1/2014
    MySQL
    Number of parameters
    Release time
    1/1998 1/2002 1/2006 1/2010 1/2014
    0
    100
    200
    300
    400
    500
    600
    1.3.14
    2.2.14
    2.3.4
    2.0.35
    1.3.24
    Number of parameters
    Release time
    Apache
    1/2006 1/2008 1/2010 1/2012 1/2014
    0
    40
    80
    120
    160
    200
    2.0.0
    1.0.0
    0.19.0
    0.1.0
    Hadoop
    Number of parameters
    Release time
    MapReduce
    HDFS
    [Tianyin Xu, et al., “Too Many Knobs…”, FSE’15]

    View Slide

  14. Configuration options live across stack
    14
    CPU Memory
    Controller
    GPU
    Lib API
    Clients
    Devices
    Network
    Task Scheduler Device Drivers
    File System
    Compilers
    Memory Manager
    Process Manager
    Frontend
    Application
    Layer
    OS/Kernel
    Layer
    Hardware
    Layer
    Deployment
    SoC Generic hardware Production Servers

    View Slide

  15. Today’s most popular systems are also composable!
    Data analytic pipelines
    15

    View Slide

  16. Today’s most popular systems are complex!
    multiscale, multi-modal, and multi-stream
    16
    Multi-Modal Data
    (Configurable)
    Image Processing
    Voice Recognition
    Context Extraction
    ML Models
    (Configurable)
    Deployment Environment
    (Configurable)
    System Components
    (Configurable)
    Multi-Cloud
    Variability Space =
    Configuration Space +

    System Architecture +

    Deployment Environment

    View Slide

  17. Configurations determine the performance
    behavior
    17
    void Parrot_setenv(. . . name,. . . value){
    #ifdef PARROT_HAS_SETENV
    my_setenv(name, value, 1);
    #else
    int name_len=strlen(name);
    int val_len=strlen(value);
    char* envs=glob_env;
    if(envs==NULL){
    return;
    }
    strcpy(envs,name);
    strcpy(envs+name_len,"=");
    strcpy(envs+name_len + 1,value);
    putenv(envs);
    #endif
    }
    #ifdef LINUX
    extern int Parrot_signbit(double x){
    endif
    else
    PARROT_HAS_SETENV
    LINUX
    Speed
    Energy

    View Slide

  18. Performance distributions are multi-modal and have long tails
    • Certain configurations can cause performance
    to take abnormally large values

    • Faulty configurations take the tail values (worse
    than 99.99th percentile)

    • Certain configurations can cause faults on
    multiple performance objectives. 

    18

    View Slide

  19. Identifying the root cause of performance faults is difficult
    ● A auto-pilot code was transplanted from
    TX1 to TX2
    ● TX2 is more powerful, but software was
    2x slower than TX1
    Fig 1. Performance fault on NVIDIA TX2
    https://forums.developer.nvidia.com/t/50477
    19

    View Slide

  20. 20
    Long conversations in issue tracking is common to find root causes and possible fixes

    View Slide

  21. Users want to understand the effect of configuration options
    21

    View Slide

  22. Fixing performance faults is difficult and not obvious
    ● These were not in the default settings
    ● Took 1 month to fix in the end...
    ● Three misconfigurations:
    ○ Wrong compilation flags for compiling
    CUDA (didn't use 'dynamic' flag)
    ○ Wrong CPU/GPU modes (didn't use TX2
    optimized cores)
    ○ Wrong Fan mode (didn't change to handle
    thermal throttling)
    ● We need to do this better
    Performance fault on NVIDIA TX2
    https://forums.developer.nvidia.com/t/50477
    22

    View Slide

  23. We performed a systematic study of performance faults in ML systems
    1. There are different kinds of performance faults across ML systems as a
    result of misconfigurations.
    I. Latency, Thermal, Energy, Throughput
    II. Combinations of above faults
    2. Configuration options interact with one another across stack
    I. e.g., software options with hardware options
    3. The interactions between options are usually low degree (2-5).
    4. The interactions between options may change across environments,
    however, such change is local in few causal mechanisms.
    5. Non-functional faults take a long time to resolve
    23

    View Slide

  24. We performed a systematic study of performance in different types of systems with options
    living across stack and with different deployment topologies
    1. ML Systems
    2. Data analytics Pipelines
    3. Big Data Systems
    4. Stream Processing Systems
    5. Compilers
    6. Video Encoders
    7. Databases
    8. SAT solvers
    24
    DNN
    ec1 : [h1 ! h2] S 0.98 0.30 0.98 0.97 0.93 8 6 5 1 0.82 16 12 12 0.9
    ec2 : [h1 ! h3] S 1.00 0.19 0.99 0.93 0.94 8 7 7 0 0.90 16 12 12 0.9
    ec3 : [h3 ! h4] M 0.89 0.41 0.47 0.46 0.66 7 7 5 1 0.80 12 18 12 0.6
    ec4 : [w1 ! w2] S 1.00 0.01 1.00 0.95 0.95 7 7 6 1 0.82 12 12 12 0.9
    ec5 : [w1 ! w3] S 1.00 0.01 1.00 0.94 0.95 7 7 6 1 0.89 12 12 12 0.9
    ec6 : [w1 ! w4] S 1.00 0.01 1.00 0.95 0.95 7 8 6 1 0.85 12 12 12 0.9
    ec7 : [v1 ! v2] M 0.97 0.24 0.96 0.86 0.93 6 6 6 0 0.78 12 14 12 0.9
    ec8 : [v1 ! v3] M 0.94 0.21 0.93 0.58 0.79 6 7 6 0 0.66 16 21 16 0.7
    ec9 : [v2 ! v3] M 0.95 0.04 0.93 0.54 0.79 6 7 6 0 0.73 17 21 16 0.7
    ec10 : [h4w3v1 ! h4w2v2] L 0.48 0.31 0.45 0.66 0.70 7 6 6 0 0.70 18 14 14 0.6
    h1: Azure, h2: AWS, h3: TK1, h4: GPU; w1: Co↵ee, w2: DiatomSizeReduction, w3: Adiac, w4: ShapesAll;
    v1: TensorFlow, v2: Theano, v3: CNTK;
    Metrics: M1: Pearson correlation; M2: Kullback-Leibler (KL) divergence; M3: Spearman correlation; M4/M5: P
    of top/bottom conf.; M6/M7: Number of influential options; M8/M9: Number of options agree/disagree; M
    Correlation btw importance of options; M11/M12: Number of interactions; M13: Number of interactions agree
    e↵ects; M14: Correlation btw the coe↵s;
    Input #1
    Input #2
    Input #3
    Input #4
    Output
    Hidden
    layer
    Input
    layer
    Output
    layer
    4 Technical Aims and Research Plan
    We will pursue the following technical aims: (1) investigate potential criteria for e↵ective sampling
    exploration of the design space of DNN architectures (Section 4.2), (2) build analytical models that
    curately predict the performance of a given architecture configuration given other similar architectu
    which either have been measured in the target environments or other similar environments, with
    measuring the network performance directly (Section 4.3), and (3), develop a tunning mechan
    that exploit the performance model from previous step to e↵ectively search for optimal architectu
    (Section 4.4).
    4.1 Project Timeline
    We plan to complete the proposed project in two years. To mitigate project risks, we will divide
    project into three major phases:
    8
    Network
    Design
    Model
    Compiler
    Hybrid
    Deployment
    OS/
    Hardware
    Scope of
    this Project
    Neural Search
    Hardware
    Optimization
    Hyper-parameter
    DNN system development stack
    Deployment
    Topology

    View Slide

  25. Each system has different performance objectives and configuration options
    25
    SPEAR (SAT Solver)
    Analysis time
    14 options
    16,384 configurations
    SAT problems
    3 hardware
    2 versions
    X264 (video encoder)
    Encoding time
    16 options
    4,000 configurations
    Video quality/size
    2 hardware
    3 versions
    SQLite (DB engine)
    Query time
    14 options
    1,000 configurations
    DB Queries
    2 hardware
    2 versions
    SaC (Compiler)
    Execution time
    50 options
    71,267 configurations
    10 Demo programs

    View Slide

  26. More information regarding setup and the gained insights can be found here
    26
    Transfer Learning for Performance Modeling of
    Configurable Systems: An Exploratory Analysis
    Pooyan Jamshidi
    Carnegie Mellon University, USA
    Norbert Siegmund
    Bauhaus-University Weimar, Germany
    Miguel Velez, Christian K¨
    astner
    Akshay Patel, Yuvraj Agarwal
    Carnegie Mellon University, USA
    Abstract—Modern software systems provide many configura-
    tion options which significantly influence their non-functional
    properties. To understand and predict the effect of configuration
    options, several sampling and learning strategies have been
    proposed, albeit often with significant cost to cover the highly
    dimensional configuration space. Recently, transfer learning has
    been applied to reduce the effort of constructing performance
    models by transferring knowledge about performance behavior
    across environments. While this line of research is promising to
    learn more accurate models at a lower cost, it is unclear why
    and when transfer learning works for performance modeling. To
    shed light on when it is beneficial to apply transfer learning, we
    conducted an empirical study on four popular software systems,
    varying software configurations and environmental conditions,
    such as hardware, workload, and software versions, to identify
    the key knowledge pieces that can be exploited for transfer
    learning. Our results show that in small environmental changes
    (e.g., homogeneous workload change), by applying a linear
    transformation to the performance model, we can understand
    the performance behavior of the target environment, while for
    severe environmental changes (e.g., drastic workload change) we
    can transfer only knowledge that makes sampling more efficient,
    e.g., by reducing the dimensionality of the configuration space.
    Index Terms—Performance analysis, transfer learning.
    I. INTRODUCTION
    Highly configurable software systems, such as mobile apps,
    compilers, and big data engines, are increasingly exposed to
    end users and developers on a daily basis for varying use cases.
    Users are interested not only in the fastest configuration but
    also in whether the fastest configuration for their applications
    also remains the fastest when the environmental situation has
    been changed. For instance, a mobile developer might be
    interested to know if the software that she has configured
    to consume minimal energy on a testing platform will also
    remain energy efficient on the users’ mobile platform; or, in
    general, whether the configuration will remain optimal when
    the software is used in a different environment (e.g., with a
    different workload, on different hardware).
    Performance models have been extensively used to learn
    and describe the performance behavior of configurable sys-
    Fig. 1: Transfer learning is a form of machine learning that takes
    advantage of transferable knowledge from source to learn an accurate,
    reliable, and less costly model for the target environment.
    their byproducts across environments is demanded by many
    application scenarios, here we mention two common scenarios:
    • Scenario 1: Hardware change: The developers of a soft-
    ware system performed a performance benchmarking of the
    system in its staging environment and built a performance
    model. The model may not be able to provide accurate
    predictions for the performance of the system in the actual
    production environment though (e.g., due to the instability
    of measurements in its staging environment [6], [30], [38]).
    • Scenario 2: Workload change: The developers of a database
    system built a performance model using a read-heavy
    workload, however, the model may not be able to provide
    accurate predictions once the workload changes to a write-
    heavy one. The reason is that if the workload changes,
    different functions of the software might get activated (more
    often) and so the non-functional behavior changes, too.
    In such scenarios, not every user wants to repeat the costly
    process of building a new performance model to find a

    View Slide

  27. Outline
    27
    Case
    Study
    Causal AI
    For Systems
    CADET
    Current
    Results
    Future
    Directions

    View Slide

  28. SocialSensor
    •Identifying trending topics

    •Identifying user defined topics

    •Social media search
    28

    View Slide

  29. SocialSensor
    29
    Content Analysis
    Orchestrator
    Crawling
    Search and Integration
    Tweets: [5k-20k/min]
    Every 10 min:
    [100k tweets]
    Tweets: [10M]
    Fetch
    Store
    Push
    Store
    Crawled
    items
    Fetch
    Internet

    View Slide

  30. Challenges
    30
    Content Analysis
    Orchestrator
    Crawling
    Search and Integration
    Tweets: [5k-20k/min]
    Every 10 min:
    [100k tweets]
    Tweets: [10M]
    Fetch
    Store
    Push
    Store
    Crawled
    items
    Fetch
    Internet
    100X
    10X
    Real time

    View Slide

  31. 31
    How can we gain a better performance without
    using more resources?

    View Slide

  32. 32
    Let’s try out different system configurations!

    View Slide

  33. Opportunity: Data processing engines in the
    pipeline were all configurable
    33
    > 100 > 100 > 100
    2300

    View Slide

  34. 34
    More combinations than estimated
    atoms in the universe

    View Slide

  35. 0 500 1000 1500
    Throughput (ops/sec)
    0
    1000
    2000
    3000
    4000
    5000
    Average write latency ( s)
    The default configuration is typically bad and the
    optimal configuration is noticeably better than median
    35
    Default Configuration
    Optimal
    Configuration
    better
    better
    • Default is bad
    • 2X-10X faster than worst
    • Noticeably faster than median

    View Slide

  36. Performance behavior varies in different environments
    36

    View Slide

  37. 100X more user


    cloud resources reduced 20%
    outperform expert recommendation


    View Slide

  38. Outline
    39
    Case
    Study
    CADET
    Current
    Results
    Future
    Directions
    Causal AI for
    Systems

    View Slide

  39. Causal AI in Systems and Software
    38
    Computer Architecture
    Database
    Operating Systems
    Programming Languages
    BigData Software Engineering
    https://github.com/y-ding/causal-system-papers

    View Slide

  40. • Build a Causal Model that capture the
    interactions options in the variability
    space using the observation
    performance data.

    • Iterative causal model evaluation and
    model update
    • Perform downstream tasks such as
    performance debugging or performance
    optimization using Causal Inference,
    Counterfactuals Reasoning, Causal
    Interactions, Causal Invariances, Causal
    Representation
    Our Causal AI for Systems
    methodology

    View Slide

  41. Our Causal AI for Systems methodology
    41

    View Slide

  42. Step1: Determining the variability space
    The large the variability space the more difficult the downstream tasks get
    42
    Multi-Modal Data
    (Configurable)
    Image Processing
    Voice Recognition
    Context Extraction
    ML Models
    (Configurable)
    Deployment Environment
    (Configurable)
    System Components
    (Configurable)
    Multi-Cloud
    Variability Space =
    Configuration Space +

    System Architecture +

    Deployment Environment

    View Slide

  43. Determining the variability space
    43
    ℂ = O
    1
    × O
    2
    × ⋯ × O
    19
    × O
    20
    Dead code removal
    Configuration
    Space
    Constant folding
    Loop unrolling
    Function inlining
    c
    1
    = 0 × 0 × ⋯ × 0 × 1
    c
    1
    ∈ ℂ
    f
    c
    (c
    1
    ) = 11.1ms
    Compile
    time
    Execution
    time
    Energy
    Compiler
    (e.f., SaC, LLVM)
    Program Compiled
    Code
    Instrumented
    Binary
    Hardware
    Compile Deploy
    Configure
    f
    e
    (c
    1
    ) = 110.3ms
    f
    en
    (c
    1
    ) = 100mwh
    Non-functional
    measurable/quantifiable
    aspect

    View Slide

  44. Step2: Collecting observational data
    By instrumenting the system across stack (software, middleware, hardware) and measuring performance objectives (depending
    on the system the perf objective will be different, e.g., throughput in data analytics pipelines) for different configurations
    44
    GPU
    Mem.
    Swap
    Mem.
    Load Latency
    c1
    0.2 2 Gb 10% 1 sec
    c2
    0.5 1 Gb 20% 2 sec
    cn
    1.0 4 Gb 40% 0.1 sec
    Multi-Modal Data
    (Configurable)
    Image Processing
    Voice Recognition
    Context Extraction
    ML Models
    (Configurable)
    Deployment Environment
    (Configurable)
    System Components
    (Configurable)
    Multi-Cloud
    Variability Space =
    Configuration Space +

    System Architecture +

    Deployment Environment
    Measurements

    View Slide

  45. Our setup for performance measurements
    45

    View Slide

  46. Hardware platforms in our experiments
    The reason behind using different types of hardware platforms is that they exhibit different behaviors due to differences in terms
    of resources, their microarchitecture, etc.
    46
    AWS DeepLens:
    Cloud-connected device
    System on Chip (SoC)
    Microcontrollers (MCUs)

    View Slide

  47. 47
    System-on-Module (SoM)
    Hardware platforms in our experiments
    The reason behind using different types of hardware platforms is that they exhibit different behaviors due to differences in terms
    of resources, their microarchitecture, etc.

    View Slide

  48. 48
    Edge TPU devices
    Hardware platforms in our experiments
    The reason behind using different types of hardware platforms is that they exhibit different behaviors due to differences in terms
    of resources, their microarchitecture, etc.

    View Slide

  49. 49
    FPGA
    Hardware platforms in our experiments
    The reason behind using different types of hardware platforms is that they exhibit different behaviors due to differences in terms
    of resources, their microarchitecture, etc.

    View Slide

  50. Measuring performance for systems involves lots of challenges
    Each hardware requires different ways of instrumentations and clean measurement that contains least amount of noise is the
    most challenging part of our experiments.
    50

    View Slide

  51. Step3: Learning a Functional Causal Model
    We developed Perf-SCM, an instantiation of SCM for Performance, which captures causal interactions via functional nodes
    51
    Batch Size
    f5
    Batch Timeout
    f1
    Memory Growth
    f7
    f16
    QoS
    Interval
    Cache Pressure
    f10 f12
    Swappiness
    f9
    f14
    Cache Size
    f4
    CPU Freq
    f2
    f3
    f6
    GPU Freq
    f11
    f15
    CPU Utilization
    EMC Freq
    CPU Cores
    f13
    Context Switches
    Migrations
    f8
    Num Cycles
    Cache Misses
    Branch Misses
    Num Instructions Scheduler Wait Time Major Faults
    Cache References Scheduler Sleep Time Minor Faults
    Scheduler Task Migrations
    Softirq Entry
    GPU Utilization
    Throughput
    Energy
    GPU
    Mem.
    Swap
    Mem.
    Load Latency
    c1
    0.2 2 Gb 10% 1 sec
    c2
    0.5 1 Gb 20% 2 sec
    cn
    1.0 4 Gb 40% 0.1 sec
    Causal
    Structure
    Learning
    (e.g., CGNN)

    View Slide

  52. Step4: Formulating queries for the downstream tasks
    E.g., conditional probabilities for performance prediction tasks.
    52
    Batch Size
    f5
    Batch Timeout
    f1
    Memory Growth
    f7
    f16
    QoS
    Interval
    Cache Pressure
    f10 f12
    Swappiness
    f9
    f14
    Cache Size
    f4
    CPU Freq
    f2
    f3
    f6
    GPU Freq
    f11
    f15
    CPU Utilization
    EMC Freq
    CPU Cores
    f13
    Context Switches
    Migrations
    f8
    Num Cycles
    Cache Misses
    Branch Misses
    Num Instructions Scheduler Wait Time Major Faults
    Cache References Scheduler Sleep Time Minor Faults
    Scheduler Task Migrations
    Softirq Entry
    GPU Utilization
    Throughput
    Energy
    P(Throughput|M, Configuration = C
    1
    )
    For performance understanding tasks,
    one may formulate the following query:
    P(Throughput|M, Configuration = C
    1
    )
    For performance debugging, one may
    formulate the following query:

    View Slide

  53. Questions of this nature require precise mathematical language lest they will
    be misleading.
    Here we are simultaneously conditioning on two values of GPU memory growth (i.e., ˆ = 0.66 and = 0.33). Traditional machine learning
    approaches cannot handle such expressions. Instead, we must resort to causal models to compute them.
    53

    View Slide

  54. There are two fundamental benefits that we get by our “Causal AI for Systems”
    methodology
    1. We learn one central (causal) model from the data and use it reliably across different performance tasks:

    • Performance understanding

    • Performance optimization

    • Performance debugging and repair

    • Performance prediction for different environments where we cannot intervene (e.g., canary-> production, we can
    intervene in canary environment, while it is not possible to disturb production environment, we may only be able
    to use measurement data)

    2. The causal model is transferable across environments.

    • We observed Sparse Mechanism Shift in systems too!

    • Alternative non-causal models (e.g., regression-based models for performance tasks) are not transferable as
    they rely on i.i.d. setting and only capture association/correlations among variables, resulting in many non-
    causal terms that may drastically change when the system is deployed in different environments.
    54

    View Slide

  55. Difference between statistical (left) and causal models (right) on a given set of
    three variables
    While a statistical model specifies a single probability distribution, a causal model represents a set of distributions, one for each
    possible intervention.
    55

    View Slide

  56. Independent Causal Mechanisms (ICM)
    Principle

    View Slide

  57. Sparse Mechanism Shift (SMS)
    Hypothesis
    Example of SMS hypothesis,
    where an intervention (which may
    or may not be intentional/observed)
    changes the position of one finger,
    and as a consequence, the object
    falls. The change in pixel space is
    entangled (or distributed), in
    contrast to the change in the causal
    model.

    View Slide

  58. Step5: Estimating the queries based on the learned causal model
    Estimation process involves traversing the causal model: (i) extracting the causal paths by backtracking from perf objectives, (ii)
    ranking the causal paths by calculating the average causal effect, (iii) extract the required information from the causal paths.
    58
    Batch Size
    f5
    Batch Timeout
    f1
    Memory Growth
    f7
    f16
    QoS
    Interval
    Cache Pressure
    f10 f12
    Swappiness
    f9
    f14
    Cache Size
    f4
    CPU Freq
    f2
    f3
    f6
    GPU Freq
    f11
    f15
    CPU Utilization
    EMC Freq
    CPU Cores
    f13
    Context Switches
    Migrations
    f8
    Num Cycles
    Cache Misses
    Branch Misses
    Num Instructions Scheduler Wait Time Major Faults
    Cache References Scheduler Sleep Time Minor Faults
    Scheduler Task Migrations
    Softirq Entry
    GPU Utilization
    Throughput
    Energy

    View Slide

  59. Step6: Evaluating and updating the causal model
    We evaluate ground truth queries to test whether the causal model is accurate enough to estimate queries with certain accuracy.
    In a typical setting, we have limited sampling budget, say 100 measurements.
    59
    Batch Size
    f5
    Batch Timeout
    f1
    Memory Growth
    f7
    f16
    QoS
    Interval
    Cache Pressure
    f10 f12
    Swappiness
    f9
    f14
    Cache Size
    f4
    CPU Freq
    f2
    f3
    f6
    GPU Freq
    f11
    f15
    CPU Utilization
    EMC Freq
    CPU Cores
    f13
    Context Switches
    Migrations
    f8
    Num Cycles
    Cache Misses
    Branch Misses
    Num Instructions Scheduler Wait Time Major Faults
    Cache References Scheduler Sleep Time Minor Faults
    Scheduler Task Migrations
    Softirq Entry
    GPU Utilization
    Throughput
    Energy
    Batch Size
    f5
    Batch Timeout
    f1
    Memory Growth
    f7
    f16
    QoS
    Interval
    Cache Pressure
    f10 f12
    Swappiness
    f9
    f14
    Cache Size
    f4
    CPU Freq
    f2
    f3
    f6
    GPU Freq
    f11
    f15
    CPU Utilization
    EMC Freq
    CPU Cores
    f13
    Context Switches
    Migrations
    f8
    Num Cycles
    Cache Misses
    Branch Misses
    Num Instructions Scheduler Wait Time Major Faults
    Cache References Scheduler Sleep Time Minor Faults
    Scheduler Task Migrations
    Softirq Entry
    GPU Utilization
    Throughput
    Energy
    Causal Model
    Update

    View Slide

  60. Step7: Calculating quantities of the downstream tasks
    Depending on the down stream tasks (the estimated queries), we may need to do some final transformations or calculations,
    here we assume the downstream task is performance optimization.
    60
    Batch Size
    f5
    Batch Timeout
    f1
    Memory Growth
    f7
    f16
    QoS
    Interval
    Cache Pressure
    f10 f12
    Swappiness
    f9
    f14
    Cache Size
    f4
    CPU Freq
    f2
    f3
    f6
    GPU Freq
    f11
    f15
    CPU Utilization
    EMC Freq
    CPU Cores
    f13
    Context Switches
    Migrations
    f8
    Num Cycles
    Cache Misses
    Branch Misses
    Num Instructions Scheduler Wait Time Major Faults
    Cache References Scheduler Sleep Time Minor Faults
    Scheduler Task Migrations
    Softirq Entry
    GPU Utilization
    Throughput
    Energy
    P(Throughput|M, Configuration = C
    1
    )
    1- Estimation
    2- Using this estimation
    in optimization loop for
    performance
    optimization tasks
    -1.5 -1 -0.5 0 0.5 1 1.5
    -1
    -0.8
    -0.6
    -0.4
    -0.2
    0
    0.2
    0.4
    0.6
    0.8
    1
    Configuration
    Space
    Empirical
    Model
    Experiment
    Experiment
    0 20 40 60 80 100 120 140 160 180 200
    -0.8
    -0.6
    -0.4
    -0.2
    0
    0.2
    0.4
    0.6
    0.8
    1
    Selection Criteria
    Sequential Design

    View Slide

  61. Outline
    61
    Case
    Study
    Current
    Results
    Future
    Directions
    Causal AI for
    Systems
    CADET

    View Slide

  62. A Typical Software Lifecycle
    >
    Test Deploy Monitor
    Write
    Vulnerabilities Deep bugs Poor product metrics
    Artifact
    62
    Misconfigurations
    CADET
    Misconfigurations
    Diagnosing and fixing
    misconfigurations with
    causal inference
    TODAY’S TALK

    View Slide

  63. Today’s Talk
    Deploy
    Artifact
    Challenge
    Ὂ Each deployment environment
    must be configured correctly
    Ὂ This is challenging and prone to
    misconfigurations
    Software may be deployed
    in several environments
    Server
    Personal Devices
    Embedded Hardware
    Autonomous Vehicles
    Deployment Environments
    63

    View Slide

  64. Today’s Talk
    Problem
    Ὂ Each deployment environment
    must be configured correctly
    Ὂ This is challenging and prone to
    misconfigurations
    Why?
    Ὂ The configuration options lie
    across the software stack
    Ὂ There are several non-trivial
    interactions with one another
    Ὂ The configuration space is
    combinatorially large with 100’s
    of configuration options
    64
    CPU Memory
    Controller
    GPU
    Lib API
    Clients
    Devices
    Network
    Task Scheduler Device Drivers
    File System
    Compilers
    Memory Manager
    Process Manager
    Frontend
    Application
    Layer
    OS/Kernel
    Layer
    Hardware
    Layer
    Deployment
    SoC Generic hardware Production Servers

    View Slide

  65. Misconfiguration and its Effects
    ● Misconfigurations can elicit unexpected interactions between software and hardware
    ● These can result in non-functional faults
    ○ Affecting non-functional system properties like
    latency, throughput, energy consumption, etc.
    65
    The system doesn’t crash or
    exhibit an obvious misbehavior
    Systems are still operational but with a
    degraded performance, e.g., high latency, low
    throughput, high energy consumption, high
    heat dissipation, or a combination of several

    View Slide

  66. 66
    CUDA performance issue on tx2
    When we are trying to transplant our CUDA source code from TX1 to TX2, it
    behaved strange.
    We noticed that TX2 has twice computing-ability as TX1 in GPU, as expectation,
    we think TX2 will 30% - 40% faster than TX1 at least.
    Unfortunately, most of our code base spent twice the time as TX1, in other words,
    TX2 only has 1/2 speed as TX1, mostly. We believe that TX2’s CUDA API runs
    much slower than TX1 in many cases.
    When we are trying to transplant our CUDA source code from TX1 to TX2, it
    behaved strange.
    We noticed that TX2 has twice computing-ability as TX1 in GPU, as expectation,
    we think TX2 will 30% - 40% faster than TX1 at least.
    Unfortunately, most of our code base spent twice the time as TX1, in other words,
    TX2 only has 1/2 speed as TX1, mostly. We believe that TX2’s CUDA API runs
    much slower than TX1 in many cases.
    The user is transferring the code
    from one hardware to another
    When we are trying to transplant our CUDA source code from TX1 to TX2, it
    behaved strange.
    We noticed that TX2 has twice computing-ability as TX1 in GPU, as expectation,
    we think TX2 will 30% - 40% faster than TX1 at least.
    Unfortunately, most of our code base spent twice the time as TX1, in other words,
    TX2 only has 1/2 speed as TX1, mostly. We believe that TX2’s CUDA API runs
    much slower than TX1 in many cases.
    The target hardware is faster
    than the the source hardware.
    User expects the code to run
    at least 30-40% faster.
    Motivating Example
    When we are trying to transplant our CUDA source code from TX1 to TX2, it
    behaved strange.
    We noticed that TX2 has twice computing-ability as TX1 in GPU, as expectation,
    we think TX2 will 30% - 40% faster than TX1 at least.
    Unfortunately, most of our code base spent twice the time as TX1, in other words,
    TX2 only has 1/2 speed as TX1, mostly. We believe that TX2’s CUDA API runs
    much slower than TX1 in many cases.
    The code ran 2x slower on the
    more powerful hardware

    View Slide

  67. Motivating Example
    67
    June 3rd
    We have already tried this. We still have high latency.
    Any other suggestions?
    June 4th
    Please do the following and let us know if it works
    1. Install JetPack 3.0
    2. Set nvpmodel=MAX-N
    3. Run jetson_clock.sh
    June 5th
    June 4th
    TX2 is pascal architecture. Please update your CMakeLists:
    + set(CUDA_STATIC_RUNTIME OFF)
    ...
    + -gencode=arch=compute_62,code=sm_62
    The user had several misconfigurations
    In Software:
    ✖ Wrong compilation flags
    ✖ Wrong SDK version
    In Hardware:
    ✖ Wrong power mode
    ✖ Wrong clock/fan settings
    The discussions took 2 days
    Any suggestions on how to improve my performance?
    Thanks!
    How to resolve such issues faster?
    ?

    View Slide

  68. 68
    Diagnose and fix the root-cause of misconfigurations that cause non-functional faults
    Objective
    Causal Debugging (with CADET)
    Ὂ Use causal models to model various cross-stack configuration interactions;
    and
    Ὂ Counterfactual reasoning to recommend fixes for these misconfigurations
    Approach

    View Slide

  69. 69
    NeurIPS 2020 (ML For Systems), Dec 12th, 2020
    https://arxiv.org/pdf/2010.06061.pdf
    https://github.com/rahlk/CADET

    View Slide

  70. Why Causal Inference? (Simpson’s Paradox)
    70
    Increasing GPU memory
    increases Latency
    More GPU memory
    usage should reduce
    latency not increase it.
    Counterintuitive!
    Any ML-/statistical models built
    on this data will be incorrect
    !

    View Slide

  71. Why Causal Inference? (Simpson’s Paradox)
    71
    Segregate data on swap memory
    Available swap
    memory is
    reducing
    GPU memory borrows memory from the swap for some intensive workloads. Other
    host processes may reduce the available swap. Little will be left for the GPU to use.

    View Slide

  72. 72
    Why Causal Inference?
    Real world problems can have
    100s if not 1000s of interacting
    configuration options
    !
    Manually understanding and
    evaluating each combination
    is impractical, if not
    impossible.

    View Slide

  73. Load
    GPU Mem.
    Swap Mem.
    Latency
    Express the relationships between
    interacting variables as a causal graph
    73
    Causal Models
    Configuration option Direction(s) of the causality
    • Latency is affected by GPU Mem. which
    in turn is influenced by swap memory
    • External factors like resource pressure
    also affects swap memory
    Non-functional property
    System event

    View Slide

  74. 74
    Causal Models
    How to construct
    this causal graph?
    ?
    If there is a fault in latency,
    how to diagnose and fix it?
    ?
    Load
    GPU Mem.
    Swap Mem.
    Latency

    View Slide

  75. 75
    CADET: Causal Debugging Tool
    • What is the root-cause
    of my fault?
    • How do I fix my
    misconfigurations to
    improve performance?
    Misconfiguration
    Fault
    fixed?
    Observational Data Build Causal Graph Extract Causal Paths
    Best Query
    Yes
    No
    update
    observational
    data
    Counterfactual Queries
    Rank Paths
    What if questions.
    E.g., What if the
    configuration option X was
    set to a value ‘x’?
    About 25 sample
    configurations
    (training data)

    View Slide

  76. Best Query
    Counterfactual Queries
    Rank Paths
    What if questions.
    E.g., What if the
    configuration option X was
    set to a value ‘x’?
    Extract Causal Paths
    76
    STEP 1: Generating a Causal Graph
    • What is the root-cause
    of my fault?
    • How do I fix my
    misconfigurations to
    improve performance?
    Misconfiguration
    Fault
    fixed?
    Observational Data
    Yes
    No
    update
    observational
    data
    About 25 sample
    configurations
    (training data)
    Build Causal Graph

    View Slide

  77. Directed Acyclic Graph
    Load
    GPU Mem. Latency
    Swap Mem.
    77
    Generating a Causal Graph: With FCI
    GPU
    Mem.
    Swap
    Mem.
    Load Latency
    c1
    0.2 2 Gb 10% 1 sec
    c2
    0.5 1 Gb 20% 2 sec
    cn
    1.0 4 Gb 40% 0.1 sec
    ⋮ ⋮ ⋮ ⋮
    Load
    Swap Mem. Latency
    GPU Mem.

    Fully connected
    skeleton
    Prune away edges
    between independent
    variables
    use statistical
    independence
    tests
    orient remaining
    edges
    Use standard orientation
    rules for forks, colliders,
    v-structures, and cycles
    Load
    GPU Mem. Latency
    Swap Mem.

    View Slide

  78. Best Query
    Counterfactual Queries
    Rank Paths
    What if questions.
    E.g., What if the
    configuration option X was
    set to a value ‘x’?
    Extract Causal Paths
    78
    STEP 2: Extracting Paths from the Graph
    • What is the root-cause
    of my fault?
    • How do I fix my
    misconfigurations to
    improve performance?
    Misconfiguration
    Fault
    fixed?
    Observational Data Build Causal Graph
    Yes
    No
    update
    observational
    data
    About 25 sample
    configurations
    (training data)

    View Slide

  79. Extracting Paths from the Causal Graph
    Problem
    ✕ In real world cases, this causal graph can be
    very complex
    ✕ It may be intractable to reason over the entire
    graph directly
    79
    Solution
    ✓ Extract paths from the causal graph
    ✓ Rank them based on their Average Causal
    Effect on latency, etc.
    ✓ Reason over the top K paths

    View Slide

  80. Extracting Paths from the Causal Graph
    80
    GPU Mem. Latency
    Swap Mem.
    Extract paths
    Always begins with a
    configuration option
    Or a system
    event
    Always terminates at a
    performance objective
    Load
    GPU Mem. Latency
    Swap Mem.
    Swap Mem. Latency
    Load GPU Mem.

    View Slide

  81. Ranking Paths from the Causal Graph
    81
    ● They may be too many causal paths
    ● We need to select the most useful ones
    ● Compute the Average Causal Effect (ACE) of
    each pair of neighbors in a path
    GPU Mem.
    Swap Mem. Latency
    (GPU Mem . , Swap) =
    1

    , ∈
    (GPU Mem . (Swap = ))

    (GPU Mem . (Swap = ))
    Expected value of GPU
    Mem. when we artificially
    intervene by setting Swap to
    the value b
    Expected value of GPU
    Mem. when we artificially
    intervene by setting Swap to
    the value a
    If this difference is large, then
    small changes to Swap Mem.
    will cause large changes to GPU
    Mem.
    Average over all permitted
    values of Swap memory.

    View Slide

  82. Ranking Paths from the Causal Graph
    82
    ● Average the ACE of all pairs of adjacent nodes in the path
    ( , ) =
    1
    2
    ( ( , ) + ( , ))
    X Y
    Z
    Sum over all pairs of
    nodes in the causal
    path.
    GPU Mem. Latency
    Swap Mem.

    View Slide

  83. Best Query
    Counterfactual Queries
    Rank Paths
    What if questions.
    E.g., What if the
    configuration option X was
    set to a value ‘x’?
    Extract Causal Paths
    83
    STEP 3: Diagnosing and Fixing the Faults
    • What is the root-cause
    of my fault?
    • How do I fix my
    misconfigurations to
    improve performance?
    Misconfiguration
    Fault
    fixed?
    Observational Data Build Causal Graph
    Yes
    No
    update
    observational
    data
    About 25 sample
    configurations
    (training data)

    View Slide

  84. Diagnosing and Fixing the Faults
    84
    ● Counterfactual inference asks “what if” questions about changes to the
    misconfigurations
    We are interested in the scenario where:
    • We hypothetically have low latency;
    Conditioned on the following events:
    • We hypothetically set the new Swap memory to 4 Gb
    • Swap Memory was initially set to 2 Gb
    • We observed high latency when Swap was set to 2 Gb
    • Everything else remains the same
    Example
    Given that my current swap memory is 2 Gb, and I have high latency. What is
    the probability of having low latency if swap memory was increased to 4 Gb?

    View Slide

  85. Low?
    Load
    GPU Mem. Latency
    Swap = 4 Gb
    Diagnosing and Fixing the Faults
    85
    GPU Mem. Latency
    Swap
    Original Path
    Load
    GPU Mem. Latency
    Swap = 4 Gb
    Path after proposed change
    Load
    Remove incoming
    edges. Assume no
    external influence.
    Modify to reflect the
    hypothetical scenario
    Low?
    Load
    GPU Mem. Latency
    Swap = 4 Gb
    Low?
    Use both the models to compute the answer to the counterfactual question

    View Slide

  86. Diagnosing and Fixing the Faults
    86
    GPU Mem. Latency
    Swap
    Original Path
    Load
    GPU Mem. Latency
    Swap = 4 Gb
    Path after proposed change
    Load
    =
    (
    ^ = . . ^ = 4 , . = 2 ,
    =2
    = h h, )
    We expect a low latency
    The latency was high
    The Swap is now 4 Gb
    The Swap was initially 2
    Gb
    Everything else
    stays the same

    View Slide

  87. Diagnosing and Fixing the Faults
    87
    Potential =
    (
    ^ = ~ ~ h , ~ ¬ h
    = , ~¬ h , )
    Probability that the outcome is good after a change, conditioned on the past
    If this difference is large, then our change is useful
    Individual Treatment Effect = Potential − Outcome
    Control =
    (
    ^ = ~ ~¬ h , )
    Probability that the outcome was bad before the change

    View Slide

  88. Diagnosing and Fixing the Faults
    88
    GPU Mem.
    Latency
    Swap Mem.
    Top K paths

    Enumerate all
    possible changes
    ( h )
    Change with
    the largest ITE
    Set every configuration
    option in the path to all
    permitted values
    Inferred from observed
    data. This is very cheap.
    !

    View Slide

  89. Diagnosing and Fixing the Faults
    89
    Change with
    the largest ITE
    Fault
    fixed?
    Yes
    No • Add to observational data
    • Update causal model
    • Repeat…
    Measure
    Performance

    View Slide

  90. 90
    CADET: End-to-End Pipeline
    • What is the root-cause
    of my fault?
    • How do I fix my
    misconfigurations to
    improve performance?
    Misconfiguration
    Fault
    fixed?
    Observational Data Build Causal Graph Extract Causal Paths
    Best Query
    Yes
    No
    update
    observational
    data
    Counterfactual Queries
    Rank Paths
    What if questions.
    E.g., What if the
    configuration option X was
    set to a value ‘x’?
    About 25 sample
    configurations
    (training data)

    View Slide

  91. Results: Motivating Example
    91
    When we are trying to transplant our CUDA source code from TX1 to TX2, it
    behaved strange.
    We noticed that TX2 has twice computing-ability as TX1 in GPU, as expectation,
    we think TX2 will 30% - 40% faster than TX1 at least.
    Unfortunately, most of our code base spent twice the time as TX1, in other words,
    TX2 only has 1/2 speed as TX1, mostly. We believe that TX2’s CUDA API runs
    much slower than TX1 in many cases.
    When we are trying to transplant our CUDA source code from TX1 to TX2, it
    behaved strange.
    We noticed that TX2 has twice computing-ability as TX1 in GPU, as expectation,
    we think TX2 will 30% - 40% faster than TX1 at least.
    Unfortunately, most of our code base spent twice the time as TX1, in other words,
    TX2 only has 1/2 speed as TX1, mostly. We believe that TX2’s CUDA API runs
    much slower than TX1 in many cases.
    When we are trying to transplant our CUDA source code from TX1 to TX2, it
    behaved strange.
    We noticed that TX2 has twice computing-ability as TX1 in GPU, as expectation,
    we think TX2 will 30% - 40% faster than TX1 at least.
    Unfortunately, most of our code base spent twice the time as TX1, in other words,
    TX2 only has 1/2 speed as TX1, mostly. We believe that TX2’s CUDA API runs
    much slower than TX1 in many cases.
    When we are trying to transplant our CUDA source code from TX1 to TX2, it
    behaved strange.
    We noticed that TX2 has twice computing-ability as TX1 in GPU, as expectation,
    we think TX2 will 30% - 40% faster than TX1 at least.
    Unfortunately, most of our code base spent twice the time as TX1, in other words,
    TX2 only has 1/2 speed as TX1, mostly. We believe that TX2’s CUDA API runs
    much slower than TX1 in many cases.
    The user is transferring the code
    from one hardware to another
    The target hardware is faster
    than the the source hardware.
    User expects the code to run
    at least 30-40% faster.
    The code ran 2x slower on the
    more powerful hardware

    View Slide

  92. More powerful
    Results: Motivating Example
    92
    Nvidia TX1
    CPU 4 cores, 1.3 GHz
    GPU 128 Cores, 0.9 GHz
    Memory 4 Gb, 25 Gb/s
    Nvidia TX2
    CPU 6 cores, 2 GHz
    GPU 256 Cores, 1.3 GHz
    Memory 8 Gb, 58 Gb/s
    Embedded real-time
    stereo estimation
    Source code
    17 Fps
    4 Fps
    4
    Slower!
    ×

    View Slide

  93. Results: Motivating Example
    93
    Configuration CADET Decision Tree Forum
    CPU Cores ✓ ✓ ✓
    CPU Freq. ✓ ✓ ✓
    EMC Freq. ✓ ✓ ✓
    GPU Freq. ✓ ✓ ✓
    Sched. Policy ✓
    Sched. Runtime ✓
    Sched. Child Proc ✓
    Dirty Bg. Ratio ✓
    Drop Caches ✓
    CUDA_STATIC_R
    T
    ✓ ✓ ✓
    Swap Memory ✓
    CADET Decision Tree Forum
    Throughput (on TX2) 26 FPS 20 FPS 23 FPS
    Throughput Gain (over TX1) 53 % 21 % 39 %
    Time to resolve 24 min. 31/2
    Hrs. 2 days
    X Finds the root-causes accurately
    X No unnecessary changes
    X Better improvements than forum’s recommendation
    X Much faster
    Results
    The user expected 30-40% gain

    View Slide

  94. Evaluation: Experimental Setup
    Nvidia TX1
    CPU 4 cores, 1.3 GHz
    GPU 128 Cores, 0.9 GHz
    Memory 4 Gb, 25 GB/s
    Nvidia TX2
    CPU 6 cores, 2 GHz
    GPU 256 Cores, 1.3 GHz
    Memory 8 Gb, 58 GB/s
    Nvidia Xavier
    CPU 8 cores, 2.26 GHz
    GPU 512 cores, 1.3 GHz
    Memory 32 Gb, 137 GB/s
    Hardware Systems
    Software Systems
    Xception
    Image recognition
    (50,000 test images)
    DeepSpeech
    Voice recognition
    (5 sec. audio clip)
    BERT
    Sentiment Analysis
    (10000 IMDb reviews)
    x264
    Video Encoder
    (11 Mb, 1080p video)
    Configuration Space
    X 30 Configurations
    X 17 System Events
    • 10 software
    • 10 OS/Kernel
    • 10 hardware
    94

    View Slide

  95. Outline
    95
    Case
    Study
    CADET
    Future
    Directions
    Causal AI for
    Systems
    Current
    Results

    View Slide

  96. 96
    RQ1: How does CADET perform compared to Model based
    Diagnostics
    RQ2: How does CADET perform compared to Search-Based
    Optimization
    Results: Research Questions

    View Slide

  97. 97
    Results: Research Question 1 (single objective)
    RQ1: How does CADET perform compared to Model based Diagnostics
    X Finds the root-causes accurately
    X Better gain
    X Much faster
    Takeaways
    More accurate than
    ML-based methods
    Better Gain
    Up to 20x
    faster

    View Slide

  98. 98
    Results: Research Question 1 (multi-objective)
    RQ1: How does CADET perform compared to Model based Diagnostics
    X No deterioration of other performance objectives
    Takeaways
    Multiple Faults
    in Latency &
    Energy usage

    View Slide

  99. 99
    RQ1: How does CADET perform compared to Model based
    Diagnostics
    RQ2: How does CADET perform compared to Search-Based
    Optimization
    Results: Research Questions

    View Slide

  100. Results: Research Question 2
    RQ2: How does CADET perform compared to Search-Based
    Optimization
    X Better with no deterioration of other performance objectives
    Takeaways
    100

    View Slide

  101. 101
    Results: Research Question 3
    RQ2: How does CADET perform compared to Search-Based
    Optimization
    X Considerably faster than search-based optimization
    Takeaways

    View Slide

  102. Outline
    102
    Case
    Study
    CADET
    Causal AI for
    Systems
    Current
    Results
    Future
    Directions

    View Slide

  103. Opportunities of Causal AI for Serverless
    • Evaluating our Causal AI for Systems methodology with Serverless
    systems provide the following opportunities:

    1. Dynamic system reconfigurations

    • Dynamic placement of functions

    • Dynamic reconfigurations of the network of functions

    • Dynamic multi-cloud placement of functions.

    2. Root cause analysis of failures or QoS drop
    103

    View Slide

  104. Opportunities of Causal AI for autonomous robot
    testing
    • Testing cyberphysical systems such as robots are difficult. The key reason
    is that there are additional interactions with the environment and the task
    that the robot is performing.

    • Evaluating our Causal AI for Systems methodology with autonomous
    robots provide the following opportunities:

    1. Identifying difficult to catch bugs in robots

    2. Identifying the root cause of an observed fault and repairing the issue
    automatically during mission time.
    104

    View Slide

  105. Summary: Causal AI for Systems
    1. Learning a Functional Causal Model for different downstream systems tasks

    2. The learned causal model is transferable across different environments
    105

    View Slide

  106. 106

    View Slide