Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Understanding and Explaining the Root Causes of Performance Faults with Causal AI: A Path towards Building Dependable Computer Systems

Pooyan Jamshidi
September 07, 2022
27

Understanding and Explaining the Root Causes of Performance Faults with Causal AI: A Path towards Building Dependable Computer Systems

An invited talk at NASA JPL on August 19th, 2022.

Understanding and Explaining the Root Causes of Performance Faults with Causal AI: A Path towards Building Dependable Computer Systems

Speaker: Pooyan Jamshidi

In this talk, I will present our recent progress in employing Causal AI (Causal Structure Learning and Inference, Counterfactual Reasoning, and Transfer Learning) in addressing several significant challenges in computer systems. After motivating the work, I will show how mainstream machine learning, which relies on spurious correlations, may become unreliable in certain situations. Next, I will present empirical observations that explain the underlying root causes of performance faults in several highly-configurable systems, including autonomous systems, robotics, on-device machine learning systems, and data analytics pipelines. I will then present our framework, Unicorn, and discuss how Unicorn fills the gap by employing a **causal reasoning approach**. In particular, I will discuss how Unicorn captures intricate interactions between configuration options across the software-hardware stack and how such interactions can impact performance variations. Finally, I will talk about our 2-years journey in a NASA-funded project called RASPBERRY-SI, developing a causal reasoning approach to enable synthesizing adaptation plans for reconfiguring autonomous systems to adapt to environmental uncertainties during operation.

For more information regarding the technical work and the people behind the work that I present, please refer to the following websites:
- The Unicorn framework: https://github.com/softsys4ai/unicorn
- The NASA-funded RASPBERRY-SI project: https://nasa-raspberry-si.github.io/raspberry-si/

Bio: Pooyan Jamshidi is an Assistant Professor of Computer Science and Engineering at UofSC. Pooyan Jamshidi's research involves designing novel artificial intelligence (AI) and machine learning (ML) algorithms and investigating their theoretical guarantees. He is also interested in applying AI/ML algorithms in high-impact applications, including robotics, computer systems, and space explorations. Pooyan has extensive collaborations with Google and NASA and is always open to explore new collaborations. Pooyan has been a recipient of the UofSC breakthrough award in 2022. Before his current position, he was a postdoctoral associate at CMU (Pittsburgh, US) and Imperial College (London, UK). He received a Ph.D. (Computer Science) from DCU (Dublin, Ireland) in 2014 and received an M.S. (Systems Engineering) and a B.S. (Math & Computer Science) from AUT (Iran) in 2003 and 2006, respectively. For more info about Pooyan's research and his group at UofSC, please refer to: http://pooyanjamshidi.github.io/.

Pooyan Jamshidi

September 07, 2022
Tweet

Transcript

  1. Understanding and Explaining the Root
    Causes of Performance Faults with Causal AI
    A Path towards Building Dependable Computer Systems
    Pooyan Jamshidi

    View Slide

  2. SEAMS’23

    View Slide

  3. 3
    Melbourne, Australia


    15-16 May 2023

    View Slide

  4. 4
    Topics of Interest

    View Slide

  5. 5
    Keynote Speaker at SEAMS’20 from NASA

    View Slide

  6. Outline
    6
    UNICORN
    Results
    Causal AI
    For Systems
    Motivation
    Autonomy
    Evaluation
    at JPL
    Causal AI for
    Autonomy
    and Robotics

    View Slide

  7. 7
    Goal: Enable developers/users


    to
    fi
    nd the right quality tradeoff

    View Slide

  8. Today’s most popular systems are con
    fi
    gurable
    8
    built

    View Slide

  9. 9

    View Slide

  10. Empirical observations con
    fi
    rm that systems are
    becoming increasingly con
    fi
    gurable
    10
    08 7/2010 7/2012 7/2014
    Release time
    1/1999 1/2003 1/2007 1/2011
    0
    1/2014
    N
    Release time
    02 1/2006 1/2010 1/2014
    2.2.14
    2.3.4
    2.0.35
    .3.24
    Release time
    Apache
    1/2006 1/2008 1/2010 1/2012 1/2014
    0
    40
    80
    120
    160
    200
    2.0.0
    1.0.0
    0.19.0
    0.1.0
    Hadoop
    Number of parameters
    Release time
    MapReduce
    HDFS
    [Tianyin Xu, et al., “Too Many Knobs…”, FSE’15]

    View Slide

  11. Empirical observations con
    fi
    rm that systems are
    becoming increasingly con
    fi
    gurable
    11
    nia San Diego, ‡Huazhong Univ. of Science & Technology, †NetApp, Inc
    tixu, longjin, xuf001, yyzhou}@cs.ucsd.edu
    kar.Pasupathy, Rukma.Talwadker}@netapp.com
    prevalent, but also severely
    software. One fundamental
    y of configuration, reflected
    parameters (“knobs”). With
    m software to ensure high re-
    aunting, error-prone task.
    nderstanding a fundamental
    users really need so many
    answer, we study the con-
    including thousands of cus-
    m (Storage-A), and hundreds
    ce system software projects.
    ng findings to motivate soft-
    ore cautious and disciplined
    these findings, we provide
    ich can significantly reduce
    A as an example, the guide-
    ters and simplify 19.7% of
    on existing users. Also, we
    tion methods in the context
    7/2006 7/2008 7/2010 7/2012 7/2014
    0
    100
    200
    300
    400
    500
    600
    700
    Storage-A
    Number of parameters
    Release time
    1/1999 1/2003 1/2007 1/2011
    0
    100
    200
    300
    400
    500
    5.6.2
    5.5.0
    5.0.16
    5.1.3
    4.1.0
    4.0.12
    3.23.0
    1/2014
    MySQL
    Number of parameters
    Release time
    1/1998 1/2002 1/2006 1/2010 1/2014
    0
    100
    200
    300
    400
    500
    600
    1.3.14
    2.2.14
    2.3.4
    2.0.35
    1.3.24
    Number of parameters
    Release time
    Apache
    1/2006 1/2008 1/2010 1/2012 1/2014
    0
    40
    80
    120
    160
    200
    2.0.0
    1.0.0
    0.19.0
    0.1.0
    Hadoop
    Number of parameters
    Release time
    MapReduce
    HDFS
    [Tianyin Xu, et al., “Too Many Knobs…”, FSE’15]

    View Slide

  12. Con
    fi
    gurations determine the performance
    behavior
    12
    void Parrot_setenv(. . . name,. . . value){
    #ifdef PARROT_HAS_SETENV
    my_setenv(name, value, 1);
    #else
    int name_len=strlen(name);
    int val_len=strlen(value);
    char* envs=glob_env;
    if(envs==NULL){
    return;
    }
    strcpy(envs,name);
    strcpy(envs+name_len,"=");
    strcpy(envs+name_len + 1,value);
    putenv(envs);
    #endif
    }
    #ifdef LINUX
    extern int Parrot_signbit(double x){
    endif
    else
    PARROT_HAS_SETENV
    LINUX
    Speed
    Energy

    View Slide

  13. Outline
    13
    Motivation
    Causal AI
    For Systems
    Results
    Case
    Study
    Causal AI for
    Autonomy
    and Robotics
    Autonomy
    Evaluation
    at JPL

    View Slide

  14. Case Study 1
    SocialSensor

    View Slide

  15. SocialSensor
    15
    Content Analysis
    Orchestrator
    Crawling
    Search and Integration
    Tweets: [5k-20k/min]
    Every 10 min:
    [100k tweets]
    Tweets: [10M]
    Fetch
    Store
    Push
    Store
    Crawled
    items
    Fetch
    Internet

    View Slide

  16. Challenges
    16
    Content Analysis
    Orchestrator
    Crawling
    Search and Integration
    Tweets: [5k-20k/min]
    Every 10 min:
    [100k tweets]
    Tweets: [10M]
    Fetch
    Store
    Push
    Store
    Crawled
    items
    Fetch
    Internet
    100X
    10X
    Real time

    View Slide

  17. 17
    How can we gain a better performance without
    using more resources?

    View Slide

  18. 18
    Let’s try out di
    ff
    erent system con
    fi
    gurations!

    View Slide

  19. Opportunity: Data processing engines in the
    pipeline were all con
    fi
    gurable
    19
    > 100 > 100 > 100
    2300

    View Slide

  20. 20
    More combinations than estimated
    atoms in the universe

    View Slide

  21. 0 500 1000 1500
    Throughput (ops/sec)
    0
    1000
    2000
    3000
    4000
    5000
    Average write latency ( s)
    The default con
    fi
    guration is typically bad and the
    optimal con
    fi
    guration is noticeably better than median
    21
    Default Con
    fi
    guration
    Optimal


    Con
    fi
    guration
    better
    better
    • Default is bad
    • 2X-10X faster than worst
    • Noticeably faster than median

    View Slide

  22. Performance behavior varies in different environments
    22

    View Slide

  23. Case Study 2
    Robotics

    View Slide

  24. CoBot experiment: DARPA BRASS
    0 2 4 6 8
    Localization error [m]
    10
    15
    20
    25
    30
    35
    40
    CPU utilization [%]
    Energy


    constraint
    Safety


    constraint
    Pareto


    front
    Sweet


    Spot
    better
    better
    no_of_particles=x


    no_of_re
    fi
    nement=y

    View Slide

  25. CoBot
    experiment
    5 10 15 20 25
    5
    10
    15
    20
    25
    0
    5
    10
    15
    20
    25
    5 10 15 20 25
    5
    10
    15
    20
    25
    0
    5
    10
    15
    20
    25
    5 10 15 20 25
    5
    10
    15
    20
    25
    0
    5
    10
    15
    20
    25
    5 10 15 20 25
    5
    10
    15
    20
    25
    0
    5
    10
    15
    20
    25
    Source
    (given)
    Target
    (ground truth
    6 months)
    Prediction with
    4 samples
    Prediction with
    Transfer learning
    CPU [%] CPU [%]

    View Slide

  26. Transfer Learning for Improving Model Predictions
    in Highly Configurable Software
    Pooyan Jamshidi, Miguel Velez, Christian K¨
    astner
    Carnegie Mellon University, USA
    {pjamshid,mvelezce,kaestner}@cs.cmu.edu
    Norbert Siegmund
    Bauhaus-University Weimar, Germany
    [email protected]
    Prasad Kawthekar
    Stanford University, USA
    [email protected]
    Abstract
    —Modern software systems are built to be used in
    dynamic environments using configuration capabilities to adapt to
    changes and external uncertainties. In a self-adaptation context,
    we are often interested in reasoning about the performance of
    the systems under different configurations. Usually, we learn
    a black-box model based on real measurements to predict
    the performance of the system given a specific configuration.
    However, as modern systems become more complex, there are
    many configuration parameters that may interact and we end up
    learning an exponentially large configuration space. Naturally,
    this does not scale when relying on real measurements in the
    actual changing environment. We propose a different solution:
    Instead of taking the measurements from the real system, we
    learn the model using samples from other sources, such as
    simulators that approximate performance of the real system at
    Predictive Model
    Learn Model with
    Transfer Learning
    Measure Measure
    Data
    Source
    Target
    Simulator (Source) Robot (Target)
    Adaptation
    Fig. 1: Transfer learning for performance model learning.
    order to identify the best performing configuration for a robot
    Details: [SEAMS ’17]

    View Slide

  27. Looking further: When transfer learning goes
    wrong
    10
    20
    30
    40
    50
    60
    Absolute Percentage Error [%]
    Sources s s1 s2 s3 s4 s5 s6
    noise-level 0 5 10 15 20 25 30
    corr. coeff. 0.98 0.95 0.89 0.75 0.54 0.34 0.19
    µ(pe) 15.34 14.14 17.09 18.71 33.06 40.93 46.75
    It worked! It didn’t!
    Insight: Predictions become
    more accurate when the source
    is more related to the target.
    Non-transfer-learning

    View Slide

  28. 5 10 15 20 25
    number of particles
    5
    10
    15
    20
    25
    number of refinements
    5
    10
    15
    20
    25
    30
    5 10 15 20 25
    number of particles
    5
    10
    15
    20
    25
    number of refinements
    10
    12
    14
    16
    18
    20
    22
    24
    5 10 15 20 25
    number of particles
    5
    10
    15
    20
    25
    number of refinements
    10
    15
    20
    25
    5 10 15 20 25
    number of particles
    5
    10
    15
    20
    25
    number of refinements
    10
    15
    20
    25
    5 10 15 20 25
    number of particles
    5
    10
    15
    20
    25
    number of refinements
    6
    8
    10
    12
    14
    16
    18
    20
    22
    24
    (a) (b) (c)
    (d) (e) 5 10 15 20 25
    number of particles
    5
    10
    15
    20
    25
    number of refinements
    12
    14
    16
    18
    20
    22
    24
    (f)
    CPU usage [%]
    CPU usage [%] CPU usage [%]
    CPU usage [%] CPU usage [%] CPU usage [%]
    It worked! It worked! It worked!
    It didn’t! It didn’t! It didn’t!

    View Slide

  29. Key question: Can we develop a theory to explain
    when transfer learning works?
    Target (Learn)
    Source (Given)
    Data
    Model
    Transferable
    Knowledge
    II. INTUITION
    rstanding the performance behavior of configurable
    e systems can enable (i) performance debugging, (ii)
    mance tuning, (iii) design-time evolution, or (iv) runtime
    on [11]. We lack empirical understanding of how the
    mance behavior of a system will vary when the environ-
    the system changes. Such empirical understanding will
    important insights to develop faster and more accurate
    g techniques that allow us to make predictions and
    ations of performance for highly configurable systems
    ging environments [10]. For instance, we can learn
    mance behavior of a system on a cheap hardware in a
    ed lab environment and use that to understand the per-
    ce behavior of the system on a production server before
    g to the end user. More specifically, we would like to
    what the relationship is between the performance of a
    in a specific environment (characterized by software
    ration, hardware, workload, and system version) to the
    t we vary its environmental conditions.
    is research, we aim for an empirical understanding of
    mance behavior to improve learning via an informed
    g process. In other words, we at learning a perfor-
    model in a changed environment based on a well-suited
    g set that has been determined by the knowledge we
    in other environments. Therefore, the main research
    A. Preliminary concepts
    In this section, we provide formal definitions of four con-
    cepts that we use throughout this study. The formal notations
    enable us to concisely convey concept throughout the paper.
    1) Configuration and environment space: Let Fi
    indicate
    the i-th feature of a configurable system A which is either
    enabled or disabled and one of them holds by default. The
    configuration space is mathematically a Cartesian product of
    all the features C = Dom(F1) ⇥ · · · ⇥ Dom(Fd), where
    Dom(Fi) = {0, 1}. A configuration of a system is then
    a member of the configuration space (feature space) where
    all the parameters are assigned to a specific value in their
    range (i.e., complete instantiations of the system’s parameters).
    We also describe an environment instance by 3 variables
    e = [w, h, v] drawn from a given environment space E =
    W ⇥H ⇥V , where they respectively represent sets of possible
    values for workload, hardware and system version.
    2) Performance model: Given a software system A with
    configuration space F and environmental instances E, a per-
    formance model is a black-box function f : F ⇥ E ! R
    given some observations of the system performance for each
    combination of system’s features x 2 F in an environment
    e 2 E. To construct a performance model for a system A
    with configuration space F, we run A in environment instance
    e 2 E on various combinations of configurations xi
    2 F, and
    record the resulting performance values yi = f(xi) + ✏i, xi
    2
    ON
    behavior of configurable
    erformance debugging, (ii)
    e evolution, or (iv) runtime
    understanding of how the
    will vary when the environ-
    mpirical understanding will
    op faster and more accurate
    to make predictions and
    ighly configurable systems
    or instance, we can learn
    on a cheap hardware in a
    that to understand the per-
    a production server before
    cifically, we would like to
    ween the performance of a
    (characterized by software
    and system version) to the
    conditions.
    empirical understanding of
    learning via an informed
    we at learning a perfor-
    ment based on a well-suited
    ned by the knowledge we
    erefore, the main research
    A. Preliminary concepts
    In this section, we provide formal definitions of four con-
    cepts that we use throughout this study. The formal notations
    enable us to concisely convey concept throughout the paper.
    1) Configuration and environment space: Let Fi
    indicate
    the i-th feature of a configurable system A which is either
    enabled or disabled and one of them holds by default. The
    configuration space is mathematically a Cartesian product of
    all the features C = Dom(F1) ⇥ · · · ⇥ Dom(Fd), where
    Dom(Fi) = {0, 1}. A configuration of a system is then
    a member of the configuration space (feature space) where
    all the parameters are assigned to a specific value in their
    range (i.e., complete instantiations of the system’s parameters).
    We also describe an environment instance by 3 variables
    e = [w, h, v] drawn from a given environment space E =
    W ⇥H ⇥V , where they respectively represent sets of possible
    values for workload, hardware and system version.
    2) Performance model: Given a software system A with
    configuration space F and environmental instances E, a per-
    formance model is a black-box function f : F ⇥ E ! R
    given some observations of the system performance for each
    combination of system’s features x 2 F in an environment
    e 2 E. To construct a performance model for a system A
    with configuration space F, we run A in environment instance
    e 2 E on various combinations of configurations xi
    2 F, and
    record the resulting performance values yi = f(xi) + ✏i, xi
    2
    oad, hardware and system version.
    e model: Given a software system A with
    ce F and environmental instances E, a per-
    is a black-box function f : F ⇥ E ! R
    rvations of the system performance for each
    ystem’s features x 2 F in an environment
    ruct a performance model for a system A
    n space F, we run A in environment instance
    combinations of configurations xi
    2 F, and
    ng performance values yi = f(xi) + ✏i, xi
    2
    (0, i). The training data for our regression
    mply Dtr = {(xi, yi)}n
    i=1
    . In other words, a
    is simply a mapping from the input space to
    ormance metric that produces interval-scaled
    ume it produces real numbers).
    e distribution: For the performance model,
    associated the performance response to each
    w let introduce another concept where we
    ment and we measure the performance. An
    mance distribution is a stochastic process,
    that defines a probability distribution over
    sures for each environmental conditions. To
    ormance distribution for a system A with
    ce F, similarly to the process of deriving
    models, we run A on various combinations
    2 F, for a specific environment instance
    values for workload, hardware and system version.
    2) Performance model: Given a software system A with
    configuration space F and environmental instances E, a per-
    formance model is a black-box function f : F ⇥ E ! R
    given some observations of the system performance for each
    combination of system’s features x 2 F in an environment
    e 2 E. To construct a performance model for a system A
    with configuration space F, we run A in environment instance
    e 2 E on various combinations of configurations xi
    2 F, and
    record the resulting performance values yi = f(xi) + ✏i, xi
    2
    F where ✏i
    ⇠ N (0, i). The training data for our regression
    models is then simply Dtr = {(xi, yi)}n
    i=1
    . In other words, a
    response function is simply a mapping from the input space to
    a measurable performance metric that produces interval-scaled
    data (here we assume it produces real numbers).
    3) Performance distribution: For the performance model,
    we measured and associated the performance response to each
    configuration, now let introduce another concept where we
    vary the environment and we measure the performance. An
    empirical performance distribution is a stochastic process,
    pd : E ! (R), that defines a probability distribution over
    performance measures for each environmental conditions. To
    construct a performance distribution for a system A with
    configuration space F, similarly to the process of deriving
    the performance models, we run A on various combinations
    configurations xi
    2 F, for a specific environment instance
    Extract Reuse
    Learn Learn
    Q1: How source and target
    are “related”?
    Q2: What characteristics
    are preserved?
    Q3: What are the actionable
    insights?

    View Slide

  30. Transfer Learning for Performance Modeling of
    Configurable Systems: An Exploratory Analysis
    Pooyan Jamshidi
    Carnegie Mellon University, USA
    Norbert Siegmund
    Bauhaus-University Weimar, Germany
    Miguel Velez, Christian K¨
    astner
    Akshay Patel, Yuvraj Agarwal
    Carnegie Mellon University, USA
    Abstract—Modern software systems provide many configura-
    tion options which significantly influence their non-functional
    properties. To understand and predict the effect of configuration
    options, several sampling and learning strategies have been
    proposed, albeit often with significant cost to cover the highly
    dimensional configuration space. Recently, transfer learning has
    been applied to reduce the effort of constructing performance
    models by transferring knowledge about performance behavior
    across environments. While this line of research is promising to
    learn more accurate models at a lower cost, it is unclear why
    and when transfer learning works for performance modeling. To
    shed light on when it is beneficial to apply transfer learning, we
    conducted an empirical study on four popular software systems,
    varying software configurations and environmental conditions,
    such as hardware, workload, and software versions, to identify
    the key knowledge pieces that can be exploited for transfer
    learning. Our results show that in small environmental changes
    (e.g., homogeneous workload change), by applying a linear
    transformation to the performance model, we can understand
    the performance behavior of the target environment, while for
    severe environmental changes (e.g., drastic workload change) we
    can transfer only knowledge that makes sampling more efficient,
    e.g., by reducing the dimensionality of the configuration space.
    Index Terms—Performance analysis, transfer learning.
    Fig. 1: Transfer learning is a form of machine learning that takes
    advantage of transferable knowledge from source to learn an accurate,
    reliable, and less costly model for the target environment.
    their byproducts across environments is demanded by many
    Details: [ASE ’17]

    View Slide

  31. Details: [AAAI Spring Symposium ’19]

    View Slide

  32. Outline
    32
    Motivation UNICORN
    Results
    Future
    Directions
    Causal AI
    For Systems

    View Slide

  33. Causal AI in Systems and Software
    33
    Computer Architecture
    Database
    Operating Systems
    Programming Languages
    BigData Software Engineering
    https://github.com/y-ding/causal-system-papers

    View Slide

  34. Misconfiguration and its Effects
    ● Misconfigurations can elicit unexpected interactions between
    software and hardware
    ● These can result in non-functional faults
    ○ Affecting non-functional system properties like
    latency, throughput, energy consumption, etc.
    34
    The system doesn’t crash or
    exhibit an obvious misbehavior
    Systems are still operational but with a
    degraded performance, e.g., high latency, low
    throughput, high energy consumption, high
    heat dissipation, or a combination of several

    View Slide

  35. 35
    CUDA performance issue on tx2
    When we are trying to transplant our CUDA source code from TX1 to TX2, it
    behaved strange.
    We noticed that TX2 has twice computing-ability as TX1 in GPU, as expectation,
    we think TX2 will 30% - 40% faster than TX1 at least.
    Unfortunately, most of our code base spent twice the time as TX1, in other words,
    TX2 only has 1/2 speed as TX1, mostly. We believe that TX2’s CUDA API runs
    much slower than TX1 in many cases.
    When we are trying to transplant our CUDA source code from TX1 to TX2, it
    behaved strange.
    We noticed that TX2 has twice computing-ability as TX1 in GPU, as expectation,
    we think TX2 will 30% - 40% faster than TX1 at least.
    Unfortunately, most of our code base spent twice the time as TX1, in other words,
    TX2 only has 1/2 speed as TX1, mostly. We believe that TX2’s CUDA API runs
    much slower than TX1 in many cases.
    The user is transferring the code
    from one hardware to another
    When we are trying to transplant our CUDA source code from TX1 to TX2, it
    behaved strange.
    We noticed that TX2 has twice computing-ability as TX1 in GPU, as expectation,
    we think TX2 will 30% - 40% faster than TX1 at least.
    Unfortunately, most of our code base spent twice the time as TX1, in other words,
    TX2 only has 1/2 speed as TX1, mostly. We believe that TX2’s CUDA API runs
    much slower than TX1 in many cases.
    The target hardware is faster
    than the the source hardware.
    User expects the code to run
    at least 30-40% faster.
    Motivating Example
    When we are trying to transplant our CUDA source code from TX1 to TX2, it
    behaved strange.
    We noticed that TX2 has twice computing-ability as TX1 in GPU, as expectation,
    we think TX2 will 30% - 40% faster than TX1 at least.
    Unfortunately, most of our code base spent twice the time as TX1, in other words,
    TX2 only has 1/2 speed as TX1, mostly. We believe that TX2’s CUDA API runs
    much slower than TX1 in many cases.
    The code ran 2x slower on the
    more powerful hardware

    View Slide

  36. Motivating Example
    36
    June 3rd
    We have already tried this. We still have high latency.
    Any other suggestions?
    June 4th
    Please do the following and let us know if it works
    1. Install JetPack 3.0
    2. Set nvpmodel=MAX-N
    3. Run jetson_clock.sh
    June 5th
    June 4th
    TX2 is pascal architecture. Please update your CMakeLists:
    + set(CUDA_STATIC_RUNTIME OFF)
    ...
    + -gencode=arch=compute_62,code=sm_62
    The user had several misconfigurations
    In Software:
    ✖ Wrong compilation flags
    ✖ Wrong SDK version
    In Hardware:
    ✖ Wrong power mode
    ✖ Wrong clock/fan settings
    The discussions took 2 days
    !
    Any suggestions on how to improve my performance?
    Thanks!
    How to resolve such issues faster?
    ?

    View Slide

  37. 37
    How to resolve these
    issues faster?

    View Slide

  38. 38
    Performance In
    fl
    uence Models
    number of counters
    number of splitters
    latency (ms)
    100
    150
    1
    200
    250
    2
    300
    Cubic Interpolation Over Finer Grid
    2
    4
    3 6
    8
    4 10
    12
    5 14
    16
    6 18
    Bitrate
    (bits/s)
    Enable
    Padding
    … Cache
    Misses
    … Through
    put (fps)
    c1
    1k 1 … 42m … 7
    c2
    2k 1 … 32m … 22
    … … … … … … …
    cn
    5k 0 … 12m … 25
    Observational Data Black-box models Regression Equation
    Throughput = 5.1 × Bitrate + 2.5 × BatchSize
    + 12.3 × Bitrate × BatchSize
    Discovered

    Interactions
    Options Options

    View Slide

  39. 39
    These methods rely on statistical correlations
    to extract meaningful information required for
    performance tasks.
    Performance In
    fl
    uence Models
    number of counters
    number of splitters
    latency (ms)
    100
    150
    1
    200
    250
    2
    300
    Cubic Interpolation Over Finer Grid
    2
    4
    3 6
    8
    4 10
    12
    5 14
    16
    6 18
    Bitrate
    (bits/s)
    Enable
    Padding
    … Cache
    Misses
    … Through
    put (fps)
    c1
    1k 1 … 42m … 7
    c2
    2k 1 … 32m … 22
    … … … … … … …
    cn
    5k 0 … 12m … 25
    Observational Data Black-box models Regression Equation
    Throughput = 5.1 × Bitrate + 2.5 × BatchSize
    + 12.3 × Bitrate × BatchSize
    Discovered

    Interactions
    Options Options

    View Slide

  40. 40
    • Performance in
    fl
    uence models could produce incorrect explanations

    • Performance in
    fl
    uence models could produce unreliable predictions.

    • Performance in
    fl
    uence models could produce unstable predictions across
    environments and in the presence of measurement noise.
    Performance In
    fl
    uence Models su
    ff
    er from
    several shortcomings

    View Slide

  41. Performance Influence Models Issue: Incorrect Explanation
    Cache Misses
    Throughput (FPS)
    20
    10
    0
    100k 200k
    41
    Increasing Cache Misses

    increases Throughput.

    View Slide

  42. Cache Misses
    Throughput (FPS)
    20
    10
    0
    100k 200k
    42
    Increasing Cache Misses

    increases Throughput.
    More Cache Misses should

    reduce Throughput not

    increase it
    Any ML/statistical models built on this
    data will be incorrect.
    This is counter-intuitive
    Performance Influence Models Issue: Incorrect Explanation

    View Slide

  43. Cache Misses
    Throughput (FPS)
    20
    10
    0
    100k 200k
    43
    Cache Misses
    Throughput (FPS)
    LRU
    FIFO
    LIFO
    MRU
    20
    10
    0
    100k 200k
    Segregating data on Cache Policy indicates that within each group
    Increase of Cache Misses result in a decrease in Throughput.
    FIFO
    LIFO
    MRU
    LRU
    Performance Influence Models Issue: Incorrect Explanation

    View Slide

  44. 44
    Performance In
    fl
    uence Models change signi
    fi
    cantly in new
    environments resulting in less accuracy.
    Performance in
    fl
    uence model in TX2.
    Performance in
    fl
    uence model in Xavier.
    Throughput = 5.1 × Bitrate + 2.5 × BatchSize + 12.3 × Bitrate × BatchSize
    Throughput = 2 × Bitrate + 1.9 × BatchSize + 1.8 × BufferSize + 0.5 × EnablePadding + 5.9 × Bitrate × BufferSize
    +6.2 × Bitrate × EnablePadding + 4.1 × Bitrate × BufferSize × EnablePadding
    Performance Influence Models Issue: Unstable Predictors

    View Slide

  45. 45
    Performance in
    fl
    uence are cannot be reliably
    used across environments.
    Performance in
    fl
    uence model in TX2.
    Performance in
    fl
    uence model in Xavier.
    Throughput = 5.1 × Bitrate + 2.5 × BatchSize + 12.3 × Bitrate × BatchSize
    Throughput = 2 × Bitrate + 1.9 × BatchSize + 1.8 × BufferSize + 0.5 × EnablePadding + 5.9 × Bitrate × BufferSize
    +6.2 × Bitrate × EnablePadding + 4.1 × Bitrate × BufferSize × EnablePadding
    Performance Influence Models Issue: Unstable Predictors

    View Slide

  46. 46
    Performance in
    fl
    uence models do not generalize
    well across deployment environments.
    Performance in
    fl
    uence model in TX2
    Performance in
    fl
    uence model in Xavier.
    Throughput = 5.1 × Bitrate + 2.5 × BatchSize + 12.3 × Bitrate × BatchSize
    Throughput = 2 × Bitrate + 1.9 × BatchSize + 1.8 × BufferSize + 0.5 × EnablePadding + 5.9 × Bitrate × BufferSize
    +6.2 × Bitrate × EnablePadding + 4.1 × Bitrate × BufferSize × EnablePadding
    Performance Influence Models Issue: Non-generalizability

    View Slide

  47. 47
    Causal Performance Model
    Expresses the relationships between
    Con
    fi
    guration options
    System Events
    Non-functional

    Properties
    Cache Misses
    Throughput (FPS)
    20
    10
    0
    100k 200k
    interacting variables as a causal graph
    Direction of

    Causality
    Cache


    Policy
    Cache


    Misses
    Through


    put

    View Slide

  48. Why Causal Inference? - Produces Correct Explanations
    48
    Cache Misses
    Throughput (FPS)
    20
    10
    0
    100k 200k
    Cache Misses
    Throughput (FPS)
    LRU
    FIFO
    LIFO
    MRU
    20
    10
    0
    100k 200k
    Cache Policy a
    ff
    ects
    Throughput via Cache Misses.
    Causal Performance Models recovers
    the correct interactions.
    Cache


    Policy
    Cache


    Misses
    Through


    put

    View Slide

  49. Why Causal Inference? - Minimal Structure Change
    49
    Causal models remain
    relatively stable
    A partial causal performance

    model in Jetson Xavier
    A partial causal performance

    model in Jetson TX2
    Bitrate
    Buffer


    Size
    Batch


    Size
    Enable


    Padding
    Branch


    Misses
    Cache


    Misses
    Cycles
    FPS Energy
    Bitrate
    Buffer


    Size
    Batch


    Size
    Enable


    Padding
    Branch


    Misses
    Cache


    Misses
    Cycles
    FPS Energy

    View Slide

  50. Why Causal Inference? - Accurate across Environments
    50
    Performance Influence Model
    0
    5
    10
    15
    20
    25
    30
    35
    40
    45
    50
    Terms
    (a)
    Common Terms (Source ! Target) Total Terms (Source) Total Terms (Target)
    Error (Source) Error (Target) Error (Source ! Target)
    0
    30
    60
    90
    Regression Models Causal Performance Model
    0
    5
    10
    15
    20
    25
    30
    35
    40
    45
    50
    0
    30
    60
    90
    Regression Models
    MAPE (%)
    Common Predictors
    are Large
    Common Predictors
    are lower in number

    View Slide

  51. 51
    Performance Influence Model
    0
    5
    10
    15
    20
    25
    30
    35
    40
    45
    50
    Terms
    (a)
    Common Terms (Source ! Target) Total Terms (Source) Total Terms (Target)
    Error (Source) Error (Target) Error (Source ! Target)
    0
    30
    60
    90
    Regression Models Causal Performance Model
    0
    5
    10
    15
    20
    25
    30
    35
    40
    45
    50
    0
    30
    60
    90
    Regression Models
    MAPE (%)
    Low error
    when reused
    High error
    when reused
    Common Predictors
    are Large
    Common Predictors
    are lower in number
    Causal models can be reliably reused when
    environmental changes occur.
    Why Causal Inference? - Accurate across Environments

    View Slide

  52. 52
    Performance Influence Model
    0
    5
    10
    15
    20
    25
    30
    35
    40
    45
    50
    Terms
    (a)
    Common Terms (Source ! Target) Total Terms (Source) Total Terms (Target)
    Error (Source) Error (Target) Error (Source ! Target)
    0
    30
    60
    90
    Regression Models Causal Performance Model
    0
    5
    10
    15
    20
    25
    30
    35
    40
    45
    50
    0
    30
    60
    90
    Regression Models
    MAPE (%)
    Causal models are more generalizable than
    Performance in
    fl
    uence models.
    Why Causal Inference? - Generalizability

    View Slide

  53. 53
    How to use Causal Performance Models?
    ?
    Cache


    Policy
    Cache


    Misses
    Through


    put
    How to generate a
    causal graph?

    View Slide

  54. 54
    How to use Causal Performance Models?
    ?
    How to use the causal graph
    for performance tasks?
    ?
    Cache


    Policy
    Cache


    Misses
    Through


    put
    How to generate a
    causal graph?

    View Slide

  55. Outline
    55
    Motivation
    Causal AI
    For Systems
    Results
    Future
    Directions
    UNICORN

    View Slide

  56. • Build a Causal Performance
    Model that capture the interactions
    options in the variability space
    using the observation performance
    data.

    • Iterative causal performance model
    evaluation and model update
    • Perform downstream performance
    tasks such as performance
    debugging & optimization using
    Causal Reasoning
    UNICORN: Our Causal AI for
    Systems Method

    View Slide

  57. UNICORN: Our Causal AI for Systems Method
    Software: DeepStream
    Middleware: TF, TensorRT
    Hardware: Nvidia Xavier
    Configuration: Default
    number of counters
    number of splitters
    latency (ms)
    100
    150
    1
    200
    250
    2
    300
    Cubic Interpolation Over Finer Grid
    2
    4
    3 6
    8
    4 10
    12
    5 14
    16
    6 18
    Budget
    Exhausted?
    Yes
    No
    5- Update Causal
    Performance Model
    Query Engine
    4- Estimate Causal
    Queries
    Estimate
    probability of
    satisfying QoS
    if BufferSize is
    set to 6k?
    2- Learn Causal
    Performance Model Performance
    Debugging
    Performance
    Optimization
    3- Translate Perf. Query
    to Causal Queries
    •What is the root-cause
    of observed perf. fault?
    •How do I fix the
    misconfig.?
    •How can I improve
    throughput without
    sacrificing accuracy?
    •How do I understand
    perf behavior?
    Measure performance
    of the configuration(s)
    that maximizes
    information gain
    Performance Data Causal Model
    P(Th > 40/s|do(Buffersize = 6k))
    1- Specify
    Performance Query
    QoS : Th > 40/s
    Observed : Th < 30/s ± 5/s

    View Slide

  58. Software: DeepStream
    Middleware: TF, TensorRT
    Hardware: Nvidia Xavier
    Configuration: Default
    number of counters
    number of splitters
    latency (ms)
    100
    150
    1
    200
    250
    2
    300
    Cubic Interpolation Over Finer Grid
    2
    4
    3 6
    8
    4 10
    12
    5 14
    16
    6 18
    Budget
    Exhausted?
    Yes
    No
    5- Update Causal
    Performance Model
    Query Engine
    4- Estimate Causal Queries
    Estimate
    probability of
    satisfying QoS
    if BufferSize is
    set to 6k?
    2- Learn
    Causal Perf. Model Performance
    Debugging
    Performance
    Optimization
    3- Translate Performance Query
    to Causal Queries
    •What is the root-cause
    of observed perf. fault?
    •How do I fix the
    misconfig.?
    •How can I improve
    throughput without
    sacrificing accuracy?
    •How do I understand
    perf behavior?
    Measure performance
    of the configuration(s)
    that maximizes
    information gain
    Performance Data Causal Model
    P(Th > 40/s|do(Buffersize = 6k))
    1- Specify
    Performance Query
    QoS : Th > 40/s
    Observed : Th < 30/s ± 5/s
    UNICORN: Our Causal AI for Systems Method

    View Slide

  59. Software: DeepStream
    Middleware: TF, TensorRT
    Hardware: Nvidia Xavier
    Configuration: Default
    number of counters
    number of splitters
    latency (ms)
    100
    150
    1
    200
    250
    2
    300
    Cubic Interpolation Over Finer Grid
    2
    4
    3 6
    8
    4 10
    12
    5 14
    16
    6 18
    Budget
    Exhausted?
    Yes
    No
    5- Update Causal
    Performance Model
    Query Engine
    4- Estimate Causal Queries
    Estimate
    probability of
    satisfying QoS
    if BufferSize is
    set to 6k?
    2- Learn
    Causal Perf. Model Performance
    Debugging
    Performance
    Optimization
    3- Translate Performance Query
    to Causal Queries
    •What is the root-cause
    of observed perf. fault?
    •How do I fix the
    misconfig.?
    •How can I improve
    throughput without
    sacrificing accuracy?
    •How do I understand
    perf behavior?
    Measure performance
    of the configuration(s)
    that maximizes
    information gain
    Performance Data Causal Model
    P(Th > 40/s|do(Buffersize = 6k))
    1- Specify
    Performance Query
    QoS : Th > 40/s
    Observed : Th < 30/s ± 5/s
    UNICORN: Our Causal AI for Systems Method

    View Slide

  60. FPS Energy
    Branch
    Misses
    Cache
    Misses
    No of
    Cycles
    Bitrate
    Buffer
    Size
    Batch
    Size
    Enable
    Padding
    FPS Energy
    Branch
    Misses
    Cache
    Misses
    No of
    Cycles
    Bitrate Buffer
    Size
    Batch
    Size
    Enable
    Padding
    Bitrate
    (bits/s)
    Enable
    Padding
    … Cache
    Misses
    … Through
    put (fps)
    c1
    1k 1 … 42m … 7
    c2
    2k 1 … 32m … 22
    … … … … … … …
    cn
    5k 0 … 12m … 25
    FPS Energy
    Branch
    Misses
    Cache
    Misses
    No of
    Cycles
    Bitrate Buffer
    Size
    Batch
    Size
    Enable
    Padding
    1- Recovering the
    Skelton
    2- Pruning
    Causal Structure
    3- Orienting
    Causal Relations
    statistical
    independence
    tests
    fully connected graph
    given constraints (e.g.,
    no connections btw
    configuration options)
    orientation rules &
    measures (entropy) +
    structural constraints
    (colliders, v-structures)
    Learning Causal Performance Model

    View Slide

  61. Performance measurement
    61
    ℂ = O1
    × O2
    × ⋯ × O19
    × O20
    Dead code removal
    Con
    fi
    guration
    Space
    Constant folding
    Loop unrolling
    Function inlining
    c1
    = 0 × 0 × ⋯ × 0 × 1
    c1
    ∈ ℂ
    fc
    (c1
    ) = 11.1ms
    Compile
    time
    Execution
    time
    Energy
    Compiler


    (e.f., SaC, LLVM)
    Program Compiled
    Code
    Instrumented
    Binary
    Hardware
    Compile Deploy
    Con
    fi
    gure
    fe
    (c1
    ) = 110.3ms
    fen
    (c1
    ) = 100mwh
    Non-functional
    measurable/quanti
    fi
    able
    aspect

    View Slide

  62. Our setup for performance measurements
    62

    View Slide

  63. FPS Energy
    Branch
    Misses
    Cache
    Misses
    No of
    Cycles
    Bitrate
    Buffer
    Size
    Batch
    Size
    Enable
    Padding
    FPS Energy
    Branch
    Misses
    Cache
    Misses
    No of
    Cycles
    Bitrate Buffer
    Size
    Batch
    Size
    Enable
    Padding
    Bitrate
    (bits/s)
    Enable
    Padding
    … Cache
    Misses
    … Through
    put (fps)
    c1
    1k 1 … 42m … 7
    c2
    2k 1 … 32m … 22
    … … … … … … …
    cn
    5k 0 … 12m … 25
    FPS Energy
    Branch
    Misses
    Cache
    Misses
    No of
    Cycles
    Bitrate Buffer
    Size
    Batch
    Size
    Enable
    Padding
    1- Recovering the
    Skelton
    2- Pruning
    Causal Structure
    3- Orienting
    Causal Relations
    statistical
    independence
    tests
    fully connected graph
    given constraints (e.g.,
    no connections btw
    configuration options)
    orientation rules &
    measures (entropy) +
    structural constraints
    (colliders, v-structures)
    Learning Causal Performance Model

    View Slide

  64. FPS Energy
    Branch
    Misses
    Cache
    Misses
    No of
    Cycles
    Bitrate
    Buffer
    Size
    Batch
    Size
    Enable
    Padding
    FPS Energy
    Branch
    Misses
    Cache
    Misses
    No of
    Cycles
    Bitrate Buffer
    Size
    Batch
    Size
    Enable
    Padding
    Bitrate
    (bits/s)
    Enable
    Padding
    … Cache
    Misses
    … Through
    put (fps)
    c1
    1k 1 … 42m … 7
    c2
    2k 1 … 32m … 22
    … … … … … … …
    cn
    5k 0 … 12m … 25
    FPS Energy
    Branch
    Misses
    Cache
    Misses
    No of
    Cycles
    Bitrate Buffer
    Size
    Batch
    Size
    Enable
    Padding
    1- Recovering the
    Skelton
    2- Pruning
    Causal Structure
    3- Orienting
    Causal Relations
    statistical
    independence
    tests
    fully connected graph
    given constraints (e.g.,
    no connections btw
    configuration options)
    orientation rules &
    measures (entropy) +
    structural constraints
    (colliders, v-structures)
    Learning Causal Performance Model

    View Slide

  65. FPS Energy
    Branch
    Misses
    Cache
    Misses
    No of
    Cycles
    Bitrate
    Buffer
    Size
    Batch
    Size
    Enable
    Padding
    FPS Energy
    Branch
    Misses
    Cache
    Misses
    No of
    Cycles
    Bitrate Buffer
    Size
    Batch
    Size
    Enable
    Padding
    Bitrate
    (bits/s)
    Enable
    Padding
    … Cache
    Misses
    … Through
    put (fps)
    c1
    1k 1 … 42m … 7
    c2
    2k 1 … 32m … 22
    … … … … … … …
    cn
    5k 0 … 12m … 25
    FPS Energy
    Branch
    Misses
    Cache
    Misses
    No of
    Cycles
    Bitrate Buffer
    Size
    Batch
    Size
    Enable
    Padding
    1- Recovering the
    Skelton
    2- Pruning
    Causal Structure
    3- Orienting
    Causal Relations
    statistical
    independence
    tests
    fully connected graph
    given constraints (e.g.,
    no connections btw
    configuration options)
    orientation rules &
    measures (entropy) +
    structural constraints
    (colliders, v-structures)
    Learning Causal Performance Model

    View Slide

  66. Throughput Energy
    Branch
    Misses
    Cache
    Misses
    No of
    Cycles
    Bitrate Buffer
    Size
    Batch
    Size
    Enable
    Padding
    f f
    f
    f f
    Causal


    Interaction
    Causal


    Paths
    Software


    Options
    Perf.


    Events
    Performance


    Objective
    f
    Branchmisses = 2 × Bitrate + 8.1 × Buffersize + 4.1 × Bitrate × Buffersize × Cachemisses
    Decoder Muxer
    Causal Performance Model

    View Slide

  67. Software: DeepStream
    Middleware: TF, TensorRT
    Hardware: Nvidia Xavier
    Configuration: Default
    number of counters
    number of splitters
    latency (ms)
    100
    150
    1
    200
    250
    2
    300
    Cubic Interpolation Over Finer Grid
    2
    4
    3 6
    8
    4 10
    12
    5 14
    16
    6 18
    Budget
    Exhausted?
    Yes
    No
    5- Update Causal
    Performance Model
    Query Engine
    4- Estimate Causal Queries
    Estimate
    probability of
    satisfying QoS
    if BufferSize is
    set to 6k?
    2- Learn
    Causal Perf. Model Performance
    Debugging
    Performance
    Optimization
    3- Translate Performance Query
    to Causal Queries
    •What is the root-cause
    of observed perf. fault?
    •How do I fix the
    misconfig.?
    •How can I improve
    throughput without
    sacrificing accuracy?
    •How do I understand
    perf behavior?
    Measure performance
    of the configuration(s)
    that maximizes
    information gain
    Performance Data Causal Model
    P(Th > 40/s|do(Buffersize = 6k))
    1- Specify
    Performance Query
    QoS : Th > 40/s
    Observed : Th < 30/s ± 5/s
    UNICORN: Our Causal AI for Systems Method

    View Slide

  68. 68
    Causal Debugging
    • What is the root-cause
    of my fault?
    • How do I fix my
    misconfigurations to
    improve performance?
    Misconfiguration
    Fault
    fixed?
    Observational Data Build Causal Graph Extract Causal Paths
    Best Query
    Yes
    No
    update
    observational
    data
    Counterfactual Queries
    Rank Paths
    What if questions.
    E.g., What if the configuration
    option X was set to a value ‘x’?
    About 25 sample
    configurations
    (training data)

    View Slide

  69. Extracting Causal Paths from the Causal Model
    Problem
    ✕ In real world cases, this causal graph can be
    very complex
    ✕ It may be intractable to reason over the entire
    graph directly
    69
    Solution
    ✓ Extract paths from the causal graph
    ✓ Rank them based on their Average Causal
    Effect on latency, etc.
    ✓ Reason over the top K paths

    View Slide

  70. Extracting Causal Paths from the Causal Model
    70
    GPU Mem. Latency
    Swap Mem.
    Extract paths
    Always begins with a
    configuration option
    Or a system
    event
    Always terminates at a
    performance objective
    Load
    GPU Mem. Latency
    Swap Mem.
    Swap Mem. Latency
    Load GPU Mem.

    View Slide

  71. Ranking Causal Paths from the Causal Model
    71
    ● They may be too many causal paths
    ● We need to select the most useful ones
    ● Compute the Average Causal Effect (ACE) of
    each pair of neighbors in a path
    GPU Mem.
    Swap Mem. Latency
    𝐴𝐶
    𝐸
    (GPU Mem . , Swap) =
    1
    𝑁

    𝑎
    ,
    𝑏

    𝑍
    𝔼
    (GPU Mem .
    𝑑 𝑜
    (Swap =
    𝑏
    )) −
    𝔼
    (GPU Mem .
    𝑑
    𝑜
    (Swap =
    𝑎
    ))
    Expected value of GPU
    Mem. when we artificially
    intervene by setting Swap to
    the value b
    Expected value of GPU
    Mem. when we artificially
    intervene by setting Swap to
    the value a
    If this difference is large, then
    small changes to Swap Mem.
    will cause large changes to GPU
    Mem.
    Average over all permitted
    values of Swap memory.

    View Slide

  72. Ranking Causal Paths from the Causal Model
    72
    ● Average the ACE of all pairs of adjacent nodes in the path
    ● Rank paths from highest path ACE (PACE) score to the lowest
    ● Use the top K paths for subsequent analysis
    𝑃𝐴𝐶𝐸
    (
    𝑍
    ,
    𝑌
    ) =
    1
    2
    (
    𝐴 𝐶 𝐸
    (
    𝑍
    ,
    𝑋
    ) +
    𝐴𝐶 𝐸
    (
    𝑋
    ,
    𝑌
    ))
    X Y
    Z
    Sum over all pairs of
    nodes in the causal path.
    GPU Mem. Latency
    Swap Mem.

    View Slide

  73. Best Query
    Counterfactual Queries
    Rank Paths
    What if questions.
    E.g., What if the
    configuration option X was
    set to a value ‘x’?
    Extract Causal Paths
    73
    Diagnosing and Fixing the Faults
    • What is the root-cause
    of my fault?
    • How do I fix my
    misconfigurations to
    improve performance?
    Misconfiguration
    Fault
    fixed?
    Observational Data Build Causal Graph
    Yes
    No
    update
    observational
    data
    About 25 sample
    configurations
    (training data)

    View Slide

  74. Diagnosing and Fixing the Faults
    74
    ● Counterfactual inference asks “what if” questions about changes to the
    misconfigurations
    We are interested in the scenario where:
    • We hypothetically have low latency;
    Conditioned on the following events:
    • We hypothetically set the new Swap memory to 4 Gb
    • Swap Memory was initially set to 2 Gb
    • We observed high latency when Swap was set to 2 Gb
    • Everything else remains the same
    Example
    Given that my current swap memory is 2 Gb, and I have high latency. What is
    the probability of having low latency if swap memory was increased to 4 Gb?

    View Slide

  75. Low?
    Load
    GPU Mem. Latency
    Swap = 4 Gb
    Diagnosing and Fixing the Faults
    75
    GPU Mem. Latency
    Swap
    Original Path
    Load
    GPU Mem. Latency
    Swap = 4 Gb
    Path after proposed change
    Load
    Remove incoming
    edges. Assume no
    external influence.
    Modify to reflect the
    hypothetical scenario
    Low?
    Load
    GPU Mem. Latency
    Swap = 4 Gb
    Low?
    Use both the models to compute the answer to the counterfactual question

    View Slide

  76. Diagnosing and Fixing the Faults
    76
    GPU Mem. Latency
    Swap
    Original Path
    Load
    GPU Mem. Latency
    Swap = 4 Gb
    Path after proposed change
    Load
    𝑃 𝑜 𝑡
    𝑒
    𝑛 𝑡𝑖
    𝑎
    𝑙
    =
    𝑃
    (
    ^
    𝐿𝑎 𝑡
    𝑒 𝑛𝑐
    𝑦
    =
    𝑙
    𝑜𝑤
    . . ^
    𝑆𝑤 𝑎𝑝
    = 4
    𝐺 𝑏
    , .
    𝑆 𝑤
    𝑎𝑝
    = 2
    𝐺
    𝑏
    ,
    𝐿𝑎
    𝑡 𝑒 𝑛𝑐𝑦
    𝑠 𝑤
    𝑎
    𝑝
    =2
    𝐺 𝑏
    = h
    𝑖𝑔
    h,
    𝑈
    )
    We expect a low latency
    The latency was high
    The Swap is now 4 Gb
    The Swap was initially 2 Gb Everything else
    stays the same

    View Slide

  77. Diagnosing and Fixing the Faults
    77
    Potential =
    𝑃
    (
    ^
    𝑜𝑢𝑡𝑐𝑜𝑚
    𝑒
    =
    𝑔𝑜
    𝑜𝑑
    ~ ~
    𝑐
    h
    𝑎 𝑛
    𝑔 𝑒
    , ~
    𝑜 𝑢
    𝑡𝑐𝑜 𝑚
    𝑒
    ¬
    𝑐
    h
    𝑎
    𝑛 𝑔 𝑒
    =
    𝑏𝑎𝑑
    , ~¬
    𝑐
    h
    𝑎
    𝑛 𝑔𝑒
    ,
    𝑈
    )
    Probability that the outcome is good after a change, conditioned on the past
    If this difference is large, then our change is useful
    Individual Treatment Effect = Potential − Outcome
    Control =
    𝑃
    (
    ^
    𝑜𝑢
    𝑡 𝑐
    𝑜
    𝑚 𝑒
    =
    𝑏𝑎𝑑
    ~ ~¬
    𝑐
    h
    𝑎 𝑛𝑔 𝑒
    ,
    𝑈
    )
    Probability that the outcome was bad before the change

    View Slide

  78. Diagnosing and Fixing the Faults
    78
    GPU Mem.
    Latency
    Swap Mem.
    Top K paths

    Enumerate all
    possible changes
    𝐼 𝑇 𝐸
    (
    𝑐
    h
    𝑎𝑛𝑔
    𝑒
    )
    Change with
    the largest ITE
    Set every configuration
    option in the path to all
    permitted values
    Inferred from observed
    data. This is very cheap.
    !

    View Slide

  79. Diagnosing and Fixing the Faults
    79
    Change with
    the largest ITE
    Fault
    fixed?
    Yes
    No • Add to observational data
    • Update causal model
    • Repeat…
    Measure
    Performance

    View Slide

  80. Software: DeepStream
    Middleware: TF, TensorRT
    Hardware: Nvidia Xavier
    Configuration: Default
    number of counters
    number of splitters
    latency (ms)
    100
    150
    1
    200
    250
    2
    300
    Cubic Interpolation Over Finer Grid
    2
    4
    3 6
    8
    4 10
    12
    5 14
    16
    6 18
    Budget
    Exhausted?
    Yes
    No
    5- Update Causal
    Performance Model
    Query Engine
    4- Estimate Causal Queries
    Estimate
    probability of
    satisfying QoS
    if BufferSize is
    set to 6k?
    2- Learn
    Causal Perf. Model Performance
    Debugging
    Performance
    Optimization
    3- Translate Performance Query
    to Causal Queries
    •What is the root-cause
    of observed perf. fault?
    •How do I fix the
    misconfig.?
    •How can I improve
    throughput without
    sacrificing accuracy?
    •How do I understand
    perf behavior?
    Measure performance
    of the configuration(s)
    that maximizes
    information gain
    Performance Data Causal Model
    P(Th > 40/s|do(Buffersize = 6k))
    1- Specify
    Performance Query
    QoS : Th > 40/s
    Observed : Th < 30/s ± 5/s
    UNICORN: Our Causal AI for Systems Method

    View Slide

  81. FPS Energy
    Branch
    Misses
    Cache
    Misses
    No of
    Cycles
    Bitrate
    Buffer
    Size
    Batch
    Size
    Enable
    Padding 1- Evaluate Candidate
    Interventions
    FPS Energy
    Branch
    Misses
    Cache
    Misses
    No of
    Cycles
    Bitrate
    Buffer
    Size
    Batch
    Size
    Enable
    Padding
    Option/Event/Obj Values
    Bitrate 1k
    Buffer Size 20k
    Batch Size 10
    Enable Padding 1
    Branch Misses 24m
    Cache Misses 42m
    No of Cycles 73b
    FPS 31/s
    Energy 42J
    2- Determine & Perform
    next Perf Measurement
    3- Updating
    Causal Model Performance
    Data
    Model averaging
    Expected change in
    belief & KL; Causal
    effects on objectives
    Interventions on Hardware,
    Workload, and Kernel Options
    Active Learning for Updating Causal Performance Model

    View Slide

  82. FPS Energy
    Branch
    Misses
    Cache
    Misses
    No of
    Cycles
    Bitrate
    Buffer
    Size
    Batch
    Size
    Enable
    Padding 1- Evaluate Candidate
    Interventions
    FPS Energy
    Branch
    Misses
    Cache
    Misses
    No of
    Cycles
    Bitrate
    Buffer
    Size
    Batch
    Size
    Enable
    Padding
    Option/Event/Obj Values
    Bitrate 1k
    Buffer Size 20k
    Batch Size 10
    Enable Padding 1
    Branch Misses 24m
    Cache Misses 42m
    No of Cycles 73b
    FPS 31/s
    Energy 42J
    2- Determine & Perform
    next Perf Measurement
    3- Updating
    Causal Model Performance
    Data
    Model averaging
    Expected change in
    belief & KL; Causal
    effects on objectives
    Interventions on Hardware,
    Workload, and Kernel Options
    Active Learning for Updating Causal Performance Model

    View Slide

  83. FPS Energy
    Branch
    Misses
    Cache
    Misses
    No of
    Cycles
    Bitrate
    Buffer
    Size
    Batch
    Size
    Enable
    Padding 1- Evaluate Candidate
    Interventions
    FPS Energy
    Branch
    Misses
    Cache
    Misses
    No of
    Cycles
    Bitrate
    Buffer
    Size
    Batch
    Size
    Enable
    Padding
    Option/Event/Obj Values
    Bitrate 1k
    Buffer Size 20k
    Batch Size 10
    Enable Padding 1
    Branch Misses 24m
    Cache Misses 42m
    No of Cycles 73b
    FPS 31/s
    Energy 42J
    2- Determine & Perform
    next Perf Measurement
    3- Updating
    Causal Model Performance
    Data
    Model averaging
    Expected change in
    belief & KL; Causal
    effects on objectives
    Interventions on Hardware,
    Workload, and Kernel Options
    Active Learning for Updating Causal Performance Model

    View Slide

  84. Benefits of Causal
    Reasoning for
    System
    Performance
    Analysis

    View Slide

  85. There are two fundamental benefits that we get by our “Causal AI for Systems”
    methodology
    1. We learn one central (causal) performance model from the data across di
    ff
    erent
    performance tasks:

    • Performance understanding

    • Performance optimization

    • Performance debugging and repair

    • Performance prediction for di
    ff
    erent environments (e.g., canary-> production)

    2. The causal model is transferable across environments.

    • We observed Sparse Mechanism Shift in systems too!

    • Alternative non-causal models (e.g., regression-based models for performance tasks)
    are not transferable as they rely on i.i.d. setting.
    85

    View Slide

  86. 86
    The new version of CADET, called UNICORN, accepted at EuroSys 2022.
    https://github.com/softsys4ai/UNICORN

    View Slide

  87. Outline
    87
    Motivation
    Causal AI
    For Systems
    Causal AI for
    Autonomy
    and Robotics
    UNICORN
    Results
    Autonomy
    Evaluation
    at JPL

    View Slide

  88. Results: Case Study
    88
    When we are trying to transplant our CUDA source code from TX1 to TX2, it
    behaved strange.
    We noticed that TX2 has twice computing-ability as TX1 in GPU, as expectation,
    we think TX2 will 30% - 40% faster than TX1 at least.
    Unfortunately, most of our code base spent twice the time as TX1, in other words,
    TX2 only has 1/2 speed as TX1, mostly. We believe that TX2’s CUDA API runs
    much slower than TX1 in many cases.
    When we are trying to transplant our CUDA source code from TX1 to TX2, it
    behaved strange.
    We noticed that TX2 has twice computing-ability as TX1 in GPU, as expectation,
    we think TX2 will 30% - 40% faster than TX1 at least.
    Unfortunately, most of our code base spent twice the time as TX1, in other words,
    TX2 only has 1/2 speed as TX1, mostly. We believe that TX2’s CUDA API runs
    much slower than TX1 in many cases.
    When we are trying to transplant our CUDA source code from TX1 to TX2, it
    behaved strange.
    We noticed that TX2 has twice computing-ability as TX1 in GPU, as expectation,
    we think TX2 will 30% - 40% faster than TX1 at least.
    Unfortunately, most of our code base spent twice the time as TX1, in other words,
    TX2 only has 1/2 speed as TX1, mostly. We believe that TX2’s CUDA API runs
    much slower than TX1 in many cases.
    When we are trying to transplant our CUDA source code from TX1 to TX2, it
    behaved strange.
    We noticed that TX2 has twice computing-ability as TX1 in GPU, as expectation,
    we think TX2 will 30% - 40% faster than TX1 at least.
    Unfortunately, most of our code base spent twice the time as TX1, in other words,
    TX2 only has 1/2 speed as TX1, mostly. We believe that TX2’s CUDA API runs
    much slower than TX1 in many cases.
    The user is transferring the code
    from one hardware to another
    The target hardware is faster
    than the the source hardware.
    User expects the code to run
    at least 30-40% faster.
    The code ran 2x slower on the
    more powerful hardware

    View Slide

  89. More powerful
    Results: Case Study
    89
    Nvidia TX1
    CPU 4 cores, 1.3 GHz
    GPU 128 Cores, 0.9 GHz
    Memory 4 Gb, 25 Gb/s
    Nvidia TX2
    CPU 6 cores, 2 GHz
    GPU 256 Cores, 1.3 GHz
    Memory 8 Gb, 58 Gb/s
    Embedded real-time
    stereo estimation
    Source code
    17 Fps
    4 Fps
    4
    Slower!
    ×

    View Slide

  90. Results: Case Study
    90
    Configuration UNICO
    RN
    Decision
    Tree
    Forum
    CPU Cores ✓ ✓ ✓
    CPU Freq. ✓ ✓ ✓
    EMC Freq. ✓ ✓ ✓
    GPU Freq. ✓ ✓ ✓
    Sched. Policy ✓
    Sched. Runtime ✓
    Sched. Child Proc ✓
    Dirty Bg. Ratio ✓
    Drop Caches ✓
    CUDA_STATIC_RT ✓ ✓ ✓
    Swap Memory ✓
    UNICORN Decision Tree Forum
    Throughput (on TX2) 26 FPS 20 FPS 23 FPS
    Throughput Gain (over TX1) 53 % 21 % 39 %
    Time to resolve 24 min. 31/2
    Hrs. 2 days
    X Finds the root-causes accurately
    X No unnecessary changes
    X Better improvements than forum’s recommendation
    X Much faster
    Results
    The user expected 30-40% gain

    View Slide

  91. Evaluation: Experimental Setup
    Nvidia TX1
    CPU 4 cores, 1.3 GHz
    GPU 128 Cores, 0.9 GHz
    Memory 4 Gb, 25 GB/s
    Nvidia TX2
    CPU 6 cores, 2 GHz
    GPU 256 Cores, 1.3 GHz
    Memory 8 Gb, 58 GB/s
    Nvidia Xavier
    CPU 8 cores, 2.26 GHz
    GPU 512 cores, 1.3 GHz
    Memory 32 Gb, 137 GB/s
    Hardware Systems
    Software Systems
    Xception
    Image recognition
    (50,000 test images)
    DeepSpeech
    Voice recognition
    (5 sec. audio clip)
    BERT
    Sentiment Analysis
    (10000 IMDb reviews)
    x264
    Video Encoder
    (11 Mb, 1080p video)
    Configuration Space
    X 30 Configurations
    X 17 System Events
    • 10 software


    • 10 OS/Kernel


    • 10 hardware
    91

    View Slide

  92. Evaluation: Data Collection
    ● For each software/hardware
    combination create a benchmark
    dataset
    ○ Exhaustively set each of configuration
    option to all permitted values.
    ○ For continuous options (e.g., GPU memory
    Mem.), sample 10 equally spaced values
    between [min, max]
    ● Measure the latency, energy
    consumption, and heat dissipation
    ○ Repeat 5x and average
    92
    Multiple
    Faults
    !
    Latency
    Faults
    !
    Energy
    Faults
    !

    View Slide

  93. Evaluation: Ground Truth
    ● For each performance fault:
    ○ Manually investigate the root-cause
    ○ “Fix” the misconfigurations
    ● A “fix” implies the configuration no longer
    has tail performance
    ○ User defined benchmark (i.e., 10th percentile)
    ○ Or some QoS/SLA benchmark
    ● Record the configurations that were
    changed
    93
    Multiple
    Faults
    !
    Latency
    Faults
    !
    Energy
    Faults
    !

    View Slide

  94. RQ2: How does UNICORN perform compared to Search-Based
    Optimization
    94
    RQ1: How does UNICORN perform compared to Model
    based Diagnostics
    Results: Research Questions

    View Slide

  95. 95
    Results: Research Question 1 (single objective)
    RQ1: How does UNICORN perform compared to Model based Diagnostics
    X Finds the root-causes accurately
    X Better gain
    X Much faster
    Takeaways
    More accurate than
    ML-based methods
    Better Gain
    Up to 20x
    faster

    View Slide

  96. 96
    Results: Research Question 1 (multi-objective)
    RQ1: How does UNICORN perform compared to Model based Diagnostics
    X No deterioration of other performance objectives
    Takeaways
    Multiple Faults
    in Latency &
    Energy usage

    View Slide

  97. RQ1: How does UNICORN perform compared to Model based
    Diagnostics
    97
    RQ2: How does UNICORN perform compared to Search-Based
    Optimization
    Results: Research Questions

    View Slide

  98. Results: Research Question 2
    RQ2: How does UNICORN perform compared to Search-Based
    Optimization
    X Better with no deterioration of other performance objectives
    Takeaways
    98

    View Slide

  99. 99
    Results: Research Question 3
    RQ2: How does UNICORN perform compared to Search-Based
    Optimization
    X Considerably faster than search-based optimization
    Takeaways

    View Slide

  100. Summary: Causal AI for Systems
    1. Learning a
    Functional Causal
    Model for di
    ff
    erent
    downstream
    systems tasks

    2. The learned
    causal model is
    transferable
    across di
    ff
    erent
    environments
    100
    Software: DeepStream
    Middleware: TF, TensorRT
    Hardware: Nvidia Xavier
    Configuration: Default
    number of counters
    number of splitters
    latency (ms)
    100
    150
    1
    200
    250
    2
    300
    Cubic Interpolation Over Finer Grid
    2
    4
    3 6
    8
    4 10
    12
    5 14
    16
    6 18
    Budget
    Exhausted?
    Yes
    No
    5- Update Causal
    Performance Model
    Query Engine
    4- Estimate Causal
    Queries
    Estimate
    probability of
    satisfying QoS
    if BufferSize is
    set to 6k?
    2- Learn Causal
    Performance Model Performance
    Debugging
    Performance
    Optimization
    3- Translate Perf. Query
    to Causal Queries
    •What is the root-cause
    of observed perf. fault?
    •How do I fix the
    misconfig.?
    •How can I improve
    throughput without
    sacrificing accuracy?
    •How do I understand
    perf behavior?
    Measure performance
    of the configuration(s)
    that maximizes
    information gain
    Performance Data Causal Model
    P(Th > 40/s|do(Buffersize = 6k))
    1- Specify
    Performance Query
    QoS : Th > 40/s
    Observed : Th < 30/s ± 5/s

    View Slide

  101. Arti
    fi
    cial Intelligence and Systems Laboratory


    (AISys Lab)
    Machine
    Learning
    Computer
    Systems
    Autonomy
    AI/ML Systems
    https://pooyanjamshidi.github.io/AISys/
    101
    Ying Meng


    (PhD student)
    Shuge Lei


    (PhD student)
    Kimia Noorbakhsh


    (Undergrad)
    Shahriar Iqbal


    (PhD student)
    Jianhai Su


    (PhD student)
    M.A. Javidian


    (postdoc)
    Sponsors, thanks!
    Fatemeh Ghofrani


    (PhD student)
    Abir Hossen


    (PhD student)
    Hamed Damirchi


    (PhD student)
    Mahdi Shari
    fi

    (PhD student)
    Lane Stanley


    (Intern)
    Sonam Kharde
    Postdoc

    View Slide

  102. Rahul Krishna


    Columbia
    Shahriar Iqbal


    UofSC
    M. A. Javidian


    Purdue
    Baishakhi Ray


    Columbia
    Collaborators

    View Slide

  103. View Slide

  104. Outline
    104
    Causal AI
    For Systems
    UNICORN
    Results
    Motivation
    Causal AI for
    Autonomy
    and Robotics
    Autonomy
    Evaluation
    at JPL

    View Slide

  105. RASPBERRY SI
    David Garlan
    CMU

    Co-I
    Bradley Schmerl
    CMU

    Co-I
    Pooyan Jamshidi
    UofSC

    PI
    Javier Camara
    York (UK)

    Collaborator
    Ellen Czaplinski
    NASA JPL

    Consultant
    Katherine Dzurilla
    UArk

    Consultant
    Jianhai Su
    UofSC

    Graduate

    Student
    Matt DeMinico
    NASA

    Co-I
    Resource Adaptive Software Purpose-Built for Extraordinary
    Robotic Research Yields - Science Instruments
    Abir Hossen
    UofSC

    Graduate

    Student
    Sonam Kharde
    UofSC

    Postdoc
    Autonomous Robotics Research
    for Ocean Worlds (ARROW)

    View Slide

  106. K. MICHAEL DALAL
    Team Lead
    USSAMA NAAL
    Software Engineer
    LANSSIE MA
    Software Engineer
    Autonomy


    • Quantitative Planning


    • Transfer & Online Learning


    • Causal AI
    JIANHAI SU
    UofSC, Graduate Student
    BRADLEY SCHMERL
    CMU, Co-I
    DAVID GARLAN
    CMU, Co-I
    JAVIER CAMARA
    York, Collaborator
    MATT DeMINICO
    NASA, Co-I
    HARI D NAYAR
    Team Lead
    ANNA E BOETTCHER
    Robotics System Engineer
    ASHISH GOEL
    Research Technologist ANJAN CHAKRABARTY
    Software Engineer
    CHETAN KULKARNI
    Prognostics Researcher
    THOMAS STUCKY
    Software Engineer
    TERENCE WELSH
    Software Engineer
    CHRISTOPHER LIM
    Robotics Software Engineer
    JACEK SAWONIEWICZ
    Robotics System Engineer
    ABIR HOSSEN
    UofSC, Graduate Student
    ELLEN CZAPLINSKI
    Arkansas, Consultant
    KATHERINE DZURILLA
    Arkansas, Consultant
    POOYAN JAMSHIDI
    UofSC, PI
    RASPBERRY SI


    Physical Testbed Virtual Testbed
    AISR: Autonomous Robotics Research
    for Ocean Worlds (ARROW)
    CAROLYN R.
    MERCER
    Program Manager
    Develop
    Develop and
    maintain
    Evaluate
    Evaluate
    Develop and
    maintain
    Sonam Kharde
    UofSC, Postdoc

    View Slide

  107. 107

    View Slide

  108. View Slide

  109. Autonomy Module: Evaluation
    109
    Design
    • MAPE-K loop based design

    • Machine learning driven quantitative
    planning and adaptation
    Evaluation
    • Two testbeds: different fidelities

    & simulation flexibilities
    Monitor
    Analyze Plan
    Execute
    Knowledge
    System Under Test

    (NASA Lander)
    Autonomy
    Physical Testbed

    OWLAT

    (NASA/JPL)
    Virtual Testbed

    OceanWATERS

    (NASA/ARC)

    View Slide

  110. Learning in Simulation for Transfer Learning to Physical Testbed
    Sim2Real
    Transfer
    110
    Physical Testbed
    Simulation Environment
    OWLAT
    OWLAT-sim
    Causal Invariances

    View Slide

  111. Causal AI for Autonomous Robot Testing
    • Testing cyber-physical systems such as robots are complicated. The key
    reason is that there are additional interactions with the environment and
    the task that the robot is performing.

    • Evaluating our Causal AI for Systems methodology with autonomous
    robots provides the following opportunities:

    1. Identifying di
    ff i
    cult-to-catch bugs in robots

    2. Identifying the root cause of an observed fault and repairing the issue
    automatically during mission time.
    111

    View Slide

  112. Outline
    112
    Causal AI
    For Systems
    UNICORN
    Results
    Motivation
    Causal AI for
    Autonomy
    and Robotics
    Autonomy
    Evaluation
    at JPL

    View Slide

  113. Lessons Learned
    • Open Science, Open Source, Open Data, and Open Collaborations

    • Diverse Team, Diverse Background, Diverse Expertise

    • Close Collaborations with the JPL and Ames teams

    • Evaluation in Real Environment

    Project Website: https://nasa-raspberry-si.github.io/raspberry-si

    View Slide

  114. Lessons Learned
    • In the simulation, we can debug/develop/test our implementation without
    worrying about damaging the hardware.

    • High bandwidth and close interaction between the testbed provider (JPL
    Team) and the autonomy team (RASPBERRY-SI)

    • Faster identification of the issues

    • Resolving the issues a lot faster

    • Getting help for development

    View Slide

  115. Lessons Learned
    • Importance of risk reduction phases

    • Integration testing

    • The interface and capability of the testbeds will evolve, and the autonomy
    needs to be designed at the same time.

    • The different capabilities between sim and physical testbed.

    • Rigorous testing remotely and in interaction with testbed providers.

    • The interaction would be beneficial for Autonomy providers as well as testbed
    providers.

    View Slide

  116. Incremental Integration Testing
    116
    Component

    Test
    Integration

    Test
    Model
    Learning
    Transfer
    Learning
    Model
    Compression
    Online
    Learning
    A B C D
    Quantitative
    Planning
    E
    Learning
    A E
    Case (Baseline)
    A E
    B A E
    B C A E
    B C D
    Case 2 (Transfer) Case 3 (Compress) Case 4 (Online)
    Test 1
    Expected
    Performance
    Case 1 < Case 2 < Case 3 < Case 4
    OWLAT Code: https://github.com/nasa/ow_simulator
    Physical Autonomy Testbed: https://www1.grc.nasa.gov/wp-content/uploads/2020_ASCE_OWLAT_20191028.pdf

    View Slide

  117. Real-World Experiments using OWLAT
    • Models learned from simulation

    • Adaptive System (Learning + Planning)

    • Sets of Tests
    117
    Adaptive
    System
    Machine
    learning
    Models
    Mission
    Environment
    Continual Learning:
    refining models
    Log Mission
    Reports Local Machine

    Cloud Storage

    View Slide

  118. Test Coverage
    • Mission Types: landing and scientific explorations -> sampling

    • Mission Difficulty:

    • Rough regions for landing

    • Number of locations where a sample needs to be fetched

    • Unexpected events:
    • Changes in the environments: e.g., uneven terrain and weather

    • Changes to the lander capabilities: e.g., deploy new sensors

    • Faults (power, instruments, etc)
    118

    View Slide

  119. Infrastructure for Automated Evaluation
    119
    Test
    Generator
    Autonomy
    Module
    Test 1
    Test Harness

    Mission
    Configuration
    Testbed

    Monitoring

    & Logging
    Communication
    Logging
    Logs
    Log
    Analysis
    Evaluation Report
    Environment
    & Lander
    Simulation
    Adapter
    Interface
    Learning

    & Planning
    Plan
    Executive

    View Slide

  120. Thank You!

    View Slide