$30 off During Our Annual Pro Sale. View Details »

Causal AI for Systems

Causal AI for Systems

exploreCSR workshop, Jan 2022.
https://democratizeai.org/

Pooyan Jamshidi

January 15, 2022
Tweet

More Decks by Pooyan Jamshidi

Other Decks in Research

Transcript

  1. Causal AI


    for Systems
    Learning Causal Performance Models for conducting Performance Tasks in a Principled and Transferable Fashion
    Pooyan Jamshidi

    View Slide

  2. What is Causal AI?

    View Slide

  3. Let’s start with a (fiction) story
    • Zeus is a patient waiting for a heart transplant. On 1 January, he received a
    new heart. Five days later, he died.

    • Imagine that we can somehow know, that had Zeus not received a heart
    transplant on 1 January then he would have been alive
    fi
    ve days later.

    • All others things in his life being unchanged.

    • Now, what do you think was the cause of Zeus’s death?!

    • Most people would agree that the transplant caused Zeus’ death.

    • The intervention had a causal e
    ff
    ect.

    View Slide

  4. Let’s start with a (fiction) story
    • Hera, received a heart transplant on 1 January. Five days later she was alive.

    • Again, imagine we can somehow know that had Hera not received the heart
    on 1 January then she would still have been alive
    fi
    ve days later.

    • All others things in his life being unchanged.

    • The transplant did not have a causal e
    ff
    ect on Hera’s
    fi
    ve day survival.

    View Slide

  5. Let’s collect some data!
    Exposure variable A (1: exposed, 0: unexposed); Outcome variable Y (1: death, 0: survival)

    View Slide

  6. Individual Causal Effect
    contrast of the values of counterfactual outcomes, but only one of those values is observed.

    View Slide

  7. Population Causal Effects
    • Pr[Ya = 1]: proportion of subjects that would have developed the outcome Y
    had all subjects in the population of interest received exposure value a.

    • The exposure has a causal e
    ff
    ect in the population if 

    Pr[Ya=1=1] Pr[Ya=0=1].

    • Unlike individual causal e
    ff
    ects, population causal e
    ff
    ects can sometimes be
    computed—or, more rigorously, consistently estimated.

    Pr[Ya=1
    = 1] − Pr[Ya=0
    = 1] ≠ 0

    View Slide

  8. Now let’s do some cool ML
    ML models characterize association
    Pr[Y = 1|A = 1] = 7/13 Pr[Y = 1|A = 0] = 3/7

    View Slide

  9. Association is not Causation!

    View Slide

  10. Computing Causal Effects via Randomization
    Unlike association measures, e
    ff
    ect measures cannot be directly computed because of missing data. However, e
    ff
    ect measures
    can be computed/estimated in randomized experiments!
    • Suppose we have a (near-in
    fi
    nite) population and that we
    fl
    ip a coin for each subject in such
    population. We assign the subject to group 1 if the coin turns tails, and to group 2 if it turns heads.

    • Next we administer the treatment or exposure of interest (A = 1) to subjects in group 1 and placebo
    (A = 0) to those in group 2. Five days later, at the end of the study, we compute the mortality risks in
    each group, Pr[Y = 1|A = 1] and Pr[Y = 1|A = 0].

    • When subjects are randomly assigned to groups 1 and 2, the proportion of deaths among the
    exposed, Pr[Y = 1|A = 1], will be the same whether subjects in group 1 receive the exposure and
    subjects in group 2 receive placebo, or vice versa.

    • Because group membership is randomised, both groups are ‘‘comparable’’: which particular group
    got the exposure is irrelevant for the value of Pr[Y = 1|A = 1]. (The same reasoning applies to Pr[Y =
    1|A = 0].)

    • Formally, we say that both groups are exchangeable.

    View Slide

  11. Let’s do some math!
    Pr[Y = 1|A = 1] = Pr[Y = 1|A = 0] = Pr[Ya
    = 1]
    Pr[Ya
    = 1|A = a] = Pr[Y = 1|A = a]
    Pr[Y = 1|A = a] = Pr[Ya
    = 1]
    In ideal randomized experiments, Association is Causation!

    View Slide

  12. But not in non-randomized observational studies
    Still remember this?
    Pr[Y = 1|A = 1] = 7/13
    Pr[Y = 1|A = 0] = 3/7

    View Slide

  13. Outline
    13
    Causal AI
    For Systems
    UNICORN
    Results
    Future
    Directions
    Motivation

    View Slide

  14. 14
    Goal: Enable developers/users


    to
    fi
    nd the right quality tradeoff

    View Slide

  15. Today’s most popular systems are con
    fi
    gurable
    15
    built

    View Slide

  16. 16

    View Slide

  17. Empirical observations con
    fi
    rm that systems are
    becoming increasingly con
    fi
    gurable
    17
    08 7/2010 7/2012 7/2014
    Release time
    1/1999 1/2003 1/2007 1/2011
    0
    1/2014
    N
    Release time
    02 1/2006 1/2010 1/2014
    2.2.14
    2.3.4
    2.0.35
    .3.24
    Release time
    Apache
    1/2006 1/2008 1/2010 1/2012 1/2014
    0
    40
    80
    120
    160
    200
    2.0.0
    1.0.0
    0.19.0
    0.1.0
    Hadoop
    Number of parameters
    Release time
    MapReduce
    HDFS
    [Tianyin Xu, et al., “Too Many Knobs…”, FSE’15]

    View Slide

  18. Empirical observations con
    fi
    rm that systems are
    becoming increasingly con
    fi
    gurable
    18
    nia San Diego, ‡Huazhong Univ. of Science & Technology, †NetApp, Inc
    tixu, longjin, xuf001, yyzhou}@cs.ucsd.edu
    kar.Pasupathy, Rukma.Talwadker}@netapp.com
    prevalent, but also severely
    software. One fundamental
    y of configuration, reflected
    parameters (“knobs”). With
    m software to ensure high re-
    aunting, error-prone task.
    nderstanding a fundamental
    users really need so many
    answer, we study the con-
    including thousands of cus-
    m (Storage-A), and hundreds
    ce system software projects.
    ng findings to motivate soft-
    ore cautious and disciplined
    these findings, we provide
    ich can significantly reduce
    A as an example, the guide-
    ters and simplify 19.7% of
    on existing users. Also, we
    tion methods in the context
    7/2006 7/2008 7/2010 7/2012 7/2014
    0
    100
    200
    300
    400
    500
    600
    700
    Storage-A
    Number of parameters
    Release time
    1/1999 1/2003 1/2007 1/2011
    0
    100
    200
    300
    400
    500
    5.6.2
    5.5.0
    5.0.16
    5.1.3
    4.1.0
    4.0.12
    3.23.0
    1/2014
    MySQL
    Number of parameters
    Release time
    1/1998 1/2002 1/2006 1/2010 1/2014
    0
    100
    200
    300
    400
    500
    600
    1.3.14
    2.2.14
    2.3.4
    2.0.35
    1.3.24
    Number of parameters
    Release time
    Apache
    1/2006 1/2008 1/2010 1/2012 1/2014
    0
    40
    80
    120
    160
    200
    2.0.0
    1.0.0
    0.19.0
    0.1.0
    Hadoop
    Number of parameters
    Release time
    MapReduce
    HDFS
    [Tianyin Xu, et al., “Too Many Knobs…”, FSE’15]

    View Slide

  19. Con
    fi
    gurations determine the performance
    behavior
    19
    void Parrot_setenv(. . . name,. . . value){
    #ifdef PARROT_HAS_SETENV
    my_setenv(name, value, 1);
    #else
    int name_len=strlen(name);
    int val_len=strlen(value);
    char* envs=glob_env;
    if(envs==NULL){
    return;
    }
    strcpy(envs,name);
    strcpy(envs+name_len,"=");
    strcpy(envs+name_len + 1,value);
    putenv(envs);
    #endif
    }
    #ifdef LINUX
    extern int Parrot_signbit(double x){
    endif
    else
    PARROT_HAS_SETENV
    LINUX
    Speed
    Energy

    View Slide

  20. Misconfiguration and its Effects
    ● Misconfigurations can elicit unexpected interactions between
    software and hardware
    ● These can result in non-functional faults
    ○ Affecting non-functional system properties like
    latency, throughput, energy consumption, etc.
    20
    The system doesn’t crash or
    exhibit an obvious misbehavior
    Systems are still operational but with a
    degraded performance, e.g., high latency, low
    throughput, high energy consumption, high
    heat dissipation, or a combination of several

    View Slide

  21. 21
    CUDA performance issue on tx2
    When we are trying to transplant our CUDA source code from TX1 to TX2, it
    behaved strange.
    We noticed that TX2 has twice computing-ability as TX1 in GPU, as expectation,
    we think TX2 will 30% - 40% faster than TX1 at least.
    Unfortunately, most of our code base spent twice the time as TX1, in other words,
    TX2 only has 1/2 speed as TX1, mostly. We believe that TX2’s CUDA API runs
    much slower than TX1 in many cases.
    When we are trying to transplant our CUDA source code from TX1 to TX2, it
    behaved strange.
    We noticed that TX2 has twice computing-ability as TX1 in GPU, as expectation,
    we think TX2 will 30% - 40% faster than TX1 at least.
    Unfortunately, most of our code base spent twice the time as TX1, in other words,
    TX2 only has 1/2 speed as TX1, mostly. We believe that TX2’s CUDA API runs
    much slower than TX1 in many cases.
    The user is transferring the code
    from one hardware to another
    When we are trying to transplant our CUDA source code from TX1 to TX2, it
    behaved strange.
    We noticed that TX2 has twice computing-ability as TX1 in GPU, as expectation,
    we think TX2 will 30% - 40% faster than TX1 at least.
    Unfortunately, most of our code base spent twice the time as TX1, in other words,
    TX2 only has 1/2 speed as TX1, mostly. We believe that TX2’s CUDA API runs
    much slower than TX1 in many cases.
    The target hardware is faster
    than the the source hardware.
    User expects the code to run
    at least 30-40% faster.
    Motivating Example
    When we are trying to transplant our CUDA source code from TX1 to TX2, it
    behaved strange.
    We noticed that TX2 has twice computing-ability as TX1 in GPU, as expectation,
    we think TX2 will 30% - 40% faster than TX1 at least.
    Unfortunately, most of our code base spent twice the time as TX1, in other words,
    TX2 only has 1/2 speed as TX1, mostly. We believe that TX2’s CUDA API runs
    much slower than TX1 in many cases.
    The code ran 2x slower on the
    more powerful hardware

    View Slide

  22. Motivating Example
    22
    June 3rd
    We have already tried this. We still have high latency.
    Any other suggestions?
    June 4th
    Please do the following and let us know if it works
    1. Install JetPack 3.0
    2. Set nvpmodel=MAX-N
    3. Run jetson_clock.sh
    June 5th
    June 4th
    TX2 is pascal architecture. Please update your CMakeLists:
    + set(CUDA_STATIC_RUNTIME OFF)
    ...
    + -gencode=arch=compute_62,code=sm_62
    The user had several misconfigurations
    In Software:
    ✖ Wrong compilation flags
    ✖ Wrong SDK version
    In Hardware:
    ✖ Wrong power mode
    ✖ Wrong clock/fan settings
    The discussions took 2 days
    !
    Any suggestions on how to improve my performance?
    Thanks!
    How to resolve such issues faster?
    ?

    View Slide

  23. Today’s most popular systems are complex!
    multiscale, multi-modal, and multi-stream
    23
    Variability Space =
    Con
    fi
    guration Space +

    System Architecture +

    Deployment Environment
    Video


    Decoder
    Stream
    Muxer
    Primary
    Detector
    Object
    Tracker
    Secondary
    Classifier
    # Configuration Options
    55
    86
    14
    44 86

    View Slide

  24. 24
    More combinations than estimated
    atoms in the universe

    View Slide

  25. 0 500 1000 1500
    Throughput (ops/sec)
    0
    1000
    2000
    3000
    4000
    5000
    Average write latency ( s)
    The default con
    fi
    guration is typically bad and the
    optimal con
    fi
    guration is noticeably better than median
    25
    Default Con
    fi
    guration
    Optimal


    Con
    fi
    guration
    better
    better
    • Default is bad
    • 2X-10X faster than worst
    • Noticeably faster than median

    View Slide

  26. Performance behavior varies in different environments
    26

    View Slide

  27. Outline
    27
    Motivation UNICORN
    Results
    Future
    Directions
    Causal AI
    For Systems

    View Slide

  28. Causal AI in Systems and Software
    28
    Computer Architecture
    Database
    Operating Systems
    Programming Languages
    BigData Software Engineering
    https://github.com/y-ding/causal-system-papers

    View Slide

  29. 29
    Throughput = 9 × Bitrate + 2.1 × Buffersize − 4.4 × Bitrate × Buffersize × BatchSize
    Causal Performance Model
    Traditional Performance Model
    VS
    Throughput Energy
    Branch
    Misses
    Cache
    Misses
    No. of
    Cycles
    Bitrate
    Buffer
    Size
    Batch
    Size
    Enable
    Padding
    f3 f4
    f
    f1
    f2
    Causal


    Interaction
    Causal


    Paths
    Software


    Options
    Intermediate


    Causal Mechanisms
    Performance


    Objective
    f
    Branchmisses = 2 × Bitrate + 8.1 × Buffersize + 4.1 × Bitrate × Buffersize × Cachemisses
    Decoder Muxer

    View Slide

  30. Critical Issues of Correlation-based Performance
    Analysis
    • Performance in
    fl
    uence models could produce unreliable predictions.

    • Performance in
    fl
    uence models could produce unstable predictions
    across environments and in the presence of measurement noise.

    • Performance in
    fl
    uence models could produce incorrect explanations.
    30

    View Slide

  31. Why Causal Inference? (Simpson’s Paradox)
    31
    Increasing GPU memory
    increases Latency
    More GPU memory
    usage should reduce
    latency not increase it.
    Counterintuitive!
    Any ML-/statistical models built
    on this data will be incorrect
    !

    View Slide

  32. Why Causal Inference? (Simpson’s Paradox)
    32
    Segregate data on swap memory
    Available swap
    memory is
    reducing
    GPU memory borrows memory from the swap for some intensive workloads. Other
    host processes may reduce the available swap. Little will be left for the GPU to use.

    View Slide

  33. 33
    Why Causal Inference?
    Real world problems can have
    100s if not 1000s of interacting
    configuration options
    !
    Manually understanding and
    evaluating each combination
    is impractical, if not
    impossible.

    View Slide

  34. Load
    GPU Mem.
    Swap Mem.
    Latency
    Express the relationships between
    interacting variables as a causal graph
    34
    Causal Performance Models
    Configuration option Direction(s) of the causality
    • Latency is affected by GPU Mem. which
    in turn is influenced by swap memory
    • External factors like resource pressure
    also affects swap memory
    Non-functional property
    System event

    View Slide

  35. 35
    Causal Performance Models
    How to construct
    this causal graph?
    ?
    If there is a fault in latency,
    how to diagnose and fix it?
    ?
    Load
    GPU Mem.
    Swap Mem.
    Latency

    View Slide

  36. View Slide

  37. Outline
    37
    Motivation
    Causal AI
    For Systems
    Results
    Future
    Directions
    UNICORN

    View Slide

  38. • Build a Causal Performance
    Model that capture the interactions
    options in the variability space
    using the observation performance
    data.

    • Iterative causal performance model
    evaluation and model update
    • Perform downstream performance
    tasks such as performance
    debugging & optimization using
    Causal Reasoning
    UNICORN: Our Causal AI for
    Systems Method

    View Slide

  39. UNICORN: Our Causal AI for Systems Method
    Software: DeepStream
    Middleware: TF, TensorRT
    Hardware: Nvidia Xavier
    Configuration: Default
    number of counters
    number of splitters
    latency (ms)
    100
    150
    1
    200
    250
    2
    300
    Cubic Interpolation Over Finer Grid
    2
    4
    3 6
    8
    4 10
    12
    5 14
    16
    6 18
    Budget
    Exhausted?
    Yes
    No
    5- Update Causal
    Performance Model
    Query Engine
    4- Estimate Causal
    Queries
    Estimate
    probability of
    satisfying QoS
    if BufferSize is
    set to 6k?
    2- Learn Causal
    Performance Model Performance
    Debugging
    Performance
    Optimization
    3- Translate Perf. Query
    to Causal Queries
    •What is the root-cause
    of observed perf. fault?
    •How do I fix the
    misconfig.?
    •How can I improve
    throughput without
    sacrificing accuracy?
    •How do I understand
    perf behavior?
    Measure performance
    of the configuration(s)
    that maximizes
    information gain
    Performance Data Causal Model
    P(Th > 40/s|do(Buffersize = 6k))
    1- Specify
    Performance Query
    QoS : Th > 40/s
    Observed : Th < 30/s ± 5/s

    View Slide

  40. Software: DeepStream
    Middleware: TF, TensorRT
    Hardware: Nvidia Xavier
    Configuration: Default
    number of counters
    number of splitters
    latency (ms)
    100
    150
    1
    200
    250
    2
    300
    Cubic Interpolation Over Finer Grid
    2
    4
    3 6
    8
    4 10
    12
    5 14
    16
    6 18
    Budget
    Exhausted?
    Yes
    No
    5- Update Causal
    Performance Model
    Query Engine
    4- Estimate Causal Queries
    Estimate
    probability of
    satisfying QoS
    if BufferSize is
    set to 6k?
    2- Learn
    Causal Perf. Model Performance
    Debugging
    Performance
    Optimization
    3- Translate Performance Query
    to Causal Queries
    •What is the root-cause
    of observed perf. fault?
    •How do I fix the
    misconfig.?
    •How can I improve
    throughput without
    sacrificing accuracy?
    •How do I understand
    perf behavior?
    Measure performance
    of the configuration(s)
    that maximizes
    information gain
    Performance Data Causal Model
    P(Th > 40/s|do(Buffersize = 6k))
    1- Specify
    Performance Query
    QoS : Th > 40/s
    Observed : Th < 30/s ± 5/s
    UNICORN: Our Causal AI for Systems Method

    View Slide

  41. Software: DeepStream
    Middleware: TF, TensorRT
    Hardware: Nvidia Xavier
    Configuration: Default
    number of counters
    number of splitters
    latency (ms)
    100
    150
    1
    200
    250
    2
    300
    Cubic Interpolation Over Finer Grid
    2
    4
    3 6
    8
    4 10
    12
    5 14
    16
    6 18
    Budget
    Exhausted?
    Yes
    No
    5- Update Causal
    Performance Model
    Query Engine
    4- Estimate Causal Queries
    Estimate
    probability of
    satisfying QoS
    if BufferSize is
    set to 6k?
    2- Learn
    Causal Perf. Model Performance
    Debugging
    Performance
    Optimization
    3- Translate Performance Query
    to Causal Queries
    •What is the root-cause
    of observed perf. fault?
    •How do I fix the
    misconfig.?
    •How can I improve
    throughput without
    sacrificing accuracy?
    •How do I understand
    perf behavior?
    Measure performance
    of the configuration(s)
    that maximizes
    information gain
    Performance Data Causal Model
    P(Th > 40/s|do(Buffersize = 6k))
    1- Specify
    Performance Query
    QoS : Th > 40/s
    Observed : Th < 30/s ± 5/s
    UNICORN: Our Causal AI for Systems Method

    View Slide

  42. FPS Energy
    Branch
    Misses
    Cache
    Misses
    No of
    Cycles
    Bitrate
    Buffer
    Size
    Batch
    Size
    Enable
    Padding
    FPS Energy
    Branch
    Misses
    Cache
    Misses
    No of
    Cycles
    Bitrate Buffer
    Size
    Batch
    Size
    Enable
    Padding
    Bitrate
    (bits/s)
    Enable
    Padding
    … Cache
    Misses
    … Through
    put (fps)
    c1
    1k 1 … 42m … 7
    c2
    2k 1 … 32m … 22
    … … … … … … …
    cn
    5k 0 … 12m … 25
    FPS Energy
    Branch
    Misses
    Cache
    Misses
    No of
    Cycles
    Bitrate Buffer
    Size
    Batch
    Size
    Enable
    Padding
    1- Recovering the
    Skelton
    2- Pruning
    Causal Structure
    3- Orienting
    Causal Relations
    statistical
    independence
    tests
    fully connected graph
    given constraints (e.g.,
    no connections btw
    configuration options)
    orientation rules &
    measures (entropy) +
    structural constraints
    (colliders, v-structures)
    Learning Causal Performance Model

    View Slide

  43. Performance measurement
    43
    ℂ = O1
    × O2
    × ⋯ × O19
    × O20
    Dead code removal
    Con
    fi
    guration
    Space
    Constant folding
    Loop unrolling
    Function inlining
    c1
    = 0 × 0 × ⋯ × 0 × 1
    c1
    ∈ ℂ
    fc
    (c1
    ) = 11.1ms
    Compile
    time
    Execution
    time
    Energy
    Compiler


    (e.f., SaC, LLVM)
    Program Compiled
    Code
    Instrumented
    Binary
    Hardware
    Compile Deploy
    Con
    fi
    gure
    fe
    (c1
    ) = 110.3ms
    fen
    (c1
    ) = 100mwh
    Non-functional
    measurable/quanti
    fi
    able
    aspect

    View Slide

  44. Our setup for performance measurements
    44

    View Slide

  45. Hardware platforms in our experiments
    The reason behind using di
    ff
    erent types of hardware platforms is that they exhibit di
    ff
    erent behaviors due to di
    ff
    erences in terms
    of resources, their microarchitecture, etc.
    45
    AWS DeepLens:


    Cloud-connected device
    System on Chip (SoC)
    Microcontrollers (MCUs)

    View Slide

  46. Measuring performance for systems involves lots of challenges
    Each hardware requires di
    ff
    erent ways of instrumentations and clean measurement that contains least amount of noise is the
    most challenging part of our experiments.
    46

    View Slide

  47. FPS Energy
    Branch
    Misses
    Cache
    Misses
    No of
    Cycles
    Bitrate
    Buffer
    Size
    Batch
    Size
    Enable
    Padding
    FPS Energy
    Branch
    Misses
    Cache
    Misses
    No of
    Cycles
    Bitrate Buffer
    Size
    Batch
    Size
    Enable
    Padding
    Bitrate
    (bits/s)
    Enable
    Padding
    … Cache
    Misses
    … Through
    put (fps)
    c1
    1k 1 … 42m … 7
    c2
    2k 1 … 32m … 22
    … … … … … … …
    cn
    5k 0 … 12m … 25
    FPS Energy
    Branch
    Misses
    Cache
    Misses
    No of
    Cycles
    Bitrate Buffer
    Size
    Batch
    Size
    Enable
    Padding
    1- Recovering the
    Skelton
    2- Pruning
    Causal Structure
    3- Orienting
    Causal Relations
    statistical
    independence
    tests
    fully connected graph
    given constraints (e.g.,
    no connections btw
    configuration options)
    orientation rules &
    measures (entropy) +
    structural constraints
    (colliders, v-structures)
    Learning Causal Performance Model

    View Slide

  48. FPS Energy
    Branch
    Misses
    Cache
    Misses
    No of
    Cycles
    Bitrate
    Buffer
    Size
    Batch
    Size
    Enable
    Padding
    FPS Energy
    Branch
    Misses
    Cache
    Misses
    No of
    Cycles
    Bitrate Buffer
    Size
    Batch
    Size
    Enable
    Padding
    Bitrate
    (bits/s)
    Enable
    Padding
    … Cache
    Misses
    … Through
    put (fps)
    c1
    1k 1 … 42m … 7
    c2
    2k 1 … 32m … 22
    … … … … … … …
    cn
    5k 0 … 12m … 25
    FPS Energy
    Branch
    Misses
    Cache
    Misses
    No of
    Cycles
    Bitrate Buffer
    Size
    Batch
    Size
    Enable
    Padding
    1- Recovering the
    Skelton
    2- Pruning
    Causal Structure
    3- Orienting
    Causal Relations
    statistical
    independence
    tests
    fully connected graph
    given constraints (e.g.,
    no connections btw
    configuration options)
    orientation rules &
    measures (entropy) +
    structural constraints
    (colliders, v-structures)
    Learning Causal Performance Model

    View Slide

  49. FPS Energy
    Branch
    Misses
    Cache
    Misses
    No of
    Cycles
    Bitrate
    Buffer
    Size
    Batch
    Size
    Enable
    Padding
    FPS Energy
    Branch
    Misses
    Cache
    Misses
    No of
    Cycles
    Bitrate Buffer
    Size
    Batch
    Size
    Enable
    Padding
    Bitrate
    (bits/s)
    Enable
    Padding
    … Cache
    Misses
    … Through
    put (fps)
    c1
    1k 1 … 42m … 7
    c2
    2k 1 … 32m … 22
    … … … … … … …
    cn
    5k 0 … 12m … 25
    FPS Energy
    Branch
    Misses
    Cache
    Misses
    No of
    Cycles
    Bitrate Buffer
    Size
    Batch
    Size
    Enable
    Padding
    1- Recovering the
    Skelton
    2- Pruning
    Causal Structure
    3- Orienting
    Causal Relations
    statistical
    independence
    tests
    fully connected graph
    given constraints (e.g.,
    no connections btw
    configuration options)
    orientation rules &
    measures (entropy) +
    structural constraints
    (colliders, v-structures)
    Learning Causal Performance Model

    View Slide

  50. Throughput Energy
    Branch
    Misses
    Cache
    Misses
    No of
    Cycles
    Bitrate Buffer
    Size
    Batch
    Size
    Enable
    Padding
    f f
    f
    f f
    Causal


    Interaction
    Causal


    Paths
    Software


    Options
    Perf.


    Events
    Performance


    Objective
    f
    Branchmisses = 2 × Bitrate + 8.1 × Buffersize + 4.1 × Bitrate × Buffersize × Cachemisses
    Decoder Muxer
    Causal Performance Model

    View Slide

  51. Software: DeepStream
    Middleware: TF, TensorRT
    Hardware: Nvidia Xavier
    Configuration: Default
    number of counters
    number of splitters
    latency (ms)
    100
    150
    1
    200
    250
    2
    300
    Cubic Interpolation Over Finer Grid
    2
    4
    3 6
    8
    4 10
    12
    5 14
    16
    6 18
    Budget
    Exhausted?
    Yes
    No
    5- Update Causal
    Performance Model
    Query Engine
    4- Estimate Causal Queries
    Estimate
    probability of
    satisfying QoS
    if BufferSize is
    set to 6k?
    2- Learn
    Causal Perf. Model Performance
    Debugging
    Performance
    Optimization
    3- Translate Performance Query
    to Causal Queries
    •What is the root-cause
    of observed perf. fault?
    •How do I fix the
    misconfig.?
    •How can I improve
    throughput without
    sacrificing accuracy?
    •How do I understand
    perf behavior?
    Measure performance
    of the configuration(s)
    that maximizes
    information gain
    Performance Data Causal Model
    P(Th > 40/s|do(Buffersize = 6k))
    1- Specify
    Performance Query
    QoS : Th > 40/s
    Observed : Th < 30/s ± 5/s
    UNICORN: Our Causal AI for Systems Method

    View Slide

  52. 52
    Diagnose and fix the root-cause of misconfigurations that cause non-functional faults
    Objective
    Causal Debugging: An example of downstream performance task
    Ὂ Use causal models to model various cross-stack configuration interactions;
    and
    Ὂ Counterfactual reasoning to recommend fixes for these misconfigurations
    Approach

    View Slide

  53. 53
    Causal Debugging
    • What is the root-cause
    of my fault?
    • How do I fix my
    misconfigurations to
    improve performance?
    Misconfiguration
    Fault
    fixed?
    Observational Data Build Causal Graph Extract Causal Paths
    Best Query
    Yes
    No
    update
    observational
    data
    Counterfactual Queries
    Rank Paths
    What if questions.
    E.g., What if the configuration
    option X was set to a value ‘x’?
    About 25 sample
    configurations
    (training data)

    View Slide

  54. Best Query
    Counterfactual Queries
    Rank Paths
    What if questions.
    E.g., What if the
    configuration option X was
    set to a value ‘x’?
    Extract Causal Paths
    54
    Extracting Causal Paths from the Causal Model
    • What is the root-cause
    of my fault?
    • How do I fix my
    misconfigurations to
    improve performance?
    Misconfiguration
    Fault
    fixed?
    Observational Data Build Causal Graph
    Yes
    No
    update
    observational
    data
    About 25 sample
    configurations
    (training data)

    View Slide

  55. Extracting Causal Paths from the Causal Model
    Problem
    ✕ In real world cases, this causal graph can be
    very complex
    ✕ It may be intractable to reason over the entire
    graph directly
    55
    Solution
    ✓ Extract paths from the causal graph
    ✓ Rank them based on their Average Causal
    Effect on latency, etc.
    ✓ Reason over the top K paths

    View Slide

  56. Extracting Causal Paths from the Causal Model
    56
    GPU Mem. Latency
    Swap Mem.
    Extract paths
    Always begins with a
    configuration option
    Or a system
    event
    Always terminates at a
    performance objective
    Load
    GPU Mem. Latency
    Swap Mem.
    Swap Mem. Latency
    Load GPU Mem.

    View Slide

  57. Ranking Causal Paths from the Causal Model
    57
    ● They may be too many causal paths
    ● We need to select the most useful ones
    ● Compute the Average Causal Effect (ACE) of
    each pair of neighbors in a path
    GPU Mem.
    Swap Mem. Latency
    𝐴𝐶
    𝐸
    (GPU Mem . , Swap) =
    1
    𝑁

    𝑎
    ,
    𝑏

    𝑍
    𝔼
    (GPU Mem .
    𝑑 𝑜
    (Swap =
    𝑏
    )) −
    𝔼
    (GPU Mem .
    𝑑
    𝑜
    (Swap =
    𝑎
    ))
    Expected value of GPU
    Mem. when we artificially
    intervene by setting Swap to
    the value b
    Expected value of GPU
    Mem. when we artificially
    intervene by setting Swap to
    the value a
    If this difference is large, then
    small changes to Swap Mem.
    will cause large changes to GPU
    Mem.
    Average over all permitted
    values of Swap memory.

    View Slide

  58. Ranking Causal Paths from the Causal Model
    58
    ● Average the ACE of all pairs of adjacent nodes in the path
    ● Rank paths from highest path ACE (PACE) score to the lowest
    ● Use the top K paths for subsequent analysis
    𝑃𝐴𝐶𝐸
    (
    𝑍
    ,
    𝑌
    ) =
    1
    2
    (
    𝐴 𝐶 𝐸
    (
    𝑍
    ,
    𝑋
    ) +
    𝐴𝐶 𝐸
    (
    𝑋
    ,
    𝑌
    ))
    X Y
    Z
    Sum over all pairs of
    nodes in the causal path.
    GPU Mem. Latency
    Swap Mem.

    View Slide

  59. Best Query
    Counterfactual Queries
    Rank Paths
    What if questions.
    E.g., What if the
    configuration option X was
    set to a value ‘x’?
    Extract Causal Paths
    59
    Diagnosing and Fixing the Faults
    • What is the root-cause
    of my fault?
    • How do I fix my
    misconfigurations to
    improve performance?
    Misconfiguration
    Fault
    fixed?
    Observational Data Build Causal Graph
    Yes
    No
    update
    observational
    data
    About 25 sample
    configurations
    (training data)

    View Slide

  60. Diagnosing and Fixing the Faults
    60
    ● Counterfactual inference asks “what if” questions about changes to the
    misconfigurations
    We are interested in the scenario where:
    • We hypothetically have low latency;
    Conditioned on the following events:
    • We hypothetically set the new Swap memory to 4 Gb
    • Swap Memory was initially set to 2 Gb
    • We observed high latency when Swap was set to 2 Gb
    • Everything else remains the same
    Example
    Given that my current swap memory is 2 Gb, and I have high latency. What is
    the probability of having low latency if swap memory was increased to 4 Gb?

    View Slide

  61. Low?
    Load
    GPU Mem. Latency
    Swap = 4 Gb
    Diagnosing and Fixing the Faults
    61
    GPU Mem. Latency
    Swap
    Original Path
    Load
    GPU Mem. Latency
    Swap = 4 Gb
    Path after proposed change
    Load
    Remove incoming
    edges. Assume no
    external influence.
    Modify to reflect the
    hypothetical scenario
    Low?
    Load
    GPU Mem. Latency
    Swap = 4 Gb
    Low?
    Use both the models to compute the answer to the counterfactual question

    View Slide

  62. Diagnosing and Fixing the Faults
    62
    GPU Mem. Latency
    Swap
    Original Path
    Load
    GPU Mem. Latency
    Swap = 4 Gb
    Path after proposed change
    Load
    𝑃 𝑜 𝑡
    𝑒
    𝑛 𝑡𝑖
    𝑎
    𝑙
    =
    𝑃
    (
    ^
    𝐿𝑎 𝑡
    𝑒 𝑛𝑐
    𝑦
    =
    𝑙
    𝑜𝑤
    . . ^
    𝑆𝑤 𝑎𝑝
    = 4
    𝐺 𝑏
    , .
    𝑆 𝑤
    𝑎𝑝
    = 2
    𝐺
    𝑏
    ,
    𝐿𝑎
    𝑡 𝑒 𝑛𝑐𝑦
    𝑠 𝑤
    𝑎
    𝑝
    =2
    𝐺 𝑏
    = h
    𝑖𝑔
    h,
    𝑈
    )
    We expect a low latency
    The latency was high
    The Swap is now 4 Gb
    The Swap was initially 2 Gb Everything else
    stays the same

    View Slide

  63. Diagnosing and Fixing the Faults
    63
    Potential =
    𝑃
    (
    ^
    𝑜𝑢𝑡𝑐𝑜𝑚
    𝑒
    =
    𝑔𝑜
    𝑜𝑑
    ~ ~
    𝑐
    h
    𝑎 𝑛
    𝑔 𝑒
    , ~
    𝑜 𝑢
    𝑡𝑐𝑜 𝑚
    𝑒
    ¬
    𝑐
    h
    𝑎
    𝑛 𝑔 𝑒
    =
    𝑏𝑎𝑑
    , ~¬
    𝑐
    h
    𝑎
    𝑛 𝑔𝑒
    ,
    𝑈
    )
    Probability that the outcome is good after a change, conditioned on the past
    If this difference is large, then our change is useful
    Individual Treatment Effect = Potential − Outcome
    Control =
    𝑃
    (
    ^
    𝑜𝑢
    𝑡 𝑐
    𝑜
    𝑚 𝑒
    =
    𝑏𝑎𝑑
    ~ ~¬
    𝑐
    h
    𝑎 𝑛𝑔 𝑒
    ,
    𝑈
    )
    Probability that the outcome was bad before the change

    View Slide

  64. Diagnosing and Fixing the Faults
    64
    GPU Mem.
    Latency
    Swap Mem.
    Top K paths

    Enumerate all
    possible changes
    𝐼 𝑇 𝐸
    (
    𝑐
    h
    𝑎𝑛𝑔
    𝑒
    )
    Change with
    the largest ITE
    Set every configuration
    option in the path to all
    permitted values
    Inferred from observed
    data. This is very cheap.
    !

    View Slide

  65. Diagnosing and Fixing the Faults
    65
    Change with
    the largest ITE
    Fault
    fixed?
    Yes
    No • Add to observational data
    • Update causal model
    • Repeat…
    Measure
    Performance

    View Slide

  66. Software: DeepStream
    Middleware: TF, TensorRT
    Hardware: Nvidia Xavier
    Configuration: Default
    number of counters
    number of splitters
    latency (ms)
    100
    150
    1
    200
    250
    2
    300
    Cubic Interpolation Over Finer Grid
    2
    4
    3 6
    8
    4 10
    12
    5 14
    16
    6 18
    Budget
    Exhausted?
    Yes
    No
    5- Update Causal
    Performance Model
    Query Engine
    4- Estimate Causal Queries
    Estimate
    probability of
    satisfying QoS
    if BufferSize is
    set to 6k?
    2- Learn
    Causal Perf. Model Performance
    Debugging
    Performance
    Optimization
    3- Translate Performance Query
    to Causal Queries
    •What is the root-cause
    of observed perf. fault?
    •How do I fix the
    misconfig.?
    •How can I improve
    throughput without
    sacrificing accuracy?
    •How do I understand
    perf behavior?
    Measure performance
    of the configuration(s)
    that maximizes
    information gain
    Performance Data Causal Model
    P(Th > 40/s|do(Buffersize = 6k))
    1- Specify
    Performance Query
    QoS : Th > 40/s
    Observed : Th < 30/s ± 5/s
    UNICORN: Our Causal AI for Systems Method

    View Slide

  67. FPS Energy
    Branch
    Misses
    Cache
    Misses
    No of
    Cycles
    Bitrate
    Buffer
    Size
    Batch
    Size
    Enable
    Padding 1- Evaluate Candidate
    Interventions
    FPS Energy
    Branch
    Misses
    Cache
    Misses
    No of
    Cycles
    Bitrate
    Buffer
    Size
    Batch
    Size
    Enable
    Padding
    Option/Event/Obj Values
    Bitrate 1k
    Buffer Size 20k
    Batch Size 10
    Enable Padding 1
    Branch Misses 24m
    Cache Misses 42m
    No of Cycles 73b
    FPS 31/s
    Energy 42J
    2- Determine & Perform
    next Perf Measurement
    3- Updating
    Causal Model Performance
    Data
    Model averaging
    Expected change in
    belief & KL; Causal
    effects on objectives
    Interventions on Hardware,
    Workload, and Kernel Options
    Active Learning for Updating Causal Performance Model

    View Slide

  68. FPS Energy
    Branch
    Misses
    Cache
    Misses
    No of
    Cycles
    Bitrate
    Buffer
    Size
    Batch
    Size
    Enable
    Padding 1- Evaluate Candidate
    Interventions
    FPS Energy
    Branch
    Misses
    Cache
    Misses
    No of
    Cycles
    Bitrate
    Buffer
    Size
    Batch
    Size
    Enable
    Padding
    Option/Event/Obj Values
    Bitrate 1k
    Buffer Size 20k
    Batch Size 10
    Enable Padding 1
    Branch Misses 24m
    Cache Misses 42m
    No of Cycles 73b
    FPS 31/s
    Energy 42J
    2- Determine & Perform
    next Perf Measurement
    3- Updating
    Causal Model Performance
    Data
    Model averaging
    Expected change in
    belief & KL; Causal
    effects on objectives
    Interventions on Hardware,
    Workload, and Kernel Options
    Active Learning for Updating Causal Performance Model

    View Slide

  69. FPS Energy
    Branch
    Misses
    Cache
    Misses
    No of
    Cycles
    Bitrate
    Buffer
    Size
    Batch
    Size
    Enable
    Padding 1- Evaluate Candidate
    Interventions
    FPS Energy
    Branch
    Misses
    Cache
    Misses
    No of
    Cycles
    Bitrate
    Buffer
    Size
    Batch
    Size
    Enable
    Padding
    Option/Event/Obj Values
    Bitrate 1k
    Buffer Size 20k
    Batch Size 10
    Enable Padding 1
    Branch Misses 24m
    Cache Misses 42m
    No of Cycles 73b
    FPS 31/s
    Energy 42J
    2- Determine & Perform
    next Perf Measurement
    3- Updating
    Causal Model Performance
    Data
    Model averaging
    Expected change in
    belief & KL; Causal
    effects on objectives
    Interventions on Hardware,
    Workload, and Kernel Options
    Active Learning for Updating Causal Performance Model

    View Slide

  70. Benefits of Causal
    Reasoning for
    System
    Performance
    Analysis

    View Slide

  71. There are two fundamental benefits that we get by our “Causal AI for Systems”
    methodology
    1. We learn one central (causal) performance model from the data across di
    ff
    erent
    performance tasks:

    • Performance understanding

    • Performance optimization

    • Performance debugging and repair

    • Performance prediction for di
    ff
    erent environments (e.g., canary-> production)

    2. The causal model is transferable across environments.

    • We observed Sparse Mechanism Shift in systems too!

    • Alternative non-causal models (e.g., regression-based models for performance tasks)
    are not transferable as they rely on i.i.d. setting.
    71

    View Slide

  72. Questions of this nature require precise mathematical language lest they will
    be misleading.
    Here we are simultaneously conditioning on two values of GPU memory growth (i.e.,
    𝑋
    ˆ = 0.66 and
    𝑋
    = 0.33). Traditional machine learning
    approaches cannot handle such expressions. Instead, we must resort to causal models to compute them.
    72

    View Slide

  73. Difference between statistical (left) and causal models (right) on a given set of
    three variables
    While a statistical model speci
    fi
    es a single probability distribution, a causal model represents a set of distributions, one for each
    possible intervention.
    73

    View Slide

  74. Independent Causal Mechanisms (ICM)
    Principle

    View Slide

  75. Sparse Mechanism Shift (SMS)
    Hypothesis
    Example of SMS hypothesis,
    where an intervention (which may
    or may not be intentional/observed)
    changes the position of one
    fi
    nger,
    and as a consequence, the object
    falls. The change in pixel space is
    entangled (or distributed), in
    contrast to the change in the causal
    model.

    View Slide

  76. 76
    NeurIPS 2020 (ML For Systems), Dec 12th, 2020
    https://arxiv.org/pdf/2010.06061.pdf
    https://github.com/softsys4ai/CADET

    View Slide

  77. 77
    The new version of CADET, called UNICORN, accepted at EuroSys 2022.
    https://github.com/softsys4ai/UNICORN

    View Slide

  78. Outline
    78
    Motivation
    Causal AI
    For Systems
    Future
    Directions
    UNICORN
    Results

    View Slide

  79. Results: Case Study
    79
    When we are trying to transplant our CUDA source code from TX1 to TX2, it
    behaved strange.
    We noticed that TX2 has twice computing-ability as TX1 in GPU, as expectation,
    we think TX2 will 30% - 40% faster than TX1 at least.
    Unfortunately, most of our code base spent twice the time as TX1, in other words,
    TX2 only has 1/2 speed as TX1, mostly. We believe that TX2’s CUDA API runs
    much slower than TX1 in many cases.
    When we are trying to transplant our CUDA source code from TX1 to TX2, it
    behaved strange.
    We noticed that TX2 has twice computing-ability as TX1 in GPU, as expectation,
    we think TX2 will 30% - 40% faster than TX1 at least.
    Unfortunately, most of our code base spent twice the time as TX1, in other words,
    TX2 only has 1/2 speed as TX1, mostly. We believe that TX2’s CUDA API runs
    much slower than TX1 in many cases.
    When we are trying to transplant our CUDA source code from TX1 to TX2, it
    behaved strange.
    We noticed that TX2 has twice computing-ability as TX1 in GPU, as expectation,
    we think TX2 will 30% - 40% faster than TX1 at least.
    Unfortunately, most of our code base spent twice the time as TX1, in other words,
    TX2 only has 1/2 speed as TX1, mostly. We believe that TX2’s CUDA API runs
    much slower than TX1 in many cases.
    When we are trying to transplant our CUDA source code from TX1 to TX2, it
    behaved strange.
    We noticed that TX2 has twice computing-ability as TX1 in GPU, as expectation,
    we think TX2 will 30% - 40% faster than TX1 at least.
    Unfortunately, most of our code base spent twice the time as TX1, in other words,
    TX2 only has 1/2 speed as TX1, mostly. We believe that TX2’s CUDA API runs
    much slower than TX1 in many cases.
    The user is transferring the code
    from one hardware to another
    The target hardware is faster
    than the the source hardware.
    User expects the code to run
    at least 30-40% faster.
    The code ran 2x slower on the
    more powerful hardware

    View Slide

  80. More powerful
    Results: Case Study
    80
    Nvidia TX1
    CPU 4 cores, 1.3 GHz
    GPU 128 Cores, 0.9 GHz
    Memory 4 Gb, 25 Gb/s
    Nvidia TX2
    CPU 6 cores, 2 GHz
    GPU 256 Cores, 1.3 GHz
    Memory 8 Gb, 58 Gb/s
    Embedded real-time
    stereo estimation
    Source code
    17 Fps
    4 Fps
    4
    Slower!
    ×

    View Slide

  81. Results: Case Study
    81
    Configuration UNICO
    RN
    Decision
    Tree
    Forum
    CPU Cores ✓ ✓ ✓
    CPU Freq. ✓ ✓ ✓
    EMC Freq. ✓ ✓ ✓
    GPU Freq. ✓ ✓ ✓
    Sched. Policy ✓
    Sched. Runtime ✓
    Sched. Child Proc ✓
    Dirty Bg. Ratio ✓
    Drop Caches ✓
    CUDA_STATIC_RT ✓ ✓ ✓
    Swap Memory ✓
    UNICORN Decision Tree Forum
    Throughput (on TX2) 26 FPS 20 FPS 23 FPS
    Throughput Gain (over TX1) 53 % 21 % 39 %
    Time to resolve 24 min. 31/2
    Hrs. 2 days
    X Finds the root-causes accurately
    X No unnecessary changes
    X Better improvements than forum’s recommendation
    X Much faster
    Results
    The user expected 30-40% gain

    View Slide

  82. Evaluation: Experimental Setup
    Nvidia TX1
    CPU 4 cores, 1.3 GHz
    GPU 128 Cores, 0.9 GHz
    Memory 4 Gb, 25 GB/s
    Nvidia TX2
    CPU 6 cores, 2 GHz
    GPU 256 Cores, 1.3 GHz
    Memory 8 Gb, 58 GB/s
    Nvidia Xavier
    CPU 8 cores, 2.26 GHz
    GPU 512 cores, 1.3 GHz
    Memory 32 Gb, 137 GB/s
    Hardware Systems
    Software Systems
    Xception
    Image recognition
    (50,000 test images)
    DeepSpeech
    Voice recognition
    (5 sec. audio clip)
    BERT
    Sentiment Analysis
    (10000 IMDb reviews)
    x264
    Video Encoder
    (11 Mb, 1080p video)
    Configuration Space
    X 30 Configurations
    X 17 System Events
    • 10 software


    • 10 OS/Kernel


    • 10 hardware
    82

    View Slide

  83. Evaluation: Data Collection
    ● For each software/hardware
    combination create a benchmark
    dataset
    ○ Exhaustively set each of configuration
    option to all permitted values.
    ○ For continuous options (e.g., GPU memory
    Mem.), sample 10 equally spaced values
    between [min, max]
    ● Measure the latency, energy
    consumption, and heat dissipation
    ○ Repeat 5x and average
    83
    Multiple
    Faults
    !
    Latency
    Faults
    !
    Energy
    Faults
    !

    View Slide

  84. Evaluation: Ground Truth
    ● For each performance fault:
    ○ Manually investigate the root-cause
    ○ “Fix” the misconfigurations
    ● A “fix” implies the configuration no longer
    has tail performance
    ○ User defined benchmark (i.e., 10th percentile)
    ○ Or some QoS/SLA benchmark
    ● Record the configurations that were
    changed
    84
    Multiple
    Faults
    !
    Latency
    Faults
    !
    Energy
    Faults
    !

    View Slide

  85. Evaluation: Metrics
    85
    Relevance Scores
    𝐺 𝑎
    𝑖
    𝑛
    =
    NFP fault − NFP repair
    NFP fault
    × 100
    Repair Quality
    NFP = Non-Functional Property
    (e.g., Latency, Energy, etc.)
    Repair value
    Faulty value
    Larger the gain, better the repair

    View Slide

  86. RQ2: How does UNICORN perform compared to Search-Based
    Optimization
    86
    RQ1: How does UNICORN perform compared to Model
    based Diagnostics
    Results: Research Questions

    View Slide

  87. 87
    Results: Research Question 1 (single objective)
    RQ1: How does UNICORN perform compared to Model based Diagnostics
    X Finds the root-causes accurately
    X Better gain
    X Much faster
    Takeaways
    More accurate than
    ML-based methods
    Better Gain
    Up to 20x
    faster

    View Slide

  88. 88
    Results: Research Question 1 (multi-objective)
    RQ1: How does UNICORN perform compared to Model based Diagnostics
    X No deterioration of other performance objectives
    Takeaways
    Multiple Faults
    in Latency &
    Energy usage

    View Slide

  89. RQ1: How does UNICORN perform compared to Model based
    Diagnostics
    89
    RQ2: How does UNICORN perform compared to Search-Based
    Optimization
    Results: Research Questions

    View Slide

  90. Results: Research Question 2
    RQ2: How does UNICORN perform compared to Search-Based
    Optimization
    X Better with no deterioration of other performance objectives
    Takeaways
    90

    View Slide

  91. 91
    Results: Research Question 3
    RQ2: How does UNICORN perform compared to Search-Based
    Optimization
    X Considerably faster than search-based optimization
    Takeaways

    View Slide

  92. Outline
    92
    Motivation
    Causal AI
    For Systems
    UNICORN
    Results
    Future
    Directions

    View Slide

  93. Causal AI for Serverless
    • Evaluating our Causal AI for Systems methodology with Serverless
    systems provide the following opportunities:

    1. Dynamic system recon
    fi
    gurations

    • Dynamic placement of functions

    • Dynamic recon
    fi
    gurations of the network of functions

    • Dynamic multi-cloud placement of functions.

    2. Root cause analysis of failures or QoS drop
    93

    View Slide

  94. Causal AI for Autonomous Robot Testing
    • Testing cyberphysical systems such as robots are di
    ff
    i
    cult. The key reason
    is that there are additional interactions with the environment and the task
    that the robot is performing.

    • Evaluating our Causal AI for Systems methodology with autonomous
    robots provide the following opportunities:

    1. Identifying di
    ff i
    cult to catch bugs in robots

    2. Identifying the root cause of an observed fault and repairing the issue
    automatically during mission time.
    94

    View Slide

  95. Summary: Causal AI for Systems
    1. Learning a
    Functional Causal
    Model for di
    ff
    erent
    downstream
    systems tasks

    2. The learned
    causal model is
    transferable
    across di
    ff
    erent
    environments
    95
    Software: DeepStream
    Middleware: TF, TensorRT
    Hardware: Nvidia Xavier
    Configuration: Default
    number of counters
    number of splitters
    latency (ms)
    100
    150
    1
    200
    250
    2
    300
    Cubic Interpolation Over Finer Grid
    2
    4
    3 6
    8
    4 10
    12
    5 14
    16
    6 18
    Budget
    Exhausted?
    Yes
    No
    5- Update Causal
    Performance Model
    Query Engine
    4- Estimate Causal
    Queries
    Estimate
    probability of
    satisfying QoS
    if BufferSize is
    set to 6k?
    2- Learn Causal
    Performance Model Performance
    Debugging
    Performance
    Optimization
    3- Translate Perf. Query
    to Causal Queries
    •What is the root-cause
    of observed perf. fault?
    •How do I fix the
    misconfig.?
    •How can I improve
    throughput without
    sacrificing accuracy?
    •How do I understand
    perf behavior?
    Measure performance
    of the configuration(s)
    that maximizes
    information gain
    Performance Data Causal Model
    P(Th > 40/s|do(Buffersize = 6k))
    1- Specify
    Performance Query
    QoS : Th > 40/s
    Observed : Th < 30/s ± 5/s

    View Slide

  96. I played a very minor role

    View Slide

  97. Arti
    fi
    cial Intelligence and Systems Laboratory


    (AISys Lab)
    Machine
    Learning
    Computer
    Systems
    Autonomy
    AI/ML Systems
    https://pooyanjamshidi.github.io/AISys/
    97
    Ying Meng


    (PhD student)
    Shuge Lei


    (PhD student)
    Kimia Noorbakhsh


    (Undergrad)
    Shahriar Iqbal


    (PhD student)
    Jianhai Su


    (PhD student)
    M.A. Javidian


    (postdoc)
    Sponsors, thanks!
    Fatemeh Ghofrani


    (PhD student)
    Abir Hossen


    (PhD student)
    Hamed Damirchi


    (PhD student)
    Mahdi Shari
    fi

    (PhD student)
    Lane Stanley


    (Intern)

    View Slide

  98. 98
    Rahul Krishna


    Columbia
    Shahriar Iqbal


    UofSC
    M. A. Javidian


    Purdue
    Baishakhi Ray


    Columbia
    Christian Kästner


    CMU
    Sven Apel


    Saarland
    Marco Valtorta


    UofSC
    Madelyn Khoury


    REU student
    Forest Agostinelli


    UofSC
    Causal AI
    for Systems
    Causal AI for
    Robot Learning
    (Causal RL +
    Transfer Learning +
    Robotics) Abir Hossen


    UofSC
    Theory of
    Causal AI
    Ahana Biswas


    IIT
    Om Pandey


    KIIT
    Hamed Damirchi


    UofSC
    Causal AI for
    Adversarial ML
    Ying Meng


    UofSC
    Fatemeh Ghofrani


    UofSC
    Mahdi Shari
    fi


    UofSC
    Collaborators (Causal AI)
    Sugato Basu


    Google AdsAI
    Garima Pruthi


    Google AdsAI
    Causal
    Representation
    Learning

    View Slide

  99. View Slide