Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Causal AI for Systems

Pooyan Jamshidi
September 11, 2021

Causal AI for Systems

Learning Causal Performance Models for conducting Performance Tasks in a Principled and Transferable Fashion

Invited Talk at the IEEE International Conference on Smart Data Services

September 6, 2021

Pooyan Jamshidi

September 11, 2021
Tweet

More Decks by Pooyan Jamshidi

Other Decks in Research

Transcript

  1. Causal AI


    for Systems
    Learning Causal Performance Models for conducting Performance Tasks in a Principled and Transferable Fashion
    Pooyan Jamshidi

    View Slide

  2. It is all about team work
    I played a very minor role

    View Slide

  3. Arti
    fi
    cial Intelligence and Systems Laboratory


    (AISys Lab)
    Machine
    Learning
    Computer
    Systems
    Autonomy
    AI/ML Systems
    https://pooyanjamshidi.github.io/AISys/
    3
    Ying Meng


    (PhD student)
    Shuge Lei


    (PhD student)
    Kimia Noorbakhsh


    (Undergrad)
    Shahriar Iqbal


    (PhD student)
    Jianhai Su


    (PhD student)
    M.A. Javidian


    (postdoc)
    Sponsors, thanks!
    Fatemeh Ghofrani


    (PhD student)
    Abir Hossen


    (PhD student)
    Hamed Damirchi


    (PhD student)
    Mahdi Shari
    fi

    (PhD student)
    Mahdi Shari
    fi

    (Intern)

    View Slide

  4. 4
    Rahul Krishna


    Columbia
    Shahriar Iqbal


    UofSC
    M. A. Javidian


    Purdue
    Baishakhi Ray


    Columbia
    Christian Kästner


    CMU
    Sven Apel


    Saarland
    Marco Valtorta


    UofSC
    Madelyn Khoury


    REU student
    Forest Agostinelli


    UofSC
    Causal AI
    for Systems
    Causal AI for
    Robot Learning
    (Causal RL +
    Transfer Learning +
    Robotics) Abir Hossen


    UofSC
    Theory of
    Causal AI
    Ahana Biswas


    IIT
    Om Pandey


    KIIT
    Hamed Damirchi


    UofSC
    Causal AI for
    Adversarial ML
    Ying Meng


    UofSC
    Fatemeh Ghofrani


    UofSC
    Mahdi Shari
    fi


    UofSC
    Collaborators (Causal AI)
    Sugato Basu


    Google AdsAI
    Garima Pruthi


    Google AdsAI
    Causal
    Representation
    Learning

    View Slide

  5. Outline
    5
    Cas
    e

    Study
    Causal A
    I

    For Systems
    Results
    Futur
    e

    Directions
    Motivation

    View Slide

  6. 6
    Goal: Enable developers/users


    to
    fi
    nd the right quality tradeoff

    View Slide

  7. Today’s most popular systems are con
    fi
    gurable
    7
    built

    View Slide

  8. 8

    View Slide

  9. Empirical observations con
    fi
    rm that systems are
    becoming increasingly con
    fi
    gurable
    9
    08 7/2010 7/2012 7/2014
    Release time
    1/1999 1/2003 1/2007 1/2011
    0
    1/2014
    N
    Release time
    02 1/2006 1/2010 1/2014
    2.2.14
    2.3.4
    2.0.35
    .3.24
    Release time
    Apache
    1/2006 1/2008 1/2010 1/2012 1/2014
    0
    40
    80
    120
    160
    200
    2.0.0
    1.0.0
    0.19.0
    0.1.0
    Hadoop
    Number of parameters
    Release time
    MapReduce
    HDFS
    [Tianyin Xu, et al., “Too Many Knobs…”, FSE’15]

    View Slide

  10. Empirical observations con
    fi
    rm that systems are
    becoming increasingly con
    fi
    gurable
    10
    nia San Diego, ‡Huazhong Univ. of Science & Technology, †NetApp, Inc
    tixu, longjin, xuf001, yyzhou}@cs.ucsd.edu
    kar.Pasupathy, Rukma.Talwadker}@netapp.com
    prevalent, but also severely
    software. One fundamental
    y of configuration, reflected
    parameters (“knobs”). With
    m software to ensure high re-
    aunting, error-prone task.
    nderstanding a fundamental
    users really need so many
    answer, we study the con-
    including thousands of cus-
    m (Storage-A), and hundreds
    ce system software projects.
    ng findings to motivate soft-
    ore cautious and disciplined
    these findings, we provide
    ich can significantly reduce
    A as an example, the guide-
    ters and simplify 19.7% of
    on existing users. Also, we
    tion methods in the context
    7/2006 7/2008 7/2010 7/2012 7/2014
    0
    100
    200
    300
    400
    500
    600
    700
    Storage-A
    Number of parameters
    Release time
    1/1999 1/2003 1/2007 1/2011
    0
    100
    200
    300
    400
    500
    5.6.2
    5.5.0
    5.0.16
    5.1.3
    4.1.0
    4.0.12
    3.23.0
    1/2014
    MySQL
    Number of parameters
    Release time
    1/1998 1/2002 1/2006 1/2010 1/2014
    0
    100
    200
    300
    400
    500
    600
    1.3.14
    2.2.14
    2.3.4
    2.0.35
    1.3.24
    Number of parameters
    Release time
    Apache
    1/2006 1/2008 1/2010 1/2012 1/2014
    0
    40
    80
    120
    160
    200
    2.0.0
    1.0.0
    0.19.0
    0.1.0
    Hadoop
    Number of parameters
    Release time
    MapReduce
    HDFS
    [Tianyin Xu, et al., “Too Many Knobs…”, FSE’15]

    View Slide

  11. Today’s most popular systems are complex!
    multiscale, multi-modal, and multi-stream
    11
    Variability Space =
    Con
    fi
    guration Space +

    System Architecture +

    Deployment Environment
    Video


    Decoder
    Stream
    Muxer
    Primary
    Detector
    Object
    Tracker
    Secondary
    Classifier
    # Configuration Options
    55
    86
    14
    44 86

    View Slide

  12. Con
    fi
    gurations determine the performance
    behavior
    12
    void Parrot_setenv(. . . name,. . . value){
    #ifdef PARROT_HAS_SETENV
    my_setenv(name, value, 1);
    #else
    int name_len=strlen(name);
    int val_len=strlen(value);
    char* envs=glob_env;
    if(envs==NULL){
    return;
    }
    strcpy(envs,name);
    strcpy(envs+name_len,"=");
    strcpy(envs+name_len + 1,value);
    putenv(envs);
    #endif
    }
    #ifdef LINUX
    extern int Parrot_signbit(double x){
    endif
    else
    PARROT_HAS_SETENV
    LINUX
    Speed
    Energy

    View Slide

  13. Performance distributions are multi-modal and have long tails
    • Certain con
    fi
    gurations can cause performance
    to take abnormally large values

    • Faulty con
    fi
    gurations take the tail values (worse
    than 99.99th percentile)

    • Certain con
    fi
    gurations can cause faults on
    multiple performance objectives. 

    13

    View Slide

  14. Misconfiguration and its Effects
    ● Misconfigurations can elicit unexpected interactions between
    software and hardwar
    e

    ● These can result in non-functional fault
    s

    ○ Affecting non-functional system properties like
    latency, throughput, energy consumption, etc.
    14
    The system doesn’t crash or
    exhibit an obvious misbehavior
    Systems are still operational but with a
    degraded performance, e.g., high latency, low
    throughput, high energy consumption, high
    heat dissipation, or a combination of several

    View Slide

  15. 15
    CUDA performance issue on tx2
    When we are trying to transplant our CUDA source code from TX1 to TX2, it
    behaved strange.
    We noticed that TX2 has twice computing-ability as TX1 in GPU, as expectation,
    we think TX2 will 30% - 40% faster than TX1 at least.
    Unfortunately, most of our code base spent twice the time as TX1, in other words,
    TX2 only has 1/2 speed as TX1, mostly. We believe that TX2’s CUDA API runs
    much slower than TX1 in many cases.
    When we are trying to transplant our CUDA source code from TX1 to TX2, it
    behaved strange.
    We noticed that TX2 has twice computing-ability as TX1 in GPU, as expectation,
    we think TX2 will 30% - 40% faster than TX1 at least.
    Unfortunately, most of our code base spent twice the time as TX1, in other words,
    TX2 only has 1/2 speed as TX1, mostly. We believe that TX2’s CUDA API runs
    much slower than TX1 in many cases.
    The user is transferring the cod
    e

    from one hardware to another
    When we are trying to transplant our CUDA source code from TX1 to TX2, it
    behaved strange.
    We noticed that TX2 has twice computing-ability as TX1 in GPU, as expectation,
    we think TX2 will 30% - 40% faster than TX1 at least.
    Unfortunately, most of our code base spent twice the time as TX1, in other words,
    TX2 only has 1/2 speed as TX1, mostly. We believe that TX2’s CUDA API runs
    much slower than TX1 in many cases.
    The target hardware is faster
    than the the source hardware
    .

    User expects the code to run
    at least 30-40% faster.
    Motivating Example
    When we are trying to transplant our CUDA source code from TX1 to TX2, it
    behaved strange.
    We noticed that TX2 has twice computing-ability as TX1 in GPU, as expectation,
    we think TX2 will 30% - 40% faster than TX1 at least.
    Unfortunately, most of our code base spent twice the time as TX1, in other words,
    TX2 only has 1/2 speed as TX1, mostly. We believe that TX2’s CUDA API runs
    much slower than TX1 in many cases.
    The code ran 2x slower on the
    more powerful hardware

    View Slide

  16. Motivating Example
    16
    June 3rd
    We have already tried this. We still have high latency.
    Any other suggestions?
    June 4th
    Please do the following and let us know if it works
    1. Install JetPack 3.0
    2. Set nvpmodel=MAX-N
    3. Run jetson_clock.sh
    June 5th
    June 4th
    TX2 is pascal architecture. Please update your CMakeLists:
    + set(CUDA_STATIC_RUNTIME OFF
    )

    ..
    .

    + -gencode=arch=compute_62,code=sm_62
    The user had several misconfigurations
    In Software:


    ✖ Wrong compilation flags
    ✖ Wrong SDK version
    In Hardware:


    ✖ Wrong power mode
    ✖ Wrong clock/fan settings
    The discussions took 2 days
    Any suggestions on how to improve my performance?
    Thanks!
    How to resolve such issues faster?
    ?

    View Slide

  17. Users want to understand the effect of configuration options
    17

    View Slide

  18. Outline
    18
    Motivation
    Causal A
    I

    For Systems
    Results
    Futur
    e

    Directions
    Cas
    e

    Study

    View Slide

  19. SocialSensor
    •Identifying trending topics

    •Identifying user de
    fi
    ned topics

    •Social media search
    19

    View Slide

  20. SocialSensor
    20
    Content Analysis
    Orchestrator
    Crawling
    Search and Integration
    Tweets: [5k-20k/min]
    Every 10 min:


    [100k tweets]
    Tweets: [10M]
    Fetch
    Store
    Push
    Store
    Crawled


    items
    Fetch
    Internet

    View Slide

  21. Challenges
    21
    Content Analysis
    Orchestrator
    Crawling
    Search and Integration
    Tweets: [5k-20k/min]
    Every 10 min:


    [100k tweets]
    Tweets: [10M]
    Fetch
    Store
    Push
    Store
    Crawled


    items
    Fetch
    Internet
    100X
    10X
    Real time

    View Slide

  22. 22
    How can we gain a better performance without
    using more resources?

    View Slide

  23. 23
    Let’s try out di
    ff
    erent system con
    fi
    gurations!

    View Slide

  24. Opportunity: Data processing engines in the
    pipeline were all con
    fi
    gurable
    24
    > 100 > 100 > 100
    2300

    View Slide

  25. 25
    More combinations than estimated
    atoms in the universe

    View Slide

  26. 0 500 1000 1500
    Throughput (ops/sec)
    0
    1000
    2000
    3000
    4000
    5000
    Average write latency ( s)
    The default con
    fi
    guration is typically bad and the
    optimal con
    fi
    guration is noticeably better than median
    26
    Default Con
    fi
    guration
    Optimal


    Con
    fi
    guration
    better
    better
    • Default is ba
    d

    • 2X-10X faster than worst
    • Noticeably faster than median

    View Slide

  27. Performance behavior varies in different environments
    27

    View Slide

  28. 100X more user


    cloud resources reduced 20%
    outperform expert recommendation


    View Slide

  29. Outline
    29
    Motivation
    Cas
    e

    Study Causal
    A
    I

    Results
    Futur
    e

    Directions
    Causal A
    I

    For Systems

    View Slide

  30. Causal AI in Systems and Software
    30
    Computer Architecture
    Database
    Operating Systems
    Programming Languages
    BigData Software Engineering
    https://github.com/y-ding/causal-system-papers

    View Slide

  31. 31
    Throughput = 9 × Bitrate + 2.1 × Buffersize − 4.4 × Bitrate × Buffersize × BatchSize
    Causal Performance Model
    Traditional Performance Model
    VS
    Throughput Energy
    Branch


    Misses
    Cache


    Misses
    No. of


    Cycles
    Bitrate
    Buffer


    Size
    Batch


    Size
    Enabl
    e

    Padding
    f3 f4
    f
    f1
    f2
    Causal


    Interaction
    Causal


    Paths
    Software


    Options
    Intermediate


    Causal Mechanisms
    Performance


    Objective
    f
    Branchmisses = 2 × Bitrate + 8.1 × Buffersize + 4.1 × Bitrate × Buffersize × Cachemisses
    Decoder Muxer

    View Slide

  32. Critical Issues of Correlation-based Performance
    Analysis
    • Performance in
    fl
    uence models could produce unreliable predictions.

    • Performance in
    fl
    uence models could produce unstable predictions
    across environments and in the presence of measurement noise.

    • Performance in
    fl
    uence models could produce incorrect explanations.
    32

    View Slide

  33. Why Causal Inference? (Simpson’s Paradox)
    33
    Increasing GPU memor
    y

    increases Latency
    More GPU memory
    usage should reduce
    latency not increase it.
    Counterintuitive!
    Any ML-/statistical models built
    on this data will be incorrect
    !

    View Slide

  34. Why Causal Inference? (Simpson’s Paradox)
    34
    Segregate data on swap memory
    Available swap
    memory is
    reducing
    GPU memory borrows memory from the swap for some intensive workloads. Other


    host processes may reduce the available swap. Little will be left for the GPU to use.

    View Slide

  35. 35
    Why Causal Inference?
    Real world problems can have
    100s if not 1000s of interacting
    configuration options
    !
    Manually understanding and
    evaluating each combination
    is impractical, if not
    impossible.

    View Slide

  36. Load
    GPU Mem.
    Swap Mem.
    Latency
    Express the relationships between


    interacting variables as a causal graph
    36
    Causal Performance Models
    Configuration option Direction(s) of the causality
    • Latency is affected by GPU Mem. which
    in turn is influenced by swap memory
    • External factors like resource pressure
    also affects swap memory
    Non-functional property
    System event

    View Slide

  37. 37
    Causal Performance Models
    How to construc
    t

    this causal graph?
    ?
    If there is a fault in latency,
    how to diagnose and fix it?
    ?
    Load
    GPU Mem.
    Swap Mem.
    Latency

    View Slide

  38. • Build a Causal Performance
    Model that capture the interactions
    options in the variability space
    using the observation performance
    data.

    • Iterative causal performance model
    evaluation and model update
    • Perform downstream performance
    tasks such as performance
    debugging & optimization using
    Causal Reasoning
    UNICORN: Our Causal AI for
    Systems Method

    View Slide

  39. UNICORN: Our Causal AI for Systems Method
    Software: DeepStream


    Middleware: TF, TensorR
    T

    Hardware: Nvidia Xavie
    r

    Configuration: Default
    number of counters
    number of splitters
    latency (ms)
    100
    150
    1
    200
    250
    2
    300
    Cubic Interpolation Over Finer Grid
    2
    4
    3 6
    8
    4 10
    12
    5 14
    16
    6 18
    Budget
    Exhausted?
    Yes
    No
    5- Update Causal
    Performance Model
    Query Engine
    4- Estimate Causal
    Queries
    Estimate
    probability of
    satisfying QoS
    if BufferSize is
    set to 6k?
    2- Learn Causal
    Performance Model Performanc
    e

    Debugging
    Performanc
    e

    Optimization
    3- Translate Perf. Query


    to Causal Queries
    •What is the root-cause
    of observed perf. fault
    ?

    •How do I fix the
    misconfig.
    ?

    •How can I improve
    throughput without
    sacrificing accuracy
    ?

    •How do I understand
    perf behavior?
    Measure performance
    of the configuration(s)
    that maximizes
    information gain
    Performance Data Causal Model
    P(Th > 40/s|do(Buffersize = 6k))
    1- Specify


    Performance Query
    QoS : Th > 40/s
    Observed : Th < 30/s ± 5/s

    View Slide

  40. Software: DeepStream


    Middleware: TF, TensorR
    T

    Hardware: Nvidia Xavie
    r

    Configuration: Default
    number of counters
    number of splitters
    latency (ms)
    100
    150
    1
    200
    250
    2
    300
    Cubic Interpolation Over Finer Grid
    2
    4
    3 6
    8
    4 10
    12
    5 14
    16
    6 18
    Budget
    Exhausted?
    Yes
    No
    5- Update Causal
    Performance Model
    Query Engine
    4- Estimate Causal Queries
    Estimate
    probability of
    satisfying QoS
    if BufferSize is
    set to 6k?
    2- Learn


    Causal Perf. Model Performanc
    e

    Debugging
    Performanc
    e

    Optimization
    3- Translate Performance Query


    to Causal Queries
    •What is the root-cause
    of observed perf. fault
    ?

    •How do I fix the
    misconfig.
    ?

    •How can I improve
    throughput without
    sacrificing accuracy
    ?

    •How do I understand
    perf behavior?
    Measure performance
    of the configuration(s)
    that maximizes
    information gain
    Performance Data Causal Model
    P(Th > 40/s|do(Buffersize = 6k))
    1- Specify


    Performance Query
    QoS : Th > 40/s
    Observed : Th < 30/s ± 5/s
    UNICORN: Our Causal AI for Systems Method

    View Slide

  41. Software: DeepStream


    Middleware: TF, TensorR
    T

    Hardware: Nvidia Xavie
    r

    Configuration: Default
    number of counters
    number of splitters
    latency (ms)
    100
    150
    1
    200
    250
    2
    300
    Cubic Interpolation Over Finer Grid
    2
    4
    3 6
    8
    4 10
    12
    5 14
    16
    6 18
    Budget
    Exhausted?
    Yes
    No
    5- Update Causal
    Performance Model
    Query Engine
    4- Estimate Causal Queries
    Estimate
    probability of
    satisfying QoS
    if BufferSize is
    set to 6k?
    2- Learn


    Causal Perf. Model Performanc
    e

    Debugging
    Performanc
    e

    Optimization
    3- Translate Performance Query


    to Causal Queries
    •What is the root-cause
    of observed perf. fault
    ?

    •How do I fix the
    misconfig.
    ?

    •How can I improve
    throughput without
    sacrificing accuracy
    ?

    •How do I understand
    perf behavior?
    Measure performance
    of the configuration(s)
    that maximizes
    information gain
    Performance Data Causal Model
    P(Th > 40/s|do(Buffersize = 6k))
    1- Specify


    Performance Query
    QoS : Th > 40/s
    Observed : Th < 30/s ± 5/s
    UNICORN: Our Causal AI for Systems Method

    View Slide

  42. FPS Energy
    Branch


    Misses
    Cache


    Misses
    No of


    Cycles
    Bitrate
    Buffer


    Size
    Batch


    Size
    Enabl
    e

    Padding
    FPS Energy
    Branch


    Misses
    Cache


    Misses
    No of


    Cycles
    Bitrate Buffer


    Size
    Batch


    Size
    Enabl
    e

    Padding
    Bitrate
    (bits/s)
    Enable
    Padding
    … Cache
    Misses
    … Through
    put (fps)
    c1
    1k 1 … 42m … 7
    c2
    2k 1 … 32m … 22
    … … … … … … …
    cn
    5k 0 … 12m … 25
    FPS Energy
    Branch


    Misses
    Cache


    Misses
    No of


    Cycles
    Bitrate Buffer


    Size
    Batch


    Size
    Enabl
    e

    Padding
    1- Recovering the


    Skelton
    2- Prunin
    g

    Causal Structure
    3- Orienting


    Causal Relations
    statistical


    independence


    tests
    fully connected graph


    given constraints (e.g.,
    no connections btw


    configuration options)
    orientation rules &
    measures (entropy) +
    structural constraints
    (colliders, v-structures)
    Learning Causal Performance Model

    View Slide

  43. Performance measurement
    43
    ℂ = O1
    × O2
    × ⋯ × O19
    × O20
    Dead code removal
    Con
    fi
    guration


    Space
    Constant folding
    Loop unrolling
    Function inlining
    c1
    = 0 × 0 × ⋯ × 0 × 1
    c1
    ∈ ℂ
    fc
    (c1
    ) = 11.1ms
    Compile


    time
    Execution


    time
    Energy
    Compiler


    (e.f., SaC, LLVM)
    Program Compiled
    Code
    Instrumented
    Binary
    Hardware
    Compile Deploy
    Con
    fi
    gure
    fe
    (c1
    ) = 110.3ms
    fen
    (c1
    ) = 100mwh
    Non-functiona
    l

    measurable/quanti
    fi
    able


    aspect

    View Slide

  44. Our setup for performance measurements
    44

    View Slide

  45. Hardware platforms in our experiments
    The reason behind using di
    ff
    erent types of hardware platforms is that they exhibit di
    ff
    erent behaviors due to di
    ff
    erences in terms
    of resources, their microarchitecture, etc.
    45
    AWS DeepLens:


    Cloud-connected device
    System on Chip (SoC)
    Microcontrollers (MCUs)

    View Slide

  46. Measuring performance for systems involves lots of challenges
    Each hardware requires di
    ff
    erent ways of instrumentations and clean measurement that contains least amount of noise is the
    most challenging part of our experiments.
    46

    View Slide

  47. FPS Energy
    Branch


    Misses
    Cache


    Misses
    No of


    Cycles
    Bitrate
    Buffer


    Size
    Batch


    Size
    Enabl
    e

    Padding
    FPS Energy
    Branch


    Misses
    Cache


    Misses
    No of


    Cycles
    Bitrate Buffer


    Size
    Batch


    Size
    Enabl
    e

    Padding
    Bitrate
    (bits/s)
    Enable
    Padding
    … Cache
    Misses
    … Through
    put (fps)
    c1
    1k 1 … 42m … 7
    c2
    2k 1 … 32m … 22
    … … … … … … …
    cn
    5k 0 … 12m … 25
    FPS Energy
    Branch


    Misses
    Cache


    Misses
    No of


    Cycles
    Bitrate Buffer


    Size
    Batch


    Size
    Enabl
    e

    Padding
    1- Recovering the


    Skelton
    2- Prunin
    g

    Causal Structure
    3- Orienting


    Causal Relations
    statistical


    independence


    tests
    fully connected graph


    given constraints (e.g.,
    no connections btw


    configuration options)
    orientation rules &
    measures (entropy) +
    structural constraints
    (colliders, v-structures)
    Learning Causal Performance Model

    View Slide

  48. FPS Energy
    Branch


    Misses
    Cache


    Misses
    No of


    Cycles
    Bitrate
    Buffer


    Size
    Batch


    Size
    Enabl
    e

    Padding
    FPS Energy
    Branch


    Misses
    Cache


    Misses
    No of


    Cycles
    Bitrate Buffer


    Size
    Batch


    Size
    Enabl
    e

    Padding
    Bitrate
    (bits/s)
    Enable
    Padding
    … Cache
    Misses
    … Through
    put (fps)
    c1
    1k 1 … 42m … 7
    c2
    2k 1 … 32m … 22
    … … … … … … …
    cn
    5k 0 … 12m … 25
    FPS Energy
    Branch


    Misses
    Cache


    Misses
    No of


    Cycles
    Bitrate Buffer


    Size
    Batch


    Size
    Enabl
    e

    Padding
    1- Recovering the


    Skelton
    2- Prunin
    g

    Causal Structure
    3- Orienting


    Causal Relations
    statistical


    independence


    tests
    fully connected graph


    given constraints (e.g.,
    no connections btw


    configuration options)
    orientation rules &
    measures (entropy) +
    structural constraints
    (colliders, v-structures)
    Learning Causal Performance Model

    View Slide

  49. FPS Energy
    Branch


    Misses
    Cache


    Misses
    No of


    Cycles
    Bitrate
    Buffer


    Size
    Batch


    Size
    Enabl
    e

    Padding
    FPS Energy
    Branch


    Misses
    Cache


    Misses
    No of


    Cycles
    Bitrate Buffer


    Size
    Batch


    Size
    Enabl
    e

    Padding
    Bitrate
    (bits/s)
    Enable
    Padding
    … Cache
    Misses
    … Through
    put (fps)
    c1
    1k 1 … 42m … 7
    c2
    2k 1 … 32m … 22
    … … … … … … …
    cn
    5k 0 … 12m … 25
    FPS Energy
    Branch


    Misses
    Cache


    Misses
    No of


    Cycles
    Bitrate Buffer


    Size
    Batch


    Size
    Enabl
    e

    Padding
    1- Recovering the


    Skelton
    2- Prunin
    g

    Causal Structure
    3- Orienting


    Causal Relations
    statistical


    independence


    tests
    fully connected graph


    given constraints (e.g.,
    no connections btw


    configuration options)
    orientation rules &
    measures (entropy) +
    structural constraints
    (colliders, v-structures)
    Learning Causal Performance Model

    View Slide

  50. Throughput Energy
    Branch


    Misses
    Cache


    Misses
    No of


    Cycles
    Bitrate Buffer


    Size
    Batch


    Size
    Enabl
    e

    Padding
    f f
    f
    f f
    Causal


    Interaction
    Causal


    Paths
    Software


    Options
    Perf.


    Events
    Performance


    Objective
    f
    Branchmisses = 2 × Bitrate + 8.1 × Buffersize + 4.1 × Bitrate × Buffersize × Cachemisses
    Decoder Muxer
    Causal Performance Model

    View Slide

  51. Software: DeepStream


    Middleware: TF, TensorR
    T

    Hardware: Nvidia Xavie
    r

    Configuration: Default
    number of counters
    number of splitters
    latency (ms)
    100
    150
    1
    200
    250
    2
    300
    Cubic Interpolation Over Finer Grid
    2
    4
    3 6
    8
    4 10
    12
    5 14
    16
    6 18
    Budget
    Exhausted?
    Yes
    No
    5- Update Causal
    Performance Model
    Query Engine
    4- Estimate Causal Queries
    Estimate
    probability of
    satisfying QoS
    if BufferSize is
    set to 6k?
    2- Learn


    Causal Perf. Model Performanc
    e

    Debugging
    Performanc
    e

    Optimization
    3- Translate Performance Query


    to Causal Queries
    •What is the root-cause
    of observed perf. fault
    ?

    •How do I fix the
    misconfig.
    ?

    •How can I improve
    throughput without
    sacrificing accuracy
    ?

    •How do I understand
    perf behavior?
    Measure performance
    of the configuration(s)
    that maximizes
    information gain
    Performance Data Causal Model
    P(Th > 40/s|do(Buffersize = 6k))
    1- Specify


    Performance Query
    QoS : Th > 40/s
    Observed : Th < 30/s ± 5/s
    UNICORN: Our Causal AI for Systems Method

    View Slide

  52. 52
    Diagnose and fix the root-cause of misconfigurations that cause non-functional faults
    Objective
    Causal Debugging: An example of downstream performance task
    Ὂ Use causal models to model various cross-stack configuration interactions;
    an
    d

    Ὂ Counterfactual reasoning to recommend fixes for these misconfigurations
    Approach

    View Slide

  53. 53
    Causal Debugging
    • What is the root-cause
    of my fault
    ?

    • How do I fix my
    misconfigurations to
    improve performance?
    Misconfiguration
    Fault


    fixed?
    Observational Data Build Causal Graph Extract Causal Paths
    Best Query
    Yes
    No
    updat
    e

    observationa
    l

    data
    Counterfactual Queries
    Rank Paths
    What if questions
    .

    E.g., What if the configuration
    option X was set to a value ‘x’?
    About 25 sample
    configurations
    (training data)

    View Slide

  54. Best Query
    Counterfactual Queries
    Rank Paths
    What if questions
    .

    E.g., What if the
    configuration option X was
    set to a value ‘x’?
    Extract Causal Paths
    54
    Extracting Causal Paths from the Causal Model
    • What is the root-cause
    of my fault
    ?

    • How do I fix my
    misconfigurations to
    improve performance?
    Misconfiguration
    Fault


    fixed?
    Observational Data Build Causal Graph
    Yes
    No
    updat
    e

    observationa
    l

    data
    About 25 sample
    configurations
    (training data)

    View Slide

  55. Extracting Causal Paths from the Causal Model
    Problem
    ✕ In real world cases, this causal graph can be
    very complex


    ✕ It may be intractable to reason over the entire
    graph directly
    55
    Solution
    ✓ Extract paths from the causal graph
    ✓ Rank them based on their Average Causal
    Effect on latency, etc.
    ✓ Reason over the top K paths

    View Slide

  56. Extracting Causal Paths from the Causal Model
    56
    GPU Mem. Latency
    Swap Mem.
    Extract paths
    Always begins with a
    configuration option
    Or a system
    event
    Always terminates at a
    performance objective
    Load
    GPU Mem. Latency
    Swap Mem.
    Swap Mem. Latency
    Load GPU Mem.

    View Slide

  57. Ranking Causal Paths from the Causal Model
    57
    ● They may be too many causal path
    s

    ● We need to select the most useful one
    s

    ● Compute the Average Causal Effect (ACE) of
    each pair of neighbors in a path
    GPU Mem.
    Swap Mem. Latency
    𝐴𝐶
    𝐸
    (GPU Mem . , Swap) =
    1
    𝑁

    𝑎
    ,
    𝑏

    𝑍
    𝔼
    (GPU Mem .
    𝑑 𝑜
    (Swap =
    𝑏
    )) −
    𝔼
    (GPU Mem .
    𝑑
    𝑜
    (Swap =
    𝑎
    ))
    Expected value of GPU
    Mem. when we artificially
    intervene by setting Swap to
    the value b
    Expected value of GPU
    Mem. when we artificially
    intervene by setting Swap to
    the value a
    If this difference is large, then
    small changes to Swap Mem.
    will cause large changes to GPU
    Mem.
    Average over all permitted
    values of Swap memory.

    View Slide

  58. Ranking Causal Paths from the Causal Model
    58
    ● Average the ACE of all pairs of adjacent nodes in the pat
    h

    ● Rank paths from highest path ACE (PACE) score to the lowes
    t

    ● Use the top K paths for subsequent analysis
    𝑃𝐴𝐶𝐸
    (
    𝑍
    ,
    𝑌
    ) =
    1
    2
    (
    𝐴 𝐶 𝐸
    (
    𝑍
    ,
    𝑋
    ) +
    𝐴𝐶 𝐸
    (
    𝑋
    ,
    𝑌
    ))
    X Y
    Z
    Sum over all pairs of
    nodes in the causal path.
    GPU Mem. Latency
    Swap Mem.

    View Slide

  59. Best Query
    Counterfactual Queries
    Rank Paths
    What if questions
    .

    E.g., What if the
    configuration option X was
    set to a value ‘x’?
    Extract Causal Paths
    59
    Diagnosing and Fixing the Faults
    • What is the root-cause
    of my fault
    ?

    • How do I fix my
    misconfigurations to
    improve performance?
    Misconfiguration
    Fault


    fixed?
    Observational Data Build Causal Graph
    Yes
    No
    updat
    e

    observationa
    l

    data
    About 25 sample
    configurations
    (training data)

    View Slide

  60. Diagnosing and Fixing the Faults
    60
    ● Counterfactual inference asks “what if” questions about changes to the
    misconfigurations
    We are interested in the scenario where:


    • We hypothetically have low latency;


    Conditioned on the following events
    :

    • We hypothetically set the new Swap memory to 4 G
    b

    • Swap Memory was initially set to 2 Gb
    • We observed high latency when Swap was set to 2 G
    b

    • Everything else remains the same
    Example
    Given that my current swap memory is 2 Gb, and I have high latency. What is


    the probability of having low latency if swap memory was increased to 4 Gb?

    View Slide

  61. Low?
    Load
    GPU Mem. Latency
    Swap = 4 Gb
    Diagnosing and Fixing the Faults
    61
    GPU Mem. Latency
    Swap
    Original Path
    Load
    GPU Mem. Latency
    Swap = 4 Gb
    Path after proposed change
    Load
    Remove incoming
    edges. Assume no
    external influence.
    Modify to reflect the


    hypothetical scenario
    Low?
    Load
    GPU Mem. Latency
    Swap = 4 Gb
    Low?
    Use both the models to compute the answer to the counterfactual question

    View Slide

  62. Diagnosing and Fixing the Faults
    62
    GPU Mem. Latency
    Swap
    Original Path
    Load
    GPU Mem. Latency
    Swap = 4 Gb
    Path after proposed change
    Load
    𝑃 𝑜 𝑡
    𝑒
    𝑛 𝑡𝑖
    𝑎
    𝑙
    =
    𝑃
    (
    ^
    𝐿𝑎 𝑡
    𝑒 𝑛𝑐
    𝑦
    =
    𝑙
    𝑜𝑤
    . . ^
    𝑆𝑤 𝑎𝑝
    = 4
    𝐺 𝑏
    , .
    𝑆 𝑤
    𝑎𝑝
    = 2
    𝐺
    𝑏
    ,
    𝐿𝑎
    𝑡 𝑒 𝑛𝑐𝑦
    𝑠 𝑤
    𝑎
    𝑝
    =2
    𝐺 𝑏
    = h
    𝑖𝑔
    h,
    𝑈
    )
    We expect a low latency
    The latency was high
    The Swap is now 4 Gb
    The Swap was initially 2 Gb Everything else


    stays the same

    View Slide

  63. Diagnosing and Fixing the Faults
    63
    Potential =
    𝑃
    (
    ^
    𝑜𝑢𝑡𝑐𝑜𝑚
    𝑒
    =
    𝑔𝑜
    𝑜𝑑
    ~ ~
    𝑐
    h
    𝑎 𝑛
    𝑔 𝑒
    , ~
    𝑜 𝑢
    𝑡𝑐𝑜 𝑚
    𝑒
    ¬
    𝑐
    h
    𝑎
    𝑛 𝑔 𝑒
    =
    𝑏𝑎𝑑
    , ~¬
    𝑐
    h
    𝑎
    𝑛 𝑔𝑒
    ,
    𝑈
    )
    Probability that the outcome is good after a change, conditioned on the past
    If this difference is large, then our change is useful
    Individual Treatment Effect = Potential − Outcome
    Control =
    𝑃
    (
    ^
    𝑜𝑢
    𝑡 𝑐
    𝑜
    𝑚 𝑒
    =
    𝑏𝑎𝑑
    ~ ~¬
    𝑐
    h
    𝑎 𝑛𝑔 𝑒
    ,
    𝑈
    )
    Probability that the outcome was bad before the change

    View Slide

  64. Diagnosing and Fixing the Faults
    64
    GPU Mem.
    Latency
    Swap Mem.
    Top K paths

    Enumerate all
    possible changes
    𝐼 𝑇 𝐸
    (
    𝑐
    h
    𝑎𝑛𝑔
    𝑒
    )
    Change with
    the largest ITE
    Set every configuration
    option in the path to all
    permitted values
    Inferred from observed
    data. This is very cheap.
    !

    View Slide

  65. Diagnosing and Fixing the Faults
    65
    Change with
    the largest ITE
    Fault


    fixed?
    Yes
    No • Add to observational dat
    a

    • Update causal mode
    l

    • Repeat…
    Measure


    Performance

    View Slide

  66. Software: DeepStream


    Middleware: TF, TensorR
    T

    Hardware: Nvidia Xavie
    r

    Configuration: Default
    number of counters
    number of splitters
    latency (ms)
    100
    150
    1
    200
    250
    2
    300
    Cubic Interpolation Over Finer Grid
    2
    4
    3 6
    8
    4 10
    12
    5 14
    16
    6 18
    Budget
    Exhausted?
    Yes
    No
    5- Update Causal
    Performance Model
    Query Engine
    4- Estimate Causal Queries
    Estimate
    probability of
    satisfying QoS
    if BufferSize is
    set to 6k?
    2- Learn


    Causal Perf. Model Performanc
    e

    Debugging
    Performanc
    e

    Optimization
    3- Translate Performance Query


    to Causal Queries
    •What is the root-cause
    of observed perf. fault
    ?

    •How do I fix the
    misconfig.
    ?

    •How can I improve
    throughput without
    sacrificing accuracy
    ?

    •How do I understand
    perf behavior?
    Measure performance
    of the configuration(s)
    that maximizes
    information gain
    Performance Data Causal Model
    P(Th > 40/s|do(Buffersize = 6k))
    1- Specify


    Performance Query
    QoS : Th > 40/s
    Observed : Th < 30/s ± 5/s
    UNICORN: Our Causal AI for Systems Method

    View Slide

  67. FPS Energy
    Branch


    Misses
    Cache


    Misses
    No of


    Cycles
    Bitrate
    Buffer


    Size
    Batch


    Size
    Enabl
    e

    Padding 1- Evaluate Candidate
    Interventions
    FPS Energy
    Branch


    Misses
    Cache


    Misses
    No of


    Cycles
    Bitrate
    Buffer


    Size
    Batch


    Size
    Enabl
    e

    Padding
    Option/Event/Obj Values
    Bitrate 1k
    Buffer Size 20k
    Batch Size 10
    Enable Padding 1
    Branch Misses 24m
    Cache Misses 42m
    No of Cycles 73b
    FPS 31/s
    Energy 42J
    2- Determine & Perform
    next Perf Measurement
    3- Updating


    Causal Model Performance


    Data
    Model averaging
    Expected change in
    belief & KL; Causal
    effects on objectives
    Interventions on Hardware,
    Workload, and Kernel Options
    Active Learning for Updating Causal Performance Model

    View Slide

  68. FPS Energy
    Branch


    Misses
    Cache


    Misses
    No of


    Cycles
    Bitrate
    Buffer


    Size
    Batch


    Size
    Enabl
    e

    Padding 1- Evaluate Candidate
    Interventions
    FPS Energy
    Branch


    Misses
    Cache


    Misses
    No of


    Cycles
    Bitrate
    Buffer


    Size
    Batch


    Size
    Enabl
    e

    Padding
    Option/Event/Obj Values
    Bitrate 1k
    Buffer Size 20k
    Batch Size 10
    Enable Padding 1
    Branch Misses 24m
    Cache Misses 42m
    No of Cycles 73b
    FPS 31/s
    Energy 42J
    2- Determine & Perform
    next Perf Measurement
    3- Updating


    Causal Model Performance


    Data
    Model averaging
    Expected change in
    belief & KL; Causal
    effects on objectives
    Interventions on Hardware,
    Workload, and Kernel Options
    Active Learning for Updating Causal Performance Model

    View Slide

  69. FPS Energy
    Branch


    Misses
    Cache


    Misses
    No of


    Cycles
    Bitrate
    Buffer


    Size
    Batch


    Size
    Enabl
    e

    Padding 1- Evaluate Candidate
    Interventions
    FPS Energy
    Branch


    Misses
    Cache


    Misses
    No of


    Cycles
    Bitrate
    Buffer


    Size
    Batch


    Size
    Enabl
    e

    Padding
    Option/Event/Obj Values
    Bitrate 1k
    Buffer Size 20k
    Batch Size 10
    Enable Padding 1
    Branch Misses 24m
    Cache Misses 42m
    No of Cycles 73b
    FPS 31/s
    Energy 42J
    2- Determine & Perform
    next Perf Measurement
    3- Updating


    Causal Model Performance


    Data
    Model averaging
    Expected change in
    belief & KL; Causal
    effects on objectives
    Interventions on Hardware,
    Workload, and Kernel Options
    Active Learning for Updating Causal Performance Model

    View Slide

  70. Benefits of Causal
    Reasoning for
    System
    Performance
    Analysis

    View Slide

  71. There are two fundamental benefits that we get by our “Causal AI for Systems”
    methodology
    1. We learn one central (causal) performance model from the data across di
    ff
    erent
    performance tasks:

    • Performance understanding

    • Performance optimization

    • Performance debugging and repair

    • Performance prediction for di
    ff
    erent environments (e.g., canary-> production)

    2. The causal model is transferable across environments.

    • We observed Sparse Mechanism Shift in systems too!

    • Alternative non-causal models (e.g., regression-based models for performance tasks)
    are not transferable as they rely on i.i.d. setting.
    71

    View Slide

  72. Questions of this nature require precise mathematical language lest they will
    be misleading.
    Here we are simultaneously conditioning on two values of GPU memory growth (i.e.,
    𝑋
    ˆ = 0.66 and
    𝑋
    = 0.33). Traditional machine learning
    approaches cannot handle such expressions. Instead, we must resort to causal models to compute them.
    72

    View Slide

  73. Difference between statistical (left) and causal models (right) on a given set of
    three variables
    While a statistical model speci
    fi
    es a single probability distribution, a causal model represents a set of distributions, one for each
    possible intervention.
    73

    View Slide

  74. Independent Causal Mechanisms (ICM)
    Principle

    View Slide

  75. Sparse Mechanism Shift (SMS)
    Hypothesis
    Example of SMS hypothesis,
    where an intervention (which may
    or may not be intentional/observed)
    changes the position of one
    fi
    nger,
    and as a consequence, the object
    falls. The change in pixel space is
    entangled (or distributed), in
    contrast to the change in the causal
    model.

    View Slide

  76. 76
    NeurIPS 2020 (ML For Systems), Dec 12th, 2020
    https://arxiv.org/pdf/2010.06061.pdf
    https://github.com/softsys4ai/CADET

    View Slide

  77. Outline
    77
    Motivation
    Cas
    e

    Study
    Futur
    e

    Directions
    Causal A
    I

    For Systems
    Results

    View Slide

  78. Results: Case Study
    78
    When we are trying to transplant our CUDA source code from TX1 to TX2, it
    behaved strange.
    We noticed that TX2 has twice computing-ability as TX1 in GPU, as expectation,
    we think TX2 will 30% - 40% faster than TX1 at least.
    Unfortunately, most of our code base spent twice the time as TX1, in other words,
    TX2 only has 1/2 speed as TX1, mostly. We believe that TX2’s CUDA API runs
    much slower than TX1 in many cases.
    When we are trying to transplant our CUDA source code from TX1 to TX2, it
    behaved strange.
    We noticed that TX2 has twice computing-ability as TX1 in GPU, as expectation,
    we think TX2 will 30% - 40% faster than TX1 at least.
    Unfortunately, most of our code base spent twice the time as TX1, in other words,
    TX2 only has 1/2 speed as TX1, mostly. We believe that TX2’s CUDA API runs
    much slower than TX1 in many cases.
    When we are trying to transplant our CUDA source code from TX1 to TX2, it
    behaved strange.
    We noticed that TX2 has twice computing-ability as TX1 in GPU, as expectation,
    we think TX2 will 30% - 40% faster than TX1 at least.
    Unfortunately, most of our code base spent twice the time as TX1, in other words,
    TX2 only has 1/2 speed as TX1, mostly. We believe that TX2’s CUDA API runs
    much slower than TX1 in many cases.
    When we are trying to transplant our CUDA source code from TX1 to TX2, it
    behaved strange.
    We noticed that TX2 has twice computing-ability as TX1 in GPU, as expectation,
    we think TX2 will 30% - 40% faster than TX1 at least.
    Unfortunately, most of our code base spent twice the time as TX1, in other words,
    TX2 only has 1/2 speed as TX1, mostly. We believe that TX2’s CUDA API runs
    much slower than TX1 in many cases.
    The user is transferring the cod
    e

    from one hardware to another
    The target hardware is faster
    than the the source hardware
    .

    User expects the code to run
    at least 30-40% faster.
    The code ran 2x slower on the
    more powerful hardware

    View Slide

  79. More powerful
    Results: Case Study
    79
    Nvidia TX1
    CPU 4 cores, 1.3 GHz
    GPU 128 Cores, 0.9 GHz
    Memory 4 Gb, 25 Gb/s
    Nvidia TX2
    CPU 6 cores, 2 GHz
    GPU 256 Cores, 1.3 GHz
    Memory 8 Gb, 58 Gb/s
    Embedded real-time
    stereo estimation
    Source code
    17 Fps
    4 Fps
    4
    Slower!
    ×

    View Slide

  80. Results: Case Study
    80
    Configuration CADET Decision Tree Forum
    CPU Cores ✓ ✓ ✓
    CPU Freq. ✓ ✓ ✓
    EMC Freq. ✓ ✓ ✓
    GPU Freq. ✓ ✓ ✓
    Sched. Policy ✓
    Sched. Runtime ✓
    Sched. Child Proc ✓
    Dirty Bg. Ratio ✓
    Drop Caches ✓
    CUDA_STATIC_R
    T
    ✓ ✓ ✓
    Swap Memory ✓
    CADET Decision Tree Forum
    Throughput (on TX2) 26 FPS 20 FPS 23 FPS
    Throughput Gain (over TX1) 53 % 21 % 39 %
    Time to resolve 24 min. 31/2
    Hrs. 2 days
    X Finds the root-causes accuratel
    y

    X No unnecessary change
    s

    X Better improvements than forum’s recommendatio
    n

    X Much faster
    Results
    The user expected 30-40% gain

    View Slide

  81. Evaluation: Experimental Setup
    Nvidia TX1
    CPU 4 cores, 1.3 GHz
    GPU 128 Cores, 0.9 GHz
    Memory 4 Gb, 25 GB/s
    Nvidia TX2
    CPU 6 cores, 2 GHz
    GPU 256 Cores, 1.3 GHz
    Memory 8 Gb, 58 GB/s
    Nvidia Xavier
    CPU 8 cores, 2.26 GHz
    GPU 512 cores, 1.3 GHz
    Memory 32 Gb, 137 GB/s
    Hardware Systems
    Software Systems
    Xception
    Image recognitio
    n

    (50,000 test images)
    DeepSpeech
    Voice recognitio
    n

    (5 sec. audio clip)
    BERT
    Sentiment Analysi
    s

    (10000 IMDb reviews)
    x264
    Video Encode
    r

    (11 Mb, 1080p video)
    Configuration Space
    X 30 Configuration
    s

    X 17 System Events
    • 10 software


    • 10 OS/Kernel


    • 10 hardware
    81

    View Slide

  82. Evaluation: Data Collection
    ● For each software/hardware
    combination create a benchmark
    datase
    t

    ○ Exhaustively set each of configuration
    option to all permitted values.
    ○ For continuous options (e.g., GPU memory
    Mem.), sample 10 equally spaced values
    between [min, max]
    ● Measure the latency, energy
    consumption, and heat dissipatio
    n

    ○ Repeat 5x and average
    82
    Multiple
    Faults
    !
    Latency
    Faults
    !
    Energy
    Faults
    !

    View Slide

  83. Evaluation: Ground Truth
    ● For each performance fault
    :

    ○ Manually investigate the root-cause
    ○ “Fix” the misconfigurations
    ● A “fix” implies the configuration no longer
    has tail performanc
    e

    ○ User defined benchmark (i.e., 10th percentile)
    ○ Or some QoS/SLA benchmark
    ● Record the configurations that were
    changed
    83
    Multiple
    Faults
    !
    Latency
    Faults
    !
    Energy
    Faults
    !

    View Slide

  84. Evaluation: Metrics
    84
    Relevance Scores
    𝐺
    𝑎
    𝑖 𝑛
    =
    NFP fault − NFP repair
    NFP fault
    × 100
    Repair Quality
    NFP = Non-Functional Propert
    y

    (e.g., Latency, Energy, etc.)
    Repair value
    Faulty value
    Larger the gain, better the repair

    View Slide

  85. 85
    RQ1: How does CADET perform compared to Model based
    Diagnostics
    RQ2: How does CADET perform compared to Search-Based
    Optimization
    Results: Research Questions

    View Slide

  86. 86
    Results: Research Question 1 (single objective)
    RQ1: How does CADET perform compared to Model based Diagnostics
    X Finds the root-causes accurately
    X Better gain
    X Much faster
    Takeaways
    More accurate tha
    n

    ML-based methods
    Better Gain
    Up to 20x


    faster

    View Slide

  87. 87
    Results: Research Question 1 (multi-objective)
    RQ1: How does CADET perform compared to Model based Diagnostics
    X No deterioration of other performance objectives
    Takeaways
    Multiple Fault
    s

    in Latency &
    Energy usage

    View Slide

  88. 88
    RQ1: How does CADET perform compared to Model based
    Diagnostics
    RQ2: How does CADET perform compared to Search-Based
    Optimization
    Results: Research Questions

    View Slide

  89. Results: Research Question 2
    RQ2: How does CADET perform compared to Search-Based
    Optimization
    X Better with no deterioration of other performance objectives
    Takeaways
    89

    View Slide

  90. 90
    Results: Research Question 3
    RQ2: How does CADET perform compared to Search-Based
    Optimization
    X Considerably faster than search-based optimization
    Takeaways

    View Slide

  91. Outline
    91
    Motivation
    Cas
    e

    Study
    Causal A
    I

    For Systems
    Results
    Futur
    e

    Directions

    View Slide

  92. Causal AI for Serverless
    • Evaluating our Causal AI for Systems methodology with Serverless
    systems provide the following opportunities:

    1. Dynamic system recon
    fi
    gurations

    • Dynamic placement of functions

    • Dynamic recon
    fi
    gurations of the network of functions

    • Dynamic multi-cloud placement of functions.

    2. Root cause analysis of failures or QoS drop
    92

    View Slide

  93. Causal AI for Autonomous Robot Testing
    • Testing cyberphysical systems such as robots are di
    ff
    i
    cult. The key reason
    is that there are additional interactions with the environment and the task
    that the robot is performing.

    • Evaluating our Causal AI for Systems methodology with autonomous
    robots provide the following opportunities:

    1. Identifying di
    ff i
    cult to catch bugs in robots

    2. Identifying the root cause of an observed fault and repairing the issue
    automatically during mission time.
    93

    View Slide

  94. Summary: Causal AI for Systems
    1. Learning a
    Functional Causal
    Model for di
    ff
    erent
    downstream
    systems tasks

    2. The learned
    causal model is
    transferable
    across di
    ff
    erent
    environments
    94
    Software: DeepStream


    Middleware: TF, TensorR
    T

    Hardware: Nvidia Xavie
    r

    Configuration: Default
    number of counters
    number of splitters
    latency (ms)
    100
    150
    1
    200
    250
    2
    300
    Cubic Interpolation Over Finer Grid
    2
    4
    3 6
    8
    4 10
    12
    5 14
    16
    6 18
    Budget
    Exhausted?
    Yes
    No
    5- Update Causal
    Performance Model
    Query Engine
    4- Estimate Causal
    Queries
    Estimate
    probability of
    satisfying QoS
    if BufferSize is
    set to 6k?
    2- Learn Causal
    Performance Model Performanc
    e

    Debugging
    Performanc
    e

    Optimization
    3- Translate Perf. Query


    to Causal Queries
    •What is the root-cause
    of observed perf. fault
    ?

    •How do I fix the
    misconfig.
    ?

    •How can I improve
    throughput without
    sacrificing accuracy
    ?

    •How do I understand
    perf behavior?
    Measure performance
    of the configuration(s)
    that maximizes
    information gain
    Performance Data Causal Model
    P(Th > 40/s|do(Buffersize = 6k))
    1- Specify


    Performance Query
    QoS : Th > 40/s
    Observed : Th < 30/s ± 5/s

    View Slide

  95. View Slide