Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Unicorn: Reasoning about Configurable System Performance through the Lens of Causality

Unicorn: Reasoning about Configurable System Performance through the Lens of Causality

This is the EuroSys'22 presentation, delivered by Shahriar Iqbal.
Paper: https://dl.acm.org/doi/abs/10.1145/3492321.3519575
Code + Data + Replication Package: https://github.com/softsys4ai/unicorn

Pooyan Jamshidi

April 06, 2022
Tweet

More Decks by Pooyan Jamshidi

Other Decks in Science

Transcript

  1. Md Shahriar Iqbal
    UNICORN: Reasoning about Configurable System
    Performance through the Lens of Causality
    Rahul Krishna MA Javidian Baishakhi Ray Pooyan Jamshidi

    View Slide

  2. Correlation vs Causation
    2

    View Slide

  3. 3
    Outline
    Motivation
    Causal Inference
    UNICORN
    Results

    View Slide

  4. Consider a data analytics pipeline
    4
    Video


    Decoder
    Stream
    Muxer
    Primary
    Detector
    Object
    Tracker
    Secondary
    Classifier
    # Configuration Options
    55
    86
    14
    44 86

    View Slide

  5. 5
    Video


    Decoder
    Stream
    Muxer
    Primary
    Detector
    Object
    Tracker
    Secondary
    Classifier
    # Configuration Options
    55
    86
    14
    44 86
    Composed System
    Compression


    Each component has a plethora of configuration options
    Encryption


    View Slide

  6. Each component has a plethora of configuration options
    6
    Video


    Decoder
    Stream
    Muxer
    Primary
    Detector
    Object
    Tracker
    Secondary
    Classifier
    # Configuration Options
    55
    86
    14
    44 86
    Con
    fi
    gurations Possible
    2285
    Complex interactions between options (intra or inter components) give rise
    to a combinatorially large con
    fi
    guration space
    Compression


    Encryption


    View Slide

  7. Energy (Joules)
    Performance varies significantly when systems
    are deployed with different configurations
    7
    Latency
    Energy Consumption
    5 10 15 20 25
    10
    20
    30
    40
    50
    >99.99%
    >99.99%
    Latency (Seconds)
    Varies by 5x
    Varies by 4.5x
    It is expected to set the system to a
    con
    fi
    guration for which the performance
    remains optimal or close to the optimal

    View Slide

  8. Energy (Joules)
    Performance varies significantly when systems
    are deployed with different configurations
    8
    Latency
    Energy Consumption
    5 10 15 20 25
    10
    20
    30
    40
    50
    >99.99%
    >99.99%
    Latency (Seconds)
    Varies by 5x
    Varies by 4.5x
    It is expected to set the system to a
    con
    fi
    guration for which the performance
    remains optimal or close to the optimal
    Reaching desired performance goal is
    di
    ffi
    cult due to sheer size of the
    con
    fi
    guration space and high
    con
    fi
    guration measurement cost

    View Slide

  9. Computer systems undergo several environmental changes
    9
    Source Environment
    Decoder Muxer Detector Tracker Classifier
    Target Environment
    Decoder Muxer Detector Tracker Classifier

    View Slide

  10. Real world example: Deployment environment change
    10
    When we are trying to transplant our CUDA source code from TX1 to TX2, it behaved
    strange.
    We noticed that TX2 has twice computing-ability as TX1 in GPU, as expectation, we
    think TX2 will 30% - 40% faster than TX1 at least.
    Unfortunately, most of our code base spent twice the time as TX1, in other words, TX2
    only has 1/2 speed as TX1, mostly. We believe that TX2’s CUDA API runs much slower
    than TX1 in many cases.
    The user is transferring the code
    from one hardware to another
    The code ran 2x slower on

    the more powerful hardware

    View Slide

  11. Real world example: Deployment environment change
    11
    When we are trying to transplant our CUDA source code from TX1 to TX2, it behaved
    strange.
    We noticed that TX2 has twice computing-ability as TX1 in GPU, as expectation, we
    think TX2 will 30% - 40% faster than TX1 at least.
    Unfortunately, most of our code base spent twice the time as TX1, in other words, TX2
    only has 1/2 speed as TX1, mostly. We believe that TX2’s CUDA API runs much slower
    than TX1 in many cases.
    The user is transferring the code
    from one hardware to another
    The code ran 2x slower on

    the more powerful hardware
    Incorrect understanding about the performance

    behavior often leads to miscon
    fi
    guration

    View Slide

  12. What is misconfiguration?
    12
    Miscon
    fi
    gurations happen due to
    unexpected interactions between
    con
    fi
    guration options in the deployment
    system stack.

    View Slide

  13. What is misconfiguration?
    13
    Miscon
    fi
    gurations happen due to
    unexpected interactions between
    con
    fi
    guration options in the deployment
    system stack.
    The system does not crash but remains
    operational with degraded performance
    e.g., high latency, low throughput, high
    energy consumption.
    Latency
    Energy Consumption
    5 10 15 20 25
    10
    20
    30
    40
    50
    >99.99%
    >99.99%
    Latency (Seconds)
    Energy (Joules)
    Miscon
    fi
    guration

    View Slide

  14. Performance task: Debugging
    14
    Performance debugging aims at
    fi
    nding the root cause
    of the miscon
    fi
    guration and
    fi
    x it.
    When we are trying to transplant our CUDA source code from TX1 to TX2, it behaved
    strange.
    We noticed that TX2 has twice computing-ability as TX1 in GPU, as expectation, we
    think TX2 will 30% - 40% faster than TX1 at least.
    Unfortunately, most of our code base spent twice the time as TX1, in other words, TX2
    only has 1/2 speed as TX1, mostly. We believe that TX2’s CUDA API runs much slower
    than TX1 in many cases.
    The user is transferring the code
    from one hardware to another
    The code ran 2x slower on

    the more powerful hardware
    The user expects 30-40%

    improvement

    View Slide

  15. Energy (Joules)
    Performance task: Optimization
    15
    Latency
    Energy Consumption
    5 10 15 20 25
    10
    20
    30
    40
    50
    >99.99%
    >99.99%
    Latency (Seconds)
    Here, the developer aims at
    fi
    nding the
    optimal con
    fi
    guration with or without
    experiencing any miscon
    fi
    guration.

    View Slide

  16. Performance debugging tasks take significantly long time, the fixes
    are typically non-intuitive (changes to seemingly underrated options)
    16
    June 3
    June 4
    June 4
    June 5
    Any suggestions on how to improve my performance?
    Thanks!
    Please do the following and let us know if it works
    1. Install JetPack 3.0
    2. Set nvpmodel=MAX-N
    3. Run jetson_clock.sh
    We have already tried this. We still have high latency.
    Any other suggestions?
    TX2 is pascal architecture. Please update your CMakeLists:
    + set(CUDA_STATIC_RUNTIME OFF)
    ...
    + -gencode=arch=compute_62,code=sm_62
    The user had several misconfigurations
    In Software:
    ✖ Wrong compilation flags
    ✖ Wrong SDK version
    In Hardware:
    ✖ Wrong power mode
    ✖ Wrong clock/fan settings

    View Slide

  17. 17
    How to resolve these issues?
    Current approaches:
    Reasoning based on
    correlation!
    Our key idea:

    Reasoning based on
    causation :)

    View Slide

  18. 18
    Performance In
    fl
    uence Models
    number of counters
    number of splitters
    latency (ms)
    100
    150
    1
    200
    250
    2
    300
    Cubic Interpolation Over Finer Grid
    2
    4
    3 6
    8
    4 10
    12
    5 14
    16
    6 18
    Bitrate
    (bits/s)
    Enable
    Padding
    … Cache
    Misses
    … Through
    put (fps)
    c1
    1k 1 … 42m … 7
    c2
    2k 1 … 32m … 22
    … … … … … … …
    cn
    5k 0 … 12m … 25
    Observational Data Black-box models Regression Equation
    Throughput = 5.1 × Bitrate + 2.5 × BatchSize
    + 12.3 × Bitrate × BatchSize
    Interactions
    Options Options
    This is a representative work, but
    there are many other works related to
    using regression models (as well as
    other statistical models) for building
    performance models
    We have selection bias here ;)

    View Slide

  19. 19
    These methods rely on statistical correlations
    to extract meaningful information required for
    performance tasks.
    Performance In
    fl
    uence Models
    number of counters
    number of splitters
    latency (ms)
    100
    150
    1
    200
    250
    2
    300
    Cubic Interpolation Over Finer Grid
    2
    4
    3 6
    8
    4 10
    12
    5 14
    16
    6 18
    Bitrate
    (bits/s)
    Enable
    Padding
    … Cache
    Misses
    … Through
    put (fps)
    c1
    1k 1 … 42m … 7
    c2
    2k 1 … 32m … 22
    … … … … … … …
    cn
    5k 0 … 12m … 25
    Observational Data Black-box models Regression Equation
    Throughput = 5.1 × Bitrate + 2.5 × BatchSize
    + 12.3 × Bitrate × BatchSize
    Discovered

    Interactions
    Options Options

    View Slide

  20. 20
    • Incorrect Explanations and Unreliable Predictions

    • Non-transferable across Environments
    Performance In
    fl
    uence Models su
    ff
    er from
    several shortcomings

    View Slide

  21. Performance Influence Models might be Unreliable
    Cache Misses
    Throughput (FPS)
    20
    10
    0
    100k 200k
    21
    Increasing Cache Misses

    increases Throughput.

    View Slide

  22. Cache Misses
    Throughput (FPS)
    20
    10
    0
    100k 200k
    22
    Increasing Cache Misses

    increases Throughput.
    More Cache Misses should

    reduce Throughput not

    increase it
    Purely statistical models built on this
    data will be unreliable.
    This is counter-intuitive
    Performance Influence Models might be Unreliable

    View Slide

  23. Cache Misses
    Throughput (FPS)
    20
    10
    0
    100k 200k
    23
    Cache Misses
    Throughput (FPS)
    LRU
    FIFO
    LIFO
    MRU
    20
    10
    0
    100k 200k
    Segregating data on Cache Policy indicates that within each group
    increase of Cache Misses result in a decrease in Throughput.
    FIFO
    LIFO
    MRU
    LRU
    Performance Influence Models might be Unreliable

    View Slide

  24. 24
    DeepStream (Environment: TX2)
    DeepStream (Environment: Xavier)
    Throughput = 5.1 × Bitrate + 2.5 × BatchSize + 12.3 × Bitrate × BatchSize
    Throughput = 2 × Bitrate + 1.9 × BatchSize + 1.8 × BufferSize + 0.5 × EnablePadding + 5.9 × Bitrate × BufferSize
    +6.2 × Bitrate × EnablePadding + 4.1 × Bitrate × BufferSize × EnablePadding
    Performance Influence Models are not Transferable
    Each term in the regression equations is
    considered a predictor

    View Slide

  25. 25
    Performance In
    fl
    uence Models change signi
    fi
    cantly in new
    environments resulting in less accuracy.
    Throughput = 5.1 × Bitrate + 2.5 × BatchSize + 12.3 × Bitrate × BatchSize
    Throughput = 2 × Bitrate + 1.9 × BatchSize + 1.8 × BufferSize + 0.5 × EnablePadding + 5.9 × Bitrate × BufferSize
    +6.2 × Bitrate × EnablePadding + 4.1 × Bitrate × BufferSize × EnablePadding
    Performance Influence Models are not Transferable
    DeepStream (Environment: TX2)
    DeepStream (Environment: Xavier)

    View Slide

  26. 26
    Performance in
    fl
    uence cannot be reliably
    used across environments.
    Throughput = 5.1 × Bitrate + 2.5 × BatchSize + 12.3 × Bitrate × BatchSize
    Throughput = 2 × Bitrate + 1.9 × BatchSize + 1.8 × BufferSize + 0.5 × EnablePadding + 5.9 × Bitrate × BufferSize
    +6.2 × Bitrate × EnablePadding + 4.1 × Bitrate × BufferSize × EnablePadding
    Performance Influence Models are not Transferable
    DeepStream (Environment: TX2)
    DeepStream (Environment: Xavier)

    View Slide

  27. 27
    Outline
    Motivation
    Causal Inference
    UNICORN
    Results

    View Slide

  28. 28
    Our Key Idea: Building Causal Performance Model instead of
    Performance Influence Models
    Expresses the relationships between
    Con
    fi
    guration options
    System Events
    Non-functional

    Properties
    Cache Misses
    Throughput (FPS)
    20
    10
    0
    100k 200k
    interacting variables as a causal graph
    Direction of

    Causality
    Cache


    Policy
    Cache


    Misses
    Through


    put

    View Slide

  29. Why Causal Performance Model? To build reliable models that
    produce correct explanations
    29
    Cache Misses
    Throughput (FPS)
    20
    10
    0
    100k 200k
    Cache Misses
    Throughput (FPS)
    LRU
    FIFO
    LIFO
    MRU
    20
    10
    0
    100k 200k
    Cache Policy a
    ff
    ects
    Throughput via Cache Misses.
    Causal Performance Models recover
    the correct interactions.
    Cache


    Policy
    Cache


    Misses
    Through


    put

    View Slide

  30. Why Causal Performance Models? To reuse them when the system environment changes
    30
    Causal models remain
    relatively stable
    A partial causal performance

    model in Jetson Xavier
    A partial causal performance

    model in Jetson TX2
    Bitrate
    Buffer


    Size
    Batch


    Size
    Enable


    Padding
    Branch


    Misses
    Cache


    Misses
    Cycles
    FPS Energy
    Bitrate
    Buffer


    Size
    Batch


    Size
    Enable


    Padding
    Branch


    Misses
    Cache


    Misses
    Cycles
    FPS Energy

    View Slide

  31. How to use Causal Performance Models?
    ?
    Cache


    Policy
    Cache


    Misses
    Through


    put
    How to generate a
    causal performance model?
    31

    View Slide

  32. How to use Causal Performance Models?
    ?
    How to use the causal performance
    model for performance tasks?
    ?
    Cache


    Policy
    Cache


    Misses
    Through


    put
    How to generate a
    causal performance model?
    32

    View Slide

  33. 33
    Outline
    Motivation
    Causal Inference
    UNICORN
    Results

    View Slide

  34. UNICORN: End-to-end Pipeline
    34
    5- Estimate
    Causal Queries
    • What is the root-cause of fault?
    • How do I fix misconfiguration?
    • How do I optimize perf.?
    • How do I understand perf.?
    Software: DeepStream
    Middleware: TF, TensorRT
    Hardware: Nvidia Xavier
    Configuration: Default
    Estimate probability of satisfying QoS if BufferSize is set to 6k?
    2- Learn Causal
    Performance Model
    Initial
    Perf. Data
    P(Th > 40/s|do(Buffersize = 6k))
    1- Specify
    Performance Query
    QoS : Th > 40/s
    Observed : Th < 30/s ± 5/s Causal Performance Model
    System Stack
    Performance Tasks
    Performance Fault/Issue
    4- Update Causal
    Performance Model
    Causal
    Inference
    Engine
    3- Determine
    Next Configuration
    Uses
    Stage
    Stages
    I. Specify Performance Query


    II. Learn Causal Performance Model


    III. Iterative Sampling


    IV. Update Causal Performance Model


    V. Estimate Causal Queries


    View Slide

  35. Stage-I: Specify Performance Query
    35
    Performance Queries


    Query: What are the root causes of my performance
    fault and how can I improve performance by 70%?

    View Slide

  36. Stage-I: Specify Performance Query
    36
    Performance Queries


    Query: What are the root causes of my performance
    fault and how can I improve performance by 70%?
    Query
    Engine
    Extracted Information


    Info: 70% gain expected
    Extracts meaningful information which is useful for

    subsequent stages for a performance task.

    View Slide

  37. Bitrate
    (bits/s)
    Enable
    Padding
    … Cache
    Misses
    … Through
    put (fps)
    c1
    1k 1 … 42m … 7
    c2
    2k 1 … 32m … 22
    … … … … … … …
    cn
    5k 0 … 12m … 25
    1- Recovering the
    Skeleton
    fully connected graph
    given constraints (e.g.,
    no connections btw
    configuration options)
    Stage-II: Learn Causal Performance Model
    37
    FPS Energy
    Branch
    Misses
    Cache
    Misses
    No of
    Cycles
    Bitrate Buffer
    Size
    Batch
    Size
    Enable
    Padding

    View Slide

  38. Bitrate
    (bits/s)
    Enable
    Padding
    … Cache
    Misses
    … Through
    put (fps)
    c1
    1k 1 … 42m … 7
    c2
    2k 1 … 32m … 22
    … … … … … … …
    cn
    5k 0 … 12m … 25
    FPS Energy
    Branch
    Misses
    Cache
    Misses
    No of
    Cycles
    Bitrate Buffer
    Size
    Batch
    Size
    Enable
    Padding
    1- Recovering the
    Skeleton
    2- Pruning
    Causal Model
    statistical
    independence
    tests
    fully connected graph
    given constraints (e.g.,
    no connections btw
    configuration options)
    Stage-II: Learn Causal Performance Model
    38
    FPS Energy
    Branch
    Misses
    Cache
    Misses
    No of
    Cycles
    Bitrate Buffer
    Size
    Batch
    Size
    Enable
    Padding

    View Slide

  39. Bitrate
    (bits/s)
    Enable
    Padding
    … Cache
    Misses
    … Through
    put (fps)
    c1
    1k 1 … 42m … 7
    c2
    2k 1 … 32m … 22
    … … … … … … …
    cn
    5k 0 … 12m … 25
    FPS Energy
    Branch
    Misses
    Cache
    Misses
    No of
    Cycles
    Bitrate Buffer
    Size
    Batch
    Size
    Enable
    Padding
    1- Recovering the
    Skeleton
    2- Pruning
    Causal Model
    3- Orienting
    Causal Relations
    statistical
    independence
    tests
    fully connected graph
    given constraints (e.g.,
    no connections btw
    configuration options)
    orientation rules &
    measures +
    constraints (colliders,
    v-structures)
    FPS Energy
    Branch
    Misses
    Cache
    Misses
    No of
    Cycles
    Bitrate Buffer
    Size
    Batch
    Size
    Enable
    Padding
    Stage-II: Learn Causal Performance Model
    Partial Ancestral
    Graph (PAG)
    39
    FPS Energy
    Branch
    Misses
    Cache
    Misses
    No of
    Cycles
    Bitrate Buffer
    Size
    Batch
    Size
    Enable
    Padding

    View Slide

  40. Stage-II: Learn Causal Performance Model
    40
    FPS Energy
    Branch
    Misses
    Cache
    Misses
    No of
    Cycles
    Bitrate Buffer
    Size
    Batch
    Size
    Enable
    Padding
    Partial Ancestral
    Graph (PAG)
    A PAG can have three types of edges between

    any nodes X and Y
    X Y X is a parent of Y
    X Y A confounder exists between X and Y
    X Y
    Not su
    ffi
    cient data to recover causal
    direction
    X Y X Y
    or or

    View Slide

  41. Stage-II: Learn Causal Performance Model
    41
    FPS Energy
    Branch
    Misses
    Cache
    Misses
    No of
    Cycles
    Bitrate Buffer
    Size
    Batch
    Size
    Enable
    Padding
    Partial Ancestral
    Graph (PAG)
    A PAG can have three types of edges between

    any nodes X and Y
    X Y X is a parent of Y
    X Y A confounder exists between X and Y
    X Y
    Not su
    ffi
    cient data to recover causal
    direction
    X Y X Y
    or or

    View Slide

  42. Stage-II: Learn Causal Performance Model
    42
    FPS Energy
    Branch
    Misses
    Cache
    Misses
    No of
    Cycles
    Bitrate Buffer
    Size
    Batch
    Size
    Enable
    Padding
    Partial Ancestral
    Graph (PAG)
    A PAG can have three types of edges between

    any nodes X and Y
    X Y X is a parent of Y
    X Y A confounder exists between X and Y
    X Y
    Not su
    ffi
    cient data to recover causal
    direction
    X Y X Y
    or or

    View Slide

  43. 43
    FPS Energy
    Branch
    Misses
    Cache
    Misses
    No of
    Cycles
    Bitrate Buffer
    Size
    Batch
    Size
    Enable
    Padding
    Bitrate
    (bits/s)
    Enable
    Padding
    … Cache
    Misses
    … Through
    put (fps)
    c1
    1k 1 … 42m … 7
    c2
    2k 1 … 32m … 22
    … … … … … … …
    cn
    5k 0 … 12m … 25
    FPS Energy
    Branch
    Misses
    Cache
    Misses
    No of
    Cycles
    Bitrate Buffer
    Size
    Batch
    Size
    Enable
    Padding
    1- Recovering the
    Skeleton
    2- Pruning
    Causal Model
    3- Orienting
    Causal Relations
    statistical
    independence
    tests
    fully connected graph
    given constraints (e.g.,
    no connections btw
    configuration options)
    orientation rules &
    measures +
    constraints (colliders,
    v-structures)
    FPS Energy
    Branch
    Misses
    Cache
    Misses
    No of
    Cycles
    Bitrate Buffer
    Size
    Batch
    Size
    Enable
    Padding
    3- Refining Causal
    Directions
    Latent search
    and entropy
    Stage-II: Learn Causal Performance Model
    Acyclic Directed
    Mixed Graph (ADMG)
    Partial Ancestral
    Graph (PAG)
    FPS Energy
    Branch
    Misses
    Cache
    Misses
    No of
    Cycles
    Bitrate Buffer
    Size
    Batch
    Size
    Enable
    Padding

    View Slide

  44. 44
    Stage-III: Iterative Sampling
    FPS Energy
    Branch
    Misses
    Cache
    Misses
    No of
    Cycles
    Bitrate Buffer
    Size
    Batch
    Size
    Enable
    Padding
    FPS
    Branch
    Misses
    Bitrate
    Causal Performance Model
    Selected Subsection of

    Causal Performance Model
    Recommended

    Con
    fi
    guration
    Bitrate
    (bits/s)
    Enable
    Padding
    … Cache
    Misses
    … Through
    put (fps)
    c1
    1k 1 … 42m … 7
    c2
    2k 1 … 32m … 22
    … … … … … … …
    cn
    5k 0 … 12m … 25
    Individual Causal E
    ff
    ect (ICE)

    Estimation
    Interventional

    Measurement
    Select Top K Paths

    using Average Causal

    E
    ff
    ect (ACE)

    View Slide

  45. 45
    x In real world case, the causal graphs can be
    very complex
    x It may be intractable to reason over the entire
    graph directly
    A real world causal graph for a

    data analytics pipeline
    Why Select Top K Paths?

    View Slide

  46. 46
    Extracting Causal Paths from the Causal Model
    Extract paths
    Always begins with a
    configuration option
    Or a system
    event
    Always terminates at a
    performance objective
    Cache


    Misses
    Bitrate
    Branch


    Misses
    FPS
    Bitrate
    Branch


    Misses
    FPS
    FPS
    Branch


    Misses
    Cache


    Misses

    View Slide

  47. Ranking Causal Paths from the Causal Model
    Expected value of Bitrate
    when we artificially intervene
    by setting Bitrate to the value
    b
    Expected value of Branch
    Misses when we artificially
    intervene by setting Bitrate
    to the value a
    If this difference is large, then
    small changes to Bitrate will
    cause large changes to Branch
    Misses
    Average over all permitted
    values of Bitrate.
    ACE(BranchMisses, Bitrate) =
    1
    N ∑ E(BranchMisses|do(Bitrate = b)) − E(BranchMisses|do(Bitrate = a))
    47
    Bitrate
    Branch


    Misses
    FPS
    • There may be too many causal paths.

    • We need to select the most useful ones.

    • Compute the Average Causal E
    ff
    ect (ACE) of each pair
    of neighbors in a path.

    View Slide

  48. 48
    Ranking Causal Paths from the Causal Model
    ● Average the ACE of all pairs of adjacent nodes in the path
    ● Rank paths from highest path ACE (PACE) score to the lowest
    ● Use the top K paths for subsequent analysis
    Sum over all pairs of
    nodes in the causal path.
    PACE
    (Z, Y) =
    1
    2
    (ACE(Z, X) + ACE(X, Y))
    Bitrate
    Branch


    Misses
    FPS

    View Slide

  49. How to reason over a path?
    49
    To reason, we need to evaluate counterfactual queries that
    can be formulated using the con
    fi
    guration options and
    performance objectives in a particular path to resolve a
    particular performance task.

    View Slide

  50. Counterfactual Queries
    50
    ● Counterfactual inference asks “what if” questions about changes to the
    misconfigurations
    We are interested in the scenario where:
    • We hypothetically have low throughput;
    Conditioned on the following events:
    • We hypothetically set the new Bitrate to 10000
    • Bitrate was initially set to 6000
    • We observed low throughput when Bitrate was set to 6000
    • Everything else remains the same
    Example
    "Given that my current Bitrate is 6000 and I have low throughput, what is the probability of
    having low throughput if my Bitrate is increased to 10000"?

    View Slide

  51. Selecting configuration for next intervention
    Top K paths
    Enumerate all
    possible changes
    Change with
    the largest ICE
    Set every configuration
    option in the path to all
    permitted values
    ICE (change)
    Inferred from
    observational data. This is
    very cheap
    51
    Bitrate
    Branch


    Misses
    FPS

    View Slide

  52. Selecting configuration for next intervention
    Change with
    the largest ICE
    Yes
    No
    • Proceed to next stage
    Measure
    Performance
    52
    Query

    Satis
    fi
    ed?
    • Terminate

    View Slide

  53. Stage-IV: Update Causal Performance Model
    53
    FPS Energy
    Branch
    Misses
    Cache
    Misses
    No of
    Cycles
    Bitrate
    Buffer
    Size
    Batch
    Size
    Enable
    Padding 1- Evaluate Candidate
    Interventions
    FPS Energy
    Branch
    Misses
    Cache
    Misses
    No of
    Cycles
    Bitrate
    Buffer
    Size
    Batch
    Size
    Enable
    Padding
    Option/Event/Obj Values
    Bitrate 1k
    Buffer Size 20k
    Batch Size 10
    Enable Padding 1
    Branch Misses 24m
    Cache Misses 42m
    No of Cycles 73b
    FPS 31/s
    Energy 42J
    2- Determine & Perform
    next Perf. Measurement
    3- Update Causal
    Performance Model
    Performance
    Data
    Model averaging
    Expected change in
    belief & KL; Causal
    effects on objectives
    Interventions on Hardware,
    Workload, and Kernel options
    Intervention 1 … Intervention n
    Belief
    Update
    Prior
    Belief
    4- Replace Causal
    Performance Model

    View Slide

  54. Stage-V: Estimate Causal Queries
    54
    P(Throughput > 40/s|do(BufferSize = 20000))
    Estimate the probability of satisfying QoS given Bu
    ff
    erSize=20000
    • Use do-calculus to evaluate causal queries
    • Estimate budget and additional constraints

    View Slide

  55. 55
    Outline
    Motivation
    Causal Inference
    UNICORN
    Results

    View Slide

  56. Experimental Setup: Systems, Workload, Hardware
    56
    Nvidia TX1
    CPU 4 cores, 1.3 GHz
    GPU 128 Cores, 0.9 GHz
    Memory 4 Gb, 25 GB/s
    Nvidia TX2
    CPU 6 cores, 2.0 GHz
    GPU 256 Cores, 1.3 GHz
    Memory 8 Gb, 58 GB/s
    Nvidia Xavier
    CPU 8 cores, 2.26 GHz
    GPU 512 Cores, 1.3 GHz
    Memory 16 Gb, 137 GB/s
    Xception
    Image Recognition
    (50,000 test images)
    DeepSpeech
    Voice Recognition
    (5 sec. audio clip)
    BERT
    Sentiment Analysis
    (10000 IMDb reviews)
    x264
    Video Encoder
    (11 Mb, 1080p video)

    View Slide

  57. Experimental Setup: Baselines
    57
    Optimization
    Debugging

    View Slide

  58. Results: Efficiency
    Unicorn
    fi
    nds the root-causes accurately

    View Slide

  59. Results: Efficiency
    Unicorn
    fi
    nds the root-causes accurately

    Unicorn achieves higher gain

    View Slide

  60. Results: Efficiency
    60
    Unicorn
    fi
    nds the root-causes accurately

    Unicorn achieves higher gain

    Unicorn performs them much faster
    UNICORN achieves higher sample e
    ffi
    ciency than other baselines.
    Takeaway

    View Slide

  61. Results: Transferability
    61
    10k 20k 50k
    0
    30
    60
    90
    Workload Size
    Gain %
    Unicorn + 20% Unicorn + 10% Unicorn (Reuse)
    Smac + 20% Smac + 10% Smac (Reuse)
    UNICORN
    fi
    nds
    con
    fi
    guration with higher
    gain when workload
    changes.

    View Slide

  62. Results: Transferability
    62
    10k 20k 50k
    0
    30
    60
    90
    Workload Size
    Gain %
    Unicorn + 20% Unicorn + 10% Unicorn (Reuse)
    Smac + 20% Smac + 10% Smac (Reuse)
    UNICORN can be e
    ff
    ectively reused in new
    environments for di
    ff
    erent performance tasks
    Takeaway

    View Slide

  63. Results: Scalability
    63
    Discovery time, query evaluation
    time and total time do not increase
    exponentially as the number of
    con
    fi
    guration options and systems
    events are increased

    View Slide

  64. Results: Scalability
    64
    Causal graphs are
    sparse

    View Slide

  65. Results: Scalability
    65
    UNICORN is scalable for larger
    multi-component systems and
    systems with large con
    fi
    guration
    space.
    Takeaway

    View Slide

  66. 66
    Decoder Muxer Detector Tracker Classi
    fi
    er
    Causal reasoning enables more reliable performance analyses and more transferable
    performance models

    View Slide

  67. 67
    Cache Misses
    Throughput (FPS)
    20
    10
    0
    100k 200k
    Cache Misses
    Throughput (FPS)
    LRU
    FIFO
    LIFO
    MRU
    20
    10
    0
    100k 200k
    Through-
    put
    Cache
    Misses
    Cache
    Policy
    Decoder Muxer Detector Tracker Classi
    fi
    er
    Causal reasoning enables more reliable performance analyses and more transferable
    performance models

    View Slide

  68. 68
    Cache Misses
    Throughput (FPS)
    20
    10
    0
    100k 200k
    Cache Misses
    Throughput (FPS)
    LRU
    FIFO
    LIFO
    MRU
    20
    10
    0
    100k 200k
    Through-
    put
    Cache
    Misses
    Cache
    Policy
    5- Estimate
    Causal Queries
    • What is the root-cause of fault?
    • How do I fix misconfiguration?
    • How do I optimize perf.?
    • How do I understand perf.?
    Software: DeepStream
    Middleware: TF, TensorRT
    Hardware: Nvidia Xavier
    Configuration: Default
    Estimate probability of satisfying QoS if BufferSize is set to 6k?
    2- Learn Causal
    Performance Model
    Initial
    Perf. Data
    P(Th > 40/s|do(Buffersize = 6k))
    1- Specify
    Performance Query
    QoS : Th > 40/s
    Observed : Th < 30/s ± 5/s Causal Performance Model
    System Stack
    Performance Tasks
    Performance Fault/Issue
    4- Update Causal
    Performance Model
    Causal
    Inference
    Engine
    3- Determine
    Next Configuration
    Uses
    Stage
    Decoder Muxer Detector Tracker Classi
    fi
    er
    Causal reasoning enables more reliable performance analyses and more transferable
    performance models

    View Slide

  69. Causal reasoning enables more reliable performance analyses and more transferable
    performance models
    69
    Cache Misses
    Throughput (FPS)
    20
    10
    0
    100k 200k
    Cache Misses
    Throughput (FPS)
    LRU
    FIFO
    LIFO
    MRU
    20
    10
    0
    100k 200k
    Through-
    put
    Cache
    Misses
    Cache
    Policy
    5- Estimate
    Causal Queries
    • What is the root-cause of fault?
    • How do I fix misconfiguration?
    • How do I optimize perf.?
    • How do I understand perf.?
    Software: DeepStream
    Middleware: TF, TensorRT
    Hardware: Nvidia Xavier
    Configuration: Default
    Estimate probability of satisfying QoS if BufferSize is set to 6k?
    2- Learn Causal
    Performance Model
    Initial
    Perf. Data
    P(Th > 40/s|do(Buffersize = 6k))
    1- Specify
    Performance Query
    QoS : Th > 40/s
    Observed : Th < 30/s ± 5/s Causal Performance Model
    System Stack
    Performance Tasks
    Performance Fault/Issue
    4- Update Causal
    Performance Model
    Causal
    Inference
    Engine
    3- Determine
    Next Configuration
    Uses
    Stage
    Decoder Muxer Detector Tracker Classi
    fi
    er

    View Slide

  70. https://github.com/softsys4ai/UNICORN

    View Slide

  71. Supplementary Slides
    71

    View Slide

  72. Maintaining performance in a highly configurable
    system is challenging
    72
    • The con
    fi
    guration space is combinatorially large with 1000's of con
    fi
    guration
    options.

    View Slide

  73. Maintaining performance in a highly configurable
    system is challenging
    73
    • The con
    fi
    guration space is combinatorially large with 1000's of con
    fi
    guration
    options.

    • Con
    fi
    guration options from each components interact non-trivially with one
    another.

    View Slide

  74. Maintaining performance in a highly configurable
    system is challenging
    74
    • The con
    fi
    guration space is combinatorially large with 1000's of con
    fi
    guration
    options.

    • Con
    fi
    guration options from each components interact non-trivially with one
    another.

    • Individual component developers have a localized and limited understanding of
    the performance behavior of these systems.

    View Slide

  75. Maintaining performance in a highly configurable
    system is challenging
    75
    • The con
    fi
    guration space is combinatorially large with 1000's of con
    fi
    guration
    options.

    • Con
    fi
    guration options from each components interact non-trivially with one
    another.

    • Individual component developers have a localized and limited understanding of
    the performance behavior of these systems.

    • Each deployment needs to be con
    fi
    gured correctly which is prone to
    miscon
    fi
    gurations.

    View Slide

  76. Maintaining performance in a highly configurable
    system is challenging
    76
    • The con
    fi
    guration space is combinatorially large with 1000's of con
    fi
    guration
    options.

    • Con
    fi
    guration options from each components interact non-trivially with one
    another.

    • Individual component developers have a localized and limited understanding of
    the performance behavior of these systems.

    • Each deployment needs to be con
    fi
    gured correctly every time an environmental
    changes occur.
    Incorrect understanding about the performance

    behavior often leads to miscon
    fi
    guration

    View Slide

  77. Building configuration space of a highly configurable
    system
    77
    C = O1
    × O2
    × O3
    × . . . × On
    Batch Size
    Interval Enable Past Frame
    Presets

    View Slide

  78. Modern computer system is composed of multiple
    components.
    78
    Video


    Decoder
    Stream
    Muxer
    Primary
    Detector
    Object
    Tracker
    Secondary
    Classifier
    # Configuration Options
    55
    86
    14
    44 86
    Component 1 Component 2 Component 3
    Composed System
    ...

    View Slide

  79. Building configuration space of a highly configurable
    system.
    79
    C = O1
    × O2
    × O3
    × . . . × On
    Batch Size
    Interval Enable Past Frame
    Presets
    c1
    ∈ C False × 20 × 5 × . . . × True

    View Slide

  80. Building configuration space of a highly configurable
    system.
    80
    C = O1
    × O2
    × O3
    × . . . × On
    Batch Size
    Interval Enable Past Frame
    Presets
    c1
    ∈ C False × 20 × 5 × . . . × True
    f1
    (c1
    ) = 32/seconds
    f2
    (c1
    ) = 63.6 Joules
    Throughput
    Energy

    View Slide

  81. Results: Transferability
    81
    10k 20k 50k
    0
    30
    60
    90
    Workload Size
    Gain %
    Unicorn + 20% Unicorn + 10% Unicorn (Reuse)
    Smac + 20% Smac + 10% Smac (Reuse)
    Accuracy Precision Recall Gain
    30
    60
    90
    %
    Unicorn (Reuse) Unicorn + 25 Unicorn (Rerun)
    Bugdoc (Reuse) Bugdoc + 25 Bugdoc (Rerun)
    Time
    0
    2
    4
    Hours.
    x UNICORN quickly
    fi
    xes the
    bug and achieves higher
    gain, accuracy, precision
    and recall when hardware
    changes

    View Slide

  82. Why Causal Inference? - Accurate across Environments
    82
    Performance Influence Model
    0
    5
    10
    15
    20
    25
    30
    35
    40
    45
    50
    Terms
    (a)
    Common Terms (Source ! Target) Total Terms (Source) Total Terms (Target)
    Error (Source) Error (Target) Error (Source ! Target)
    0
    30
    60
    90
    Regression Models Causal Performance Model
    0
    5
    10
    15
    20
    25
    30
    35
    40
    45
    50
    0
    30
    60
    90
    Regression Models
    MAPE (%)
    Common Predictors
    are Large
    Common Predictors
    are lower in number

    View Slide

  83. 83
    Performance Influence Model
    0
    5
    10
    15
    20
    25
    30
    35
    40
    45
    50
    Terms
    (a)
    Common Terms (Source ! Target) Total Terms (Source) Total Terms (Target)
    Error (Source) Error (Target) Error (Source ! Target)
    0
    30
    60
    90
    Regression Models Causal Performance Model
    0
    5
    10
    15
    20
    25
    30
    35
    40
    45
    50
    0
    30
    60
    90
    Regression Models
    MAPE (%)
    Low error
    when reused
    High error
    when reused
    Common Predictors
    are Large
    Common Predictors
    are lower in number
    Causal models can be reliably reused when
    environmental changes occur.
    Why Causal Inference? - Accurate across Environments

    View Slide

  84. 84
    Performance Influence Model
    0
    5
    10
    15
    20
    25
    30
    35
    40
    45
    50
    Terms
    (a)
    Common Terms (Source ! Target) Total Terms (Source) Total Terms (Target)
    Error (Source) Error (Target) Error (Source ! Target)
    0
    30
    60
    90
    Regression Models Causal Performance Model
    0
    5
    10
    15
    20
    25
    30
    35
    40
    45
    50
    0
    30
    60
    90
    Regression Models
    MAPE (%)
    Causal models are more generalizable than
    Performance in
    fl
    uence models.
    Why Causal Inference? - Generalizability

    View Slide

  85. 85
    Future work
    • Determining more accurate causal graphs by incorporating domain
    knowledge
    Bitrate
    Buffer


    Size
    Batch


    Size
    Enable


    Padding
    Branch


    Misses
    Cache


    Misses
    Cycles
    FPS Energy
    Bitrate
    Buffer


    Size
    Batch


    Size
    Enable


    Padding
    Branch


    Misses
    Cache


    Misses
    Cycles
    FPS Energy
    Domain Expert
    Causal Performance Model

    from Observational Data
    Causal Performance Model

    Corrected by Expert
    Background knowledge

    View Slide

  86. 86
    Future work
    • Developing new domain-speci
    fi
    c languages for
    performance query speci
    fi
    cation
    Unstructured Performance


    Queries
    Semantic


    Analysis
    Query


    Engine
    Useful Information
    End user

    View Slide