Upgrade to Pro — share decks privately, control downloads, hide ads and more …

End-to-end Deep Learning of Optimization Heuristics (PACT'17)

Chris Cummins
September 12, 2017

End-to-end Deep Learning of Optimization Heuristics (PACT'17)

Paper: https://github.com/ChrisCummins/paper-end2end-dl

Accurate automatic optimization heuristics are necessary for dealing with the complexity and diversity of modern hardware and software. Machine learning is a proven technique for learning such heuristics, but its success is bound by the quality of the features used. These features must be hand crafted by developers through a combination of expert domain knowledge and trial and error. This makes the quality of the final model directly dependent on the skill and available time of the system architect.

Our work introduces a better way for building heuristics. We develop a deep neural network that learns heuristics over raw code, entirely without using code features. The neural network simultaneously constructs appropriate representations of the code and learns how best to optimize, removing the need for manual feature creation. Further, we show that our neural nets can transfer learning from one optimization problem to another, improving the accuracy of new models, without the help of human experts.

We compare the effectiveness of our automatically generated heuristics against ones with features hand-picked by experts. We examine two challenging tasks: predicting optimal mapping for heterogeneous parallelism and GPU thread coarsening factors. In 89% of the cases, the quality of our fully automatic heuristics matches or surpasses that of state-of-the-art predictive models using hand-crafted features, providing on average 14% and 12% more performance with no human effort expended on designing features.

Chris Cummins

September 12, 2017
Tweet

More Decks by Chris Cummins

Other Decks in Science

Transcript

  1. Optimization Heuristics
    Deep Learning
    http://chriscummins.cc/pact17
    End-to-end
    of

    View Slide

  2. Chris Cummins
    Lancaster University
    University of Edinburgh
    University of Edinburgh
    University of Edinburgh
    Pavlos Petoumenos
    Zheng Wang
    Hugh Leather

    View Slide

  3. compilers are very complex
    hand-coded heuristics
    of choices
    hundreds,
    thousands,
    millions
    }
    {
    int main(
    int argc,
    char** arg)
    {...
    _main:
    .cfi_start
    proc
    ## BB#0:
    pushq %rbp
    ...
    (out of date by time of release)

    View Slide

  4. Machine learning in compilers
    y = f(x)
    optimization
    decision
    model
    features
    (derived from IR)

    View Slide

  5. Machine learning in compilers
    Training
    Programs
    Driver
    Feature
    Extractor
    Feature
    Vectors
    Best
    Decisions
    Training Data
    Training Data
    Training Data
    Optimization
    Heuristic

    View Slide

  6. Machine learning in compilers
    Training
    Programs
    Driver
    Feature
    Extractor
    Feature
    Vectors
    Best
    Decisions
    Training Data
    Training Data
    Training Data
    Optimization
    Heuristic
    the human bit!
    1. hard to get right
    2. time consuming
    3. repetitious

    View Slide

  7. Use a GPU
    Use a CPU
    Learned Heuristic
    Feature space
    Feature “Y”
    Feature “X”

    View Slide

  8. Use a GPU
    Use a CPU
    Learned Heuristic
    Feature space
    Feature “Y”
    Feature “X”
    need good
    features!

    View Slide

  9. irrelevant
    e.g. not capturing the right
    information
    e.g. missing critical
    information
    incomplete
    Ways to fail
    unsuitable
    e.g. wrong combination of
    features / model

    View Slide

  10. What we have
    Training
    Programs
    Driver
    Feature
    Extractor
    Feature
    Vectors
    Best
    Decisions
    Training Data
    Training Data
    Training Data
    Predictive
    Model

    View Slide

  11. Training
    Programs
    Driver Best
    Decisions
    Training Data
    Training Data
    Training Data
    Predictive
    Model
    What we need

    View Slide

  12. Heuristics without features
    Beats expert approach
    Learning across heuristics
    Contributions

    View Slide

  13. int
    main(int
    argc, char
    **argv)
    { ...
    Our approach
    Deep
    Learning
    Optimization
    Decision
    Program
    Code

    View Slide

  14. int
    main(int
    argc, char
    **argv)
    { ...
    Our approach
    Deep
    Learning
    Optimization
    Decision
    Program
    Code
    {
    preprocessing
    Rewriter
    Encoder
    Code in
    encode as sequence of vocabulary indices
    Vocabulary table for characters +
    lang keywords
    normalize identifiers & code style
    1.var/fun names: ‘foo’, ‘bar’, … to ‘a’, ‘b’, …
    2.sanitize whitespace
    3.consistent use of optional braces

    View Slide

  15. Our approach
    Deep
    Learning
    Optimization
    Decision
    Program
    Code
    Rewriter
    Encoder
    Embedding Heuristic
    Model
    Language
    Model
    Rewriter
    Encoder
    Code in
    map vocab indices
    into real space
    summarize sequence as vector
    (2 layer LSTM network)
    predict optimization
    on vector (2 layer DNN)

    View Slide

  16. Our approach
    Deep
    Learning
    Optimization
    Decision
    Program
    Code
    Embedding Heuristic
    Model
    Language
    Model
    Rewriter
    Encoder
    Code in

    View Slide

  17. How does it work?

    View Slide

  18. View Slide

  19. How does it work?
    well

    View Slide

  20. Heterogeneous Mapping Thread Coarsening
    Prior Art
    CGO’13
    Grewe et. al
    PACT’14
    Magni et. al

    View Slide

  21. Heterogeneous Mapping Thread Coarsening
    Prior Art
    Binary
    classification
    {CPU, GPU}
    One-of-six
    classification
    {1, 2, 4, 8, 16, 32}
    CGO’13 PACT’14
    Decision Space
    Model
    Decision Tree Cascading
    Neural Networks

    View Slide

  22. Heterogeneous Mapping Thread Coarsening
    Prior Art
    4 features
    Combined from 7 raw
    values.
    Instruction counts / ratios.
    7 features
    Principle Components of 34
    raw values.
    Instruction counts / ratios /
    relative deltas.
    CGO’13 PACT’14
    Features
    2 papers!

    View Slide

  23. Heterogeneous Mapping Thread Coarsening
    int
    main(int
    argc ...
    int
    main(int
    argc ...
    Our Approach
    1. Use the same model design for both
    2. No tweaking of parameters
    3. Minimum change - 3 line diff

    View Slide

  24. Heterogeneous Mapping Thread Coarsening
    Prior Art
    2x CPU-GPU
    architectures
    4x GPU
    architectures
    CGO’13 PACT’14
    Hardware
    Training Programs
    7 Benchmark Suites 3 Benchmark Suites

    View Slide

  25. results

    View Slide

  26. 14% and 5% improvements over state-of-the-art
    Speedup
    Heterogeneous Mapping
    2.38x
    2.09x
    Speedup
    Thread Coarsening
    1.06x
    1.01x
    State-of-the-art DeepTune w. Transfer Learning

    View Slide

  27. 14% and 5% improvements over state-of-the-art
    Speedup
    Heterogeneous Mapping
    2.38x
    2.09x
    Speedup
    Thread Coarsening
    1.06x
    1.01x
    State-of-the-art DeepTune w. Transfer Learning
    256 benchmarks
    17 benchmarks

    View Slide

  28. Heterogeneous Mapping Thread Coarsening
    Transfer Learning
    Embed-
    ding
    Heuristic
    Model
    Language
    Model
    Embed-
    ding
    Heuristic
    Model
    Language
    Model
    general specialized

    View Slide

  29. Heterogeneous Mapping Thread Coarsening
    Transfer Learning
    Embed-
    ding
    Heuristic
    Model
    Language
    Model
    Embed-
    ding
    Heuristic
    Model
    Language
    Model
    general specialized
    initialize with values

    View Slide

  30. 14% and 5% improvements over state-of-the-art
    Speedup
    Heterogeneous Mapping
    2.38x
    2.09x
    Speedup
    Thread Coarsening
    1.06x
    1.01x
    State-of-the-art DeepTune w. Transfer Learning

    View Slide

  31. Speedup
    Heterogeneous Mapping
    2.38x
    2.09x
    Speedup
    Thread Coarsening
    1.12x
    1.06x
    1.01x
    State-of-the-art DeepTune w. Transfer Learning
    14% and 11% improvements over state-of-the-art

    View Slide

  32. Try it for yourself!
    http://chriscummins.cc/pact17
    code and data on GitHub
    runs in the browser
    Consist
    ent * Complete
    *
    Well Docume
    nted * Easyto
    Reuse *
    * Ev
    aluated
    *
    ACT *
    Artifact
    *
    AEC
    P

    View Slide

  33. Problem: feature design is hard
    Featureless heuristics
    First cross-domain learning
    11-14% speedups
    Deep Learning Optimisation Heuristics
    End-to-end
    of
    http://chriscummins.cc/pact17

    View Slide