DL4Compilers @ CGO'22 - Speaker Deck

Slide 1

Slide 1 text

DL4Compilers CGO Tutorial Sandya Mannarswamy Dibyendu Das Chris Cummins

Slide 2

Slide 2 text

1. Learning over Programs (Chris) 2. Applications of DL to Compilers (Dibyendu) 3. Challenges & Research Directions (Sandya) overview 3 sections, 45 min each, 10 min Q&A, 5 min break

Slide 3

Slide 3 text

1. Learning over Programs (Chris) 2. Applications of DL to Compilers (Dibyendu) 3. Challenges & Research Directions (Sandya) overview 3 sections, 45 min each, 10 min Q&A, 5 min break 3 2

Slide 4

Slide 4 text

● Bad heuristics ● Wasted energy ● Widening performance gap Building compilers... a job for life ● 100s of variables ● NP-hard or worse ● Compiler / HW keeps changing

Slide 5

Slide 5 text

Collect examples Learn from examples Update heuristic Repeat on change "Build an optimizing compiler, your code will be fast for a day. Teach a compiler to optimize ... "

Slide 6

Slide 6 text

Summarize the program Program Features void LinearAlgebraOp::AnalyzeInputs( OpKernelContext* context, TensorInputs* inputs, TensorShapes* input_matrix_shapes, TensorShape* batch_shape) { int input_rank = -1; for (int i = 0; i < NumMatrixInputs(context); ++i) { const Tensor& in = context->input(i); if (i == 0) { input_rank = in.dims(); OP_REQUIRES( context, input_rank >= 2, errors::InvalidArgument( "Input tensor ", i, " must have rank >= 2")); 0.2 0.31 -0.7 1.24

Slide 7

Slide 7 text

Collect examples Features Best Param ... ...

Slide 8

Slide 8 text

Supervised Machine Learner Model Learn from examples Features Param Features Best Param ... ...

Slide 9

Slide 9 text

Model The model is the heuristic Model Model Features Param Features Param Features Param

Slide 10

Slide 10 text

Model The model is the heuristic Model Model Features Param Model Model Features Param Model Features Param Model Model Features Param Model Features Param Features Param New Program Features Predicted param

Slide 11

Slide 11 text

Feature Vectors Feature Vectors Best Decisions Feature Vectors Feature Vectors Feature Vectors Feature Vectors Feature Vectors Training Programs Feature Vectors Feature Vectors Datasets Ad-hoc Drivers Training Data Feature Extractor Learned Heuristic Machine learning for compilers

Slide 12

Slide 12 text

Feature Vectors Feature Vectors Training Programs Feature Vectors Feature Vectors Datasets Feature Vectors Feature Vectors Best Decisions Feature Vectors Feature Vectors Feature Vectors Ad-hoc Drivers Training Data Feature Extractor Learned Heuristic Machine learning for compilers The bit I'm going to talk about

Slide 13

Slide 13 text

Feature Vectors Feature Vectors Best Decisions Feature Vectors Feature Vectors Feature Vectors Ad-hoc Drivers Training Data Feature Extractor Learned Heuristic Machine learning for compilers (the bit Dibyendu is going to talk about) Feature Vectors Feature Vectors Training Programs Feature Vectors Feature Vectors Datasets

Slide 14

Slide 14 text

History of ML in compilers Autotuning 1970s 1998 2008 2015 2012 2020 Milepost GCC AlexNet 2016 GGNN 2017 [paper] [paper] ML-guided AUtotuning [paper] MLGO 2003 [paper] [source] DeepTune [paper] ICSA'22 competition

Slide 15

Slide 15 text

1. Learning over Programs

Slide 16

Slide 16 text

Rotem et. al. Proﬁle Guided Optimization without Proﬁles: A Machine Learning Approach (2022) Cummins et. al. End-to-end Deep Learning of Optimization Heuristics (2017) Cummins et. al. ProGraML: A Graph-based Program Representation for Data Flow Analysis and Compiler Optimizations (2021) Three approaches to program representation 1. Handcrafted features 2. Language Modeling 3. Graph Reasoning

Slide 17

Slide 17 text

Slide 18

Slide 18 text

How it works Program Features void LinearAlgebraOp::AnalyzeInputs( OpKernelContext* context, TensorInputs* inputs, TensorShapes* input_matrix_shapes, TensorShape* batch_shape) { int input_rank = -1; for (int i = 0; i < NumMatrixInputs(context); ++i) { const Tensor& in = context->input(i); if (i == 0) { input_rank = in.dims(); OP_REQUIRES( context, input_rank >= 2, errors::InvalidArgument( "Input tensor ", i, " must have rank >= 2")); 0.2 0.31 -0.7 1.24

Slide 19

Slide 19 text

How it works Program IR Features void LinearAlgebraOp::AnalyzeInputs( OpKernelContext* context, TensorInputs* inputs, TensorShapes* input_matrix_shapes, TensorShape* batch_shape) { int input_rank = -1; for (int i = 0; i < NumMatrixInputs(context); ++i) { const Tensor& in = context->input(i); if (i == 0) { input_rank = in.dims(); OP_REQUIRES( context, input_rank >= 2, errors::InvalidArgument( "Input tensor ", i, " must have rank >= 2")); (CFG, DFG, AST,...) 0.2 0.31 -0.7 1.24

Slide 20

Slide 20 text

Slide 21

Slide 21 text

● PGO is difficult to deploy because it requires multiple steps ● LLVM has a set of hard coded rules that predict things ● Developed by dozens of engineers, using thousands of lines of code, over a decade Case Study: PGO

Slide 22

Slide 22 text

Case Study: PGO

Slide 23

Slide 23 text

Advantages Drawbacks 1. Interpretable e.g. "#. instructions" 2. Fast to extract Typically lightweight analyses 3. Fast to process e.g. >100k inferences / sec 1. Diﬃcult to get right How do you know when "done"? 2. Time consuming to develop Model / features relationship 3. Repetitious Features aren't transferable

Slide 24

Slide 24 text

Ways to fail Irrelevant Incomplete Unsuitable e.g. not capturing the right information e.g. missing critical information e.g. wrong combination of features+model

Slide 25

Slide 25 text

Slide 26

Slide 26 text

Slide 27

Slide 27 text

Feature Vectors Feature Vectors Training Programs Feature Vectors Feature Vectors Datasets Feature Vectors Feature Vectors Best Decisions Ad-hoc Drivers Learned Heuristic How it works Training Data

Slide 28

Slide 28 text

Program Code Code in Normalizer Tokenizer Optimization Decision ✓ LSTM DNN

Slide 29

Slide 29 text

Tokenization kernel void A(global float* a, const float b) { a[get_global_id(0)] *= 3.14 + b; } Vocab Encoded Token Index Candidate Vocab const float get_global_id global int kernel void ... Input

Slide 30

Slide 30 text

Tokenization kernel void A(global float* a, const float b) { a[get_global_id(0)] *= 3.14 + b; } Vocab Encoded Token Index kernel 0 0 Candidate Vocab const float get_global_id global int kernel void ... Input

Slide 31

Slide 31 text

Tokenization kernel void A(global float* a, const float b) { a[get_global_id(0)] *= 3.14 + b; } Vocab Encoded Token Index kernel 0 [space] 1 0 1 Candidate Vocab const float get_global_id global int kernel void ... Input

Slide 32

Slide 32 text

Tokenization kernel void A(global float* a, const float b) { a[get_global_id(0)] *= 3.14 + b; } Vocab Encoded Token Index kernel 0 [space] 1 void 2 0 1 2 Candidate Vocab const float get_global_id global int kernel void ... Input

Slide 33

Slide 33 text

Tokenization kernel void A(global float* a, const float b) { a[get_global_id(0)] *= 3.14 + b; } Vocab Encoded Token Index kernel 0 [space] 1 void 2 0 1 2 1 Candidate Vocab const float get_global_id global int kernel void ... Input

Slide 34

Slide 34 text

Tokenization kernel void A(global float* a, const float b) { a[get_global_id(0)] *= 3.14 + b; } Vocab Encoded Token Index kernel 0 [space] 1 void 2 A 3 0 1 2 1 3 Candidate Vocab const float get_global_id global int kernel void ... Input

Slide 35

Slide 35 text

Tokenization kernel void A(global float* a, const float b) { a[get_global_id(0)] *= 3.14 + b; } Vocab Encoded Token Index kernel 0 [space] 1 void 2 A 3 ( 4 0 1 2 1 3 4 Candidate Vocab const float get_global_id global int kernel void ... Input

Slide 36

Slide 36 text

Tokenization kernel void A(global float* a, const float b) { a[get_global_id(0)] *= 3.14 + b; } Vocab Encoded Token Index kernel 0 [space] 1 void 2 A 3 ( 4 global 5 0 1 2 1 3 4 5 Candidate Vocab const float get_global_id global int kernel void ... Input

Slide 37

Slide 37 text

Tokenization kernel void A(global float* a, const float b) { a[get_global_id(0)] *= 3.14 + b; } Vocab Encoded Token Index kernel 0 [space] 1 void 2 A 3 ( 4 global 5 0 1 2 1 3 4 5 1 Candidate Vocab const float get_global_id global int kernel void ... Input

Slide 38

Slide 38 text

Tokenization kernel void A(global float* a, const float b) { a[get_global_id(0)] *= 3.14 + b; } Vocab Encoded Token Index kernel 0 [space] 1 void 2 A 3 ( 4 global 5 float 6 0 1 2 1 3 4 5 1 6 Candidate Vocab const float get_global_id global int kernel void ... Input

Slide 39

Slide 39 text

Tokenization kernel void A(global float* a, const float b) { a[get_global_id(0)] *= 3.14 + b; } Vocab Encoded Token Index kernel 0 [space] 1 void 2 A 3 ( 4 global 5 float 6 * 7 0 1 2 1 3 4 5 1 6 7 Candidate Vocab const float get_global_id global int kernel void ... Input

Slide 40

Slide 40 text

Tokenization kernel void A(global float* a, const float b) { a[get_global_id(0)] *= 3.14 + b; } Vocab Encoded Token Index kernel 0 [space] 1 void 2 A 3 ( 4 global 5 float 6 * 7 0 1 2 1 3 4 5 1 6 7 1 Candidate Vocab const float get_global_id global int kernel void ... Input

Slide 41

Slide 41 text

Tokenization kernel void A(global float* a, const float b) { a[get_global_id(0)] *= 3.14 + b; } Vocab Encoded Token Index kernel 0 [space] 1 void 2 A 3 ( 4 global 5 float 6 * 7 a 8 0 1 2 1 3 4 5 1 6 7 1 8 Candidate Vocab const float get_global_id global int kernel void ... Input

Slide 42

Slide 42 text

Tokenization kernel void A(global float* a, const float b) { a[get_global_id(0)] *= 3.14 + b; } Vocab Encoded Token Index kernel 0 [space] 1 void 2 A 3 ( 4 global 5 float 6 * 7 a 8 Token Index , 9 const 10 b 11 ) 12 { 13 \n 14 [ 15 get_global_id 16 0 17 0 1 2 1 3 4 5 1 6 7 1 8 ... Candidate Vocab const float get_global_id global int kernel void ... Input 181 tokens 33M tokens

Slide 43

Slide 43 text

CGO’13 PACT’14 Prior Art Heterogeneous Mapping Thread Coarsening

Slide 44

Slide 44 text

Prior Art Heterogeneous Mapping Thread Coarsening Decision Space Model Binary classification One-of-six classification Neural Networks Decision Tree Cascading {CPU, GPU} {1, 2, 4, 8, 16, 32} Features 7 Principle Components of 34 raw features Combinations of values from ad-hoc LLVM analysis CGO’13 PACT’14 2x papers!

Slide 45

Slide 45 text

CGO’13 PACT’14 Heterogeneous Mapping Thread Coarsening Using DeepTune 1. Use the same model design for both 2. No tweaking of parameters 3. Minimum change - 3 line diff

Slide 46

Slide 46 text

14% and 5% improvements over state-of-the-art Tiny dataset!

Slide 47

Slide 47 text

Heterogeneous Mapping Thread Coarsening +ﬁne-tune

Slide 48

Slide 48 text

14% and 5% improvements over state-of-the-art

Slide 49

Slide 49 text

14% and 11% improvements over state-of-the-art

Slide 50

Slide 50 text

Advantages Drawbacks 1. No Feature Engineering Save time! 2. Enables Transfer Learning Reuse training across problems 1. Black box Not interpretable 2. Doesn't suit all problems Nonlinear code dependencies?

Slide 51

Slide 51 text

Slide 52

Slide 52 text

int main( int argc, char** argv) {... target triple = "..." define i32 @main() {... 1. IR 3. GGNN %0 res %1 main printf A br ret add mul Control-Flow Data-Flow Call Graph Control Data Call 2. Graph How it works

Slide 53

Slide 53 text

Building ProGraML: Control-flow Derived from compiler IR (here, LLVM) Full-flow-graph: represent each instruction as a vertex. Vertex label is the instruction name. Edges are control-flow. Edge position attribute for branching control-flow.

Slide 54

Slide 54 text

Building ProGraML: Data-ﬂow Add graph vertices for constants (diamonds) and variables (oblongs). Vertex label is the data type. Edges are data-ﬂow. Edge position attribute for operand order.

Slide 55

Slide 55 text

Building ProGraML: Call-ﬂow Edges are call-ﬂow. Inbound edge to function entry instruction. Outbound edge from (all) function exit instruction(s).

Slide 56

Slide 56 text

Learning with ProGraML: Node Embeddings Use vertex labels as embedding keys Derive vocab from set of unique vertex labels on training graphs. Separate type/instruction nodes leads to compact vocab, excellent coverage on unseen programs compared to prior approaches: inst2vec: combined instruction+operands CDFG: uses only instructions for vocab, ignores data br add i32 0 1 2 i32 = a

Slide 57

Slide 57 text

Learning with ProGraML: GGNNs Position gating to differentiate control branches and operand order 6 typed weight matrices for {forwards,backwards} {control,data,call} edge types Message Passing Readout Head per-vertex prediction after T message-passing steps

Slide 58

Slide 58 text

DeepTune ProGraML Reachability 0.504 0.996 Dominator Trees 0.114 0.781 Data Dependencies 0.236 0.993 Live-out Variables - 1.000 Global Common Subexpressions 0.214 0.930 Learning compiler analyses F1 Scores Dataset: 250k LLVM graphs covering 6 program languages

Slide 59

Slide 59 text

Advantages Drawbacks 1. Powerful representation Well-established principles 2. Models non-linear relations Broad range of data flows 1. Slow to create / process Many nodes, many FLOPs 2. Lossy node featurization How to represent literals? 3. GNNs struggle on large inputs

Slide 60

Slide 60 text

1. ML promises better optimizations, faster 2. Variety of approaches to featurizing code: a. Handcrafted analyses b. Deep Language Modeling c. Graph Reasoning 3. Not a "solved problem", plenty to be done! conclusions

Slide 61

Slide 61 text

5 min break

Slide 62

Slide 62 text

2. Challenges and Research Directions

Slide 63

Slide 63 text

Presenter

Slide 64

Slide 64 text

Outline ● DL in Production Compilers – Challenges ● DL Driven Compiler Heuristics ● A Case Study ● Takeaways and Research Directions

Slide 65

Slide 65 text

ML/DL in Compiler Optimization is not New! ● Considerable research in applying ML to various problems ● Even from last decade: ○ Sameer Kulkarni, John Cavazos, Christian Wimmer, Douglas Simon. Automatic construction of inlining heuristics using machine learning. CGO 2013 ○ H. Leather, E. Bonilla, M. O’Boyle. Automatic feature generation for machine learning based optimizing compilers. CGO 2009 (Unroll factor learning) ● Not much of this has made it to production compilers ● Adopting ML/DL in production compilers is not easy! ● Many challenges in adopting ML/DL driven optimizations!

Slide 66

Slide 66 text

ML/DL in Production Compilers - Challenges ● “One of the challenges is that the existing compilers, LLVM included, were never designed for machine learning integration. ● And so there’s a lot of work that could be done to integrate machine learning techniques … into our compiler frameworks. ● But because the abstractions were wrong, it’s really hard to do that outside of a one-off research paper…”

Slide 67

Slide 67 text

With great challenges, comes opportunities for research ☺

Slide 68

Slide 68 text

Learning Compiler Heuristics

Slide 69

Slide 69 text

Compiler Heuristics ● Every optimization can be modelled with two phases ○ Analysis & Transformation ● Analysis Phase (Decision Making) ○ Driven by heuristics, No code change involved ○ Wrong decisions – no impact on correctness. Only on Performance. ● Transformation Phase (Implementing the Decision) ○ Impact Correctness (Not just performance) ● Complex Optimizations - Inlining, Scheduling, Register allocation ○ NP Hard Problems ○ Decision making approximated by complex and multifactorial heuristics

Slide 70

Slide 70 text

Heuristics vs Learnt Models/Policies Heuristics are human-trained ○ based on a human-manageable set of benchmarks and regression cases. Heuristics are human-written code that needs to be maintained ○ Limits the number of program features and combinations involved ○ But using more features and feature combinations -> better opt. decisions Human heuristics more comprehensible in theory, can grow complex over time DL easily scales to large training examples ○ Generalization to real world diverse programs likely to be better DL scales well with the addition of features ○ Avoids need of retraining models often in production compilers ○ Can also discover automatically profitable feature combinations DL models not comprehensible/explainable

Slide 71

Slide 71 text

Applying DL in Compilers Human Trained, NO ML Assistive ML Embedded ML

Slide 72

Slide 72 text

Desiderata for DL driven Compiler Policies ● Day to day production use of compiler should remain unchanged ● Separate deployment of compiler from training of ML models/policies ● Minimize compile time overheads to acceptable levels ○ ML model embedded in compiler and deployed in inference mode ● Can cater to different user personas ○ Normal User (uses compiler to compile his application) ○ Compiler engineer who is developing/maintaining the compiler

Slide 73

Slide 73 text

DL driven Compiler Heuristics Case Study ● Compiler Heuristics amenable to replacement by learnt models/policies ● What are the challenges in adopting this in production compiler? ● Let us look at a case study ○ MLGO: a Machine Learning Guided Compiler Optimizations Framework (Troffin et al.2020).

Slide 74

Slide 74 text

MLGO Case Study

Slide 75

Slide 75 text

MLGO Overview ● Use ML driven policy for the task of inlining-for-size in LLVM compiler ● Size instead of performance ○ Modelling performance rewards are noisy, costly ● Inlining heuristics are complex! ● Supervised learning not viable ○ there are no optimal labels for the task ○ no simple way of saying an inline decision is optimal ● Need to explore different strategies, learn from these experiences ● Reinforcement Learning (RL) & Evolution Strategies (ES) more suitable

Slide 76

Slide 76 text

User Personas Envisaged Normal User Compiler Developer

Slide 77

Slide 77 text

Supporting Normal User Persona in MLGO Normal User Needs • correctness and performance of the generated code should not be impacted • Timeliness of build • compilation determinism (incremental build support) • No added cost/complexity to build and release pipelines MLGO Design Goals • No visible changes to normal user • Separating Correctness & Policy • Only support ML driven heuristics • not code changes • Minimal impact on compile time • No online training • Should not need frequent retraining • Generalize across code bases

Slide 78

Slide 78 text

Supporting Compiler Engineer Persona Compiler Engineer Needs • Wants better optimizations in compiler • Wants to apply ML to improve compiler passes • Improve ML driven opts • Fix regressions & ship blockers MLGO Design Goals • Support efficient retraining • ML models/policy visible to user • Can lead to additional dependencies • Build/release pipelines can be changed • Improve policy by adding missing features/regressions to retraining data • Flexibility to support alternate training algorithms

Slide 79

Slide 79 text

Usage Scenarios for DL Driven Policies ● Policy/Model Creation ○ Dataset created, new ML model trained iteratively to replace existing human heuristics. ○ User Persona – Compiler Engineer ● Policy Deployment ○ Model incorporated into compiler in inference mode and deployed in release compiler ○ User persona – Application Developer ● Policy Improvement ○ Retraining Model to improve performance. User Persona : Compiler Engineer ● Policy Maintenance ○ Bug Fixing ML model. User Persona: Compiler Engineer

Slide 80

Slide 80 text

Inlining Pass in LLVM Compiler ● Operates on SCC of Call Graph of a module in a bottom up order ● The inlined callee’s call sites added to worklist for iterative processing ● Inlining pass includes a number of decisions ○ The order of traversal ○ Clean ups done and their timing ○ Decision to inline a callsite or not ● MLGO focusses on the decision to inline a callsite or not based on size

Slide 81

Slide 81 text

Inlining Heuristics in LLVM Compiler ● Compute static cost of callee post inlining ● Compare the computed cost with a threshold ● Threshold based on call site hotness and inline keyword ● Bonuses/threshold modifications based on ○ Callee characteristics like single BB, number of SIMD instructions etc ● Inlining may be deferred, if it may be profitable to inline caller first ● Interplay of a number of program characteristics ○ Local to the callsite ○ Global ○ Source level directives, optimization options etc

Slide 82

Slide 82 text

Replacing Manual Heuristics with learnt models ● MLGO trains the inlining policy with 2 different algorithms 1. Reinforcement Learning based 2. Evolution Strategies (ES) based ● To handle cold start issue for RL Policy, use behavioral cloning ● Behavioral Cloning mimics the standard LLVM Inliner heuristics

Slide 83

Slide 83 text

RL driven Approach ● In RL, an agent interacts with the environment ○ based on current state and learnt policy, performs actions ● The action leads to a reward ○ Also changes the current state of the environment ● Reward feedback tunes the policy for further steps ● In our case, compiler is the agent and Policy is the learnt model for heuristics ● Action is inline/not inline & State is the current state of the call graph ● We will talk about reward later!

Slide 84

Slide 84 text

RL Formulation ● Inlining for size formulated as Markov Decision Process ○ Sequential Decision Making ● MDP represented by the tuple < S, A, P, R > ○ state space S ○ action space A, ○ state transition distribution P(𝑠′|𝑠, 𝑎), ○ reward function R(𝑠, 𝑎). ● The agent’s decisions governed by policy 𝜋 = 𝑃𝑟 (𝑎|𝑠) ○ maps observed state 𝑠 to a distribution over actions. ○ 𝜋 is a neural network and we call it policy network. ● Goal is to find the optimal policy 𝜋∗ to maximize total reward

Slide 85

Slide 85 text

RL - Inlining for Size ● State S is the current call graph State and call site being visited ○ Not practical ○ Approximated using a set of features ● Action A = {0.1} 0-> no inline. 1-> inline ● Deterministic state transition based on action and Call Graph updated ● Reward R – native size reduction after action A ○ If inlined: R = S(caller_before) – S(caller_after) + [S(callee) if callee deleted, 0 if not deleted] ○ Not inlined: R = 0 ○ Compute total native size with/without inlining and subtract

Slide 86

Slide 86 text

Representing RL State Space • RL State representation is Call Graph and call site • Encoding CG state at each point is computationally expensive • MLGO approximates the state space ○ by handful of numerical features ○ Local call site related features and global CG features

Slide 87

Slide 87 text

Challenge: Representing RL State Space ● Full RL State space representation computationally unviable ● Can impact compile time significantly ● MLGO trades off state representation fidelity to mitigate this ○ Falls back to handful of numeric features ● This reduces information available to the RL model ○ impacts the policy trained in MLGO ● MLGO also does not use IR code embedding of callee.. ○ To reduce memory/compute costs ● Opportunity: Develop computationally viable & high fidelity ○ State space representation ○ IR embedding representations (both task agnostic/task specific)

Slide 88

Slide 88 text

Challenge: RL Reward Computation ● Difficult to estimate native function size during inlining pass ● MLGO opts to use total reward instead of partial rewards ○ Evaluate native size with and without inlining and subtract ● This requires more compute and can impact model quality ● Inlining for performance would make this even more complicated! Opportunity: Develop scalable reward formulations without impacting model quality

Slide 89

Slide 89 text

Data Collection Challenges ● No standard/ready made datasets for inlining for size.. ● Training Data Collection is a major bottleneck ● Needs to be parallelized to reduce training cycle time ● Model Quality can vary based on ○ Generality of corpus ○ Size of training corpus Training Data Collection

Slide 90

Slide 90 text

Model Improvement Challenges ● Similar to Manual Heuristics Improvement ● Long cycle time and requires compiler engineer expertise ● Identify missing features based on regressions ○ Black box nature of DL algorithms makes it difficult ● Incorporate regression test cases from field and retraining ○ Requires model updates in production compiler ● Explore alternative learning algorithms ○ Longer dev time and compiler release updates needed ○ Trade off between simpler (better interpretable) algorithms vs performance

Slide 91

Slide 91 text

Model Debug/Fix - Challenges ● Similar challenges as model improvement ● DL model policies blackbox in nature, hamper debugging ● For Ship blockers, fall back to ○ Earlier working policy/model ○ Manual heuristics ● Fix would require model retraining ○ Trained with newer training data ○ Adding/dropping features ● Selective application of manual heuristics to buggy code + DL model driven inlining for rest of code base

Slide 92

Slide 92 text

Key takeaways from MLGO Case Study ● Trade-offs exist between model/policy quality and compute costs ● Timeliness of compiles is non-negotiable for normal user ● Explainability of DL models/policies is desirable for troubleshooting

Slide 93

Slide 93 text

Research Opportunities from MLGO Study ● How do we support speed optimizations? ○ Handling noisy rewards like speedup/runtime ○ Task Specific proxy reward formulations ○ Scalable and compute efficient ● How do we design richer & efficient state representations? ○ Encode CG State into a compact representation ○ With minimal impact on compile time ○ Learning these representations from pre-trained models? ○ Exploring IR code embedding techniques for callee analysis

Slide 94

Slide 94 text

Key Takeaways Challenges 1. Lack of standardized datasets ❑ Small/custom datasets typical ❑ Large datasets like AnghaBench available only at source code level 2. Lack of Pretrained IR models ❑ Transfer learning not yet possible 3. Non-availability of generalized contextual IR embeddings Research Directions 1. Automatically Synthesizing benchmarks and datasets for various compiler tasks 2. Pre-trained LMs at different points of optimization pipeline ❑ Middle end and at codegen level 3. Techniques for IR embeddings that can generalize across code bases and different compiler tasks

Slide 95

Slide 95 text

Thank you!

Slide 96

Slide 96 text

Writing Compiler Optimizations is Hard! ● Software stack is becoming increasingly complex ● Correctness & performance of applications depends on the compiler ● Increasingly difficult to write compiler optimizations which generalize across software ● Can ML/DL assist human in developing better compiler optimizations?

Slide 97

Slide 97 text

DL is Ideal for…. ● We usually apply DL to problems that are hard to solve manually/algorithmically ● Typical characteristics include ○ Large search space ○ Approximate solutions preferred ○ Availability of large code bases/samples that can be mined ○ Probabilistic nature of ML does not impact correctness ● Compiler Problems that fit these characteristics well ○ Heuristics, Phase Ordering decisions, Cost Modelling

Slide 98

Slide 98 text

5 min break

Slide 99

Slide 99 text

3. Applications of DL to Compilers

Slide 100

Slide 100 text

Writing Compiler Optimizations is Hard ➔ Software stack is becoming increasingly complex ➔ Increasingly difficult to write compiler optimizations which generalize across software ➔ But correctness & performance of applications depends on the compiler ➔ Where can we target ML to ease & improve compiler optimization?

Slide 101

Slide 101 text

Where to apply ML in Compiler Opts ➔ Optimization usually modelled with two phases ◆ Analysis & Transformation ➔ Analysis Phase (Correctness, Cost/Benefit, Feasibility) ◆ May be driven by heuristics, No code change involved ➔ Transformation Phase (Implementing the Decision) ◆ The big concern is correctness ➔ Complex Optimizations - Inlining, Scheduling, Register allocation ◆ NP Hard Problems ◆ Decision making approximated by complex and multifactorial heuristics

Slide 102

Slide 102 text

What to Target in Compiler Opts ➔ We usually apply ML to problems that are hard to solve manually/algorithmically ➔ Preferable characteristics include ◆ Large search space ◆ Approximate solutions preferred ◆ Availability of large code bases/samples that can be mined ◆ Probabilistic nature of ML does not impact correctness ➔ Three areas: Optimization heuristics, Phase Ordering, Cost Modelling

Slide 103

Slide 103 text

DL-based 5 Compiler Optimizations ➔ We will talk about 5 compiler opts/techniques Optimization Middle-End Back-End Generic Ithemal: Basic Block Throughput Estimation ✓ Register Allocation ✓ Auto-Vectorization Using Imitation Learning ✓ Learned Performance Model for TPUs ✓ Phase-Ordering via Deep RL ✓

Slide 104

Slide 104 text

Ithemal : Basic Block Throughput Prediction

Slide 105

Slide 105 text

Estimating Basic Block Throughput The problem: Given a basic-block of x86 instructions estimate the throughput in terms of clock cycles Analytical models (llvm-mca, IACA) are usually used in such scenarios

Slide 106

Slide 106 text

Accurate Modeling of Processor Core is Complex ➔ Modeling the micro-architectural details of a complex core is a very hard problem ◆ Very easy to omit details ◆ Specs are not always accurate ◆ Some details are proprietary

Slide 107

Slide 107 text

Ithemal: a Data-driven approach ➔ Run/train a DL model using many samples of x86 BBs and corresponding cycle count ◆ Measured on real hardware ◆ A new hardware just means re-training the new model OR some form of transfer learning ➔ High accuracy and ease of portability

Slide 108

Slide 108 text

Ithemal DL model ➔ Hierarchical LSTMs with x86 instruction set in a BB as input ◆ 2-layers ◆ Layer-1 for the sequence of operands of each instruction ◆ Layer-2 for the sequence of instructions ➔ Regression model ◆ Throughput predictor

Slide 109

Slide 109 text

Ithemal competitive performance ➔ Ithemal delivers better throughput accuracy compared to analytical models ➔ Portable and robust ➔ Github: https://github.com/ithemal/Ithemal/tree/master/learning/pytorch

Slide 110

Slide 110 text

DL-based Graph Coloring Register Allocation

Slide 111

Slide 111 text

Register Allocation as a Graph-Coloring Problem ➔ Register Allocation- an important problem in code generation ➔ The number of registers available may be < number of variables ➔ Create *interference graph* which models registers which need to be *live* at the same time

Slide 112

Slide 112 text

Modeling Graph Coloring using LSTMs ➔ Viewed as a sequence-2-sequence translation via LSTMs ➔ An input sequence where each item of the sequence corresponds to a node of the graph ➔ The output sequence is of the same length as the input sequence (number of nodes of the graph) ➔ Trained using random graphs

Slide 113

Slide 113 text

Inference and Color-Correction ➔ Difficult to encode constraints in LSTM that two adjacent nodes cannot have some color ➔ *Invalid* edges may appear during inference ➔ Rectify these edges using a post-inference color-correct pass

Slide 114

Slide 114 text

DL-model vs LLVM’s GRA ➔ Collected the interference graphs for the functions of certain SPEC CPU® 2017 benchmarks ◆ Use these graphs to predict colors using the DL-model ➔ Collect the actual register count of each function after codegen from LLVM ➔ Comparison shows DL-model performing better than GRA ➔ Github: https://github.com/dasdibye/DL4RegAlloc

Slide 115

Slide 115 text

Compiler Auto-vectorization using Imitation Learning

Slide 116

Slide 116 text

Auto-vectorization/SLP via ML ➔ The problem: ◆ Solve the SLP (Superword-Level Parallelism) using ML ◆ SLP is a superset of Loop Vectorization whereby stratight-line code can also be vectorized ➔ Naïve ML strategies may lead to correctness issues as only isomorphic statements can be vectorized

Slide 117

Slide 117 text

MDP Formulation ➔ The pairwise instruction packing problem formulated as a MDP by iteratively forming one vector pack at a time following a particular instruction traversal policy – Bottom-Up or Top-Down

Slide 118

Slide 118 text

MDP Formulation - State Details ❖ Nodes : 5 types of nodes to encode the graph features of a state: Instruction Node: correspond to each instruction with at least one valid packing opportunity or instructions which are already packed Pack Node: common node representing overhead packing instructions Unpack Node: common node representing overhead unpacking instructions Constant Node: common node representing any constant value used by instructions Focus Node: connected to the instruction that is considered for packing ❖ Edges: Following are the 4 types of edges connecting the above nodes: Dependency Edge: encodes if an instruction must be executed after another Possible Pack Edge: encodes whether two instructions can be packed together Packed Edge: encodes instructions that are already packed together Focus Edge: the focus edge connects the focus node to the instruction node that is considered for packing

Slide 119

Slide 119 text

Imitation Learning ➔ goSLP solves the SLP problem *exactly* using ILP solvers ➔ Use such a solution to imitate/mimic the action space ➔ Use actual runtimes/estimates as cost/reward function ➔ Gated Graph Neural Network (GGNN) used as part of the policy network modeling to make packing decisions for each state

Slide 120

Slide 120 text

A Learned Performance Model for TPUs

Slide 121

Slide 121 text

Learning a performance model ➔ Program performance is tightly coupled with the underlying processor architecture as well as the optimization decisions that are made during compilation ➔ Developing an accurate analytical model of program performance on a modern processor is challenging and can take months of engineering effort ➔ Developers of analytical models are often unaware of detailed features of the processor or effects from all compiler passes ➔ Learn a performance model for XLA graphs on the TPU

Slide 122

Slide 122 text

Approach Dataflow Graph - Decompose into Kernels Predict Per Kernel and Sum over all kernels f( ) Τ secs ≅ ≅ Train a DNN to map kernels to runtime estimates DL op DL op Kernel = ∪ DL ops Kernel = ∪ DL ops DL op DL op

Slide 123

Slide 123 text

Model Architecture DL op Node/op features Concat Feedforward GNN (Graph NN) DL op Embeddings Full Dataflow Graph LSTM Linear Kernel Runtime Estimate Final Node embeddings, topo-sorted and fed to LSTM

Slide 124

Slide 124 text

Application of the learnt Model ➔ Tile Selection ◆ Find the optimal tile size among many for a kernel ◆ Query the valid tile sizes and obtain the relative performance among them ● A kernel may have up to 500K valid tile sizes ● For training, up to 25M samples were used ➔ Operator Fusion ◆ Used a random search strategy for fusion on the entire computation graph ◆ For training, up to 50K fusion configurations ◆ More than 200M samples

Slide 125

Slide 125 text

Static Neural Compiler Optimization via Deep Reinforcement Learning

Slide 126

Slide 126 text

Phase Ordering problem ➔ Large number of passes in any modern compiler ◆ 50+ passes in LLVM pass infrastructure depending on the optimization level ◆ Each pass may have several tunable parameters ➔ Phase-ordering problem extremely hard to solve ◆ Current order is rigid but enumerating possibilities creates a huge search space ◆ We may need to trade off between compile time and performance

Slide 127

Slide 127 text

Phase ordering as an RL problem ➔ IR can resemble a state ➔ Passes can resemble actions ➔ Optimizer is the environment ➔ Rewards are runtimes IR 100 ms IR2 125 ms IR1 25 ms Pass 1 Pass 2

Slide 128

Slide 128 text

Overall RL approach ➔ Learn the value function: Q(S t ,A t ,w) = R(S t ,A t ) + γ.max a∊A Q(S t+1 ,a,w) ➔ Where R = ln (T(S t )/T(S t+1 )), T(x) is runtime of IR x Start input_ir.ll Action Reward 1 -2.3 2 +1.5 3 +0.3 4 -0.9 Action Reward 2 +1.5 Value > 0 ? Record LLVM Opt Action History Agent Prediction Max End True

Slide 129

Slide 129 text

States and Action Spaces ➔ IR is encoded using NCC (Neural Code Comprehension) vector embedding ( Ben-Nun et al. ) ➔ Action history is one-hot encoded ➔ A state is an agglomeration of the encoded IR and action history ➔ The actions are divided into 3 levels of abstraction : H, M, L ◆ H denotes a group of passes together while M,L denotes individual passes ◆ H creates smaller search space than M,L IR Action History State NCC 1-hot State Vector Embedding

Slide 130

Slide 130 text

In Conclusion ➔ At the cusp of utilizing DL techniques for compiler passes ➔ A big challenge is to ensure correctness ➔ Proposed work consists of NLP-like techniques and RL ➔ No consensus yet on standardized program representations and CFG/DFG ➔ Open questions: ◆ Can we generate *semantically correct* transformed code using a DL model ? ● Would the optimizations of the future be a mix of DL models + Fast correction algorithms ? ● Get most of it right using a DL model and then apply a quick and simple correction pass ◆ Can these models be really portable – across compilers and across hardware ?

Slide 131

Slide 131 text

End of slides