DL4Compilers @ CGO'22

DL4Compilers CGO Tutorial Sandya Mannarswamy Dibyendu Das Chris Cummins

1. Learning over Programs (Chris) 2. Applications of DL to
Compilers (Dibyendu) 3. Challenges & Research Directions (Sandya) overview 3 sections, 45 min each, 10 min Q&A, 5 min break

1. Learning over Programs (Chris) 2. Applications of DL to
Compilers (Dibyendu) 3. Challenges & Research Directions (Sandya) overview 3 sections, 45 min each, 10 min Q&A, 5 min break 3 2

• Bad heuristics • Wasted energy • Widening performance gap
Building compilers... a job for life • 100s of variables • NP-hard or worse • Compiler / HW keeps changing

Collect examples Learn from examples Update heuristic Repeat on change
"Build an optimizing compiler, your code will be fast for a day. Teach a compiler to optimize ... "

Summarize the program Program Features void LinearAlgebraOp<InputScalar, OutputScalar>::AnalyzeInputs( OpKernelContext* context,
TensorInputs* inputs, TensorShapes* input_matrix_shapes, TensorShape* batch_shape) { int input_rank = -1; for (int i = 0; i < NumMatrixInputs(context); ++i) { const Tensor& in = context->input(i); if (i == 0) { input_rank = in.dims(); OP_REQUIRES( context, input_rank >= 2, errors::InvalidArgument( "Input tensor ", i, " must have rank >= 2")); 0.2 0.31 -0.7 1.24

Collect examples Features Best Param ... ...

Supervised Machine Learner Model Learn from examples Features Param Features
Best Param ... ...

Model The model is the heuristic Model Model Features Param
Features Param Features Param

Model The model is the heuristic Model Model Features Param
Model Model Features Param Model Features Param Model Model Features Param Model Features Param Features Param New Program Features Predicted param

Feature Vectors Feature Vectors Best Decisions Feature Vectors Feature Vectors
Feature Vectors Feature Vectors Feature Vectors Training Programs Feature Vectors Feature Vectors Datasets Ad-hoc Drivers Training Data Feature Extractor Learned Heuristic Machine learning for compilers

Feature Vectors Feature Vectors Training Programs Feature Vectors Feature Vectors
Datasets Feature Vectors Feature Vectors Best Decisions Feature Vectors Feature Vectors Feature Vectors Ad-hoc Drivers Training Data Feature Extractor Learned Heuristic Machine learning for compilers The bit I'm going to talk about

Feature Vectors Ad-hoc Drivers Training Data Feature Extractor Learned Heuristic Machine learning for compilers (the bit Dibyendu is going to talk about) Feature Vectors Feature Vectors Training Programs Feature Vectors Feature Vectors Datasets

History of ML in compilers Autotuning 1970s 1998 2008 2015
2012 2020 Milepost GCC AlexNet 2016 GGNN 2017 [paper] [paper] ML-guided AUtotuning [paper] MLGO 2003 [paper] [source] DeepTune [paper] ICSA'22 competition

1. Learning over Programs

Rotem et. al. Proﬁle Guided Optimization without Proﬁles: A Machine
Learning Approach (2022) Cummins et. al. End-to-end Deep Learning of Optimization Heuristics (2017) Cummins et. al. ProGraML: A Graph-based Program Representation for Data Flow Analysis and Compiler Optimizations (2021) Three approaches to program representation 1. Handcrafted features 2. Language Modeling 3. Graph Reasoning

How it works Program Features void LinearAlgebraOp<InputScalar, OutputScalar>::AnalyzeInputs( OpKernelContext* context,
TensorInputs* inputs, TensorShapes* input_matrix_shapes, TensorShape* batch_shape) { int input_rank = -1; for (int i = 0; i < NumMatrixInputs(context); ++i) { const Tensor& in = context->input(i); if (i == 0) { input_rank = in.dims(); OP_REQUIRES( context, input_rank >= 2, errors::InvalidArgument( "Input tensor ", i, " must have rank >= 2")); 0.2 0.31 -0.7 1.24

How it works Program IR Features void LinearAlgebraOp<InputScalar, OutputScalar>::AnalyzeInputs( OpKernelContext*
context, TensorInputs* inputs, TensorShapes* input_matrix_shapes, TensorShape* batch_shape) { int input_rank = -1; for (int i = 0; i < NumMatrixInputs(context); ++i) { const Tensor& in = context->input(i); if (i == 0) { input_rank = in.dims(); OP_REQUIRES( context, input_rank >= 2, errors::InvalidArgument( "Input tensor ", i, " must have rank >= 2")); (CFG, DFG, AST,...) 0.2 0.31 -0.7 1.24

How it works Program IR Features void LinearAlgebraOp<InputScalar, OutputScalar>::AnalyzeInputs( OpKernelContext*
context, TensorInputs* inputs, TensorShapes* input_matrix_shapes, TensorShape* batch_shape) { int input_rank = -1; for (int i = 0; i < NumMatrixInputs(context); ++i) { const Tensor& in = context->input(i); if (i == 0) { input_rank = in.dims(); OP_REQUIRES( context, input_rank >= 2, errors::InvalidArgument( "Input tensor ", i, " must have rank >= 2")); (CFG, DFG, AST,...) 0.2 0.31 -0.7 1.24 #. instructions loop nest level arithmetic intensity trip counts

• PGO is difficult to deploy because it requires multiple
steps • LLVM has a set of hard coded rules that predict things • Developed by dozens of engineers, using thousands of lines of code, over a decade Case Study: PGO

Case Study: PGO

Advantages Drawbacks 1. Interpretable e.g. "#. instructions" 2. Fast to
extract Typically lightweight analyses 3. Fast to process e.g. >100k inferences / sec 1. Diﬃcult to get right How do you know when "done"? 2. Time consuming to develop Model / features relationship 3. Repetitious Features aren't transferable

Ways to fail Irrelevant Incomplete Unsuitable e.g. not capturing the
right information e.g. missing critical information e.g. wrong combination of features+model

Feature Vectors Feature Vectors Feature Vectors Training Programs Feature Vectors Feature Vectors Datasets Ad-hoc Drivers Training Data Feature Extractor Learned Heuristic How it works

Feature Vectors Feature Vectors Training Programs Feature Vectors Feature Vectors
Datasets Feature Vectors Feature Vectors Best Decisions Ad-hoc Drivers Learned Heuristic How it works Training Data

Program Code Code in Normalizer Tokenizer Optimization Decision ✓ LSTM
DNN

Tokenization kernel void A(global float* a, const float b) {
a[get_global_id(0)] *= 3.14 + b; } Vocab Encoded Token Index Candidate Vocab const float get_global_id global int kernel void ... Input

a[get_global_id(0)] *= 3.14 + b; } Vocab Encoded Token Index kernel 0 0 Candidate Vocab const float get_global_id global int kernel void ... Input

a[get_global_id(0)] *= 3.14 + b; } Vocab Encoded Token Index kernel 0 [space] 1 0 1 Candidate Vocab const float get_global_id global int kernel void ... Input

a[get_global_id(0)] *= 3.14 + b; } Vocab Encoded Token Index kernel 0 [space] 1 void 2 0 1 2 Candidate Vocab const float get_global_id global int kernel void ... Input

a[get_global_id(0)] *= 3.14 + b; } Vocab Encoded Token Index kernel 0 [space] 1 void 2 0 1 2 1 Candidate Vocab const float get_global_id global int kernel void ... Input

a[get_global_id(0)] *= 3.14 + b; } Vocab Encoded Token Index kernel 0 [space] 1 void 2 A 3 0 1 2 1 3 Candidate Vocab const float get_global_id global int kernel void ... Input

a[get_global_id(0)] *= 3.14 + b; } Vocab Encoded Token Index kernel 0 [space] 1 void 2 A 3 ( 4 0 1 2 1 3 4 Candidate Vocab const float get_global_id global int kernel void ... Input

a[get_global_id(0)] *= 3.14 + b; } Vocab Encoded Token Index kernel 0 [space] 1 void 2 A 3 ( 4 global 5 0 1 2 1 3 4 5 Candidate Vocab const float get_global_id global int kernel void ... Input

a[get_global_id(0)] *= 3.14 + b; } Vocab Encoded Token Index kernel 0 [space] 1 void 2 A 3 ( 4 global 5 0 1 2 1 3 4 5 1 Candidate Vocab const float get_global_id global int kernel void ... Input

a[get_global_id(0)] *= 3.14 + b; } Vocab Encoded Token Index kernel 0 [space] 1 void 2 A 3 ( 4 global 5 float 6 0 1 2 1 3 4 5 1 6 Candidate Vocab const float get_global_id global int kernel void ... Input

a[get_global_id(0)] *= 3.14 + b; } Vocab Encoded Token Index kernel 0 [space] 1 void 2 A 3 ( 4 global 5 float 6 * 7 0 1 2 1 3 4 5 1 6 7 Candidate Vocab const float get_global_id global int kernel void ... Input

a[get_global_id(0)] *= 3.14 + b; } Vocab Encoded Token Index kernel 0 [space] 1 void 2 A 3 ( 4 global 5 float 6 * 7 0 1 2 1 3 4 5 1 6 7 1 Candidate Vocab const float get_global_id global int kernel void ... Input

a[get_global_id(0)] *= 3.14 + b; } Vocab Encoded Token Index kernel 0 [space] 1 void 2 A 3 ( 4 global 5 float 6 * 7 a 8 0 1 2 1 3 4 5 1 6 7 1 8 Candidate Vocab const float get_global_id global int kernel void ... Input

a[get_global_id(0)] *= 3.14 + b; } Vocab Encoded Token Index kernel 0 [space] 1 void 2 A 3 ( 4 global 5 float 6 * 7 a 8 Token Index , 9 const 10 b 11 ) 12 { 13 \n 14 [ 15 get_global_id 16 0 17 0 1 2 1 3 4 5 1 6 7 1 8 ... Candidate Vocab const float get_global_id global int kernel void ... Input 181 tokens 33M tokens

CGO’13 PACT’14 Prior Art Heterogeneous Mapping Thread Coarsening

Prior Art Heterogeneous Mapping Thread Coarsening Decision Space Model Binary
classification One-of-six classification Neural Networks Decision Tree Cascading {CPU, GPU} {1, 2, 4, 8, 16, 32} Features 7 Principle Components of 34 raw features Combinations of values from ad-hoc LLVM analysis CGO’13 PACT’14 2x papers!

CGO’13 PACT’14 Heterogeneous Mapping Thread Coarsening Using DeepTune 1. Use
the same model design for both 2. No tweaking of parameters 3. Minimum change - 3 line diff

14% and 5% improvements over state-of-the-art Tiny dataset!

Heterogeneous Mapping Thread Coarsening +ﬁne-tune

14% and 5% improvements over state-of-the-art

14% and 11% improvements over state-of-the-art

Advantages Drawbacks 1. No Feature Engineering Save time! 2. Enables
Transfer Learning Reuse training across problems 1. Black box Not interpretable 2. Doesn't suit all problems Nonlinear code dependencies?

int main( int argc, char** argv) {... target triple =
"..." define i32 @main() {... 1. IR 3. GGNN %0 res %1 main printf A br ret add mul Control-Flow Data-Flow Call Graph Control Data Call 2. Graph How it works

Building ProGraML: Control-flow Derived from compiler IR (here, LLVM) Full-flow-graph:
represent each instruction as a vertex. Vertex label is the instruction name. Edges are control-flow. Edge position attribute for branching control-flow.

Building ProGraML: Data-ﬂow Add graph vertices for constants (diamonds) and
variables (oblongs). Vertex label is the data type. Edges are data-ﬂow. Edge position attribute for operand order.

Building ProGraML: Call-ﬂow Edges are call-ﬂow. Inbound edge to function
entry instruction. Outbound edge from (all) function exit instruction(s).

Learning with ProGraML: Node Embeddings Use vertex labels as embedding
keys Derive vocab from set of unique vertex labels on training graphs. Separate type/instruction nodes leads to compact vocab, excellent coverage on unseen programs compared to prior approaches: inst2vec: combined instruction+operands CDFG: uses only instructions for vocab, ignores data br add i32 0 1 2 i32 <id> = a<id> <int8>

Learning with ProGraML: GGNNs Position gating to differentiate control branches
and operand order 6 typed weight matrices for {forwards,backwards} {control,data,call} edge types Message Passing Readout Head per-vertex prediction after T message-passing steps

DeepTune ProGraML Reachability 0.504 0.996 Dominator Trees 0.114 0.781 Data
Dependencies 0.236 0.993 Live-out Variables - 1.000 Global Common Subexpressions 0.214 0.930 Learning compiler analyses F1 Scores Dataset: 250k LLVM graphs covering 6 program languages

Advantages Drawbacks 1. Powerful representation Well-established principles 2. Models non-linear
relations Broad range of data flows 1. Slow to create / process Many nodes, many FLOPs 2. Lossy node featurization How to represent literals? 3. GNNs struggle on large inputs

1. ML promises better optimizations, faster 2. Variety of approaches
to featurizing code: a. Handcrafted analyses b. Deep Language Modeling c. Graph Reasoning 3. Not a "solved problem", plenty to be done! conclusions

5 min break

2. Challenges and Research Directions

Presenter

Outline • DL in Production Compilers – Challenges • DL
Driven Compiler Heuristics • A Case Study • Takeaways and Research Directions

ML/DL in Compiler Optimization is not New! • Considerable research
in applying ML to various problems • Even from last decade: ◦ Sameer Kulkarni, John Cavazos, Christian Wimmer, Douglas Simon. Automatic construction of inlining heuristics using machine learning. CGO 2013 ◦ H. Leather, E. Bonilla, M. O’Boyle. Automatic feature generation for machine learning based optimizing compilers. CGO 2009 (Unroll factor learning) • Not much of this has made it to production compilers • Adopting ML/DL in production compilers is not easy! • Many challenges in adopting ML/DL driven optimizations!

ML/DL in Production Compilers - Challenges • “One of the
challenges is that the existing compilers, LLVM included, were never designed for machine learning integration. • And so there’s a lot of work that could be done to integrate machine learning techniques … into our compiler frameworks. • But because the abstractions were wrong, it’s really hard to do that outside of a one-off research paper…”

With great challenges, comes opportunities for research ☺

Learning Compiler Heuristics

Compiler Heuristics • Every optimization can be modelled with two
phases ◦ Analysis & Transformation • Analysis Phase (Decision Making) ◦ Driven by heuristics, No code change involved ◦ Wrong decisions – no impact on correctness. Only on Performance. • Transformation Phase (Implementing the Decision) ◦ Impact Correctness (Not just performance) • Complex Optimizations - Inlining, Scheduling, Register allocation ◦ NP Hard Problems ◦ Decision making approximated by complex and multifactorial heuristics

Heuristics vs Learnt Models/Policies Heuristics are human-trained ◦ based on
a human-manageable set of benchmarks and regression cases. Heuristics are human-written code that needs to be maintained ◦ Limits the number of program features and combinations involved ◦ But using more features and feature combinations -> better opt. decisions Human heuristics more comprehensible in theory, can grow complex over time DL easily scales to large training examples ◦ Generalization to real world diverse programs likely to be better DL scales well with the addition of features ◦ Avoids need of retraining models often in production compilers ◦ Can also discover automatically profitable feature combinations DL models not comprehensible/explainable

Applying DL in Compilers Human Trained, NO ML Assistive ML
Embedded ML

Desiderata for DL driven Compiler Policies • Day to day
production use of compiler should remain unchanged • Separate deployment of compiler from training of ML models/policies • Minimize compile time overheads to acceptable levels ◦ ML model embedded in compiler and deployed in inference mode • Can cater to different user personas ◦ Normal User (uses compiler to compile his application) ◦ Compiler engineer who is developing/maintaining the compiler

DL driven Compiler Heuristics Case Study • Compiler Heuristics amenable
to replacement by learnt models/policies • What are the challenges in adopting this in production compiler? • Let us look at a case study ◦ MLGO: a Machine Learning Guided Compiler Optimizations Framework (Troffin et al.2020).

MLGO Case Study

MLGO Overview • Use ML driven policy for the task
of inlining-for-size in LLVM compiler • Size instead of performance ◦ Modelling performance rewards are noisy, costly • Inlining heuristics are complex! • Supervised learning not viable ◦ there are no optimal labels for the task ◦ no simple way of saying an inline decision is optimal • Need to explore different strategies, learn from these experiences • Reinforcement Learning (RL) & Evolution Strategies (ES) more suitable

User Personas Envisaged Normal User Compiler Developer

Supporting Normal User Persona in MLGO Normal User Needs •
correctness and performance of the generated code should not be impacted • Timeliness of build • compilation determinism (incremental build support) • No added cost/complexity to build and release pipelines MLGO Design Goals • No visible changes to normal user • Separating Correctness & Policy • Only support ML driven heuristics • not code changes • Minimal impact on compile time • No online training • Should not need frequent retraining • Generalize across code bases

Supporting Compiler Engineer Persona Compiler Engineer Needs • Wants better
optimizations in compiler • Wants to apply ML to improve compiler passes • Improve ML driven opts • Fix regressions & ship blockers MLGO Design Goals • Support efficient retraining • ML models/policy visible to user • Can lead to additional dependencies • Build/release pipelines can be changed • Improve policy by adding missing features/regressions to retraining data • Flexibility to support alternate training algorithms

Usage Scenarios for DL Driven Policies • Policy/Model Creation ◦
Dataset created, new ML model trained iteratively to replace existing human heuristics. ◦ User Persona – Compiler Engineer • Policy Deployment ◦ Model incorporated into compiler in inference mode and deployed in release compiler ◦ User persona – Application Developer • Policy Improvement ◦ Retraining Model to improve performance. User Persona : Compiler Engineer • Policy Maintenance ◦ Bug Fixing ML model. User Persona: Compiler Engineer

Inlining Pass in LLVM Compiler • Operates on SCC of
Call Graph of a module in a bottom up order • The inlined callee’s call sites added to worklist for iterative processing • Inlining pass includes a number of decisions ◦ The order of traversal ◦ Clean ups done and their timing ◦ Decision to inline a callsite or not • MLGO focusses on the decision to inline a callsite or not based on size

Inlining Heuristics in LLVM Compiler • Compute static cost of
callee post inlining • Compare the computed cost with a threshold • Threshold based on call site hotness and inline keyword • Bonuses/threshold modifications based on ◦ Callee characteristics like single BB, number of SIMD instructions etc • Inlining may be deferred, if it may be profitable to inline caller first • Interplay of a number of program characteristics ◦ Local to the callsite ◦ Global ◦ Source level directives, optimization options etc

Replacing Manual Heuristics with learnt models • MLGO trains the
inlining policy with 2 different algorithms 1. Reinforcement Learning based 2. Evolution Strategies (ES) based • To handle cold start issue for RL Policy, use behavioral cloning • Behavioral Cloning mimics the standard LLVM Inliner heuristics

RL driven Approach • In RL, an agent interacts with
the environment ◦ based on current state and learnt policy, performs actions • The action leads to a reward ◦ Also changes the current state of the environment • Reward feedback tunes the policy for further steps • In our case, compiler is the agent and Policy is the learnt model for heuristics • Action is inline/not inline & State is the current state of the call graph • We will talk about reward later!

RL Formulation • Inlining for size formulated as Markov Decision
Process ◦ Sequential Decision Making • MDP represented by the tuple < S, A, P, R > ◦ state space S ◦ action space A, ◦ state transition distribution P(𝑠′|𝑠, 𝑎), ◦ reward function R(𝑠, 𝑎). • The agent’s decisions governed by policy 𝜋 = 𝑃𝑟 (𝑎|𝑠) ◦ maps observed state 𝑠 to a distribution over actions. ◦ 𝜋 is a neural network and we call it policy network. • Goal is to find the optimal policy 𝜋∗ to maximize total reward

RL - Inlining for Size • State S is the
current call graph State and call site being visited ◦ Not practical ◦ Approximated using a set of features • Action A = {0.1} 0-> no inline. 1-> inline • Deterministic state transition based on action and Call Graph updated • Reward R – native size reduction after action A ◦ If inlined: R = S(caller_before) – S(caller_after) + [S(callee) if callee deleted, 0 if not deleted] ◦ Not inlined: R = 0 ◦ Compute total native size with/without inlining and subtract

Representing RL State Space • RL State representation is Call
Graph and call site • Encoding CG state at each point is computationally expensive • MLGO approximates the state space ◦ by handful of numerical features ◦ Local call site related features and global CG features

Challenge: Representing RL State Space • Full RL State space
representation computationally unviable • Can impact compile time significantly • MLGO trades off state representation fidelity to mitigate this ◦ Falls back to handful of numeric features • This reduces information available to the RL model ◦ impacts the policy trained in MLGO • MLGO also does not use IR code embedding of callee.. ◦ To reduce memory/compute costs • Opportunity: Develop computationally viable & high fidelity ◦ State space representation ◦ IR embedding representations (both task agnostic/task specific)

Challenge: RL Reward Computation • Difficult to estimate native function
size during inlining pass • MLGO opts to use total reward instead of partial rewards ◦ Evaluate native size with and without inlining and subtract • This requires more compute and can impact model quality • Inlining for performance would make this even more complicated! Opportunity: Develop scalable reward formulations without impacting model quality

Data Collection Challenges • No standard/ready made datasets for inlining
for size.. • Training Data Collection is a major bottleneck • Needs to be parallelized to reduce training cycle time • Model Quality can vary based on ◦ Generality of corpus ◦ Size of training corpus Training Data Collection

Model Improvement Challenges • Similar to Manual Heuristics Improvement •
Long cycle time and requires compiler engineer expertise • Identify missing features based on regressions ◦ Black box nature of DL algorithms makes it difficult • Incorporate regression test cases from field and retraining ◦ Requires model updates in production compiler • Explore alternative learning algorithms ◦ Longer dev time and compiler release updates needed ◦ Trade off between simpler (better interpretable) algorithms vs performance

Model Debug/Fix - Challenges • Similar challenges as model improvement
• DL model policies blackbox in nature, hamper debugging • For Ship blockers, fall back to ◦ Earlier working policy/model ◦ Manual heuristics • Fix would require model retraining ◦ Trained with newer training data ◦ Adding/dropping features • Selective application of manual heuristics to buggy code + DL model driven inlining for rest of code base

Key takeaways from MLGO Case Study • Trade-offs exist between
model/policy quality and compute costs • Timeliness of compiles is non-negotiable for normal user • Explainability of DL models/policies is desirable for troubleshooting

Research Opportunities from MLGO Study • How do we support
speed optimizations? ◦ Handling noisy rewards like speedup/runtime ◦ Task Specific proxy reward formulations ◦ Scalable and compute efficient • How do we design richer & efficient state representations? ◦ Encode CG State into a compact representation ◦ With minimal impact on compile time ◦ Learning these representations from pre-trained models? ◦ Exploring IR code embedding techniques for callee analysis

Key Takeaways Challenges 1. Lack of standardized datasets ❑ Small/custom
datasets typical ❑ Large datasets like AnghaBench available only at source code level 2. Lack of Pretrained IR models ❑ Transfer learning not yet possible 3. Non-availability of generalized contextual IR embeddings Research Directions 1. Automatically Synthesizing benchmarks and datasets for various compiler tasks 2. Pre-trained LMs at different points of optimization pipeline ❑ Middle end and at codegen level 3. Techniques for IR embeddings that can generalize across code bases and different compiler tasks

Thank you!

Writing Compiler Optimizations is Hard! • Software stack is becoming
increasingly complex • Correctness & performance of applications depends on the compiler • Increasingly difficult to write compiler optimizations which generalize across software • Can ML/DL assist human in developing better compiler optimizations?

DL is Ideal for…. • We usually apply DL to
problems that are hard to solve manually/algorithmically • Typical characteristics include ◦ Large search space ◦ Approximate solutions preferred ◦ Availability of large code bases/samples that can be mined ◦ Probabilistic nature of ML does not impact correctness • Compiler Problems that fit these characteristics well ◦ Heuristics, Phase Ordering decisions, Cost Modelling

5 min break

3. Applications of DL to Compilers

Writing Compiler Optimizations is Hard ➔ Software stack is becoming
increasingly complex ➔ Increasingly difficult to write compiler optimizations which generalize across software ➔ But correctness & performance of applications depends on the compiler ➔ Where can we target ML to ease & improve compiler optimization?

Where to apply ML in Compiler Opts ➔ Optimization usually
modelled with two phases ◆ Analysis & Transformation ➔ Analysis Phase (Correctness, Cost/Benefit, Feasibility) ◆ May be driven by heuristics, No code change involved ➔ Transformation Phase (Implementing the Decision) ◆ The big concern is correctness ➔ Complex Optimizations - Inlining, Scheduling, Register allocation ◆ NP Hard Problems ◆ Decision making approximated by complex and multifactorial heuristics

What to Target in Compiler Opts ➔ We usually apply
ML to problems that are hard to solve manually/algorithmically ➔ Preferable characteristics include ◆ Large search space ◆ Approximate solutions preferred ◆ Availability of large code bases/samples that can be mined ◆ Probabilistic nature of ML does not impact correctness ➔ Three areas: Optimization heuristics, Phase Ordering, Cost Modelling

DL-based 5 Compiler Optimizations ➔ We will talk about 5
compiler opts/techniques Optimization Middle-End Back-End Generic Ithemal: Basic Block Throughput Estimation ✓ Register Allocation ✓ Auto-Vectorization Using Imitation Learning ✓ Learned Performance Model for TPUs ✓ Phase-Ordering via Deep RL ✓

Ithemal : Basic Block Throughput Prediction

Estimating Basic Block Throughput The problem: Given a basic-block of
x86 instructions estimate the throughput in terms of clock cycles Analytical models (llvm-mca, IACA) are usually used in such scenarios

Accurate Modeling of Processor Core is Complex ➔ Modeling the
micro-architectural details of a complex core is a very hard problem ◆ Very easy to omit details ◆ Specs are not always accurate ◆ Some details are proprietary

Ithemal: a Data-driven approach ➔ Run/train a DL model using
many samples of x86 BBs and corresponding cycle count ◆ Measured on real hardware ◆ A new hardware just means re-training the new model OR some form of transfer learning ➔ High accuracy and ease of portability

Ithemal DL model ➔ Hierarchical LSTMs with x86 instruction set
in a BB as input ◆ 2-layers ◆ Layer-1 for the sequence of operands of each instruction ◆ Layer-2 for the sequence of instructions ➔ Regression model ◆ Throughput predictor

Ithemal competitive performance ➔ Ithemal delivers better throughput accuracy compared
to analytical models ➔ Portable and robust ➔ Github: https://github.com/ithemal/Ithemal/tree/master/learning/pytorch

DL-based Graph Coloring Register Allocation

Register Allocation as a Graph-Coloring Problem ➔ Register Allocation- an
important problem in code generation ➔ The number of registers available may be < number of variables ➔ Create *interference graph* which models registers which need to be *live* at the same time

Modeling Graph Coloring using LSTMs ➔ Viewed as a sequence-2-sequence
translation via LSTMs ➔ An input sequence where each item of the sequence corresponds to a node of the graph ➔ The output sequence is of the same length as the input sequence (number of nodes of the graph) ➔ Trained using random graphs

Inference and Color-Correction ➔ Difficult to encode constraints in LSTM
that two adjacent nodes cannot have some color ➔ *Invalid* edges may appear during inference ➔ Rectify these edges using a post-inference color-correct pass

DL-model vs LLVM’s GRA ➔ Collected the interference graphs for
the functions of certain SPEC CPU® 2017 benchmarks ◆ Use these graphs to predict colors using the DL-model ➔ Collect the actual register count of each function after codegen from LLVM ➔ Comparison shows DL-model performing better than GRA ➔ Github: https://github.com/dasdibye/DL4RegAlloc

Compiler Auto-vectorization using Imitation Learning

Auto-vectorization/SLP via ML ➔ The problem: ◆ Solve the SLP
(Superword-Level Parallelism) using ML ◆ SLP is a superset of Loop Vectorization whereby stratight-line code can also be vectorized ➔ Naïve ML strategies may lead to correctness issues as only isomorphic statements can be vectorized

MDP Formulation ➔ The pairwise instruction packing problem formulated as
a MDP by iteratively forming one vector pack at a time following a particular instruction traversal policy – Bottom-Up or Top-Down

MDP Formulation - State Details ❖ Nodes : 5 types
of nodes to encode the graph features of a state: Instruction Node: correspond to each instruction with at least one valid packing opportunity or instructions which are already packed Pack Node: common node representing overhead packing instructions Unpack Node: common node representing overhead unpacking instructions Constant Node: common node representing any constant value used by instructions Focus Node: connected to the instruction that is considered for packing ❖ Edges: Following are the 4 types of edges connecting the above nodes: Dependency Edge: encodes if an instruction must be executed after another Possible Pack Edge: encodes whether two instructions can be packed together Packed Edge: encodes instructions that are already packed together Focus Edge: the focus edge connects the focus node to the instruction node that is considered for packing

Imitation Learning ➔ goSLP solves the SLP problem *exactly* using
ILP solvers ➔ Use such a solution to imitate/mimic the action space ➔ Use actual runtimes/estimates as cost/reward function ➔ Gated Graph Neural Network (GGNN) used as part of the policy network modeling to make packing decisions for each state

A Learned Performance Model for TPUs

Learning a performance model ➔ Program performance is tightly coupled
with the underlying processor architecture as well as the optimization decisions that are made during compilation ➔ Developing an accurate analytical model of program performance on a modern processor is challenging and can take months of engineering effort ➔ Developers of analytical models are often unaware of detailed features of the processor or effects from all compiler passes ➔ Learn a performance model for XLA graphs on the TPU

Approach Dataflow Graph - Decompose into Kernels Predict Per Kernel
and Sum over all kernels f( ) Τ secs ≅ ≅ Train a DNN to map kernels to runtime estimates DL op DL op Kernel = ∪ DL ops Kernel = ∪ DL ops DL op DL op

Model Architecture DL op Node/op features Concat Feedforward GNN (Graph
NN) DL op Embeddings Full Dataflow Graph LSTM Linear Kernel Runtime Estimate Final Node embeddings, topo-sorted and fed to LSTM

Application of the learnt Model ➔ Tile Selection ◆ Find
the optimal tile size among many for a kernel ◆ Query the valid tile sizes and obtain the relative performance among them • A kernel may have up to 500K valid tile sizes • For training, up to 25M samples were used ➔ Operator Fusion ◆ Used a random search strategy for fusion on the entire computation graph ◆ For training, up to 50K fusion configurations ◆ More than 200M samples

Static Neural Compiler Optimization via Deep Reinforcement Learning

Phase Ordering problem ➔ Large number of passes in any
modern compiler ◆ 50+ passes in LLVM pass infrastructure depending on the optimization level ◆ Each pass may have several tunable parameters ➔ Phase-ordering problem extremely hard to solve ◆ Current order is rigid but enumerating possibilities creates a huge search space ◆ We may need to trade off between compile time and performance

Phase ordering as an RL problem ➔ IR can resemble
a state ➔ Passes can resemble actions ➔ Optimizer is the environment ➔ Rewards are runtimes IR 100 ms IR2 125 ms IR1 25 ms Pass 1 Pass 2

Overall RL approach ➔ Learn the value function: Q(S t
,A t ,w) = R(S t ,A t ) + γ.max a∊A Q(S t+1 ,a,w) ➔ Where R = ln (T(S t )/T(S t+1 )), T(x) is runtime of IR x Start input_ir.ll Action Reward 1 -2.3 2 +1.5 3 +0.3 4 -0.9 Action Reward 2 +1.5 Value > 0 ? Record LLVM Opt Action History Agent Prediction Max End True

States and Action Spaces ➔ IR is encoded using NCC
(Neural Code Comprehension) vector embedding ( Ben-Nun et al. ) ➔ Action history is one-hot encoded ➔ A state is an agglomeration of the encoded IR and action history ➔ The actions are divided into 3 levels of abstraction : H, M, L ◆ H denotes a group of passes together while M,L denotes individual passes ◆ H creates smaller search space than M,L IR Action History State NCC 1-hot State Vector Embedding

In Conclusion ➔ At the cusp of utilizing DL techniques
for compiler passes ➔ A big challenge is to ensure correctness ➔ Proposed work consists of NLP-like techniques and RL ➔ No consensus yet on standardized program representations and CFG/DFG ➔ Open questions: ◆ Can we generate *semantically correct* transformed code using a DL model ? • Would the optimizations of the future be a mix of DL models + Fast correction algorithms ? • Get most of it right using a DL model and then apply a quick and simple correction pass ◆ Can these models be really portable – across compilers and across hardware ?

End of slides

DL4Compilers @ CGO'22

DL4Compilers @ CGO'22

More Decks by Chris Cummins

Other Decks in Research

Featured

Transcript