TMPA-2017: Simple Type Based Alias Analysis for a VLIW Processor

MCST Simple Type-Based Alias Analysis for a VLIW Processor Markin
A. L. Alex.L.Markin@mcst.ru Ermolitsky A. V. Alexander.V.Ermolitsky@mcst.ru 4 march 2017

Elbrus Elbrus — general purpose VLIW (Very Long Instruction Word)
microprocessor. Features: 23 instructions per tick In-Order instruction execution Array Access Unit (AAU) — asynchronous array loading from memory to the Array Prefetch Buﬀer (APB) Hardware support of loop pipelining Disambiguation Access Memory (DAM) — hardware support of pointer disambiguation All these features vitaly need good compiler optimization. 2 / 20

Pointer analysis void foo(int * a, float * b) {
for(int i = 1; i < N; i++) { a[0] += a[i]; b[0] *= b[i]; } } The purpose of pointer analysis is to detect whether a and b may refer to the the same memory area. It is diﬃcult because: Lack of information about program (in per-module build mode) Pointer analysis needs a lot of resources (in whole-program mode) Pointer analysis algorithms are complicated 3 / 20

Strict-aliasing The C language allows to disambiguate pointers by types:
7 An object shall have its stored value accessed only by an lvalue expression that has one of the following types: a type compatible with the effective type of the object, a qualified version of a type compatible with the effective type of the object, a type that is the signed or unsigned type corresponding to the effective type of the object, a type that is the signed or unsigned type corresponding to a qualified version of the effective type of the object, an aggregate or union type that includes one of the a mentioned types among its members (including, recursively, a member of a subaggregate or contained union), or a character type. 4 / 20

Algorithm The strict-aliasing implementation for lcc (Elbrus C Compiler) works
with the architecture-independent IR (EIR). General description: 1. Gather all interesting READ and WRITE operations 2. Generate compatibility vector for each type of operations 3. Assign results of analysis to corresponding operations Type-based alias analysis is implemented in all major compilers. 5 / 20

Implementation characteristics Pointer analysis — answers whether two pointers can
refer to the same memory area Intraprocedural — does not require whole program information Flow-insensitive — does not use information about the program control-ﬂow Context-insensitive — does not use information from the functions call points No memory modeling Result representation is vector 6 / 20

Runtime results 400.perlbench 401.bzip2 403.gcc 429.mcf 445.gobmk 462.libquantum 464.h264ref 471.omnetpp
473.astar 483.xalancbmk 0.90 0.95 1.00 1.05 1.10 1.15 17.49 lcc module lcc whole gcc module gcc lto Figure: Integer SPEC CPU2006 execution speedup (> 1 is better) 7 / 20

Runtime results 416.gamess 433.milc 434.zeusmp 435.gromacs 436.cactusADM 437.leslie3d 444.namd 447.dealII
450.soplex 453.povray 454.calculix 459.GemsFDTD 465.tonto 482.sphinx3 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2 lcc module lcc whole gcc module gcc lto Figure: Floating point SPEC CPU2006 execution speedup (> 1 is better) 8 / 20

Runtime results GMean speedup gained with the help of strict-aliasing:
lcc -O3 -ffast lcc -O3 -ffast -fwhole gcc -O3 gcc -O3 -flto SPEC CPU2006 INT 28.6% 1.9% 1% 0% SPEC CPU2006 FP 13.3% 4.3% 1.5 1.1% Testing environment: lcc — Elbrus 4C (Elbrus v3 ISA) gcc — Intel Xeon E5-2650 (x86 64 ISA) 9 / 20

Implementation Aspects Problem: strict aliasing violations are common. So separate
analysis for strict-aliasing errors detecting was implemented Problem: unions are hard to analyse at compile time, so they are ignored 10 / 20

462.libquantum This test got 17.49 times execution speedup after enabling
strict-aliasing analysis for per-module build mode! Three hottest functions have the same pattern: void foo(str_1 * str) { for(int i = 0; i < N; i++) { str->arr[i].field; // LOAD of arr and LOAD of field ... str->arr[i].field = val; // STORE to field } } Dependence between STORE of field and LOAD of arr prohibits to eliminate invariant LOAD. 11 / 20

462.libquantum In the lcc architecture-independent representation (EIR) we have the
following operations: loop: ... o1. READ str : str_1 * o2. RD_FIELD o1.arr : str_2 * o3. ADD_P o2, i : str_2 * o4. RD_FIELD o3.field : int32 ... o4. WR_FIELD o3.field <- val : int32 12 / 20

462.libquantum The strict-aliasing analysis builds table of type compatibility for
three types: str_1 * str_2 * int32 str_1 * 1 0 0 str_2 * 0 1 0 int32 0 0 1 In this example all three types are incompatibile and the operations working with them can not refer to the same memory area. 13 / 20

462.libquantum Speedup was gained by the Elbrus-speciﬁc optimizations. The architecture-dependent
IR of the loop is the following: loop: ... o1. LOAD str->arr 0 -> r1 // Alias vector: 010 o2. ADD_P r1 i -> r2 o3. LOAD r2 offset(field) -> r3 // Alias vector: 001 ... o4. STORE r2 offset(field) val // Alias vector: 001 Results of strict-aliasing makes possible to disambiguate operations o1. LOAD and o4. STORE and to eliminate invariant o1. LOAD from the loop. 14 / 20

462.libquantum The only LOAD in the loop makes possible to
evaluate some optimizations: o1. LOAD str->arr 0 -> r1 // Alias vector: 010 loop: ... o2. MOVA arr_buff ... o3. ADD_P r1 i -> r2 o4. STORE r2 offset(field) val // Alias vector: 001 Before strict-aliasing: weak pipelining DAM applied no APB After strict-aliasing: improved pipelining No DAM APB 15 / 20

Other tests Almost all other tests (except 453.povray) have similar
to 462.libquantum but more complicated code patterns. The tests 459.GemsFDTD and 437.leslie3d are Fortran tests but lcc translates them to C so we can also see their speedup. In the 453.povray hot functions there are no loops. The 16% speedup is based only on peephole improvement! 16 / 20

Strict-aliasing clients Strict-aliasing Redundant Load/Store Elimination Memory Runtime Optimizations DAM
RTMD Loop Optimizations APB Pipelining Peephole 17 / 20

Compile Time In general the impact of the analysis on
the compilation time is low. Compilation time speedup: lcc -O3 -ffast lcc -O3 -ffast -fwhole gcc -O3 gcc -O3 -flto GMean -3% 1% 1% 2% The size of the stored analysis results is linear to the number of operations in the procedure. 18 / 20

Summary Advantages of strict-aliasing: Relatively easy implementation Works in per-module
build mode In some cases works with object ﬁelds High scalability Great execution speedup on VLIW processor Disadvantages of strict-aliasing: Needs complicated analysis for detecting strict-aliasing errors Low precision 19 / 20

Conclusion In this work: Simple type-base alias analysis algorithm was
described and implemented for Elbrus compiler The impact on the runtime and compile time characteristics analyzed Further work Extending algorithm to disambiguate ﬁelds of structures Detailed research of strict-aliasing errors in GNU/Linux distribution Comparison of diﬀerent pointer analysis precision 20 / 20

TMPA-2017: Simple Type Based Alias Analysis for...

TMPA-2017: Simple Type Based Alias Analysis for a VLIW Processor

Exactpro PRO

More Decks by Exactpro

Other Decks in Technology

Featured

Transcript

MCST Simple Type-Based Alias Analysis for a VLIW Processor Markin

Elbrus Elbrus — general purpose VLIW (Very Long Instruction Word)

Pointer analysis void foo(int * a, float * b) {

Strict-aliasing The C language allows to disambiguate pointers by types:

Algorithm The strict-aliasing implementation for lcc (Elbrus C Compiler) works

Implementation characteristics Pointer analysis — answers whether two pointers can

Runtime results 400.perlbench 401.bzip2 403.gcc 429.mcf 445.gobmk 462.libquantum 464.h264ref 471.omnetpp

Runtime results 416.gamess 433.milc 434.zeusmp 435.gromacs 436.cactusADM 437.leslie3d 444.namd 447.dealII

Runtime results GMean speedup gained with the help of strict-aliasing:

Implementation Aspects Problem: strict aliasing violations are common. So separate

462.libquantum This test got 17.49 times execution speedup after enabling

462.libquantum In the lcc architecture-independent representation (EIR) we have the

462.libquantum The strict-aliasing analysis builds table of type compatibility for

462.libquantum Speedup was gained by the Elbrus-speciﬁc optimizations. The architecture-dependent

462.libquantum The only LOAD in the loop makes possible to

Other tests Almost all other tests (except 453.povray) have similar

Strict-aliasing clients Strict-aliasing Redundant Load/Store Elimination Memory Runtime Optimizations DAM

Compile Time In general the impact of the analysis on

Summary Advantages of strict-aliasing: Relatively easy implementation Works in per-module

Conclusion In this work: Simple type-base alias analysis algorithm was