Slide 1

Slide 1 text

MCST Simple Type-Based Alias Analysis for a VLIW Processor Markin A. L. [email protected] Ermolitsky A. V. [email protected] 4 march 2017

Slide 2

Slide 2 text

Elbrus Elbrus — general purpose VLIW (Very Long Instruction Word) microprocessor. Features: 23 instructions per tick In-Order instruction execution Array Access Unit (AAU) — asynchronous array loading from memory to the Array Prefetch Buffer (APB) Hardware support of loop pipelining Disambiguation Access Memory (DAM) — hardware support of pointer disambiguation All these features vitaly need good compiler optimization. 2 / 20

Slide 3

Slide 3 text

Pointer analysis void foo(int * a, float * b) { for(int i = 1; i < N; i++) { a[0] += a[i]; b[0] *= b[i]; } } The purpose of pointer analysis is to detect whether a and b may refer to the the same memory area. It is difficult because: Lack of information about program (in per-module build mode) Pointer analysis needs a lot of resources (in whole-program mode) Pointer analysis algorithms are complicated 3 / 20

Slide 4

Slide 4 text

Strict-aliasing The C language allows to disambiguate pointers by types: 7 An object shall have its stored value accessed only by an lvalue expression that has one of the following types: a type compatible with the effective type of the object, a qualified version of a type compatible with the effective type of the object, a type that is the signed or unsigned type corresponding to the effective type of the object, a type that is the signed or unsigned type corresponding to a qualified version of the effective type of the object, an aggregate or union type that includes one of the a mentioned types among its members (including, recursively, a member of a subaggregate or contained union), or a character type. 4 / 20

Slide 5

Slide 5 text

Algorithm The strict-aliasing implementation for lcc (Elbrus C Compiler) works with the architecture-independent IR (EIR). General description: 1. Gather all interesting READ and WRITE operations 2. Generate compatibility vector for each type of operations 3. Assign results of analysis to corresponding operations Type-based alias analysis is implemented in all major compilers. 5 / 20

Slide 6

Slide 6 text

Implementation characteristics Pointer analysis — answers whether two pointers can refer to the same memory area Intraprocedural — does not require whole program information Flow-insensitive — does not use information about the program control-flow Context-insensitive — does not use information from the functions call points No memory modeling Result representation is vector 6 / 20

Slide 7

Slide 7 text

Runtime results 400.perlbench 401.bzip2 403.gcc 429.mcf 445.gobmk 462.libquantum 464.h264ref 471.omnetpp 473.astar 483.xalancbmk 0.90 0.95 1.00 1.05 1.10 1.15 17.49 lcc module lcc whole gcc module gcc lto Figure: Integer SPEC CPU2006 execution speedup (> 1 is better) 7 / 20

Slide 8

Slide 8 text

Runtime results 416.gamess 433.milc 434.zeusmp 435.gromacs 436.cactusADM 437.leslie3d 444.namd 447.dealII 450.soplex 453.povray 454.calculix 459.GemsFDTD 465.tonto 482.sphinx3 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2 lcc module lcc whole gcc module gcc lto Figure: Floating point SPEC CPU2006 execution speedup (> 1 is better) 8 / 20

Slide 9

Slide 9 text

Runtime results GMean speedup gained with the help of strict-aliasing: lcc -O3 -ffast lcc -O3 -ffast -fwhole gcc -O3 gcc -O3 -flto SPEC CPU2006 INT 28.6% 1.9% 1% 0% SPEC CPU2006 FP 13.3% 4.3% 1.5 1.1% Testing environment: lcc — Elbrus 4C (Elbrus v3 ISA) gcc — Intel Xeon E5-2650 (x86 64 ISA) 9 / 20

Slide 10

Slide 10 text

Implementation Aspects Problem: strict aliasing violations are common. So separate analysis for strict-aliasing errors detecting was implemented Problem: unions are hard to analyse at compile time, so they are ignored 10 / 20

Slide 11

Slide 11 text

462.libquantum This test got 17.49 times execution speedup after enabling strict-aliasing analysis for per-module build mode! Three hottest functions have the same pattern: void foo(str_1 * str) { for(int i = 0; i < N; i++) { str->arr[i].field; // LOAD of arr and LOAD of field ... str->arr[i].field = val; // STORE to field } } Dependence between STORE of field and LOAD of arr prohibits to eliminate invariant LOAD. 11 / 20

Slide 12

Slide 12 text

462.libquantum In the lcc architecture-independent representation (EIR) we have the following operations: loop: ... o1. READ str : str_1 * o2. RD_FIELD o1.arr : str_2 * o3. ADD_P o2, i : str_2 * o4. RD_FIELD o3.field : int32 ... o4. WR_FIELD o3.field <- val : int32 12 / 20

Slide 13

Slide 13 text

462.libquantum The strict-aliasing analysis builds table of type compatibility for three types: str_1 * str_2 * int32 str_1 * 1 0 0 str_2 * 0 1 0 int32 0 0 1 In this example all three types are incompatibile and the operations working with them can not refer to the same memory area. 13 / 20

Slide 14

Slide 14 text

462.libquantum Speedup was gained by the Elbrus-specific optimizations. The architecture-dependent IR of the loop is the following: loop: ... o1. LOAD str->arr 0 -> r1 // Alias vector: 010 o2. ADD_P r1 i -> r2 o3. LOAD r2 offset(field) -> r3 // Alias vector: 001 ... o4. STORE r2 offset(field) val // Alias vector: 001 Results of strict-aliasing makes possible to disambiguate operations o1. LOAD and o4. STORE and to eliminate invariant o1. LOAD from the loop. 14 / 20

Slide 15

Slide 15 text

462.libquantum The only LOAD in the loop makes possible to evaluate some optimizations: o1. LOAD str->arr 0 -> r1 // Alias vector: 010 loop: ... o2. MOVA arr_buff ... o3. ADD_P r1 i -> r2 o4. STORE r2 offset(field) val // Alias vector: 001 Before strict-aliasing: weak pipelining DAM applied no APB After strict-aliasing: improved pipelining No DAM APB 15 / 20

Slide 16

Slide 16 text

Other tests Almost all other tests (except 453.povray) have similar to 462.libquantum but more complicated code patterns. The tests 459.GemsFDTD and 437.leslie3d are Fortran tests but lcc translates them to C so we can also see their speedup. In the 453.povray hot functions there are no loops. The 16% speedup is based only on peephole improvement! 16 / 20

Slide 17

Slide 17 text

Strict-aliasing clients Strict-aliasing Redundant Load/Store Elimination Memory Runtime Optimizations DAM RTMD Loop Optimizations APB Pipelining Peephole 17 / 20

Slide 18

Slide 18 text

Compile Time In general the impact of the analysis on the compilation time is low. Compilation time speedup: lcc -O3 -ffast lcc -O3 -ffast -fwhole gcc -O3 gcc -O3 -flto GMean -3% 1% 1% 2% The size of the stored analysis results is linear to the number of operations in the procedure. 18 / 20

Slide 19

Slide 19 text

Summary Advantages of strict-aliasing: Relatively easy implementation Works in per-module build mode In some cases works with object fields High scalability Great execution speedup on VLIW processor Disadvantages of strict-aliasing: Needs complicated analysis for detecting strict-aliasing errors Low precision 19 / 20

Slide 20

Slide 20 text

Conclusion In this work: Simple type-base alias analysis algorithm was described and implemented for Elbrus compiler The impact on the runtime and compile time characteristics analyzed Further work Extending algorithm to disambiguate fields of structures Detailed research of strict-aliasing errors in GNU/Linux distribution Comparison of different pointer analysis precision 20 / 20