VMIL'16: Bringing Low-Level Languages to the JVM: Efficient Execution of LLVM IR on Truffle

Bringing Low-Level Languages to the JVM: Efficient Execution of LLVM
IR on Truffle Manuel Rigger1, Matthias Grimmer1, Christian Wimmer2, Thomas Würthinger2, Hanspeter Mössenböck1 VMIL, 31. October, 2016 1Johannes Kepler University Linz 2Oracle Labs

JVM C C++ Fortran ... Execute on What is Sulong?
2 Execute low-level languages… … on the JVM …

JVM LLVM IR C C++ Fortran ... Compile to Execute
on What is Sulong? 3 Execute low-level languages… … on the JVM … … by compiling them to LLVM IR, …

LLVM IR Interpreter JVM LLVM IR C C++ Fortran ...
Compile to Execute with What is Sulong? 4 Execute low-level languages… … interpreting this IR, … … on the JVM … … by compiling them to LLVM IR, …

What is Sulong? 5 Execute low-level languages… … interpreting this
IR, … … on the JVM … … and using a dynamic compiler to make the approach fast. … by compiling them to LLVM IR, … LLVM IR Interpreter JVM LLVM IR C C++ Fortran ... JIT compiler Compile to Execute with

Why do We Need Sulong? (1) To implement dynamic language‘s
native interfaces 6 (2) As an alternative to Java‘s foreign function interfaces JRuby+Truffle FastR Graal.JS JVM Rigger, M.; et al.: Sulong - Execution of LLVM-Based Languages on the JVM ICOOOLPS'16

LLVM IR Interpreter Truffle LLVM IR Clang C C++ GCC
Fortran Other LLVM frontend ... JVM LLVM tools Graal compiler System Overview 7

Fortran Other LLVM frontend ... JVM LLVM tools Graal compiler System Overview 7 Lattner, et al.: LLVM: A compilation framework for lifelong program analysis & transformation. CGO 2004.

Fortran Other LLVM frontend ... JVM LLVM tools Graal compiler System Overview 7

Fortran Other LLVM frontend ... JVM LLVM tools Graal compiler System Overview 7 Würthinger, et al.: One VM to rule them all Onward! 2013

Three Contributions/Challenges • Interpretation approach • Compilation approach • Native
Calls + Performance 8

Example program 9 void processRequests () { int i =
0; do { processPacket (); i ++; } while (i < 10000) ; } define void @processRequests () #0 { ; ( basic block 0) br label %1 ; <label >:1 ( basic block 1) %i .0 = phi i32 [ 0, %0 ], [ %2 , %1 ] call void @processPacket () %2 = add nsw i32 %i .0, 1 %3 = icmp slt i32 %2 , 10000 br i1 %3 , label %1 , label %4 ; <label >:4 ( basic block 2) ret void } LLVM IR Clang

Example program 10 define void @processRequests () #0 { ;
( basic block 0) br label %1 ; <label >:1 ( basic block 1) %i .0 = phi i32 [ 0, %0 ], [ %2 , %1 ] call void @processPacket () %2 = add nsw i32 %i .0, 1 %3 = icmp slt i32 %2 , 10000 br i1 %3 , label %1 , label %4 ; <label >:4 ( basic block 2) ret void } LLVM IR

( basic block 0) br label %1 ; <label >:1 ( basic block 1) %i .0 = phi i32 [ 0, %0 ], [ %2 , %1 ] call void @processPacket () %2 = add nsw i32 %i .0, 1 %3 = icmp slt i32 %2 , 10000 br i1 %3 , label %1 , label %4 ; <label >:4 ( basic block 2) ret void } LLVM IR Contains unstructured control flow

( basic block 0) br label %1 ; <label >:1 ( basic block 1) %i .0 = phi i32 [ 0, %0 ], [ %2 , %1 ] call void @processPacket () %2 = add nsw i32 %i .0, 1 %3 = icmp slt i32 %2 , 10000 br i1 %3 , label %1 , label %4 ; <label >:4 ( basic block 2) ret void } LLVM IR Contains unstructured control flow Erosa et. al.: Taming control flow: A structured approach to eliminating goto statements. Computer Languages 1994 Recover the control flow information?

Block0 Block1 Block2 Basic Block Dispatch Node 1 2 -1
1 Interpreter 11 int blockIndex = 0; while (blockIndex != -1) blockIndex = blocks[blockIndex ].execute (); Interpreter implementation

1 Interpreter 11 int blockIndex = 0; while (blockIndex != -1) blockIndex = blocks[blockIndex ].execute (); Interpreter implementation Successor block indices

1 Interpreter 12 define void @processRequests () #0 { ; ( basic block 0) br label %1 ; <label >:1 ( basic block 1) %i .0 = phi i32 [ 0, %0 ], [ %2 , %1 ] call void @processPacket () %2 = add nsw i32 %i .0, 1 %3 = icmp slt i32 %2 , 10000 br i1 %3 , label %1 , label %4 ; <label >:4 ( basic block 2) ret void } Program execution

1 Compiler 15 // blockIndex = 0 Unrolling of the interpreter Control flow graph, not tree  cycles!

1 Compiler 16 blocks [0]. execute (); Unrolling of the interpreter

1 Compiler 17 blocks [0]. execute (); // blockIndex = 1 Unrolling of the interpreter

1 Compiler 18 blocks [0]. execute (); // blockIndex = 1 blockIndex = blocks [1]. execute (); Unrolling of the interpreter

1 Compiler 19 blocks [0]. execute (); // blockIndex = 1 blockIndex = blocks [1]. execute (); // blockIndex = 1 Unrolling of the interpreter

1 Compiler 20 blocks [0]. execute (); // blockIndex = 1 merge1 : blockIndex = blocks [1]. execute (); // blockIndex = 1 or … if ( blockIndex == 1) goto merge1 ; Unrolling of the interpreter

1 Compiler 20 blocks [0]. execute (); // blockIndex = 1 merge1 : blockIndex = blocks [1]. execute (); // blockIndex = 1 or … if ( blockIndex == 1) goto merge1 ; Merging already expanded paths makes the compilation work! Unrolling of the interpreter

1 Compiler 21 blocks [0]. execute (); // blockIndex = 1 merge1 : blockIndex = blocks [1]. execute (); // blockIndex = 1 or 2 if ( blockIndex == 1) goto merge1 ; else // blockIndex = 2 Unrolling of the interpreter

1 Compiler 22 blocks [0]. execute (); // blockIndex = 1 merge1 : blockIndex = blocks [1]. execute (); // blockIndex = 1 or 2 if ( blockIndex == 1) goto merge1 ; else blocks [2]. execute (); // blockIndex = -1 Unrolling of the interpreter

1 Compiler 23 blocks [0]. execute (); // blockIndex = 1 merge1 : blockIndex = blocks [1]. execute (); // blockIndex = 1 or 2 if ( blockIndex == 1) goto merge1 ; else blocks [2]. execute (); // blockIndex = -1 Unrolling of the interpreter

1 Compiler 24 blocks [0]. execute (); // blockIndex = 1 merge1 : blockIndex = blocks [1]. execute (); // blockIndex = 1 or 2 if ( blockIndex == 1) goto merge1 ; else blocks [2]. execute (); // blockIndex = -1 return ; Unrolling of the interpreter

Compiler 25 Unrolling of the interpreter Block0 Block1 Block2 Basic
Block Dispatch Node 1 2 -1 1 blocks [0]. execute (); // blockIndex = 1 merge1 : blockIndex = blocks [1]. execute (); // blockIndex = 1 or 2 if ( blockIndex == 1) goto merge1 ; else blocks [2]. execute (); // blockIndex = -1 return ;

Native Calls 26 malloc LLVM IR interpreter sizeof(int) * 10
C standard library

Native Calls 26 We call the standard library functions to
allocate unmanaged heap memory malloc LLVM IR interpreter sizeof(int) * 10 C standard library

Native Calls 26 We call the standard library functions to
allocate unmanaged heap memory We use sun.misc.Unsafe to access unmanaged memory malloc LLVM IR interpreter sizeof(int) * 10 C standard library

Optimizations Problem: Java compilers do not optimize accesses to unmanaged
memory Solution: LLVM optimizations • Promote to local variables • Dead store elimination • Traditional compiler optimizations based on alias analysis 27 Compile- time Run- time

Optimizations Problem: Static compilers cannot react to program behavior Solution:
Profiling information exploited by Truffle and Graal • Runtime inlining • Dynamic dead code elimination • Value profiling • Polymorphic inline caches 28 Compile- time Run- time

Performance Evaluation • C (and Fortran) programs • Peak performance
 warmup runs • With LLVM optimizations • Sulong revision ad56c6f available on https://github.com/graalvm/sulong 29

C Shootouts 30 0 0,2 0,4 0,6 0,8 1 1,2
1,4 The Computer Language Benchmark Game Sulong Clang O3 higher is better (all benchmarks) relative to Clang -O3 based on LLVM 3.3

C Shootouts 31 0 0,2 0,4 0,6 0,8 1 1,2
1,4 The Computer Language Benchmark Game Sulong Clang O3 small benchmarks: up to 453 LOC

C Shootouts 32 0 0,2 0,4 0,6 0,8 1 1,2
1,4 The Computer Language Benchmark Game Sulong Clang O3 binarytrees: 2.5x slower

C Shootouts 33 0 0,2 0,4 0,6 0,8 1 1,2
1,4 The Computer Language Benchmark Game Sulong Clang O3 nbody: 1.3x faster

C Shootouts 34 0 0,2 0,4 0,6 0,8 1 1,2
1,4 The Computer Language Benchmark Game Sulong Clang O3 geometric mean: 1.2x slower

Single Compilation-Unit Benchmarks 35 0 0,2 0,4 0,6 0,8 1
1,2 oggenc bzip2 gzip Large single compilation-unit C programs Sulong Clang O3 large benchmarks: 5K LOC-48K LOC

C Large Benchmarks 36 0 0,2 0,4 0,6 0,8 1
1,2 oggenc bzip2 gzip Large single compilation-unit C programs Sulong Clang O3 geometric mean: 2.0x slower

Evaluation Summary 37 Language Speedup Geometric Mean Reference C 0.7x
Clang 3.3 -O3 Fortran 0.4x GCC 4.6 -O3

Clang 3.3 -O3 Fortran 0.4x GCC 4.6 -O3 Polyhedron benchmarks

Clang 3.3 -O3 Fortran 0.4x GCC 4.6 -O3 Reasons • Missing micro optimizations • Unncessary allocations of interpreter data structures • Truffle function calls • (we did not look into optimizing Fortran LLVM IR programs) Polyhedron benchmarks

Future Work • Implement foreign function interfaces using Sulong •
Comparison with other systems/foreign function interfaces • Managed Execution 38 JRuby+Truffle FastR Graal.JS JVM

Bibliography • Rigger, M.; Grimmer, M.; Mössenböck, H.: Sulong -
Execution of LLVM-Based Languages on the JVM. Int. Workshop on Implementation, Compilation, Optimization of Object-Oriented Languages, Programs and Systems (ICOOOLPS'16), July 18, 2016, Rome, Italy • Lattner, Chris, and Vikram Adve: LLVM: A compilation framework for lifelong program analysis & transformation. Code Generation and Optimization, 2004. CGO 2004. International Symposium on. IEEE, 2004. • Würthinger, Thomas, et al.: One VM to rule them all. Proceedings of the 2013 ACM international symposium on New ideas, new paradigms, and reflections on programming & software. ACM, 2013. • A. M. Erosa and L. J. Hendren. Taming control flow: A structured approach to eliminating goto statements. In Proceedings of Computer Languages, pages 229–240, 1994. 39

Thanks for listening! 40 https://github.com/graalvm/sulong/ @RiggerManuel

VMIL'16: Bringing Low-Level Languages to the JV...

VMIL'16: Bringing Low-Level Languages to the JVM: Efficient Execution of LLVM IR on Truffle

More Decks by Manuel Rigger

Other Decks in Research

Featured

Transcript