Slide 1

Slide 1 text

Bringing Low-Level Languages to the JVM: Efficient Execution of LLVM IR on Truffle Manuel Rigger1, Matthias Grimmer1, Christian Wimmer2, Thomas Würthinger2, Hanspeter Mössenböck1 VMIL, 31. October, 2016 1Johannes Kepler University Linz 2Oracle Labs

Slide 2

Slide 2 text

JVM C C++ Fortran ... Execute on What is Sulong? 2 Execute low-level languages… … on the JVM …

Slide 3

Slide 3 text

JVM LLVM IR C C++ Fortran ... Compile to Execute on What is Sulong? 3 Execute low-level languages… … on the JVM … … by compiling them to LLVM IR, …

Slide 4

Slide 4 text

LLVM IR Interpreter JVM LLVM IR C C++ Fortran ... Compile to Execute with What is Sulong? 4 Execute low-level languages… … interpreting this IR, … … on the JVM … … by compiling them to LLVM IR, …

Slide 5

Slide 5 text

What is Sulong? 5 Execute low-level languages… … interpreting this IR, … … on the JVM … … and using a dynamic compiler to make the approach fast. … by compiling them to LLVM IR, … LLVM IR Interpreter JVM LLVM IR C C++ Fortran ... JIT compiler Compile to Execute with

Slide 6

Slide 6 text

Why do We Need Sulong? (1) To implement dynamic language‘s native interfaces 6 (2) As an alternative to Java‘s foreign function interfaces JRuby+Truffle FastR Graal.JS JVM Rigger, M.; et al.: Sulong - Execution of LLVM-Based Languages on the JVM ICOOOLPS'16

Slide 7

Slide 7 text

LLVM IR Interpreter Truffle LLVM IR Clang C C++ GCC Fortran Other LLVM frontend ... JVM LLVM tools Graal compiler System Overview 7

Slide 8

Slide 8 text

LLVM IR Interpreter Truffle LLVM IR Clang C C++ GCC Fortran Other LLVM frontend ... JVM LLVM tools Graal compiler System Overview 7 Lattner, et al.: LLVM: A compilation framework for lifelong program analysis & transformation. CGO 2004.

Slide 9

Slide 9 text

LLVM IR Interpreter Truffle LLVM IR Clang C C++ GCC Fortran Other LLVM frontend ... JVM LLVM tools Graal compiler System Overview 7

Slide 10

Slide 10 text

LLVM IR Interpreter Truffle LLVM IR Clang C C++ GCC Fortran Other LLVM frontend ... JVM LLVM tools Graal compiler System Overview 7 Würthinger, et al.: One VM to rule them all Onward! 2013

Slide 11

Slide 11 text

Three Contributions/Challenges • Interpretation approach • Compilation approach • Native Calls + Performance 8

Slide 12

Slide 12 text

Example program 9 void processRequests () { int i = 0; do { processPacket (); i ++; } while (i < 10000) ; } define void @processRequests () #0 { ; ( basic block 0) br label %1 ; :1 ( basic block 1) %i .0 = phi i32 [ 0, %0 ], [ %2 , %1 ] call void @processPacket () %2 = add nsw i32 %i .0, 1 %3 = icmp slt i32 %2 , 10000 br i1 %3 , label %1 , label %4 ; :4 ( basic block 2) ret void } LLVM IR Clang

Slide 13

Slide 13 text

Example program 10 define void @processRequests () #0 { ; ( basic block 0) br label %1 ; :1 ( basic block 1) %i .0 = phi i32 [ 0, %0 ], [ %2 , %1 ] call void @processPacket () %2 = add nsw i32 %i .0, 1 %3 = icmp slt i32 %2 , 10000 br i1 %3 , label %1 , label %4 ; :4 ( basic block 2) ret void } LLVM IR

Slide 14

Slide 14 text

Example program 10 define void @processRequests () #0 { ; ( basic block 0) br label %1 ; :1 ( basic block 1) %i .0 = phi i32 [ 0, %0 ], [ %2 , %1 ] call void @processPacket () %2 = add nsw i32 %i .0, 1 %3 = icmp slt i32 %2 , 10000 br i1 %3 , label %1 , label %4 ; :4 ( basic block 2) ret void } LLVM IR Contains unstructured control flow

Slide 15

Slide 15 text

Example program 10 define void @processRequests () #0 { ; ( basic block 0) br label %1 ; :1 ( basic block 1) %i .0 = phi i32 [ 0, %0 ], [ %2 , %1 ] call void @processPacket () %2 = add nsw i32 %i .0, 1 %3 = icmp slt i32 %2 , 10000 br i1 %3 , label %1 , label %4 ; :4 ( basic block 2) ret void } LLVM IR Contains unstructured control flow Erosa et. al.: Taming control flow: A structured approach to eliminating goto statements. Computer Languages 1994 Recover the control flow information?

Slide 16

Slide 16 text

Block0 Block1 Block2 Basic Block Dispatch Node 1 2 -1 1 Interpreter 11 int blockIndex = 0; while (blockIndex != -1) blockIndex = blocks[blockIndex ].execute (); Interpreter implementation

Slide 17

Slide 17 text

Block0 Block1 Block2 Basic Block Dispatch Node 1 2 -1 1 Interpreter 11 int blockIndex = 0; while (blockIndex != -1) blockIndex = blocks[blockIndex ].execute (); Interpreter implementation Successor block indices

Slide 18

Slide 18 text

Block0 Block1 Block2 Basic Block Dispatch Node 1 2 -1 1 Interpreter 12 define void @processRequests () #0 { ; ( basic block 0) br label %1 ; :1 ( basic block 1) %i .0 = phi i32 [ 0, %0 ], [ %2 , %1 ] call void @processPacket () %2 = add nsw i32 %i .0, 1 %3 = icmp slt i32 %2 , 10000 br i1 %3 , label %1 , label %4 ; :4 ( basic block 2) ret void } Program execution

Slide 19

Slide 19 text

Block0 Block1 Block2 Basic Block Dispatch Node 1 2 -1 1 Interpreter 13 define void @processRequests () #0 { ; ( basic block 0) br label %1 ; :1 ( basic block 1) %i .0 = phi i32 [ 0, %0 ], [ %2 , %1 ] call void @processPacket () %2 = add nsw i32 %i .0, 1 %3 = icmp slt i32 %2 , 10000 br i1 %3 , label %1 , label %4 ; :4 ( basic block 2) ret void } Program execution

Slide 20

Slide 20 text

Block0 Block1 Block2 Basic Block Dispatch Node 1 2 -1 1 Interpreter 14 define void @processRequests () #0 { ; ( basic block 0) br label %1 ; :1 ( basic block 1) %i .0 = phi i32 [ 0, %0 ], [ %2 , %1 ] call void @processPacket () %2 = add nsw i32 %i .0, 1 %3 = icmp slt i32 %2 , 10000 br i1 %3 , label %1 , label %4 ; :4 ( basic block 2) ret void } Program execution

Slide 21

Slide 21 text

Block0 Block1 Block2 Basic Block Dispatch Node 1 2 -1 1 Compiler 15 // blockIndex = 0 Unrolling of the interpreter Control flow graph, not tree  cycles!

Slide 22

Slide 22 text

Block0 Block1 Block2 Basic Block Dispatch Node 1 2 -1 1 Compiler 16 blocks [0]. execute (); Unrolling of the interpreter

Slide 23

Slide 23 text

Block0 Block1 Block2 Basic Block Dispatch Node 1 2 -1 1 Compiler 17 blocks [0]. execute (); // blockIndex = 1 Unrolling of the interpreter

Slide 24

Slide 24 text

Block0 Block1 Block2 Basic Block Dispatch Node 1 2 -1 1 Compiler 18 blocks [0]. execute (); // blockIndex = 1 blockIndex = blocks [1]. execute (); Unrolling of the interpreter

Slide 25

Slide 25 text

Block0 Block1 Block2 Basic Block Dispatch Node 1 2 -1 1 Compiler 19 blocks [0]. execute (); // blockIndex = 1 blockIndex = blocks [1]. execute (); // blockIndex = 1 Unrolling of the interpreter

Slide 26

Slide 26 text

Block0 Block1 Block2 Basic Block Dispatch Node 1 2 -1 1 Compiler 20 blocks [0]. execute (); // blockIndex = 1 merge1 : blockIndex = blocks [1]. execute (); // blockIndex = 1 or … if ( blockIndex == 1) goto merge1 ; Unrolling of the interpreter

Slide 27

Slide 27 text

Block0 Block1 Block2 Basic Block Dispatch Node 1 2 -1 1 Compiler 20 blocks [0]. execute (); // blockIndex = 1 merge1 : blockIndex = blocks [1]. execute (); // blockIndex = 1 or … if ( blockIndex == 1) goto merge1 ; Merging already expanded paths makes the compilation work! Unrolling of the interpreter

Slide 28

Slide 28 text

Block0 Block1 Block2 Basic Block Dispatch Node 1 2 -1 1 Compiler 21 blocks [0]. execute (); // blockIndex = 1 merge1 : blockIndex = blocks [1]. execute (); // blockIndex = 1 or 2 if ( blockIndex == 1) goto merge1 ; else // blockIndex = 2 Unrolling of the interpreter

Slide 29

Slide 29 text

Block0 Block1 Block2 Basic Block Dispatch Node 1 2 -1 1 Compiler 22 blocks [0]. execute (); // blockIndex = 1 merge1 : blockIndex = blocks [1]. execute (); // blockIndex = 1 or 2 if ( blockIndex == 1) goto merge1 ; else blocks [2]. execute (); // blockIndex = -1 Unrolling of the interpreter

Slide 30

Slide 30 text

Block0 Block1 Block2 Basic Block Dispatch Node 1 2 -1 1 Compiler 23 blocks [0]. execute (); // blockIndex = 1 merge1 : blockIndex = blocks [1]. execute (); // blockIndex = 1 or 2 if ( blockIndex == 1) goto merge1 ; else blocks [2]. execute (); // blockIndex = -1 Unrolling of the interpreter

Slide 31

Slide 31 text

Block0 Block1 Block2 Basic Block Dispatch Node 1 2 -1 1 Compiler 24 blocks [0]. execute (); // blockIndex = 1 merge1 : blockIndex = blocks [1]. execute (); // blockIndex = 1 or 2 if ( blockIndex == 1) goto merge1 ; else blocks [2]. execute (); // blockIndex = -1 return ; Unrolling of the interpreter

Slide 32

Slide 32 text

Compiler 25 Unrolling of the interpreter Block0 Block1 Block2 Basic Block Dispatch Node 1 2 -1 1 blocks [0]. execute (); // blockIndex = 1 merge1 : blockIndex = blocks [1]. execute (); // blockIndex = 1 or 2 if ( blockIndex == 1) goto merge1 ; else blocks [2]. execute (); // blockIndex = -1 return ;

Slide 33

Slide 33 text

Native Calls 26 malloc LLVM IR interpreter sizeof(int) * 10 C standard library

Slide 34

Slide 34 text

Native Calls 26 We call the standard library functions to allocate unmanaged heap memory malloc LLVM IR interpreter sizeof(int) * 10 C standard library

Slide 35

Slide 35 text

Native Calls 26 We call the standard library functions to allocate unmanaged heap memory We use sun.misc.Unsafe to access unmanaged memory malloc LLVM IR interpreter sizeof(int) * 10 C standard library

Slide 36

Slide 36 text

Optimizations Problem: Java compilers do not optimize accesses to unmanaged memory Solution: LLVM optimizations • Promote to local variables • Dead store elimination • Traditional compiler optimizations based on alias analysis 27 Compile- time Run- time

Slide 37

Slide 37 text

Optimizations Problem: Static compilers cannot react to program behavior Solution: Profiling information exploited by Truffle and Graal • Runtime inlining • Dynamic dead code elimination • Value profiling • Polymorphic inline caches 28 Compile- time Run- time

Slide 38

Slide 38 text

Performance Evaluation • C (and Fortran) programs • Peak performance  warmup runs • With LLVM optimizations • Sulong revision ad56c6f available on https://github.com/graalvm/sulong 29

Slide 39

Slide 39 text

C Shootouts 30 0 0,2 0,4 0,6 0,8 1 1,2 1,4 The Computer Language Benchmark Game Sulong Clang O3 higher is better (all benchmarks) relative to Clang -O3 based on LLVM 3.3

Slide 40

Slide 40 text

C Shootouts 31 0 0,2 0,4 0,6 0,8 1 1,2 1,4 The Computer Language Benchmark Game Sulong Clang O3 small benchmarks: up to 453 LOC

Slide 41

Slide 41 text

C Shootouts 32 0 0,2 0,4 0,6 0,8 1 1,2 1,4 The Computer Language Benchmark Game Sulong Clang O3 binarytrees: 2.5x slower

Slide 42

Slide 42 text

C Shootouts 33 0 0,2 0,4 0,6 0,8 1 1,2 1,4 The Computer Language Benchmark Game Sulong Clang O3 nbody: 1.3x faster

Slide 43

Slide 43 text

C Shootouts 34 0 0,2 0,4 0,6 0,8 1 1,2 1,4 The Computer Language Benchmark Game Sulong Clang O3 geometric mean: 1.2x slower

Slide 44

Slide 44 text

Single Compilation-Unit Benchmarks 35 0 0,2 0,4 0,6 0,8 1 1,2 oggenc bzip2 gzip Large single compilation-unit C programs Sulong Clang O3 large benchmarks: 5K LOC-48K LOC

Slide 45

Slide 45 text

C Large Benchmarks 36 0 0,2 0,4 0,6 0,8 1 1,2 oggenc bzip2 gzip Large single compilation-unit C programs Sulong Clang O3 geometric mean: 2.0x slower

Slide 46

Slide 46 text

Evaluation Summary 37 Language Speedup Geometric Mean Reference C 0.7x Clang 3.3 -O3 Fortran 0.4x GCC 4.6 -O3

Slide 47

Slide 47 text

Evaluation Summary 37 Language Speedup Geometric Mean Reference C 0.7x Clang 3.3 -O3 Fortran 0.4x GCC 4.6 -O3 Polyhedron benchmarks

Slide 48

Slide 48 text

Evaluation Summary 37 Language Speedup Geometric Mean Reference C 0.7x Clang 3.3 -O3 Fortran 0.4x GCC 4.6 -O3 Reasons • Missing micro optimizations • Unncessary allocations of interpreter data structures • Truffle function calls • (we did not look into optimizing Fortran LLVM IR programs) Polyhedron benchmarks

Slide 49

Slide 49 text

Future Work • Implement foreign function interfaces using Sulong • Comparison with other systems/foreign function interfaces • Managed Execution 38 JRuby+Truffle FastR Graal.JS JVM

Slide 50

Slide 50 text

Bibliography • Rigger, M.; Grimmer, M.; Mössenböck, H.: Sulong - Execution of LLVM-Based Languages on the JVM. Int. Workshop on Implementation, Compilation, Optimization of Object-Oriented Languages, Programs and Systems (ICOOOLPS'16), July 18, 2016, Rome, Italy • Lattner, Chris, and Vikram Adve: LLVM: A compilation framework for lifelong program analysis & transformation. Code Generation and Optimization, 2004. CGO 2004. International Symposium on. IEEE, 2004. • Würthinger, Thomas, et al.: One VM to rule them all. Proceedings of the 2013 ACM international symposium on New ideas, new paradigms, and reflections on programming & software. ACM, 2013. • A. M. Erosa and L. J. Hendren. Taming control flow: A structured approach to eliminating goto statements. In Proceedings of Computer Languages, pages 229–240, 1994. 39

Slide 51

Slide 51 text

Thanks for listening! 40 https://github.com/graalvm/sulong/ @RiggerManuel