VMIL'16: Bringing Low-Level Languages to the JVM: Efficient Execution of LLVM IR on Truffle

389c8e3d83119ec458c5c57e8d92da2a?s=47 Manuel Rigger
November 07, 2016

VMIL'16: Bringing Low-Level Languages to the JVM: Efficient Execution of LLVM IR on Truffle

389c8e3d83119ec458c5c57e8d92da2a?s=128

Manuel Rigger

November 07, 2016
Tweet

Transcript

  1. 1.

    Bringing Low-Level Languages to the JVM: Efficient Execution of LLVM

    IR on Truffle Manuel Rigger1, Matthias Grimmer1, Christian Wimmer2, Thomas Würthinger2, Hanspeter Mössenböck1 VMIL, 31. October, 2016 1Johannes Kepler University Linz 2Oracle Labs
  2. 2.

    JVM C C++ Fortran ... Execute on What is Sulong?

    2 Execute low-level languages… … on the JVM …
  3. 3.

    JVM LLVM IR C C++ Fortran ... Compile to Execute

    on What is Sulong? 3 Execute low-level languages… … on the JVM … … by compiling them to LLVM IR, …
  4. 4.

    LLVM IR Interpreter JVM LLVM IR C C++ Fortran ...

    Compile to Execute with What is Sulong? 4 Execute low-level languages… … interpreting this IR, … … on the JVM … … by compiling them to LLVM IR, …
  5. 5.

    What is Sulong? 5 Execute low-level languages… … interpreting this

    IR, … … on the JVM … … and using a dynamic compiler to make the approach fast. … by compiling them to LLVM IR, … LLVM IR Interpreter JVM LLVM IR C C++ Fortran ... JIT compiler Compile to Execute with
  6. 6.

    Why do We Need Sulong? (1) To implement dynamic language‘s

    native interfaces 6 (2) As an alternative to Java‘s foreign function interfaces JRuby+Truffle FastR Graal.JS JVM Rigger, M.; et al.: Sulong - Execution of LLVM-Based Languages on the JVM ICOOOLPS'16
  7. 7.

    LLVM IR Interpreter Truffle LLVM IR Clang C C++ GCC

    Fortran Other LLVM frontend ... JVM LLVM tools Graal compiler System Overview 7
  8. 8.

    LLVM IR Interpreter Truffle LLVM IR Clang C C++ GCC

    Fortran Other LLVM frontend ... JVM LLVM tools Graal compiler System Overview 7 Lattner, et al.: LLVM: A compilation framework for lifelong program analysis & transformation. CGO 2004.
  9. 9.

    LLVM IR Interpreter Truffle LLVM IR Clang C C++ GCC

    Fortran Other LLVM frontend ... JVM LLVM tools Graal compiler System Overview 7
  10. 10.

    LLVM IR Interpreter Truffle LLVM IR Clang C C++ GCC

    Fortran Other LLVM frontend ... JVM LLVM tools Graal compiler System Overview 7 Würthinger, et al.: One VM to rule them all Onward! 2013
  11. 12.

    Example program 9 void processRequests () { int i =

    0; do { processPacket (); i ++; } while (i < 10000) ; } define void @processRequests () #0 { ; ( basic block 0) br label %1 ; <label >:1 ( basic block 1) %i .0 = phi i32 [ 0, %0 ], [ %2 , %1 ] call void @processPacket () %2 = add nsw i32 %i .0, 1 %3 = icmp slt i32 %2 , 10000 br i1 %3 , label %1 , label %4 ; <label >:4 ( basic block 2) ret void } LLVM IR Clang
  12. 13.

    Example program 10 define void @processRequests () #0 { ;

    ( basic block 0) br label %1 ; <label >:1 ( basic block 1) %i .0 = phi i32 [ 0, %0 ], [ %2 , %1 ] call void @processPacket () %2 = add nsw i32 %i .0, 1 %3 = icmp slt i32 %2 , 10000 br i1 %3 , label %1 , label %4 ; <label >:4 ( basic block 2) ret void } LLVM IR
  13. 14.

    Example program 10 define void @processRequests () #0 { ;

    ( basic block 0) br label %1 ; <label >:1 ( basic block 1) %i .0 = phi i32 [ 0, %0 ], [ %2 , %1 ] call void @processPacket () %2 = add nsw i32 %i .0, 1 %3 = icmp slt i32 %2 , 10000 br i1 %3 , label %1 , label %4 ; <label >:4 ( basic block 2) ret void } LLVM IR Contains unstructured control flow
  14. 15.

    Example program 10 define void @processRequests () #0 { ;

    ( basic block 0) br label %1 ; <label >:1 ( basic block 1) %i .0 = phi i32 [ 0, %0 ], [ %2 , %1 ] call void @processPacket () %2 = add nsw i32 %i .0, 1 %3 = icmp slt i32 %2 , 10000 br i1 %3 , label %1 , label %4 ; <label >:4 ( basic block 2) ret void } LLVM IR Contains unstructured control flow Erosa et. al.: Taming control flow: A structured approach to eliminating goto statements. Computer Languages 1994 Recover the control flow information?
  15. 16.

    Block0 Block1 Block2 Basic Block Dispatch Node 1 2 -1

    1 Interpreter 11 int blockIndex = 0; while (blockIndex != -1) blockIndex = blocks[blockIndex ].execute (); Interpreter implementation
  16. 17.

    Block0 Block1 Block2 Basic Block Dispatch Node 1 2 -1

    1 Interpreter 11 int blockIndex = 0; while (blockIndex != -1) blockIndex = blocks[blockIndex ].execute (); Interpreter implementation Successor block indices
  17. 18.

    Block0 Block1 Block2 Basic Block Dispatch Node 1 2 -1

    1 Interpreter 12 define void @processRequests () #0 { ; ( basic block 0) br label %1 ; <label >:1 ( basic block 1) %i .0 = phi i32 [ 0, %0 ], [ %2 , %1 ] call void @processPacket () %2 = add nsw i32 %i .0, 1 %3 = icmp slt i32 %2 , 10000 br i1 %3 , label %1 , label %4 ; <label >:4 ( basic block 2) ret void } Program execution
  18. 19.

    Block0 Block1 Block2 Basic Block Dispatch Node 1 2 -1

    1 Interpreter 13 define void @processRequests () #0 { ; ( basic block 0) br label %1 ; <label >:1 ( basic block 1) %i .0 = phi i32 [ 0, %0 ], [ %2 , %1 ] call void @processPacket () %2 = add nsw i32 %i .0, 1 %3 = icmp slt i32 %2 , 10000 br i1 %3 , label %1 , label %4 ; <label >:4 ( basic block 2) ret void } Program execution
  19. 20.

    Block0 Block1 Block2 Basic Block Dispatch Node 1 2 -1

    1 Interpreter 14 define void @processRequests () #0 { ; ( basic block 0) br label %1 ; <label >:1 ( basic block 1) %i .0 = phi i32 [ 0, %0 ], [ %2 , %1 ] call void @processPacket () %2 = add nsw i32 %i .0, 1 %3 = icmp slt i32 %2 , 10000 br i1 %3 , label %1 , label %4 ; <label >:4 ( basic block 2) ret void } Program execution
  20. 21.

    Block0 Block1 Block2 Basic Block Dispatch Node 1 2 -1

    1 Compiler 15 // blockIndex = 0 Unrolling of the interpreter Control flow graph, not tree  cycles!
  21. 22.

    Block0 Block1 Block2 Basic Block Dispatch Node 1 2 -1

    1 Compiler 16 blocks [0]. execute (); Unrolling of the interpreter
  22. 23.

    Block0 Block1 Block2 Basic Block Dispatch Node 1 2 -1

    1 Compiler 17 blocks [0]. execute (); // blockIndex = 1 Unrolling of the interpreter
  23. 24.

    Block0 Block1 Block2 Basic Block Dispatch Node 1 2 -1

    1 Compiler 18 blocks [0]. execute (); // blockIndex = 1 blockIndex = blocks [1]. execute (); Unrolling of the interpreter
  24. 25.

    Block0 Block1 Block2 Basic Block Dispatch Node 1 2 -1

    1 Compiler 19 blocks [0]. execute (); // blockIndex = 1 blockIndex = blocks [1]. execute (); // blockIndex = 1 Unrolling of the interpreter
  25. 26.

    Block0 Block1 Block2 Basic Block Dispatch Node 1 2 -1

    1 Compiler 20 blocks [0]. execute (); // blockIndex = 1 merge1 : blockIndex = blocks [1]. execute (); // blockIndex = 1 or … if ( blockIndex == 1) goto merge1 ; Unrolling of the interpreter
  26. 27.

    Block0 Block1 Block2 Basic Block Dispatch Node 1 2 -1

    1 Compiler 20 blocks [0]. execute (); // blockIndex = 1 merge1 : blockIndex = blocks [1]. execute (); // blockIndex = 1 or … if ( blockIndex == 1) goto merge1 ; Merging already expanded paths makes the compilation work! Unrolling of the interpreter
  27. 28.

    Block0 Block1 Block2 Basic Block Dispatch Node 1 2 -1

    1 Compiler 21 blocks [0]. execute (); // blockIndex = 1 merge1 : blockIndex = blocks [1]. execute (); // blockIndex = 1 or 2 if ( blockIndex == 1) goto merge1 ; else // blockIndex = 2 Unrolling of the interpreter
  28. 29.

    Block0 Block1 Block2 Basic Block Dispatch Node 1 2 -1

    1 Compiler 22 blocks [0]. execute (); // blockIndex = 1 merge1 : blockIndex = blocks [1]. execute (); // blockIndex = 1 or 2 if ( blockIndex == 1) goto merge1 ; else blocks [2]. execute (); // blockIndex = -1 Unrolling of the interpreter
  29. 30.

    Block0 Block1 Block2 Basic Block Dispatch Node 1 2 -1

    1 Compiler 23 blocks [0]. execute (); // blockIndex = 1 merge1 : blockIndex = blocks [1]. execute (); // blockIndex = 1 or 2 if ( blockIndex == 1) goto merge1 ; else blocks [2]. execute (); // blockIndex = -1 Unrolling of the interpreter
  30. 31.

    Block0 Block1 Block2 Basic Block Dispatch Node 1 2 -1

    1 Compiler 24 blocks [0]. execute (); // blockIndex = 1 merge1 : blockIndex = blocks [1]. execute (); // blockIndex = 1 or 2 if ( blockIndex == 1) goto merge1 ; else blocks [2]. execute (); // blockIndex = -1 return ; Unrolling of the interpreter
  31. 32.

    Compiler 25 Unrolling of the interpreter Block0 Block1 Block2 Basic

    Block Dispatch Node 1 2 -1 1 blocks [0]. execute (); // blockIndex = 1 merge1 : blockIndex = blocks [1]. execute (); // blockIndex = 1 or 2 if ( blockIndex == 1) goto merge1 ; else blocks [2]. execute (); // blockIndex = -1 return ;
  32. 34.

    Native Calls 26 We call the standard library functions to

    allocate unmanaged heap memory malloc LLVM IR interpreter sizeof(int) * 10 C standard library
  33. 35.

    Native Calls 26 We call the standard library functions to

    allocate unmanaged heap memory We use sun.misc.Unsafe to access unmanaged memory malloc LLVM IR interpreter sizeof(int) * 10 C standard library
  34. 36.

    Optimizations Problem: Java compilers do not optimize accesses to unmanaged

    memory Solution: LLVM optimizations • Promote to local variables • Dead store elimination • Traditional compiler optimizations based on alias analysis 27 Compile- time Run- time
  35. 37.

    Optimizations Problem: Static compilers cannot react to program behavior Solution:

    Profiling information exploited by Truffle and Graal • Runtime inlining • Dynamic dead code elimination • Value profiling • Polymorphic inline caches 28 Compile- time Run- time
  36. 38.

    Performance Evaluation • C (and Fortran) programs • Peak performance

     warmup runs • With LLVM optimizations • Sulong revision ad56c6f available on https://github.com/graalvm/sulong 29
  37. 39.

    C Shootouts 30 0 0,2 0,4 0,6 0,8 1 1,2

    1,4 The Computer Language Benchmark Game Sulong Clang O3 higher is better (all benchmarks) relative to Clang -O3 based on LLVM 3.3
  38. 40.

    C Shootouts 31 0 0,2 0,4 0,6 0,8 1 1,2

    1,4 The Computer Language Benchmark Game Sulong Clang O3 small benchmarks: up to 453 LOC
  39. 41.

    C Shootouts 32 0 0,2 0,4 0,6 0,8 1 1,2

    1,4 The Computer Language Benchmark Game Sulong Clang O3 binarytrees: 2.5x slower
  40. 42.

    C Shootouts 33 0 0,2 0,4 0,6 0,8 1 1,2

    1,4 The Computer Language Benchmark Game Sulong Clang O3 nbody: 1.3x faster
  41. 43.

    C Shootouts 34 0 0,2 0,4 0,6 0,8 1 1,2

    1,4 The Computer Language Benchmark Game Sulong Clang O3 geometric mean: 1.2x slower
  42. 44.

    Single Compilation-Unit Benchmarks 35 0 0,2 0,4 0,6 0,8 1

    1,2 oggenc bzip2 gzip Large single compilation-unit C programs Sulong Clang O3 large benchmarks: 5K LOC-48K LOC
  43. 45.

    C Large Benchmarks 36 0 0,2 0,4 0,6 0,8 1

    1,2 oggenc bzip2 gzip Large single compilation-unit C programs Sulong Clang O3 geometric mean: 2.0x slower
  44. 47.

    Evaluation Summary 37 Language Speedup Geometric Mean Reference C 0.7x

    Clang 3.3 -O3 Fortran 0.4x GCC 4.6 -O3 Polyhedron benchmarks
  45. 48.

    Evaluation Summary 37 Language Speedup Geometric Mean Reference C 0.7x

    Clang 3.3 -O3 Fortran 0.4x GCC 4.6 -O3 Reasons • Missing micro optimizations • Unncessary allocations of interpreter data structures • Truffle function calls • (we did not look into optimizing Fortran LLVM IR programs) Polyhedron benchmarks
  46. 49.

    Future Work • Implement foreign function interfaces using Sulong •

    Comparison with other systems/foreign function interfaces • Managed Execution 38 JRuby+Truffle FastR Graal.JS JVM
  47. 50.

    Bibliography • Rigger, M.; Grimmer, M.; Mössenböck, H.: Sulong -

    Execution of LLVM-Based Languages on the JVM. Int. Workshop on Implementation, Compilation, Optimization of Object-Oriented Languages, Programs and Systems (ICOOOLPS'16), July 18, 2016, Rome, Italy • Lattner, Chris, and Vikram Adve: LLVM: A compilation framework for lifelong program analysis & transformation. Code Generation and Optimization, 2004. CGO 2004. International Symposium on. IEEE, 2004. • Würthinger, Thomas, et al.: One VM to rule them all. Proceedings of the 2013 ACM international symposium on New ideas, new paradigms, and reflections on programming & software. ACM, 2013. • A. M. Erosa and L. J. Hendren. Taming control flow: A structured approach to eliminating goto statements. In Proceedings of Computer Languages, pages 229–240, 1994. 39