Upgrade to Pro — share decks privately, control downloads, hide ads and more …

VMIL'16: Bringing Low-Level Languages to the JVM: Efficient Execution of LLVM IR on Truffle

Manuel Rigger
November 07, 2016

VMIL'16: Bringing Low-Level Languages to the JVM: Efficient Execution of LLVM IR on Truffle

Manuel Rigger

November 07, 2016
Tweet

More Decks by Manuel Rigger

Other Decks in Research

Transcript

  1. Bringing Low-Level Languages to the JVM:
    Efficient Execution of LLVM IR on Truffle
    Manuel Rigger1, Matthias Grimmer1, Christian Wimmer2, Thomas
    Würthinger2, Hanspeter Mössenböck1
    VMIL, 31. October, 2016
    1Johannes Kepler University Linz 2Oracle Labs

    View full-size slide

  2. JVM
    C C++ Fortran ...
    Execute on
    What is Sulong?
    2
    Execute low-level languages…
    … on the JVM …

    View full-size slide

  3. JVM
    LLVM IR
    C C++ Fortran ...
    Compile to
    Execute on
    What is Sulong?
    3
    Execute low-level languages…
    … on the JVM …
    … by compiling them to LLVM IR, …

    View full-size slide

  4. LLVM IR Interpreter
    JVM
    LLVM IR
    C C++ Fortran ...
    Compile to
    Execute with
    What is Sulong?
    4
    Execute low-level languages…
    … interpreting this IR, …
    … on the JVM …
    … by compiling them to LLVM IR, …

    View full-size slide

  5. What is Sulong?
    5
    Execute low-level languages…
    … interpreting this IR, …
    … on the JVM …
    … and using a dynamic compiler
    to make the approach fast.
    … by compiling them to LLVM IR, …
    LLVM IR Interpreter
    JVM
    LLVM IR
    C C++ Fortran ...
    JIT compiler
    Compile to
    Execute with

    View full-size slide

  6. Why do We Need Sulong?
    (1) To implement dynamic
    language‘s native interfaces
    6
    (2) As an alternative to Java‘s
    foreign function interfaces
    JRuby+Truffle
    FastR
    Graal.JS
    JVM
    Rigger, M.; et al.:
    Sulong - Execution of LLVM-Based Languages on the JVM
    ICOOOLPS'16

    View full-size slide

  7. LLVM IR Interpreter
    Truffle
    LLVM IR
    Clang
    C C++
    GCC
    Fortran
    Other
    LLVM
    frontend
    ...
    JVM
    LLVM tools
    Graal compiler
    System Overview
    7

    View full-size slide

  8. LLVM IR Interpreter
    Truffle
    LLVM IR
    Clang
    C C++
    GCC
    Fortran
    Other
    LLVM
    frontend
    ...
    JVM
    LLVM tools
    Graal compiler
    System Overview
    7
    Lattner, et al.: LLVM: A compilation
    framework for lifelong program
    analysis & transformation.
    CGO 2004.

    View full-size slide

  9. LLVM IR Interpreter
    Truffle
    LLVM IR
    Clang
    C C++
    GCC
    Fortran
    Other
    LLVM
    frontend
    ...
    JVM
    LLVM tools
    Graal compiler
    System Overview
    7

    View full-size slide

  10. LLVM IR Interpreter
    Truffle
    LLVM IR
    Clang
    C C++
    GCC
    Fortran
    Other
    LLVM
    frontend
    ...
    JVM
    LLVM tools
    Graal compiler
    System Overview
    7
    Würthinger, et al.: One VM to rule them all
    Onward! 2013

    View full-size slide

  11. Three Contributions/Challenges
    • Interpretation approach
    • Compilation approach
    • Native Calls + Performance
    8

    View full-size slide

  12. Example program
    9
    void processRequests () {
    int i = 0;
    do {
    processPacket ();
    i ++;
    } while (i < 10000) ;
    }
    define void @processRequests () #0 {
    ; ( basic block 0)
    br label %1
    ; :1 ( basic block 1)
    %i .0 = phi i32 [ 0, %0 ], [ %2 , %1 ]
    call void @processPacket ()
    %2 = add nsw i32 %i .0, 1
    %3 = icmp slt i32 %2 , 10000
    br i1 %3 , label %1 , label %4
    ; :4 ( basic block 2)
    ret void
    }
    LLVM IR
    Clang

    View full-size slide

  13. Example program
    10
    define void @processRequests () #0 {
    ; ( basic block 0)
    br label %1
    ; :1 ( basic block 1)
    %i .0 = phi i32 [ 0, %0 ], [ %2 , %1 ]
    call void @processPacket ()
    %2 = add nsw i32 %i .0, 1
    %3 = icmp slt i32 %2 , 10000
    br i1 %3 , label %1 , label %4
    ; :4 ( basic block 2)
    ret void
    }
    LLVM IR

    View full-size slide

  14. Example program
    10
    define void @processRequests () #0 {
    ; ( basic block 0)
    br label %1
    ; :1 ( basic block 1)
    %i .0 = phi i32 [ 0, %0 ], [ %2 , %1 ]
    call void @processPacket ()
    %2 = add nsw i32 %i .0, 1
    %3 = icmp slt i32 %2 , 10000
    br i1 %3 , label %1 , label %4
    ; :4 ( basic block 2)
    ret void
    }
    LLVM IR
    Contains unstructured control flow

    View full-size slide

  15. Example program
    10
    define void @processRequests () #0 {
    ; ( basic block 0)
    br label %1
    ; :1 ( basic block 1)
    %i .0 = phi i32 [ 0, %0 ], [ %2 , %1 ]
    call void @processPacket ()
    %2 = add nsw i32 %i .0, 1
    %3 = icmp slt i32 %2 , 10000
    br i1 %3 , label %1 , label %4
    ; :4 ( basic block 2)
    ret void
    }
    LLVM IR
    Contains unstructured control flow
    Erosa et. al.: Taming control flow: A structured approach to
    eliminating goto statements. Computer Languages 1994
    Recover the control flow
    information?

    View full-size slide

  16. Block0
    Block1
    Block2
    Basic Block Dispatch Node
    1 2 -1
    1
    Interpreter
    11
    int blockIndex = 0;
    while (blockIndex != -1)
    blockIndex = blocks[blockIndex ].execute ();
    Interpreter implementation

    View full-size slide

  17. Block0
    Block1
    Block2
    Basic Block Dispatch Node
    1 2 -1
    1
    Interpreter
    11
    int blockIndex = 0;
    while (blockIndex != -1)
    blockIndex = blocks[blockIndex ].execute ();
    Interpreter implementation
    Successor block indices

    View full-size slide

  18. Block0
    Block1
    Block2
    Basic Block Dispatch Node
    1 2 -1
    1
    Interpreter
    12
    define void @processRequests () #0 {
    ; ( basic block 0)
    br label %1
    ; :1 ( basic block 1)
    %i .0 = phi i32 [ 0, %0 ], [ %2 , %1 ]
    call void @processPacket ()
    %2 = add nsw i32 %i .0, 1
    %3 = icmp slt i32 %2 , 10000
    br i1 %3 , label %1 , label %4
    ; :4 ( basic block 2)
    ret void
    }
    Program execution

    View full-size slide

  19. Block0
    Block1
    Block2
    Basic Block Dispatch Node
    1 2 -1
    1
    Interpreter
    13
    define void @processRequests () #0 {
    ; ( basic block 0)
    br label %1
    ; :1 ( basic block 1)
    %i .0 = phi i32 [ 0, %0 ], [ %2 , %1 ]
    call void @processPacket ()
    %2 = add nsw i32 %i .0, 1
    %3 = icmp slt i32 %2 , 10000
    br i1 %3 , label %1 , label %4
    ; :4 ( basic block 2)
    ret void
    }
    Program execution

    View full-size slide

  20. Block0
    Block1
    Block2
    Basic Block Dispatch Node
    1 2 -1
    1
    Interpreter
    14
    define void @processRequests () #0 {
    ; ( basic block 0)
    br label %1
    ; :1 ( basic block 1)
    %i .0 = phi i32 [ 0, %0 ], [ %2 , %1 ]
    call void @processPacket ()
    %2 = add nsw i32 %i .0, 1
    %3 = icmp slt i32 %2 , 10000
    br i1 %3 , label %1 , label %4
    ; :4 ( basic block 2)
    ret void
    }
    Program execution

    View full-size slide

  21. Block0
    Block1
    Block2
    Basic Block Dispatch Node
    1 2 -1
    1
    Compiler
    15
    // blockIndex = 0
    Unrolling of the interpreter
    Control flow graph, not tree  cycles!

    View full-size slide

  22. Block0
    Block1
    Block2
    Basic Block Dispatch Node
    1 2 -1
    1
    Compiler
    16
    blocks [0]. execute ();
    Unrolling of the interpreter

    View full-size slide

  23. Block0
    Block1
    Block2
    Basic Block Dispatch Node
    1 2 -1
    1
    Compiler
    17
    blocks [0]. execute (); // blockIndex = 1
    Unrolling of the interpreter

    View full-size slide

  24. Block0
    Block1
    Block2
    Basic Block Dispatch Node
    1 2 -1
    1
    Compiler
    18
    blocks [0]. execute (); // blockIndex = 1
    blockIndex = blocks [1]. execute ();
    Unrolling of the interpreter

    View full-size slide

  25. Block0
    Block1
    Block2
    Basic Block Dispatch Node
    1 2 -1
    1
    Compiler
    19
    blocks [0]. execute (); // blockIndex = 1
    blockIndex = blocks [1]. execute (); // blockIndex = 1
    Unrolling of the interpreter

    View full-size slide

  26. Block0
    Block1
    Block2
    Basic Block Dispatch Node
    1 2 -1
    1
    Compiler
    20
    blocks [0]. execute (); // blockIndex = 1
    merge1 :
    blockIndex = blocks [1]. execute (); // blockIndex = 1 or …
    if ( blockIndex == 1)
    goto merge1 ;
    Unrolling of the interpreter

    View full-size slide

  27. Block0
    Block1
    Block2
    Basic Block Dispatch Node
    1 2 -1
    1
    Compiler
    20
    blocks [0]. execute (); // blockIndex = 1
    merge1 :
    blockIndex = blocks [1]. execute (); // blockIndex = 1 or …
    if ( blockIndex == 1)
    goto merge1 ;
    Merging already expanded paths makes the compilation work!
    Unrolling of the interpreter

    View full-size slide

  28. Block0
    Block1
    Block2
    Basic Block Dispatch Node
    1 2 -1
    1
    Compiler
    21
    blocks [0]. execute (); // blockIndex = 1
    merge1 :
    blockIndex = blocks [1]. execute (); // blockIndex = 1 or 2
    if ( blockIndex == 1)
    goto merge1 ;
    else // blockIndex = 2
    Unrolling of the interpreter

    View full-size slide

  29. Block0
    Block1
    Block2
    Basic Block Dispatch Node
    1 2 -1
    1
    Compiler
    22
    blocks [0]. execute (); // blockIndex = 1
    merge1 :
    blockIndex = blocks [1]. execute (); // blockIndex = 1 or 2
    if ( blockIndex == 1)
    goto merge1 ;
    else
    blocks [2]. execute (); // blockIndex = -1
    Unrolling of the interpreter

    View full-size slide

  30. Block0
    Block1
    Block2
    Basic Block Dispatch Node
    1 2 -1
    1
    Compiler
    23
    blocks [0]. execute (); // blockIndex = 1
    merge1 :
    blockIndex = blocks [1]. execute (); // blockIndex = 1 or 2
    if ( blockIndex == 1)
    goto merge1 ;
    else
    blocks [2]. execute (); // blockIndex = -1
    Unrolling of the interpreter

    View full-size slide

  31. Block0
    Block1
    Block2
    Basic Block Dispatch Node
    1 2 -1
    1
    Compiler
    24
    blocks [0]. execute (); // blockIndex = 1
    merge1 :
    blockIndex = blocks [1]. execute (); // blockIndex = 1 or 2
    if ( blockIndex == 1)
    goto merge1 ;
    else
    blocks [2]. execute (); // blockIndex = -1
    return ;
    Unrolling of the interpreter

    View full-size slide

  32. Compiler
    25
    Unrolling of the interpreter
    Block0
    Block1
    Block2
    Basic Block Dispatch Node
    1 2 -1
    1
    blocks [0]. execute (); // blockIndex = 1
    merge1 :
    blockIndex = blocks [1]. execute (); // blockIndex = 1 or 2
    if ( blockIndex == 1)
    goto merge1 ;
    else
    blocks [2]. execute (); // blockIndex = -1
    return ;

    View full-size slide

  33. Native Calls
    26
    malloc
    LLVM IR interpreter
    sizeof(int) * 10
    C standard library

    View full-size slide

  34. Native Calls
    26
    We call the standard library functions
    to allocate unmanaged heap memory
    malloc
    LLVM IR interpreter
    sizeof(int) * 10
    C standard library

    View full-size slide

  35. Native Calls
    26
    We call the standard library functions
    to allocate unmanaged heap memory
    We use sun.misc.Unsafe to
    access unmanaged memory
    malloc
    LLVM IR interpreter
    sizeof(int) * 10
    C standard library

    View full-size slide

  36. Optimizations
    Problem: Java compilers do not optimize accesses to
    unmanaged memory
    Solution: LLVM optimizations
    • Promote to local variables
    • Dead store elimination
    • Traditional compiler optimizations based on alias
    analysis
    27
    Compile-
    time
    Run-
    time

    View full-size slide

  37. Optimizations
    Problem: Static compilers cannot react to program
    behavior
    Solution: Profiling information exploited by Truffle and
    Graal
    • Runtime inlining
    • Dynamic dead code elimination
    • Value profiling
    • Polymorphic inline caches
    28
    Compile-
    time
    Run-
    time

    View full-size slide

  38. Performance Evaluation
    • C (and Fortran) programs
    • Peak performance  warmup runs
    • With LLVM optimizations
    • Sulong revision ad56c6f available on
    https://github.com/graalvm/sulong
    29

    View full-size slide

  39. C Shootouts
    30
    0
    0,2
    0,4
    0,6
    0,8
    1
    1,2
    1,4
    The Computer Language Benchmark Game Sulong Clang O3
    higher is better (all benchmarks)
    relative to Clang -O3 based on LLVM 3.3

    View full-size slide

  40. C Shootouts
    31
    0
    0,2
    0,4
    0,6
    0,8
    1
    1,2
    1,4
    The Computer Language Benchmark Game Sulong Clang O3
    small benchmarks: up to 453 LOC

    View full-size slide

  41. C Shootouts
    32
    0
    0,2
    0,4
    0,6
    0,8
    1
    1,2
    1,4
    The Computer Language Benchmark Game Sulong Clang O3
    binarytrees: 2.5x slower

    View full-size slide

  42. C Shootouts
    33
    0
    0,2
    0,4
    0,6
    0,8
    1
    1,2
    1,4
    The Computer Language Benchmark Game Sulong Clang O3
    nbody: 1.3x faster

    View full-size slide

  43. C Shootouts
    34
    0
    0,2
    0,4
    0,6
    0,8
    1
    1,2
    1,4
    The Computer Language Benchmark Game Sulong Clang O3
    geometric mean: 1.2x slower

    View full-size slide

  44. Single Compilation-Unit Benchmarks
    35
    0
    0,2
    0,4
    0,6
    0,8
    1
    1,2
    oggenc bzip2 gzip
    Large single compilation-unit C programs
    Sulong Clang O3
    large benchmarks: 5K LOC-48K LOC

    View full-size slide

  45. C Large Benchmarks
    36
    0
    0,2
    0,4
    0,6
    0,8
    1
    1,2
    oggenc bzip2 gzip
    Large single compilation-unit C programs
    Sulong Clang O3
    geometric mean: 2.0x slower

    View full-size slide

  46. Evaluation Summary
    37
    Language Speedup
    Geometric Mean
    Reference
    C 0.7x Clang 3.3 -O3
    Fortran 0.4x GCC 4.6 -O3

    View full-size slide

  47. Evaluation Summary
    37
    Language Speedup
    Geometric Mean
    Reference
    C 0.7x Clang 3.3 -O3
    Fortran 0.4x GCC 4.6 -O3
    Polyhedron benchmarks

    View full-size slide

  48. Evaluation Summary
    37
    Language Speedup
    Geometric Mean
    Reference
    C 0.7x Clang 3.3 -O3
    Fortran 0.4x GCC 4.6 -O3
    Reasons
    • Missing micro optimizations
    • Unncessary allocations of interpreter data structures
    • Truffle function calls
    • (we did not look into optimizing Fortran LLVM IR programs)
    Polyhedron benchmarks

    View full-size slide

  49. Future Work
    • Implement foreign function interfaces using Sulong
    • Comparison with other systems/foreign function interfaces
    • Managed Execution
    38
    JRuby+Truffle
    FastR
    Graal.JS
    JVM

    View full-size slide

  50. Bibliography
    • Rigger, M.; Grimmer, M.; Mössenböck, H.: Sulong - Execution of LLVM-Based Languages on the JVM. Int.
    Workshop on Implementation, Compilation, Optimization of Object-Oriented Languages, Programs and
    Systems (ICOOOLPS'16), July 18, 2016, Rome, Italy
    • Lattner, Chris, and Vikram Adve: LLVM: A compilation framework for lifelong program analysis &
    transformation. Code Generation and Optimization, 2004. CGO 2004. International Symposium on. IEEE,
    2004.
    • Würthinger, Thomas, et al.: One VM to rule them all. Proceedings of the 2013 ACM international symposium
    on New ideas, new paradigms, and reflections on programming & software. ACM, 2013.
    • A. M. Erosa and L. J. Hendren. Taming control flow: A structured approach to eliminating goto statements. In
    Proceedings of Computer Languages, pages 229–240, 1994.
    39

    View full-size slide

  51. Thanks for listening!
    40
    https://github.com/graalvm/sulong/
    @RiggerManuel

    View full-size slide