Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Parallel and Thread-Safe Ruby at High-Speed with TruffleRuby

Parallel and Thread-Safe Ruby at High-Speed with TruffleRuby

These are the slides of my keynote at RubyKaigi 2018: http://rubykaigi.org/2018/presentations/eregontp.html#jun02

Array and Hash are used in every Ruby program. Yet, current implementations either prevent the use of them in parallel (the global interpreter lock in MRI) or lack thread-safety guarantees (JRuby and Rubinius raise an exception on concurrent Array append). Concurrent::Array from concurrent-ruby is thread-safe but prevents parallel access.

This talk shows a technique to make Array and Hash thread-safe while enabling parallel access, with no penalty on single-threaded performance. In short, we keep the most important thread-safety guarantees of the global lock while allowing Ruby to scale up to tens of cores!

Benoit Daloze

June 02, 2018
Tweet

More Decks by Benoit Daloze

Other Decks in Programming

Transcript

  1. Parallel and Thread-Safe Ruby
    at High-Speed with TruffleRuby
    Benoit Daloze
    @eregontp

    View Slide

  2. Who am I?
    Benoit Daloze
    Twitter: @eregontp
    GitHub: @eregon
    PhD student at Johannes Kepler
    University, Austria
    Research on concurrency with
    TruffleRuby
    Worked on TruffleRuby for 3+ years
    Maintainer of ruby/spec
    CRuby (MRI) committer
    2 / 63

    View Slide

  3. Agenda
    1. Performance of TruffleRuby
    2. Parallel and Thread-Safe Ruby
    3 / 63

    View Slide

  4. TruffleRuby
    A high-performance implementation of Ruby by Oracle Labs
    Uses the Graal Just-In-Time Compiler
    Targets full compatibility with CRuby, including C extensions
    https://github.com/oracle/truffleruby
    4 / 63

    View Slide

  5. Two Modes to Run TruffleRuby
    On the Java Virtual Machine (JVM)
    Can interoperate with Java
    Great peak performance
    On SubstrateVM, which AOT compiles TruffleRuby & Graal to
    produces a native executable (default mode)
    Fast startup, even faster than MRI 2.5.1! (25ms vs 44ms)
    Fast warmup (Graal & TruffleRuby interpreter precompiled)
    Lower footprint
    Great peak performance
    5 / 63

    View Slide

  6. Ruby 3x3 Project
    Goal: CRuby 3.0 should be 3x faster than CRuby 2.0
    ⇒ with a just-in-time (JIT) compiler: MJIT
    6 / 63

    View Slide

  7. Ruby 3x3 Project
    Goal: CRuby 3.0 should be 3x faster than CRuby 2.0
    ⇒ with a just-in-time (JIT) compiler: MJIT
    Do we need to wait for CRuby 3? (≈ 2020)
    6 / 63

    View Slide

  8. Ruby 3x3 Project
    Goal: CRuby 3.0 should be 3x faster than CRuby 2.0
    ⇒ with a just-in-time (JIT) compiler: MJIT
    Do we need to wait for CRuby 3? (≈ 2020)
    Can Ruby be faster than 3x CRuby 2.0?
    6 / 63

    View Slide

  9. OptCarrot: Demonstration
    The main CPU benchmark for Ruby 3x3
    A Nintendo Entertainment System emulator written in Ruby
    Created by @mame (Yusuke Endoh)
    7 / 63

    View Slide

  10. OptCarrot: Demonstration
    The main CPU benchmark for Ruby 3x3
    A Nintendo Entertainment System emulator written in Ruby
    Created by @mame (Yusuke Endoh)
    Demo!
    7 / 63

    View Slide

  11. OptCarrot Warmup
    8 / 63

    View Slide

  12. Classic Benchmarks
    richards
    pidigits
    fannkuch
    red-black
    neural-net
    m
    atrix-m
    ultiply
    spectral-norm
    m
    andelbrot
    deltablue
    n-body
    binary-trees
    0
    5
    10
    15
    20
    25
    30
    Speedup relative to CRuby
    CRuby 2.3 JRuby 9.1.12 TruffleRuby
    9 / 63

    View Slide

  13. MJIT Micro-Benchmarks
    Geomean of 22 micro-benchmarks
    0
    5
    10
    15
    20
    25
    30
    35
    1
    4.21
    2.48
    32.83
    Speedup over CRuby 2.0
    CRuby 2.0
    RTL-MJIT
    JRuby
    TruffleRuby Native 0.31
    10 / 63

    View Slide

  14. Other areas where TruffleRuby is very fast
    9.4x faster than MRI 2.3 on rendering a small ERB template
    Thanks to Strings being represented as Ropes
    See Kevin Menard’s talk at RubyKaigi 2016
    “A Tale of Two String Representations”
    https://www.youtube.com/watch?v=UQnxukip368
    eval(), Kernel#binding , Proc & lambdas & blocks, . . .
    But no time to detail in this talk
    11 / 63

    View Slide

  15. Rails benchmarks? Discourse?
    We are working on running Discourse
    Database migration and the asset pipeline work with a few
    patches
    Many dependencies (> 100)
    Many C extensions, which often need patches currently
    Ongoing research & experiments to reduce needed patches
    We support openssl, psych, zlib, syslog, puma, sqlite3, unf, etc
    12 / 63

    View Slide

  16. How TruffleRuby achieves great performance
    Partial Evaluation
    The Graal JIT compiler
    Implementation:
    core primitives (e.g., Integer#+) are written in Java
    the rest of the core library is written in Ruby
    13 / 63

    View Slide

  17. From Ruby code to Abstract Syntax Tree
    def foo
    [1, 2].map { |e| e * 3 }
    end
    ⇓ Parse
    foo
    call
    map
    [1,2] block
    map
    []
    build
    array
    call
    block
    block
    *
    e 3
    14 / 63

    View Slide

  18. Truffle AST: a tree of Node objects with execute()
    def foo
    [1, 2].map { |e| e * 3 }
    end
    ⇓ Parse
    foo
    Call
    Array
    Lit
    Block
    Lit
    map
    Read
    Array
    Build
    Array
    Call
    Block
    block
    Mul
    Read
    Local
    IntLit
    15 / 63

    View Slide

  19. Partial Evaluation
    PartialEvaluation(Truffle AST of a Ruby method) =
    a CompilerGraph representing how to execute the Ruby method
    Start from the root Node execute() method and inline every
    Java method while performing constant folding
    16 / 63

    View Slide

  20. Partial Evaluation of foo()
    foo
    Call
    Block
    Lit
    Array
    Lit
    IntLit
    val=1
    IntLit
    val=2
    foo.execute() =
    return child.execute()
    CallNode.execute() =
    return callMethod(receiver.execute(), name, args.execute())
    ArrayLitNode.execute() =
    return new RubyArray(values.execute())
    IntLitNode.execute() =
    return val;
    ⇓ Partial Evaluation
    Object foo() {
    return callMethod(
    new RubyArray(new int[2] { 1, 2 }),
    "map",
    new RubyProc(...));
    }
    17 / 63

    View Slide

  21. Partial Evaluation
    Object foo() {
    RubyArray array = new RubyArray(new int[2] { 1, 2 });
    RubyProc block = new RubyProc(this::block);
    return callMethod(array, "map", block);
    }
    RubyArray map(RubyArray array, RubyProc block) {
    RubyArray newArray = new RubyArray(new int[array.size]);
    for (int i = 0; i < array.size; i++) {
    newArray[i] = callBlock(block, array[i]);
    }
    return newArray;
    }
    Object block(Object e) {
    if (e instance Integer) {
    return Math.multiplyExact(e, 3);
    } else // deoptimize();
    }
    18 / 63

    View Slide

  22. Inlining
    Inline caches in Call nodes cache which method or block was called
    foo
    Call
    Array
    Lit
    Block
    Lit
    map
    Read
    Array
    Build
    Array
    Call
    Block
    block
    Mul
    Read
    Local
    IntLit
    19 / 63

    View Slide

  23. Inlining
    foo
    Call
    Array
    Lit
    Block
    Lit
    map
    Read
    Array
    Build
    Array
    Call
    Block
    block
    Mul
    Read
    Local
    IntLit
    20 / 63

    View Slide

  24. Inlining
    foo
    Call
    Array
    Lit
    Block
    Lit
    map
    Read
    Array
    Build
    Array
    Call
    Block
    block
    Mul
    Read
    Local
    IntLit
    21 / 63

    View Slide

  25. Inlining: inline map() and block()
    Object foo() {
    RubyArray array = new RubyArray(new int[2] { 1, 2 });
    RubyArray newArray = new RubyArray(new int[array.size]);
    for (int i = 0; i < array.size; i++) {
    Object e = array[i];
    if (e instance Integer) {
    newArray[i] = Math.multiplyExact(e, 3);
    } else // deoptimize();
    }
    return newArray;
    }
    22 / 63

    View Slide

  26. Escape Analysis: remove array
    Object foo() {
    int[] arrayStorage = new int[2] { 1, 2 };
    RubyArray newArray = new RubyArray(new int[2]);
    for (int i = 0; i < 2; i++) {
    Object e = arrayStorage[i];
    if (e instance Integer) {
    newArray[i] = Math.multiplyExact(e, 3);
    } else // deoptimize();
    }
    return newArray;
    }
    23 / 63

    View Slide

  27. Type propagation: arrayStorage is int[]
    Object foo() {
    int[] arrayStorage = new int[2] { 1, 2 };
    RubyArray newArray = new RubyArray(new int[2]);
    for (int i = 0; i < 2; i++) {
    int e = arrayStorage[i];
    newArray[i] = Math.multiplyExact(e, 3);
    }
    return newArray;
    }
    24 / 63

    View Slide

  28. Loop unrolling
    Object foo() {
    int[] arrayStorage = new int[2] { 1, 2 };
    RubyArray newArray = new RubyArray(new int[2]);
    newArray[0] = Math.multiplyExact(arrayStorage[0], 3);
    newArray[1] = Math.multiplyExact(arrayStorage[1], 3);
    return newArray;
    }
    25 / 63

    View Slide

  29. Escape Analysis: remove arrayStorage and replace usages
    Object foo() {
    RubyArray newArray = new RubyArray(new int[2]);
    newArray[0] = Math.multiplyExact(1, 3);
    newArray[1] = Math.multiplyExact(2, 3);
    return newArray;
    }
    26 / 63

    View Slide

  30. Constant Folding: multiplication of constants
    Object foo() {
    RubyArray newArray = new RubyArray(new int[2]);
    newArray[0] = 3;
    newArray[1] = 6;
    return newArray;
    }
    27 / 63

    View Slide

  31. Escape analysis: reassociate reads/writes on same locations
    Object foo() {
    return new RubyArray(new int[2] { 3, 6 });
    }
    28 / 63

    View Slide

  32. TruffleRuby: Partial Evaluation + Graal Compilation
    def foo
    [1, 2].map { |e| e * 3 }
    end

    Object foo() {
    return new RubyArray(new int[2] { 3, 6 });
    }
    29 / 63

    View Slide

  33. TruffleRuby: Partial Evaluation + Graal Compilation
    def foo
    [1, 2].map { |e| e * 3 }
    end

    def foo
    [3, 6]
    end
    30 / 63

    View Slide

  34. MJIT: The Method JIT for CRuby
    The Ruby code is parsed to an AST, then transformed to bytecode
    1. When a method is called many times, MJIT generates C code
    from Ruby bytecode
    2. Call gcc or clang on the C code to create a shared library
    3. Load the shared library and call the compiled C function
    31 / 63

    View Slide

  35. Current MJIT from CRuby trunk: generated C code
    def foo
    [1, 2].map { |e| e * 3 }
    end
    ⇓ MJIT
    VALUE foo() {
    VALUE values[] = { 1, 2 };
    VALUE array = rb_ary_new_from_values(2, values);
    return rb_funcall_with_block(array, "map", block_function);
    }
    VALUE block_function(VALUE e) {
    if (FIXNUM_P(e) && FIXNUM_P(3)) {
    return rb_fix_mul_fix(e, 3);
    } else (FLOAT_P(e) && FLOAT_P(3)) {
    ...
    } else goto deoptimize;
    }
    32 / 63

    View Slide

  36. Clang/GCC: fold FIXNUM_P(3)
    VALUE foo() {
    VALUE values[] = { 1, 2 };
    VALUE array = rb_ary_new_from_values(2, array_values);
    return rb_funcall_with_block(array, "map", block_function);
    }
    VALUE block_function(VALUE e) {
    if (FIXNUM_P(e)) {
    return rb_fix_mul_fix(e, 3);
    } else goto deoptimize;
    }
    33 / 63

    View Slide

  37. Current MJIT: Cannot inline rb_ary_map()
    rb_ary_map() is part of the Ruby binary
    MJIT knows nothing about it and cannot inline it
    foo() and the block cannot be optimized together
    VALUE foo() {
    VALUE values[] = { 1, 2 };
    VALUE array = rb_ary_new_from_values(2, array_values);
    return rb_funcall_with_block(array, "map", block_function);
    }
    VALUE block_function(VALUE e) {
    if (FIXNUM_P(e)) {
    return rb_fix_mul_fix(e, 3);
    } else goto deoptimize;
    }
    34 / 63

    View Slide

  38. Ruby Performance Summary
    The performance of Ruby can be significantly improved, as
    TruffleRuby shows
    No need to rewrite applications in other languages for speed
    The JIT compiler needs access to the core library to be able to
    inline through it
    The JIT compiler needs to understand Ruby constructs
    For example, C has no concept of Ruby object allocations,
    cannot easily understand it (it’s just writes to the heap)
    35 / 63

    View Slide

  39. Agenda
    Parallel and Thread-Safe Ruby
    36 / 63

    View Slide

  40. The Problem
    Dynamic languages have poor support for parallelism
    37 / 63

    View Slide

  41. The Problem
    Dynamic languages have poor support for parallelism
    Due to the implementations!
    37 / 63

    View Slide

  42. The Problem
    Dynamic languages implementations
    have poor support for parallelism
    Global Lock CRuby, CPython: cannot execute Ruby code in
    parallel in a single process. Multi processes waste
    memory and slow communication
    38 / 63

    View Slide

  43. The Problem
    Dynamic languages implementations
    have poor support for parallelism
    Global Lock CRuby, CPython: cannot execute Ruby code in
    parallel in a single process. Multi processes waste
    memory and slow communication
    Unsafe JRuby, Rubinius: concurrent Array#<< raise
    exceptions!
    38 / 63

    View Slide

  44. The Problem
    Dynamic languages implementations
    have poor support for parallelism
    Global Lock CRuby, CPython: cannot execute Ruby code in
    parallel in a single process. Multi processes waste
    memory and slow communication
    Unsafe JRuby, Rubinius: concurrent Array#<< raise
    exceptions!
    Share Nothing JavaScript, Erlang, Guilds: Cannot pass objects by
    reference between threads, need to deep copy.
    Or, only pass deeply immutable data structures.
    38 / 63

    View Slide

  45. Guilds
    Stronger memory model, almost no shared memory and no
    low-level data races
    But objects & collections cannot be shared between multiple
    guilds, they have to be deep copied or transfer ownership
    Unclear about scaling due to copy overhead. Shared mutable
    data (in their own Guild) can only be accessed sequentially.
    Existing libraries using Ruby Threads need some rewriting to
    use Guilds, it is a different programming model
    Complementary: some problems are easier to express with
    shared memory than an actor-like model and vice versa
    39 / 63

    View Slide

  46. Appending concurrently
    array = []
    # Create 100 threads
    100.times.map {
    Thread.new {
    # Append 1000 integers to the array
    1000.times { |i|
    array << i
    }
    }
    }.each { |thread| thread.join }
    puts array.size
    40 / 63

    View Slide

  47. Appending concurrently
    CRuby, the reference implementation with a Global Lock:
    ruby append.rb
    100000
    41 / 63

    View Slide

  48. Appending concurrently
    CRuby, the reference implementation with a Global Lock:
    ruby append.rb
    100000
    JRuby, with concurrent threads:
    jruby append.rb
    64324
    # or
    ConcurrencyError: Detected invalid array contents due to
    unsynchronized modifications with concurrent users
    << at org/jruby/RubyArray.java:1256
    41 / 63

    View Slide

  49. Appending concurrently
    CRuby, the reference implementation with a Global Lock:
    ruby append.rb
    100000
    JRuby, with concurrent threads:
    jruby append.rb
    64324
    # or
    ConcurrencyError: Detected invalid array contents due to
    unsynchronized modifications with concurrent users
    << at org/jruby/RubyArray.java:1256
    Rubinius, with concurrent threads:
    rbx append.rb
    Tuple::copy_from: index 2538 out of bounds for size 2398
    (Rubinius::ObjectBoundsExceededError)
    41 / 63

    View Slide

  50. A Workaround: using a Mutex
    array = [] # or use Concurrent::Array.new
    mutex = Mutex.new
    100.times.map {
    Thread.new {
    1000.times { |i|
    # Add user-level synchonization
    mutex.synchronize {
    array << i
    }
    }
    }
    }.each { |thread| thread.join }
    puts array.size
    42 / 63

    View Slide

  51. Problems of the Workarounds
    Easy to forget to wrap with Mutex#synchronize or to use
    Concurrent::Array.new
    If the synchronization is unnecessary, adds significant overhead
    Both workarounds prevent parallel access to the Array / Hash
    From the user point of view, Array and Hash are thread-safe
    on CRuby, thread-unsafe implementations are incompatible
    Bugs were reported to Bundler and RubyGems because JRuby
    and Rubinius do not provide thread-safety for Array and Hash
    43 / 63

    View Slide

  52. A Hard Problem
    How to make collections thread-safe,
    and have no single-threaded overhead?
    44 / 63

    View Slide

  53. A Hard Problem
    How to make collections thread-safe,
    have no single-threaded overhead,
    and support parallel access?
    45 / 63

    View Slide

  54. Similar Problem with Objects
    Objects in dynamic languages support adding and removing fields
    at runtime
    The storage for fields in objects needs to grow dynamically
    Concurrent writes might be lost
    46 / 63

    View Slide

  55. Lost Object Field Updates
    obj = Object.new
    t1 = Thread.new { obj.a = 1; obj.a = 2 }
    t2 = Thread.new { obj.b = "b" }
    t1.join; t2.join
    p obj.a # => 1 !!!
    47 / 63

    View Slide

  56. Idea: Distinguishing Local and Shared Objects
    Idea: Only synchronize objects and collections which need it:
    Objects reachable by only 1 thread need no synchronization
    Objects reachable by multiple threads need synchronization
    48 / 63

    View Slide

  57. Local and Shared Objects: Reachability
    49 / 63

    View Slide

  58. Local and Shared Objects: Reachability
    50 / 63

    View Slide

  59. Write Barrier: Tracking the set of shared objects
    Write to shared object =⇒ share value, transitively
    # Share 1 Hash, 1 String and 1 Object
    shared_obj.field = { "a" => Object.new }
    Shared collections use a write barrier when adding elements
    shared_array << Object.new # Share 1 Object
    shared_hash["foo"] = "bar" # Share the key and value
    51 / 63

    View Slide

  60. Single-Threaded Performance for Objects
    Peak performance, normalized to Unsafe, lower is better
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    q
    0.0
    0.5
    1.0
    1.5
    2.0
    2.5
    Bounce DeltaBlue JSON List NBody Richards Towers
    Unsafe Safe All Shared
    All Shared synchronizes on all object writes (similar to JRuby)
    Benchmarks from Cross-Language Compiler Benchmarking: Are We Fast Yet?
    S. Marr, B. Daloze, H. Mössenböck, 2016. 52 / 63

    View Slide

  61. Single-Threaded Performance for Collections
    Peak performance, normalized to TruffleRuby, lower is better
    0.9
    1.0
    1.1
    1.2
    1.3
    1.4
    Bounce List Mandelbrot NBody Permute Queens Sieve Storage Towers DeltaBlue Json Richards
    Benchmark
    TruffleRuby TruffleRuby with Concurrent Collections
    No difference because these benchmarks do not use shared collections.
    53 / 63

    View Slide

  62. Array Storage Strategies
    empty
    int[] long[] Object[] double[]
    store int store double
    store long store Object store Object
    store Object
    array = [] # empty
    array << 1 # int[]
    array << "foo" # Object[]
    Storage Strategies for Collections in Dynamically Typed Languages
    C.F. Bolz, L. Diekmann & L. Tratt, OOPSLA 2013.
    54 / 63

    View Slide

  63. Goals for Shared Arrays
    Goals:
    All Array operations supported and thread-safe
    Preserve the compact representation of storage strategies
    Enable parallel reads and writes to different parts of the Array,
    as they are frequent in many parallel workloads
    55 / 63

    View Slide

  64. Concurrent Array Strategies
    SharedFixedStorage
    Object[]
    SharedFixedStorage
    double[]
    SharedFixedStorage
    long[]
    empty
    int[] long[] Object[] double[]
    storage strategies
    concurrent
    strategies
    store int store double
    store long store Object store Object
    store Object
    SharedDynamicStorage
    empty
    int[] long[] Object[] double[]
    SharedFixedStorage
    int[]
    internal storage change: <<, delete, etc
    storage transition
    on sharing
    56 / 63

    View Slide

  65. Scalability of Array Reads and Writes
    100% reads 90% reads, 10% writes
    124 8 12 16 20 24 28 32 36 40 44 124 8 12 16 20 24 28 32 36 40 44
    0
    10
    20
    30
    0
    10
    20
    30
    Threads
    Billion array accesses per sec.
    SharedFixedStorage
    VolatileFixedStorage
    LightweightLayoutLock
    LayoutLock
    S
    R
    00% reads 90% reads, 10% writes 50% reads, 50% w
    20 24 28 32 36 40 44 124 8 12 16 20 24 28 32 36 40 44 124 8 12 16 20 24 28
    0
    10
    20
    30
    0
    10
    20
    30
    Threads
    SharedFixedStorage
    VolatileFixedStorage
    LightweightLayoutLock
    LayoutLock
    StampedLock
    ReentrantLock
    Local
    57 / 63

    View Slide

  66. NASA Parallel Benchmarks
    IS-C LU-W MG-A SP-W
    BT-W CG-B EP-B FT-A
    12 4 6 8 10 12 14 16 12 4 6 8 10 12 14 16 12 4 6 8 10 12 14 16 12 4 6 8 10 12 14 16
    0
    5
    10
    15
    0
    5
    10
    15
    Threads
    Scalability relative to 1 thread performance
    Concurrent Strategies TruffleRuby Java Fortran
    58 / 63

    View Slide

  67. TruffleRuby Conclusion
    TruffleRuby runs unmodified Ruby code faster, as fast as the
    best dynamic languages implementations like V8 for JavaScript
    TruffleRuby can run Ruby code in parallel safely: “GIL bugfix”
    Safe Array/Hash not in master yet (ETA: this summer)
    TruffleRuby can run existing Ruby code using Threads in
    parallel, no need to change to a different programming model
    59 / 63

    View Slide

  68. Conclusion
    The performance of Ruby can be significantly improved, as
    TruffleRuby shows
    No need to rewrite applications in other languages for speed
    We can have parallelism and thread-safety for objects and
    collections in Ruby, with no single-threaded overhead
    We can execute Ruby code in parallel, with the most
    important thread-safety guarantees and scale to many cores
    60 / 63

    View Slide

  69. Trying TruffleRuby
    Soon:
    $ ruby-install truffleruby
    $ rbenv install truffleruby
    $ rvm install truffleruby
    61 / 63

    View Slide

  70. GraalVM 1.0
    GraalVM 1.0RC1 was released (17 April 2018)
    http://www.graalvm.org/
    Open-Source Community Edition and Enterprise Edition
    TruffleRuby, JavaScript and Node.js, R, Python and
    LLVM bitcode (C, C++, Rust) in a single VM!
    All these languages can interoperate easily and efficiently
    62 / 63

    View Slide

  71. Parallel and Thread-Safe Ruby
    at High-Speed with TruffleRuby
    Benoit Daloze
    @eregontp

    View Slide