Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Parallel and Thread-Safe Ruby at High-Speed wit...

Parallel and Thread-Safe Ruby at High-Speed with TruffleRuby - vienna.rb edition

These are the slides of my talk at vienna.rb:
https://www.meetup.com/en-AU/vienna-rb/events/253616653/

How does TruffleRuby achieves great performance?
How does it understand Ruby code?
The first part of this talk explains how TruffleRuby compiles Ruby code.

Array and Hash are used in every Ruby program. Yet, current implementations either prevent the use of them in parallel (the global interpreter lock in MRI) or lack thread-safety guarantees (JRuby and Rubinius raise an exception on concurrent Array append).

This talk shows a technique to make Array and Hash thread-safe while enabling parallel access, with no penalty on single-threaded performance. In short, we keep the most important thread-safety guarantees of the global lock while allowing Ruby to scale up to tens of cores!

Benoit Daloze

August 23, 2018
Tweet

More Decks by Benoit Daloze

Other Decks in Programming

Transcript

  1. Who am I? Benoit Daloze Twitter: @eregontp GitHub: @eregon PhD

    student at Johannes Kepler University, Linz, Austria Research on concurrency with TruffleRuby Worked on TruffleRuby for 3+ years Maintainer of ruby/spec CRuby (MRI) committer 2 / 62
  2. TruffleRuby A high-performance implementation of Ruby by Oracle Labs Uses

    the Graal Just-In-Time Compiler Targets full compatibility with CRuby, including C extensions https://github.com/oracle/truffleruby 4 / 62
  3. Two Modes to Run TruffleRuby On the Java Virtual Machine

    (JVM) Can interoperate with Java Great peak performance On SubstrateVM, which AOT compiles TruffleRuby & Graal to produces a native executable (default mode) Fast startup, even faster than MRI 2.5.1! (25ms vs 44ms) Fast warmup (Graal & TruffleRuby interpreter precompiled) Lower footprint (≈ 60MB max RSS for Hello World) Great peak performance 5 / 62
  4. Ruby 3x3 Project Goal: CRuby 3.0 should be 3x faster

    than CRuby 2.0 ⇒ with a just-in-time (JIT) compiler: MJIT 6 / 62
  5. Ruby 3x3 Project Goal: CRuby 3.0 should be 3x faster

    than CRuby 2.0 ⇒ with a just-in-time (JIT) compiler: MJIT Do we need to wait for CRuby 3? (≈ 2020) 6 / 62
  6. Ruby 3x3 Project Goal: CRuby 3.0 should be 3x faster

    than CRuby 2.0 ⇒ with a just-in-time (JIT) compiler: MJIT Do we need to wait for CRuby 3? (≈ 2020) Can Ruby be faster than 3x CRuby 2.0? 6 / 62
  7. OptCarrot: Demonstration The main CPU benchmark for Ruby 3x3 A

    Nintendo Entertainment System emulator written in Ruby Created by @mame (Yusuke Endoh) 7 / 62
  8. OptCarrot: Demonstration The main CPU benchmark for Ruby 3x3 A

    Nintendo Entertainment System emulator written in Ruby Created by @mame (Yusuke Endoh) Demo! 7 / 62
  9. Classic Benchmarks richards pidigits fannkuch red-black neural-net m atrix-m ultiply

    spectral-norm m andelbrot deltablue n-body binary-trees 0 5 10 15 20 25 30 Speedup relative to CRuby CRuby 2.3 JRuby 9.1.12 TruffleRuby 9 / 62
  10. MJIT Micro-Benchmarks Geomean of 22 micro-benchmarks 0 5 10 15

    20 25 30 35 1 4.21 2.48 32.83 Speedup over CRuby 2.0 CRuby 2.0 RTL-MJIT JRuby TruffleRuby Native 0.31 10 / 62
  11. Other areas where TruffleRuby is very fast 9.4x faster than

    MRI 2.3 on rendering a small ERB template Thanks to Strings being represented as Ropes See Kevin Menard’s talk at RubyKaigi 2016 “A Tale of Two String Representations” https://www.youtube.com/watch?v=UQnxukip368 eval(), Kernel#binding , Proc & lambdas & blocks, . . . But no time to detail in this talk 11 / 62
  12. Rails benchmarks? Discourse? We are working on running Discourse Database

    migration and the asset pipeline work with a few patches Many dependencies (> 100) Many C extensions, which often need patches currently Ongoing research & experiments to reduce needed patches We support openssl, psych, zlib, syslog, puma, sqlite3, unf, etc 12 / 62
  13. How TruffleRuby achieves great performance Partial Evaluation The Graal JIT

    compiler Implementation: core primitives (e.g., Integer#+) are written in Java the rest of the core library is written in Ruby 13 / 62
  14. From Ruby code to Abstract Syntax Tree def foo [1,

    2].map { |e| e * 3 } end ⇓ Parse foo call map [1,2] block map [] build array call block block * e 3 14 / 62
  15. Truffle AST: a tree of Node objects with execute() def

    foo [1, 2].map { |e| e * 3 } end ⇓ Parse foo Call Array Lit Block Lit map Read Array Build Array Call Block block Mul Read Local IntLit 15 / 62
  16. Partial Evaluation PartialEvaluation(Truffle AST of a Ruby method) = a

    CompilerGraph representing how to execute the Ruby method Start from the root Node execute() method and inline every Java method while performing constant folding 16 / 62
  17. Partial Evaluation of foo() foo Call Block Lit Array Lit

    IntLit val=1 IntLit val=2 foo.execute() = return child.execute() CallNode.execute() = return callMethod(receiver.execute(), name, args.execute()) ArrayLitNode.execute() = return new RubyArray(values.execute()) IntLitNode.execute() = return val; ⇓ Partial Evaluation Object foo() { return callMethod( new RubyArray(new int[2] { 1, 2 }), "map", new RubyProc(...)); } 17 / 62
  18. Partial Evaluation Object foo() { RubyArray array = new RubyArray(new

    int[2] { 1, 2 }); RubyProc block = new RubyProc(this::block); return callMethod(array, "map", block); } RubyArray map(RubyArray array, RubyProc block) { RubyArray newArray = new RubyArray(new int[array.size]); for (int i = 0; i < array.size; i++) { newArray[i] = callBlock(block, array[i]); } return newArray; } Object block(Object e) { if (e instance Integer) { return Math.multiplyExact(e, 3); } else // deoptimize(); } 18 / 62
  19. Inlining Inline caches in Call nodes cache which method or

    block was called foo Call Array Lit Block Lit map Read Array Build Array Call Block block Mul Read Local IntLit 19 / 62
  20. Inlining foo Call Array Lit Block Lit map Read Array

    Build Array Call Block block Mul Read Local IntLit 20 / 62
  21. Inlining foo Call Array Lit Block Lit map Read Array

    Build Array Call Block block Mul Read Local IntLit 21 / 62
  22. Inlining: inline map() and block() Object foo() { RubyArray array

    = new RubyArray(new int[2] { 1, 2 }); RubyArray newArray = new RubyArray(new int[array.size]); for (int i = 0; i < array.size; i++) { Object e = array[i]; if (e instance Integer) { newArray[i] = Math.multiplyExact(e, 3); } else // deoptimize(); } return newArray; } 22 / 62
  23. Escape Analysis: remove array Object foo() { int[] arrayStorage =

    new int[2] { 1, 2 }; RubyArray newArray = new RubyArray(new int[2]); for (int i = 0; i < 2; i++) { Object e = arrayStorage[i]; if (e instance Integer) { newArray[i] = Math.multiplyExact(e, 3); } else // deoptimize(); } return newArray; } 23 / 62
  24. Type propagation: arrayStorage is int[] Object foo() { int[] arrayStorage

    = new int[2] { 1, 2 }; RubyArray newArray = new RubyArray(new int[2]); for (int i = 0; i < 2; i++) { int e = arrayStorage[i]; newArray[i] = Math.multiplyExact(e, 3); } return newArray; } 24 / 62
  25. Loop unrolling Object foo() { int[] arrayStorage = new int[2]

    { 1, 2 }; RubyArray newArray = new RubyArray(new int[2]); newArray[0] = Math.multiplyExact(arrayStorage[0], 3); newArray[1] = Math.multiplyExact(arrayStorage[1], 3); return newArray; } 25 / 62
  26. Escape Analysis: remove arrayStorage and replace usages Object foo() {

    RubyArray newArray = new RubyArray(new int[2]); newArray[0] = Math.multiplyExact(1, 3); newArray[1] = Math.multiplyExact(2, 3); return newArray; } 26 / 62
  27. Constant Folding: multiplication of constants Object foo() { RubyArray newArray

    = new RubyArray(new int[2]); newArray[0] = 3; newArray[1] = 6; return newArray; } 27 / 62
  28. Escape analysis: reassociate reads/writes on same locations Object foo() {

    return new RubyArray(new int[2] { 3, 6 }); } 28 / 62
  29. TruffleRuby: Partial Evaluation + Graal Compilation def foo [1, 2].map

    { |e| e * 3 } end ⇓ Object foo() { return new RubyArray(new int[2] { 3, 6 }); } 29 / 62
  30. TruffleRuby: Partial Evaluation + Graal Compilation def foo [1, 2].map

    { |e| e * 3 } end ⇓ def foo [3, 6] end 30 / 62
  31. MJIT: The Method JIT for CRuby The Ruby code is

    parsed to an AST, then transformed to bytecode 1. When a method is called many times, MJIT generates C code from Ruby bytecode 2. Call gcc or clang on the C code to create a shared library 3. Load the shared library and call the compiled C function 31 / 62
  32. Current MJIT from CRuby trunk: generated C code def foo

    [1, 2].map { |e| e * 3 } end ⇓ MJIT VALUE foo() { VALUE values[] = { 1, 2 }; VALUE array = rb_ary_new_from_values(2, values); return rb_funcall_with_block(array, "map", block_function); } VALUE block_function(VALUE e) { if (FIXNUM_P(e) && FIXNUM_P(3)) { return rb_fix_mul_fix(e, 3); } else (FLOAT_P(e) && FLOAT_P(3)) { ... } else goto deoptimize; } 32 / 62
  33. Clang/GCC: fold FIXNUM_P(3) VALUE foo() { VALUE values[] = {

    1, 2 }; VALUE array = rb_ary_new_from_values(2, array_values); return rb_funcall_with_block(array, "map", block_function); } VALUE block_function(VALUE e) { if (FIXNUM_P(e)) { return rb_fix_mul_fix(e, 3); } else goto deoptimize; } 33 / 62
  34. Current MJIT: Cannot inline rb_ary_map() rb_ary_map() is part of the

    Ruby binary MJIT knows nothing about it and cannot inline it foo() and the block cannot be optimized together VALUE foo() { VALUE values[] = { 1, 2 }; VALUE array = rb_ary_new_from_values(2, array_values); return rb_funcall_with_block(array, "map", block_function); } VALUE block_function(VALUE e) { if (FIXNUM_P(e)) { return rb_fix_mul_fix(e, 3); } else goto deoptimize; } 34 / 62
  35. Ruby Performance Summary The performance of Ruby can be significantly

    improved, as TruffleRuby shows No need to rewrite applications in other languages for speed The JIT compiler needs access to the core library to be able to inline through it The JIT compiler needs to understand Ruby constructs For example, C has no concept of Ruby object allocations, cannot easily understand it (it’s just writes to the heap) 35 / 62
  36. The Problem Dynamic languages implementations have poor support for parallelism

    Global Lock CRuby, CPython: cannot execute Ruby code in parallel in a single process. Multi processes waste memory and slow communication. 38 / 62
  37. The Problem Dynamic languages implementations have poor support for parallelism

    Global Lock CRuby, CPython: cannot execute Ruby code in parallel in a single process. Multi processes waste memory and slow communication. Unsafe JRuby, Rubinius: concurrent Array#<< raise exceptions! 38 / 62
  38. The Problem Dynamic languages implementations have poor support for parallelism

    Global Lock CRuby, CPython: cannot execute Ruby code in parallel in a single process. Multi processes waste memory and slow communication. Unsafe JRuby, Rubinius: concurrent Array#<< raise exceptions! Share Nothing JavaScript, Erlang, Guilds: Cannot pass objects by reference between threads, need to deep copy. Or, only pass deeply immutable data structures. 38 / 62
  39. Guilds Stronger memory model, almost no shared memory and no

    low-level data races But objects & collections cannot be shared between multiple guilds, they have to be deep copied or transfer ownership Unclear about scaling due to copy overhead. Shared mutable data (in their own Guild) can only be accessed sequentially. Existing libraries using Ruby Threads need some rewriting to use Guilds, it is a different programming model Complementary: some problems are easier to express with shared memory than an actor-like model and vice versa 39 / 62
  40. Appending concurrently array = [] # Create 100 threads 100.times.map

    { Thread.new { # Append 1000 integers to the array 1000.times { |i| array << i } } }.each { |thread| thread.join } puts array.size 40 / 62
  41. Appending concurrently CRuby, the reference implementation with a Global Lock:

    ruby append.rb 100000 JRuby, with concurrent threads: jruby append.rb 64324 # or ConcurrencyError: Detected invalid array contents due to unsynchronized modifications with concurrent users << at org/jruby/RubyArray.java:1256 41 / 62
  42. Appending concurrently CRuby, the reference implementation with a Global Lock:

    ruby append.rb 100000 JRuby, with concurrent threads: jruby append.rb 64324 # or ConcurrencyError: Detected invalid array contents due to unsynchronized modifications with concurrent users << at org/jruby/RubyArray.java:1256 Rubinius, with concurrent threads: rbx append.rb Tuple::copy_from: index 2538 out of bounds for size 2398 (Rubinius::ObjectBoundsExceededError) 41 / 62
  43. A Workaround: using a Mutex array = [] # or

    use Concurrent::Array.new mutex = Mutex.new 100.times.map { Thread.new { 1000.times { |i| # Add user-level synchonization mutex.synchronize { array << i } } } }.each { |thread| thread.join } puts array.size 42 / 62
  44. Problems of the Workarounds Easy to forget to wrap with

    Mutex#synchronize or to use Concurrent::Array.new If the synchronization is unnecessary, adds significant overhead Both workarounds prevent parallel access to the Array / Hash From the user point of view, Array and Hash are thread-safe on CRuby, thread-unsafe implementations are incompatible Bugs were reported to Bundler and RubyGems because JRuby and Rubinius do not provide thread-safety for Array and Hash 43 / 62
  45. A Hard Problem How to make collections thread-safe, have no

    single-threaded overhead, and support parallel access? 45 / 62
  46. Similar Problem with Objects Objects in dynamic languages support adding

    and removing fields at runtime The storage for fields in objects needs to grow dynamically Concurrent writes might be lost 46 / 62
  47. Lost Object Field Updates obj = Object.new t1 = Thread.new

    { obj.a = 1; obj.a = 2 } t2 = Thread.new { obj.b = "b" } t1.join; t2.join p obj.a # => 1 !!! 47 / 62
  48. Idea: Distinguishing Local and Shared Objects Idea: Only synchronize objects

    and collections which need it: Objects reachable by only 1 thread need no synchronization Objects reachable by multiple threads need synchronization 48 / 62
  49. Write Barrier: Tracking the set of shared objects Write to

    shared object =⇒ share value, transitively # Share 1 Hash, 1 String and 1 Object shared_obj.field = { "a" => Object.new } Shared collections use a write barrier when adding elements shared_array << Object.new # Share 1 Object shared_hash["foo"] = "bar" # Share the key and value 51 / 62
  50. Single-Threaded Performance for Objects Peak performance, normalized to Unsafe, lower

    is better q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 0.0 0.5 1.0 1.5 2.0 2.5 Bounce DeltaBlue JSON List NBody Richards Towers Unsafe Safe All Shared All Shared synchronizes on all object writes (similar to JRuby) Benchmarks from Cross-Language Compiler Benchmarking: Are We Fast Yet? S. Marr, B. Daloze, H. Mössenböck, 2016. 52 / 62
  51. Single-Threaded Performance for Collections Peak performance, normalized to TruffleRuby, lower

    is better 0.9 1.0 1.1 1.2 1.3 1.4 Bounce List Mandelbrot NBody Permute Queens Sieve Storage Towers DeltaBlue Json Richards Benchmark TruffleRuby TruffleRuby with Concurrent Collections No difference because these benchmarks do not use shared collections. 53 / 62
  52. Array Storage Strategies empty int[] long[] Object[] double[] store int

    store double store long store Object store Object store Object array = [] # empty array << 1 # int[] array << "foo" # Object[] Storage Strategies for Collections in Dynamically Typed Languages C.F. Bolz, L. Diekmann & L. Tratt, OOPSLA 2013. 54 / 62
  53. Goals for Shared Arrays Goals: All Array operations supported and

    thread-safe Preserve the compact representation of storage strategies Enable parallel reads and writes to different parts of the Array, as they are frequent in many parallel workloads 55 / 62
  54. Concurrent Array Strategies SharedFixedStorage Object[] SharedFixedStorage double[] SharedFixedStorage long[] empty

    int[] long[] Object[] double[] storage strategies concurrent strategies store int store double store long store Object store Object store Object SharedDynamicStorage empty int[] long[] Object[] double[] SharedFixedStorage int[] internal storage change: <<, delete, etc storage transition on sharing 56 / 62
  55. Scalability of Array Reads and Writes 100% reads 90% reads,

    10% writes 124 8 12 16 20 24 28 32 36 40 44 124 8 12 16 20 24 28 32 36 40 44 0 10 20 30 0 10 20 30 Threads Billion array accesses per sec. SharedFixedStorage VolatileFixedStorage LightweightLayoutLock LayoutLock S R 00% reads 90% reads, 10% writes 50% reads, 50% w 20 24 28 32 36 40 44 124 8 12 16 20 24 28 32 36 40 44 124 8 12 16 20 24 28 0 10 20 30 0 10 20 30 Threads SharedFixedStorage VolatileFixedStorage LightweightLayoutLock LayoutLock StampedLock ReentrantLock Local 57 / 62
  56. NASA Parallel Benchmarks IS-C LU-W MG-A SP-W BT-W CG-B EP-B

    FT-A 12 4 6 8 10 12 14 16 12 4 6 8 10 12 14 16 12 4 6 8 10 12 14 16 12 4 6 8 10 12 14 16 0 5 10 15 0 5 10 15 Threads Scalability relative to 1 thread performance Concurrent Strategies TruffleRuby Java Fortran 58 / 62
  57. Conclusion The performance of Ruby can be significantly improved, as

    TruffleRuby shows No need to rewrite applications in other languages for speed We can have parallelism and thread-safety for objects and collections in Ruby, with no single-threaded overhead We can execute Ruby code in parallel, with the most important thread-safety guarantees and scale to many cores 59 / 62
  58. Trying TruffleRuby Use your favorite Ruby manager/installer: $ ruby-install truffleruby

    $ rbenv install truffleruby-1.0.0-rc5 $ rvm install truffleruby 60 / 62
  59. GraalVM 1.0 GraalVM 1.0RC1 was released (17 April 2018) http://www.graalvm.org/

    TruffleRuby, JavaScript and Node.js, R, Python and LLVM bitcode (C, C++, Rust) in a single VM! All these languages can interoperate easily and efficiently Open-Source Community Edition and Enterprise Edition 61 / 62