Slide 1

Slide 1 text

Parallel and Thread-Safe Ruby at High-Speed with TruffleRuby Benoit Daloze @eregontp

Slide 2

Slide 2 text

Who am I? Benoit Daloze Twitter: @eregontp GitHub: @eregon PhD student at Johannes Kepler University, Austria Research on concurrency with TruffleRuby Worked on TruffleRuby for 3+ years Maintainer of ruby/spec CRuby (MRI) committer 2 / 63

Slide 3

Slide 3 text

Agenda 1. Performance of TruffleRuby 2. Parallel and Thread-Safe Ruby 3 / 63

Slide 4

Slide 4 text

TruffleRuby A high-performance implementation of Ruby by Oracle Labs Uses the Graal Just-In-Time Compiler Targets full compatibility with CRuby, including C extensions https://github.com/oracle/truffleruby 4 / 63

Slide 5

Slide 5 text

Two Modes to Run TruffleRuby On the Java Virtual Machine (JVM) Can interoperate with Java Great peak performance On SubstrateVM, which AOT compiles TruffleRuby & Graal to produces a native executable (default mode) Fast startup, even faster than MRI 2.5.1! (25ms vs 44ms) Fast warmup (Graal & TruffleRuby interpreter precompiled) Lower footprint Great peak performance 5 / 63

Slide 6

Slide 6 text

Ruby 3x3 Project Goal: CRuby 3.0 should be 3x faster than CRuby 2.0 ⇒ with a just-in-time (JIT) compiler: MJIT 6 / 63

Slide 7

Slide 7 text

Ruby 3x3 Project Goal: CRuby 3.0 should be 3x faster than CRuby 2.0 ⇒ with a just-in-time (JIT) compiler: MJIT Do we need to wait for CRuby 3? (≈ 2020) 6 / 63

Slide 8

Slide 8 text

Ruby 3x3 Project Goal: CRuby 3.0 should be 3x faster than CRuby 2.0 ⇒ with a just-in-time (JIT) compiler: MJIT Do we need to wait for CRuby 3? (≈ 2020) Can Ruby be faster than 3x CRuby 2.0? 6 / 63

Slide 9

Slide 9 text

OptCarrot: Demonstration The main CPU benchmark for Ruby 3x3 A Nintendo Entertainment System emulator written in Ruby Created by @mame (Yusuke Endoh) 7 / 63

Slide 10

Slide 10 text

OptCarrot: Demonstration The main CPU benchmark for Ruby 3x3 A Nintendo Entertainment System emulator written in Ruby Created by @mame (Yusuke Endoh) Demo! 7 / 63

Slide 11

Slide 11 text

OptCarrot Warmup 8 / 63

Slide 12

Slide 12 text

Classic Benchmarks richards pidigits fannkuch red-black neural-net m atrix-m ultiply spectral-norm m andelbrot deltablue n-body binary-trees 0 5 10 15 20 25 30 Speedup relative to CRuby CRuby 2.3 JRuby 9.1.12 TruffleRuby 9 / 63

Slide 13

Slide 13 text

MJIT Micro-Benchmarks Geomean of 22 micro-benchmarks 0 5 10 15 20 25 30 35 1 4.21 2.48 32.83 Speedup over CRuby 2.0 CRuby 2.0 RTL-MJIT JRuby TruffleRuby Native 0.31 10 / 63

Slide 14

Slide 14 text

Other areas where TruffleRuby is very fast 9.4x faster than MRI 2.3 on rendering a small ERB template Thanks to Strings being represented as Ropes See Kevin Menard’s talk at RubyKaigi 2016 “A Tale of Two String Representations” https://www.youtube.com/watch?v=UQnxukip368 eval(), Kernel#binding , Proc & lambdas & blocks, . . . But no time to detail in this talk 11 / 63

Slide 15

Slide 15 text

Rails benchmarks? Discourse? We are working on running Discourse Database migration and the asset pipeline work with a few patches Many dependencies (> 100) Many C extensions, which often need patches currently Ongoing research & experiments to reduce needed patches We support openssl, psych, zlib, syslog, puma, sqlite3, unf, etc 12 / 63

Slide 16

Slide 16 text

How TruffleRuby achieves great performance Partial Evaluation The Graal JIT compiler Implementation: core primitives (e.g., Integer#+) are written in Java the rest of the core library is written in Ruby 13 / 63

Slide 17

Slide 17 text

From Ruby code to Abstract Syntax Tree def foo [1, 2].map { |e| e * 3 } end ⇓ Parse foo call map [1,2] block map [] build array call block block * e 3 14 / 63

Slide 18

Slide 18 text

Truffle AST: a tree of Node objects with execute() def foo [1, 2].map { |e| e * 3 } end ⇓ Parse foo Call Array Lit Block Lit map Read Array Build Array Call Block block Mul Read Local IntLit 15 / 63

Slide 19

Slide 19 text

Partial Evaluation PartialEvaluation(Truffle AST of a Ruby method) = a CompilerGraph representing how to execute the Ruby method Start from the root Node execute() method and inline every Java method while performing constant folding 16 / 63

Slide 20

Slide 20 text

Partial Evaluation of foo() foo Call Block Lit Array Lit IntLit val=1 IntLit val=2 foo.execute() = return child.execute() CallNode.execute() = return callMethod(receiver.execute(), name, args.execute()) ArrayLitNode.execute() = return new RubyArray(values.execute()) IntLitNode.execute() = return val; ⇓ Partial Evaluation Object foo() { return callMethod( new RubyArray(new int[2] { 1, 2 }), "map", new RubyProc(...)); } 17 / 63

Slide 21

Slide 21 text

Partial Evaluation Object foo() { RubyArray array = new RubyArray(new int[2] { 1, 2 }); RubyProc block = new RubyProc(this::block); return callMethod(array, "map", block); } RubyArray map(RubyArray array, RubyProc block) { RubyArray newArray = new RubyArray(new int[array.size]); for (int i = 0; i < array.size; i++) { newArray[i] = callBlock(block, array[i]); } return newArray; } Object block(Object e) { if (e instance Integer) { return Math.multiplyExact(e, 3); } else // deoptimize(); } 18 / 63

Slide 22

Slide 22 text

Inlining Inline caches in Call nodes cache which method or block was called foo Call Array Lit Block Lit map Read Array Build Array Call Block block Mul Read Local IntLit 19 / 63

Slide 23

Slide 23 text

Inlining foo Call Array Lit Block Lit map Read Array Build Array Call Block block Mul Read Local IntLit 20 / 63

Slide 24

Slide 24 text

Inlining foo Call Array Lit Block Lit map Read Array Build Array Call Block block Mul Read Local IntLit 21 / 63

Slide 25

Slide 25 text

Inlining: inline map() and block() Object foo() { RubyArray array = new RubyArray(new int[2] { 1, 2 }); RubyArray newArray = new RubyArray(new int[array.size]); for (int i = 0; i < array.size; i++) { Object e = array[i]; if (e instance Integer) { newArray[i] = Math.multiplyExact(e, 3); } else // deoptimize(); } return newArray; } 22 / 63

Slide 26

Slide 26 text

Escape Analysis: remove array Object foo() { int[] arrayStorage = new int[2] { 1, 2 }; RubyArray newArray = new RubyArray(new int[2]); for (int i = 0; i < 2; i++) { Object e = arrayStorage[i]; if (e instance Integer) { newArray[i] = Math.multiplyExact(e, 3); } else // deoptimize(); } return newArray; } 23 / 63

Slide 27

Slide 27 text

Type propagation: arrayStorage is int[] Object foo() { int[] arrayStorage = new int[2] { 1, 2 }; RubyArray newArray = new RubyArray(new int[2]); for (int i = 0; i < 2; i++) { int e = arrayStorage[i]; newArray[i] = Math.multiplyExact(e, 3); } return newArray; } 24 / 63

Slide 28

Slide 28 text

Loop unrolling Object foo() { int[] arrayStorage = new int[2] { 1, 2 }; RubyArray newArray = new RubyArray(new int[2]); newArray[0] = Math.multiplyExact(arrayStorage[0], 3); newArray[1] = Math.multiplyExact(arrayStorage[1], 3); return newArray; } 25 / 63

Slide 29

Slide 29 text

Escape Analysis: remove arrayStorage and replace usages Object foo() { RubyArray newArray = new RubyArray(new int[2]); newArray[0] = Math.multiplyExact(1, 3); newArray[1] = Math.multiplyExact(2, 3); return newArray; } 26 / 63

Slide 30

Slide 30 text

Constant Folding: multiplication of constants Object foo() { RubyArray newArray = new RubyArray(new int[2]); newArray[0] = 3; newArray[1] = 6; return newArray; } 27 / 63

Slide 31

Slide 31 text

Escape analysis: reassociate reads/writes on same locations Object foo() { return new RubyArray(new int[2] { 3, 6 }); } 28 / 63

Slide 32

Slide 32 text

TruffleRuby: Partial Evaluation + Graal Compilation def foo [1, 2].map { |e| e * 3 } end ⇓ Object foo() { return new RubyArray(new int[2] { 3, 6 }); } 29 / 63

Slide 33

Slide 33 text

TruffleRuby: Partial Evaluation + Graal Compilation def foo [1, 2].map { |e| e * 3 } end ⇓ def foo [3, 6] end 30 / 63

Slide 34

Slide 34 text

MJIT: The Method JIT for CRuby The Ruby code is parsed to an AST, then transformed to bytecode 1. When a method is called many times, MJIT generates C code from Ruby bytecode 2. Call gcc or clang on the C code to create a shared library 3. Load the shared library and call the compiled C function 31 / 63

Slide 35

Slide 35 text

Current MJIT from CRuby trunk: generated C code def foo [1, 2].map { |e| e * 3 } end ⇓ MJIT VALUE foo() { VALUE values[] = { 1, 2 }; VALUE array = rb_ary_new_from_values(2, values); return rb_funcall_with_block(array, "map", block_function); } VALUE block_function(VALUE e) { if (FIXNUM_P(e) && FIXNUM_P(3)) { return rb_fix_mul_fix(e, 3); } else (FLOAT_P(e) && FLOAT_P(3)) { ... } else goto deoptimize; } 32 / 63

Slide 36

Slide 36 text

Clang/GCC: fold FIXNUM_P(3) VALUE foo() { VALUE values[] = { 1, 2 }; VALUE array = rb_ary_new_from_values(2, array_values); return rb_funcall_with_block(array, "map", block_function); } VALUE block_function(VALUE e) { if (FIXNUM_P(e)) { return rb_fix_mul_fix(e, 3); } else goto deoptimize; } 33 / 63

Slide 37

Slide 37 text

Current MJIT: Cannot inline rb_ary_map() rb_ary_map() is part of the Ruby binary MJIT knows nothing about it and cannot inline it foo() and the block cannot be optimized together VALUE foo() { VALUE values[] = { 1, 2 }; VALUE array = rb_ary_new_from_values(2, array_values); return rb_funcall_with_block(array, "map", block_function); } VALUE block_function(VALUE e) { if (FIXNUM_P(e)) { return rb_fix_mul_fix(e, 3); } else goto deoptimize; } 34 / 63

Slide 38

Slide 38 text

Ruby Performance Summary The performance of Ruby can be significantly improved, as TruffleRuby shows No need to rewrite applications in other languages for speed The JIT compiler needs access to the core library to be able to inline through it The JIT compiler needs to understand Ruby constructs For example, C has no concept of Ruby object allocations, cannot easily understand it (it’s just writes to the heap) 35 / 63

Slide 39

Slide 39 text

Agenda Parallel and Thread-Safe Ruby 36 / 63

Slide 40

Slide 40 text

The Problem Dynamic languages have poor support for parallelism 37 / 63

Slide 41

Slide 41 text

The Problem Dynamic languages have poor support for parallelism Due to the implementations! 37 / 63

Slide 42

Slide 42 text

The Problem Dynamic languages implementations have poor support for parallelism Global Lock CRuby, CPython: cannot execute Ruby code in parallel in a single process. Multi processes waste memory and slow communication 38 / 63

Slide 43

Slide 43 text

The Problem Dynamic languages implementations have poor support for parallelism Global Lock CRuby, CPython: cannot execute Ruby code in parallel in a single process. Multi processes waste memory and slow communication Unsafe JRuby, Rubinius: concurrent Array#<< raise exceptions! 38 / 63

Slide 44

Slide 44 text

The Problem Dynamic languages implementations have poor support for parallelism Global Lock CRuby, CPython: cannot execute Ruby code in parallel in a single process. Multi processes waste memory and slow communication Unsafe JRuby, Rubinius: concurrent Array#<< raise exceptions! Share Nothing JavaScript, Erlang, Guilds: Cannot pass objects by reference between threads, need to deep copy. Or, only pass deeply immutable data structures. 38 / 63

Slide 45

Slide 45 text

Guilds Stronger memory model, almost no shared memory and no low-level data races But objects & collections cannot be shared between multiple guilds, they have to be deep copied or transfer ownership Unclear about scaling due to copy overhead. Shared mutable data (in their own Guild) can only be accessed sequentially. Existing libraries using Ruby Threads need some rewriting to use Guilds, it is a different programming model Complementary: some problems are easier to express with shared memory than an actor-like model and vice versa 39 / 63

Slide 46

Slide 46 text

Appending concurrently array = [] # Create 100 threads 100.times.map { Thread.new { # Append 1000 integers to the array 1000.times { |i| array << i } } }.each { |thread| thread.join } puts array.size 40 / 63

Slide 47

Slide 47 text

Appending concurrently CRuby, the reference implementation with a Global Lock: ruby append.rb 100000 41 / 63

Slide 48

Slide 48 text

Appending concurrently CRuby, the reference implementation with a Global Lock: ruby append.rb 100000 JRuby, with concurrent threads: jruby append.rb 64324 # or ConcurrencyError: Detected invalid array contents due to unsynchronized modifications with concurrent users << at org/jruby/RubyArray.java:1256 41 / 63

Slide 49

Slide 49 text

Appending concurrently CRuby, the reference implementation with a Global Lock: ruby append.rb 100000 JRuby, with concurrent threads: jruby append.rb 64324 # or ConcurrencyError: Detected invalid array contents due to unsynchronized modifications with concurrent users << at org/jruby/RubyArray.java:1256 Rubinius, with concurrent threads: rbx append.rb Tuple::copy_from: index 2538 out of bounds for size 2398 (Rubinius::ObjectBoundsExceededError) 41 / 63

Slide 50

Slide 50 text

A Workaround: using a Mutex array = [] # or use Concurrent::Array.new mutex = Mutex.new 100.times.map { Thread.new { 1000.times { |i| # Add user-level synchonization mutex.synchronize { array << i } } } }.each { |thread| thread.join } puts array.size 42 / 63

Slide 51

Slide 51 text

Problems of the Workarounds Easy to forget to wrap with Mutex#synchronize or to use Concurrent::Array.new If the synchronization is unnecessary, adds significant overhead Both workarounds prevent parallel access to the Array / Hash From the user point of view, Array and Hash are thread-safe on CRuby, thread-unsafe implementations are incompatible Bugs were reported to Bundler and RubyGems because JRuby and Rubinius do not provide thread-safety for Array and Hash 43 / 63

Slide 52

Slide 52 text

A Hard Problem How to make collections thread-safe, and have no single-threaded overhead? 44 / 63

Slide 53

Slide 53 text

A Hard Problem How to make collections thread-safe, have no single-threaded overhead, and support parallel access? 45 / 63

Slide 54

Slide 54 text

Similar Problem with Objects Objects in dynamic languages support adding and removing fields at runtime The storage for fields in objects needs to grow dynamically Concurrent writes might be lost 46 / 63

Slide 55

Slide 55 text

Lost Object Field Updates obj = Object.new t1 = Thread.new { obj.a = 1; obj.a = 2 } t2 = Thread.new { obj.b = "b" } t1.join; t2.join p obj.a # => 1 !!! 47 / 63

Slide 56

Slide 56 text

Idea: Distinguishing Local and Shared Objects Idea: Only synchronize objects and collections which need it: Objects reachable by only 1 thread need no synchronization Objects reachable by multiple threads need synchronization 48 / 63

Slide 57

Slide 57 text

Local and Shared Objects: Reachability 49 / 63

Slide 58

Slide 58 text

Local and Shared Objects: Reachability 50 / 63

Slide 59

Slide 59 text

Write Barrier: Tracking the set of shared objects Write to shared object =⇒ share value, transitively # Share 1 Hash, 1 String and 1 Object shared_obj.field = { "a" => Object.new } Shared collections use a write barrier when adding elements shared_array << Object.new # Share 1 Object shared_hash["foo"] = "bar" # Share the key and value 51 / 63

Slide 60

Slide 60 text

Single-Threaded Performance for Objects Peak performance, normalized to Unsafe, lower is better q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 0.0 0.5 1.0 1.5 2.0 2.5 Bounce DeltaBlue JSON List NBody Richards Towers Unsafe Safe All Shared All Shared synchronizes on all object writes (similar to JRuby) Benchmarks from Cross-Language Compiler Benchmarking: Are We Fast Yet? S. Marr, B. Daloze, H. Mössenböck, 2016. 52 / 63

Slide 61

Slide 61 text

Single-Threaded Performance for Collections Peak performance, normalized to TruffleRuby, lower is better 0.9 1.0 1.1 1.2 1.3 1.4 Bounce List Mandelbrot NBody Permute Queens Sieve Storage Towers DeltaBlue Json Richards Benchmark TruffleRuby TruffleRuby with Concurrent Collections No difference because these benchmarks do not use shared collections. 53 / 63

Slide 62

Slide 62 text

Array Storage Strategies empty int[] long[] Object[] double[] store int store double store long store Object store Object store Object array = [] # empty array << 1 # int[] array << "foo" # Object[] Storage Strategies for Collections in Dynamically Typed Languages C.F. Bolz, L. Diekmann & L. Tratt, OOPSLA 2013. 54 / 63

Slide 63

Slide 63 text

Goals for Shared Arrays Goals: All Array operations supported and thread-safe Preserve the compact representation of storage strategies Enable parallel reads and writes to different parts of the Array, as they are frequent in many parallel workloads 55 / 63

Slide 64

Slide 64 text

Concurrent Array Strategies SharedFixedStorage Object[] SharedFixedStorage double[] SharedFixedStorage long[] empty int[] long[] Object[] double[] storage strategies concurrent strategies store int store double store long store Object store Object store Object SharedDynamicStorage empty int[] long[] Object[] double[] SharedFixedStorage int[] internal storage change: <<, delete, etc storage transition on sharing 56 / 63

Slide 65

Slide 65 text

Scalability of Array Reads and Writes 100% reads 90% reads, 10% writes 124 8 12 16 20 24 28 32 36 40 44 124 8 12 16 20 24 28 32 36 40 44 0 10 20 30 0 10 20 30 Threads Billion array accesses per sec. SharedFixedStorage VolatileFixedStorage LightweightLayoutLock LayoutLock S R 00% reads 90% reads, 10% writes 50% reads, 50% w 20 24 28 32 36 40 44 124 8 12 16 20 24 28 32 36 40 44 124 8 12 16 20 24 28 0 10 20 30 0 10 20 30 Threads SharedFixedStorage VolatileFixedStorage LightweightLayoutLock LayoutLock StampedLock ReentrantLock Local 57 / 63

Slide 66

Slide 66 text

NASA Parallel Benchmarks IS-C LU-W MG-A SP-W BT-W CG-B EP-B FT-A 12 4 6 8 10 12 14 16 12 4 6 8 10 12 14 16 12 4 6 8 10 12 14 16 12 4 6 8 10 12 14 16 0 5 10 15 0 5 10 15 Threads Scalability relative to 1 thread performance Concurrent Strategies TruffleRuby Java Fortran 58 / 63

Slide 67

Slide 67 text

TruffleRuby Conclusion TruffleRuby runs unmodified Ruby code faster, as fast as the best dynamic languages implementations like V8 for JavaScript TruffleRuby can run Ruby code in parallel safely: “GIL bugfix” Safe Array/Hash not in master yet (ETA: this summer) TruffleRuby can run existing Ruby code using Threads in parallel, no need to change to a different programming model 59 / 63

Slide 68

Slide 68 text

Conclusion The performance of Ruby can be significantly improved, as TruffleRuby shows No need to rewrite applications in other languages for speed We can have parallelism and thread-safety for objects and collections in Ruby, with no single-threaded overhead We can execute Ruby code in parallel, with the most important thread-safety guarantees and scale to many cores 60 / 63

Slide 69

Slide 69 text

Trying TruffleRuby Soon: $ ruby-install truffleruby $ rbenv install truffleruby $ rvm install truffleruby 61 / 63

Slide 70

Slide 70 text

GraalVM 1.0 GraalVM 1.0RC1 was released (17 April 2018) http://www.graalvm.org/ Open-Source Community Edition and Enterprise Edition TruffleRuby, JavaScript and Node.js, R, Python and LLVM bitcode (C, C++, Rust) in a single VM! All these languages can interoperate easily and efficiently 62 / 63

Slide 71

Slide 71 text

Parallel and Thread-Safe Ruby at High-Speed with TruffleRuby Benoit Daloze @eregontp