Slide 1

Slide 1 text

Concurrent Storage Strategies: Making Collections in Dynamic Languages Thread-Safe and Efficient Benoit Daloze Arie Tal Stefan Marr Hanspeter Mössenböck Erez Petrank

Slide 2

Slide 2 text

Introduction We are in the multi-core era, but: Dynamically-typed languages have poor support for parallel execution (e.g.: Ruby, Python, JavaScript, . . . ) Built-in collections are either inefficient or thread-unsafe 2 / 33

Slide 3

Slide 3 text

Built-in collections Implem. Synchronization on collections CRuby Global Interpreter Lock =⇒ no parallelism CPython Global Interpreter Lock =⇒ no parallelism Jython synchronized =⇒ slow single-threaded, no scaling JRuby No synchronization =⇒ unsafe with multiple threads Nashorn No synchronization =⇒ unsafe with multiple threads 3 / 33

Slide 4

Slide 4 text

Appending concurrently array = [] # Create 100 threads 100.times.map { Thread.new { # Append 1000 integers to the array 1000.times { |i| array << i } } }.each { |thread| thread.join } puts array.size 4 / 33

Slide 5

Slide 5 text

Appending concurrently MRI/CRuby, the reference implementation with a GIL: ruby append.rb 100000 5 / 33

Slide 6

Slide 6 text

Appending concurrently MRI/CRuby, the reference implementation with a GIL: ruby append.rb 100000 JRuby, on the JVM with concurrent threads: jruby append.rb 64324 5 / 33

Slide 7

Slide 7 text

Appending concurrently MRI/CRuby, the reference implementation with a GIL: ruby append.rb 100000 JRuby, on the JVM with concurrent threads: jruby append.rb 64324 # If you are not lucky ConcurrencyError: Detected invalid array contents due to unsynchronized modifications with concurrent users << at org/jruby/RubyArray.java:1256 block at append.rb:8 zsh: exit 1 5 / 33

Slide 8

Slide 8 text

Appending concurrently TruffleRuby, on top of GraalVM with concurrent threads: truffleruby append.rb 77148 # If you are not lucky append.rb:8:in ’<<’: 1338 (RubyTruffleError) ArrayIndexOutOfBoundsException IntegerArrayMirror.set from append.rb:8:in ’block (2 levels) in ’ zsh: exit 1 6 / 33

Slide 9

Slide 9 text

Appending concurrently TruffleRuby, with Thread-Safe Collections: truffleruby-safe append.rb 100000 7 / 33

Slide 10

Slide 10 text

Ruby built-in collections Array (a stack, a queue, a deque, set-like operations) Hash (compare keys by #hash + #eql? or by identity) String (mutable) That’s all! 8 / 33

Slide 11

Slide 11 text

Goals Dynamic languages have few but very versatile built-in collections Enables a programming style that does not require so many upfront decisions (e.g.: choosing a collection implementation) Use them for both single-threaded and multi-threaded workloads The collections should be efficient The collections should scale when used concurrently 9 / 33

Slide 12

Slide 12 text

Outline Tracking Sharing Concurrent Arrays Performance 10 / 33

Slide 13

Slide 13 text

Tracking Sharing Tracking Sharing Concurrent Arrays Performance 11 / 33

Slide 14

Slide 14 text

Local and Shared Objects Only synchronize on objects which are accessed concurrently Expensive to track exactly, so we make an over-approximation: track all objects which can be accessed concurrently, based on reachability 12 / 33

Slide 15

Slide 15 text

Local and Shared Objects Efficient and Thread-Safe Objects for Dynamically-Typed Languages. B. Daloze, S. Marr, D. Bonetta, H. Mössenböck, OOPSLA’16. 13 / 33

Slide 16

Slide 16 text

Local and Shared Objects Efficient and Thread-Safe Objects for Dynamically-Typed Languages. B. Daloze, S. Marr, D. Bonetta, H. Mössenböck, OOPSLA’16. 14 / 33

Slide 17

Slide 17 text

Extending Sharing to Collections Collections are objects, they can track sharing the same way Shared collections use a write barrier when adding an element to the collection shared_array[3] = Object.new shared_hash["foo"] = "bar" Collections can change their representation when shared 15 / 33

Slide 18

Slide 18 text

Impact on Single-Threaded Performance Peak performance, normalized to TruffleRuby, lower is better 0.9 1.0 1.1 1.2 1.3 1.4 Bounce List Mandelbrot NBody Permute Queens Sieve Storage Towers DeltaBlue Json Richards Benchmark TruffleRuby TruffleRuby with Concurrent Collections No difference because these benchmarks do not use shared collections. Benchmarks from Cross-Language Compiler Benchmarking: Are We Fast Yet? S. Marr, B. Daloze, H. Mössenböck, DLS’16. 16 / 33

Slide 19

Slide 19 text

Concurrent Arrays Tracking Sharing Concurrent Arrays Performance 17 / 33

Slide 20

Slide 20 text

Array storage strategies empty int[] long[] Object[] double[] store int store double store long store Object store Object store Object array = [] # empty array << 1 # int[] array << "foo" # Object[] 18 / 33

Slide 21

Slide 21 text

A Closer Look at Array class RubyArray { // null, int[], long[], double[] or Object[] Object storage; // Invariant: size <= storage.length int size; } 19 / 33

Slide 22

Slide 22 text

Concurrent Arrays Goals: Each Array operation should appear atomic Keep the compact representation of storage strategies Scale concurrent reads and writes, as they are frequent in many usages 20 / 33

Slide 23

Slide 23 text

Concurrent Array Strategies SharedFixedStorage Object[] SharedFixedStorage double[] SharedFixedStorage long[] empty int[] long[] Object[] double[] storage strategies concurrent strategies store int store double store long store Object store Object store Object SharedDynamicStorage empty int[] long[] Object[] double[] SharedFixedStorage int[] internal storage change: <<, delete, etc storage transition on sharing 21 / 33

Slide 24

Slide 24 text

SharedFixedStorage Assumes the storage (e.g. int[16]) does not need to change =⇒ Array size and type of the elements fits the storage If so, the Array can be accessed without any synchronization, in parallel and without any overhead (except the write barrier) 22 / 33

Slide 25

Slide 25 text

Migrating to SharedDynamicStorage What if we need to change the storage? $array = [1, 2, 3] # SharedFixedStorage # All of these migrate to SharedDynamicStorage $array[1] = Object.new $array << 4 $array.delete_at(1) We use a Guest-Language Safepoint to migrate to SharedDynamicStorage 23 / 33

Slide 26

Slide 26 text

SharedDynamicStorage SharedDynamicStorage uses a lock to synchronize operations To keep scalability when writing on different parts of the Array, an exclusive lock or a read-write lock is not enough. We use a Layout Lock: read, writes and layout changes Layout Lock: A Scalable Locking Paradigm for Concurrent Data Layout Modifications. N. Cohen, A. Tal, E. Petrank, PPoPP’17 24 / 33

Slide 27

Slide 27 text

Performance Tracking Sharing Concurrent Arrays Performance 25 / 33

Slide 28

Slide 28 text

Array Benchmarks All threads work on a single Array Each thread has its own section of the Array With 6 different synchronization mechanisms: SharedFixedStorage: no synchronization ReentrantLock, Synchronized: from Java StampedLock: a read-write lock LayoutLock: scalable reads and writes and layout changes LightweightLayoutLock: an improvement over LayoutLock 26 / 33

Slide 29

Slide 29 text

Scalability of Array Reads 0 5 10 15 20 12 4 8 12 16 202224 28 32 36 40 44 Threads Throughput LightweightLayoutLock SharedFixedStorage LayoutLock ReentrantLock StampedLock Synchronized Throughput in billions of accesses per second. 27 / 33

Slide 30

Slide 30 text

Scalability of Array with 50% reads/50% writes 0 2 4 6 12 4 8 12 16 202224 28 32 36 40 44 Threads Throughput LightweightLayoutLock SharedFixedStorage LayoutLock ReentrantLock StampedLock Synchronized Throughput in billions of accesses per second. 28 / 33

Slide 31

Slide 31 text

Scalability of Array Appends 0 10 20 30 12 4 8 12 16 202224 28 32 36 40 44 Threads Throughput LightweightLayoutLock LayoutLock ReentrantLock StampedLock Synchronized Appends are considered layout changes and use the exclusive lock. Throughput in millions of appends per second. 29 / 33

Slide 32

Slide 32 text

PyPy’s Parallel Mandelbrot 30 / 33

Slide 33

Slide 33 text

PyPy’s Parallel Mandelbrot 0 4 8 12 16 20 24 28 32 12 4 8 12 16 202224 28 32 36 40 44 Threads Scalability relative to Local SharedFixedStorage Local Parallelized by distributing 64 groups of rows between threads dynamically using a global queue. 31 / 33

Slide 34

Slide 34 text

Scalability of Hash 0 200 400 600 12 4 8 12 16 202224 28 32 36 40 44 Threads Throughput LightweightLayoutLock LayoutLock Local ReentrantLock StampedLock 80% lookups, 10% puts, 10% removes over a range of 65536 keys Throughput in millions of operations per second. 32 / 33

Slide 35

Slide 35 text

Conclusion Standard built-in collections in dynamic languages can be thread-safe and yet as efficient as unsynchronized collections We can make Array and Hash scale up to 44 cores linearly with SharedFixedStorage and the Lightweight Layout Lock We enable parallel programming with the existing built-in collections, not requiring upfront decisions by the programmer (e.g.: choosing a collection implementation based on concurrency or usage) 33 / 33