Who reordered my code?!

Slide 1

Slide 1 text

No content

Slide 2

Slide 2 text

Copyright © 2016, Oracle and/or its affiliates. All rights reserved. | Who reordered my code?! Petr Chalupa Principal Member of Technical Staff Oracle Labs September 08, 2016 JRuby+Truffle Concurrent Ruby

Slide 3

Slide 3 text

Copyright © 2016, Oracle and/or its affiliates. All rights reserved. | Safe Harbor Statement The following is intended to provide some insight into a line of research in Oracle Labs. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. Oracle reserves the right to alter its development plans and practices at any time, and the development, release, and timing of any features or functionality described in connection with any Oracle product or service remains at the sole discretion of Oracle. Any views expressed in this presentation are my own and do not necessarily reflect the views of Oracle. 3

Slide 4

Slide 4 text

Copyright © 2016, Oracle and/or its affiliates. All rights reserved. | Live example • Mutual exclusion of two threads • No locks 4 Decker’s algorithm flag1 = flag2 = false Thread 1 flag1 = true flag2 ? contention : critical_section Thread 2 flag2 = true flag1 ? contention : critical_section

Slide 5

Slide 5 text

Copyright © 2016, Oracle and/or its affiliates. All rights reserved. | Outline When you can see reordering? What does it do? Embrace or reject? How to deal with reordering? Does it have a practical use? 1 2 3 4 5 5

Slide 6

Slide 6 text

Slide 7

Slide 7 text

Copyright © 2016, Oracle and/or its affiliates. All rights reserved. | Performance • CRuby 3x3 (Heroku, Appfolio) • Ruby OMR preview – OMR, J9 (IBM) • JRuby – invokedynamic, new IR (Red Hat) • JRuby+Truffle – Truffle, Graal (Oracle) 7

Slide 8

Slide 8 text

Copyright © 2016, Oracle and/or its affiliates. All rights reserved. | Parallelism • Almost every computer has more than one core • Parallel computation has to be supported to utilize all cores • JRuby, JRuby+Truffle and Rubinius support parallel execution • Maybe GIL will be removed in Ruby 3? 8 Kernel Ruby interpreter C extensions Ruby Threads GIL OS Threads Kernel Ruby Interpreted Ruby Threads OS Threads C extensions Ruby compiled C code

Slide 9

Slide 9 text

Copyright © 2016, Oracle and/or its affiliates. All rights reserved. | Concurrent library • Ideas considered for Ruby 3: actors, isolation, channels, streams, … – Easy to use high-‐level concurrency abstraction 9 • Unanswered questions: – How do we write fast concurrent data-‐structures? – How do we write more concurrent abstractions?

Slide 10

Slide 10 text

Slide 11

Slide 11 text

Slide 12

Slide 12 text

Copyright © 2016, Oracle and/or its affiliates. All rights reserved. | Fast Ruby implementation For Ruby language to be fast an implementation with speculatively optimizing dynamic compilation and parallel execution is needed. • Speculative: can speculate on following propositions – Method body is stable – Constant's value is stable – Type speculation – … 12 COUNT = 2 def foo(a, b) COUNT * (a + b) end foo(1, 2)

Slide 13

Slide 13 text

Copyright © 2016, Oracle and/or its affiliates. All rights reserved. | Fast Ruby implementation For Ruby language to be fast an implementation with speculatively optimizing dynamic compilation and parallel execution is needed. • Optimizing: does all the clever optimizations as e.g. gcc – In-‐lining – Splitting – Constant folding – Value numbering – Hoisting – … 13

Slide 14

Slide 14 text

Copyright © 2016, Oracle and/or its affiliates. All rights reserved. | Fast Ruby implementation For Ruby language to be fast an implementation with speculatively optimizing dynamic compilation and parallel execution is needed. • Dynamic: – Just-‐in-‐time compilation of hot methods – Also deoptimize when speculatively taken assumptions fail • Parallel: – Ruby code runs in parallel 14 COUNT = 2 def foo(a, b) COUNT * (a + b) end COUNT = 3

Slide 15

Slide 15 text

Copyright © 2016, Oracle and/or its affiliates. All rights reserved. | Fast Ruby implementation • JRuby+Truffle is such an implementation – Truffle: self optimizing AST interpreter – Graal: compiler written in Java 15

Slide 16

Slide 16 text

Slide 17

Slide 17 text

Copyright © 2016, Oracle and/or its affiliates. All rights reserved. | Compiler reorders code • Optimizes by transforming the code • Is allowed to perform any optimization if the transformation cannot be observed on the same thread – The code has the same result – Assumes only one thread 17

Slide 18

Slide 18 text

Copyright © 2016, Oracle and/or its affiliates. All rights reserved. | Seemingly sequential Ruby code def foo(a, b, c, d) x = a + b y = c + d x * y end 18 + + × Expanded to a parallel graph in the compiler These two operations can happen in either order Why? Because they are independent operations – there are no dependencies between the two. foo(a, b, c, d) end

Slide 19

Slide 19 text

Copyright © 2016, Oracle and/or its affiliates. All rights reserved. | Seemingly sequential Ruby code 19 add a b %r1 add c d %r2 mul %r1 %r2 %r3 ret %r3 add c d %r1 add a b %r2 mul %r1 %r2 %r3 ret %r3 Generated machine code can use either order of operations Why? Because they are independent operations – there are no dependencies between the two.

Slide 20

Slide 20 text

Copyright © 2016, Oracle and/or its affiliates. All rights reserved. | Seemingly sequential Ruby code 20 add a b %r1 Even if our compiler didn’t reorder, the processor could do it anyway! mul %r1 %r2 %r3 add c d %r2 ret %r3 Why? Because they are independent instructions – there are no dependencies between the two. These two operations can happen in your processor in either order

Slide 21

Slide 21 text

Copyright © 2016, Oracle and/or its affiliates. All rights reserved. | Decker’s algorithm seen by compiler 21 flag1 = flag2 = false Thread 1 flag1 = true flag2 ? contention : critical_section Thread 2 flag2 = true flag1 ? contention : critical_section flag1 = true if flag2 contention critical_section

Slide 22

Slide 22 text

Copyright © 2016, Oracle and/or its affiliates. All rights reserved. | Example class Future def initialize; @value = nil; end def fulfill(v); end def value; end end 22 Thread 2 def value Thread.pass until @value @value end Transformed into def value temp = @value Thread.pass until temp @value end Thread 1 def fulfill(result) @value = result end If value is called before fulfill it will block indefinitely. Order 2: temp = @value # nil 2: Thread.pass until temp # nil 1: @value = result # :result

Slide 23

Slide 23 text

Copyright © 2016, Oracle and/or its affiliates. All rights reserved. | Cache reordering effects • Dekker's algorithm • Compiled without reordering • Old processor executing in program order – No out-‐of-‐order execution • Coherent cache with just a write buffer 23

Slide 24

Slide 24 text

Copyright © 2016, Oracle and/or its affiliates. All rights reserved. | Cache reordering effects 24 flag1 = flag2 = false Thread 1 flag1 = true flag2 ? contention : critical_section Thread 2 flag2 = true flag1 ? contention : critical_section

Slide 25

Slide 25 text

Copyright © 2016, Oracle and/or its affiliates. All rights reserved. | Cache reordering effects 25 Thread 1 flag1 = true flag2 ? contention : critical_section Global memory Thread 2 flag2 = true flag1 ? contention : critical_section Store buffer Store buffer false false

Slide 26

Slide 26 text

Slide 27

Slide 27 text

Copyright © 2016, Oracle and/or its affiliates. All rights reserved. | Processor reordering effects 27 flag1 = flag2 = false Thread 1 flag1 = true flag2 ? contention : critical_section Thread 2 flag2 = true flag1 ? contention : critical_section

Slide 28

Slide 28 text

Copyright © 2016, Oracle and/or its affiliates. All rights reserved. | Processor reordering effects 28 flag1 = flag2 = false Thread 1 r1 = flag2 # read flag1 = true # write r1 ? contention : critical_section Thread 2 r1 = flag1 # read flag2 = true # write r1 ? contention : critical_section • Store reordered with load • StoreLoad reordering is allowed on x86

Slide 29

Slide 29 text

Copyright © 2016, Oracle and/or its affiliates. All rights reserved. | Who reordered my code?! • It might have been: – Compiler – Cache – Processor • We do not care who it was though, only the actual execution matters • The reordered code runs faster while the transformation cannot be observed on a single thread 29

Slide 30

Slide 30 text

Copyright © 2016, Oracle and/or its affiliates. All rights reserved. | Do we want reordering? • Yes – Even the very basic code transformations would be forbidden without it – It would require memory barriers around every read and write – It cannot be avoided • We want to let the compiler, cache, processor – keep working for us – run our code faster than we wrote it – minimize waiting for memory 30

Slide 31

Slide 31 text

Copyright © 2016, Oracle and/or its affiliates. All rights reserved. | Relaxed memory order class Variable def initialize @mutex, @updates, @seen_up_to = Mutex.new, [], {} end def write(value) @mutex.synchronize do @seen_up_to[Thread.current] = @updates.size @updates << value end value end # def read -> end 31 Updates Seen by -‐ Thread 1 0 1 Thread 2, Thread 3 42 Thread 4 54 def read @mutex.synchronize do seen = @seen_up_to[Thread.current] || 0 new_seen = (seen...@updates).to_a.sample @seen_up_to[Thread.current] = new_seen return @updates[new_seen] end end

Slide 32

Slide 32 text

Copyright © 2016, Oracle and/or its affiliates. All rights reserved. | Relaxed memory order • Each thread sees different values • Variables are completely independent • Only the order of the values is shared • Not every value has to be seen by a given thread • No way to tell if a thread got the latest value 32

Slide 33

Slide 33 text

Slide 34

Slide 34 text

Copyright © 2016, Oracle and/or its affiliates. All rights reserved. | Sequential consistency “The result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program.” — Leslie Lamport 1979 • Allows to reason about the program as if it is executed interleaved on one thread even though it's executed in parallel on many threads • Cannot be done for all variables • Better to apply to just shared variables 34

Slide 35

Slide 35 text

Copyright © 2016, Oracle and/or its affiliates. All rights reserved. | Sequential consistency line :a line :b line 1 line 2 line :a line 1 line 2 line :b line :a line 1 line :b line 2 line 1 line 2 line :a line :b line 1 line :a line 2 line :b line 1 line :a line :b line 2 35 Thread 1 line :a line :b Thread 2 line 1 line 2 Allowed orders

Slide 36

Slide 36 text

Copyright © 2016, Oracle and/or its affiliates. All rights reserved. | Sequential consistency Can :a and :b be both printed? a = b = false 36 Thread 1 a = true Thread 2 b = true Thread 3 if a && !b puts :a end Thread 4 if b && !a puts :b end Assuming a && !b the order has to be a = true a && !b # => true # puts :a b = true # puts :a • Impossible to insert b && !a to a place where it would be true • The reasoning is just mirrored for puts :b

Slide 37

Slide 37 text

Copyright © 2016, Oracle and/or its affiliates. All rights reserved. | Memory model • Defines shared variables • Allows optimizations while keeping sequential consistency • Contract: the program is sequentially consistent if there are no data races • Answers which values can a particular read return in a program • It's difficult to define – We'll focus only on implications 37

Slide 38

Slide 38 text

Copyright © 2016, Oracle and/or its affiliates. All rights reserved. | Shared variables • Called volatile in Java and atomic in C++ • We have to tell the compiler which variables are shared – It has to assume that they may be accessed at any time from other threads – Reads and writes of shared variables cannot be reordered • Reads and writes are atomic 38

Slide 39

Slide 39 text

Copyright © 2016, Oracle and/or its affiliates. All rights reserved. | Shared variables • To conform with sequential consistency, intuitively: – When written, it has to be made visible immediately to all other threads, called release – When read, it reads the latest value, called acquire • Provides safe publication – Release and acquire has very useful effect on non-‐shared variables 39 Release on variable @a Changes Thread 1 Visible changes Thread 2 Acquire on variable @a

Slide 40

Slide 40 text

Copyright © 2016, Oracle and/or its affiliates. All rights reserved. | Shared variables Thread 1 a = 42 # cannot be moved down shared = true # release Thread 2 if (r1 = shared) # acquire r2 = a # cannot be moved up end [r1, r2] # => [true, 42], [false, nil] 40 a = 0 shared = false r1 = shared # false # no `r2 = a` a = 42 shared = true a = 42 r1 = shared # false # no `r2 = a` shared = true a = 42 shared = true r1 = shared # true r2 = a Possible orders

Slide 41

Slide 41 text

Copyright © 2016, Oracle and/or its affiliates. All rights reserved. | Decker’s algorithm seen by compiler – fixed 41 flag1 = flag2 = false shared :flag1, :flag2 Thread 1 flag1 = true flag2 ? contention : critical_section Thread 2 flag2 = true flag1 ? contention : critical_section Order flag1 = true flag2 ? contention : critical_section # false -> critical flag2 = true flag1 ? contention : critical_section # true -> contention

Slide 42

Slide 42 text

Copyright © 2016, Oracle and/or its affiliates. All rights reserved. | Example – fixed class Future shared :@value def initialize; @value = nil; end def fulfill(v); end def value; end end 42 Thread 1 def value Thread.pass until @value @value end Transformed into def value temp = @value Thread.pass until temp @value end Thread 2 def fulfill(value) @value = value end @value cannot be reordered, has to actually read the value each time.

Slide 43

Slide 43 text

Slide 44

Slide 44 text

Slide 45

Slide 45 text

Copyright © 2016, Oracle and/or its affiliates. All rights reserved. | Counter class MutexCounter def initialize(value = 0) @mutex = Mutex.new @mutex.synchronize { @value = value } end def add(increment = 1) @mutex.synchronize do @value += increment end end def value @mutex.synchronize { @value } end end 45

Slide 46

Slide 46 text

Copyright © 2016, Oracle and/or its affiliates. All rights reserved. | Counter class SharedCounter def initialize(value = 0) @mutex = Mutex.new @value = AtomicReference.new value end def add(increment = 1) @mutex.synchronize do @value.set @value.get + increment end end def value @value.get end end 46

Slide 47

Slide 47 text

Copyright © 2016, Oracle and/or its affiliates. All rights reserved. | Benchmark – value improvement 24,29 11,07 9,69 4,96 1,01 0,17 0 5 10 15 20 25 30 MRI JRuby JRuby+Truffle MutexCounter SharedCounter Confidential – Oracle Internal/Restricted/Highly Restricted 47

Slide 48

Slide 48 text

Copyright © 2016, Oracle and/or its affiliates. All rights reserved. | Compare-‐and-‐set operations • Atomic operation on a shared variable compare_and_set expected, new_value # => true || false attr_atomic :value # shared variable self.value = 1 48 Thread 1 while true current = value new_value = current + 1 break if compare_and_set_value( current, new_value) end Thread 2 while true current = value new_value = current * 2 break if compare_and_set_value( current, new_value) end

Slide 49

Slide 49 text

Copyright © 2016, Oracle and/or its affiliates. All rights reserved. | Counter class CasCounter def initialize(value = 0) @value = AtomicReference.new value end def add(increment = 1) while true current = @value.get new_value = current + increment break if @value.compare_and_set(current, new_value) end end def value @value.get end end 49

Slide 50

Slide 50 text

Copyright © 2016, Oracle and/or its affiliates. All rights reserved. | Benchmark – add improvement 26,81 20,23 9,95 15,06 2,97 1,75 0 5 10 15 20 25 30 MRI JRuby JRuby+Truffle MutexCounter CasCounter Confidential – Oracle Internal/Restricted/Highly Restricted 50

Slide 51

Slide 51 text

Copyright © 2016, Oracle and/or its affiliates. All rights reserved. | Conclusions • Fast Ruby implementation • Parallel execution • Shared memory 51 Reordering • Shared variables • Sequential consistency Fast concurrent data structures and concurrency abstractions built directly in Ruby It is not for every day coding. Look for abstractions in gems like concurrent-‐ruby first. Memory model

Slide 52

Slide 52 text

Copyright © 2016, Oracle and/or its affiliates. All rights reserved. | Acknowledgements Benoit Daloze Brandon Fish Petr Chalupa Kevin Menard Chris Seaton Jruby & Rubinius Contributors Oracle Danilo Ansaloni Stefan Anzinger Cosmin Basca Daniele Bonetta Matthias Brantner Petr Chalupa Jürgen Christ Laurent Daynès Gilles Duboscq Martin Entlicher Bastian Hossbach Christian Humer Mick Jordan Vojin Jovanovic Peter Kessler Oracle (continued) David Leopoldseder Kevin Menard Jakub Podlešák Aleksandar Prokopec Tom Rodriguez Roland Schatz Chris Seaton Doug Simon Štěpán Šindelář Zbyněk Šlajchrt Lukas Stadler Codrut Stancu Jan Štola Jaroslav Tulach Michael Van De Vanter Adam Welc Christian Wimmer Christian Wirth Paul Wögerer Mario Wolczko Andreas Wöß Thomas Würthinger JKU Linz Prof. Hanspeter Mössenböck Benoit Daloze Josef Eisl Thomas Feichtinger Matthias Grimmer Christian Häubl Josef Haider Christian Huber Stefan Marr Manuel Rigger Stefan Rumzucker Bernhard Urban University of Edinburgh Christophe Dubach Juan José Fumero Alfonso Ranjeet Singh Toomas Remmelg LaBRI Floréal Morandat University of California, Irvine Prof. Michael Franz Gulfem Savrun Yeniceri Wei Zhang Purdue University Prof. Jan Vitek Tomas Kalibera Petr Maj Lei Zhao T. U. Dortmund Prof. Peter Marwedel Helena Kotthaus Ingo Korb University of California, Davis Prof. Duncan Temple Lang Nicholas Ulle University of Lugano, Switzerland Prof. Walter Binder Sun Haiyang Yudi Zheng Oracle Interns Brian Belleville Miguel Garcia Shams Imam Alexey Karyakin Stephen Kell Andreas Kunft Volker Lanting Gero Leinemann Julian Lettner Joe Nash David Piorkowski Gregor Richards Robert Seilbeck Rifat Shariyar Alumni Erik Eckstein Michael Haupt Christos Kotselidis Hyunjin Lee David Leibs Chris Thalinger Till Westmann

Slide 53

Slide 53 text

Copyright © 2016, Oracle and/or its affiliates. All rights reserved. | Safe Harbor Statement The preceding is intended to provide some insight into a line of research in Oracle Labs. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. Oracle reserves the right to alter its development plans and practices at any time, and the development, release, and timing of any features or functionality described in connection with any Oracle product or service remains at the sole discretion of Oracle. Any views expressed in this presentation are my own and do not necessarily reflect the views of Oracle. Confidential – Oracle Internal/Restricted/Highly Restricted 53