Slide 1

Slide 1 text

No content

Slide 2

Slide 2 text

Copyright  ©  2016, Oracle  and/or  its  affiliates.  All  rights  reserved.    | Who reordered my  code?! Petr  Chalupa Principal  Member  of  Technical  Staff Oracle  Labs September  08,  2016   JRuby+Truffle Concurrent  Ruby

Slide 3

Slide 3 text

Copyright  ©  2016, Oracle  and/or  its  affiliates.  All  rights  reserved.    | Safe  Harbor  Statement The  following  is  intended  to  provide  some  insight  into  a  line  of  research  in  Oracle  Labs.  It   is  intended  for  information  purposes  only,  and  may  not  be  incorporated  into  any  contract.   It  is  not  a  commitment  to  deliver  any  material,  code,  or  functionality,  and  should  not  be   relied  upon  in  making  purchasing  decisions.  Oracle  reserves  the  right  to  alter  its   development  plans  and  practices  at  any  time,  and  the  development,  release,  and  timing   of  any  features  or  functionality  described  in  connection  with  any  Oracle  product  or   service  remains  at  the  sole  discretion  of  Oracle.  Any  views  expressed  in  this  presentation   are  my  own  and  do  not  necessarily  reflect  the  views  of  Oracle. 3

Slide 4

Slide 4 text

Copyright  ©  2016, Oracle  and/or  its  affiliates.  All  rights  reserved.    | Live  example • Mutual  exclusion  of  two  threads • No  locks 4 Decker’s  algorithm flag1 = flag2 = false Thread  1 flag1 = true flag2 ? contention : critical_section Thread  2 flag2 = true flag1 ? contention : critical_section

Slide 5

Slide 5 text

Copyright  ©  2016, Oracle  and/or  its  affiliates.  All  rights  reserved.    | Outline When  you  can  see  reordering? What  does  it  do? Embrace  or  reject? How  to  deal  with  reordering? Does  it  have  a  practical  use? 1 2 3 4 5 5

Slide 6

Slide 6 text

Copyright  ©  2016, Oracle  and/or  its  affiliates.  All  rights  reserved.    | Ruby’s  new  goals 6

Slide 7

Slide 7 text

Copyright  ©  2016, Oracle  and/or  its  affiliates.  All  rights  reserved.    | Performance • CRuby 3x3  (Heroku,  Appfolio) • Ruby  OMR  preview  – OMR,  J9  (IBM) • JRuby – invokedynamic,  new  IR  (Red  Hat) • JRuby+Truffle – Truffle,  Graal (Oracle) 7

Slide 8

Slide 8 text

Copyright  ©  2016, Oracle  and/or  its  affiliates.  All  rights  reserved.    | Parallelism • Almost  every  computer  has  more  than  one  core • Parallel  computation  has  to  be  supported  to  utilize  all  cores • JRuby,  JRuby+Truffle and  Rubinius support  parallel  execution • Maybe  GIL  will  be  removed  in  Ruby  3? 8 Kernel Ruby  interpreter C  extensions Ruby  Threads GIL OS  Threads Kernel Ruby  Interpreted Ruby  Threads OS  Threads C  extensions Ruby  compiled C  code

Slide 9

Slide 9 text

Copyright  ©  2016, Oracle  and/or  its  affiliates.  All  rights  reserved.    | Concurrent  library • Ideas  considered  for  Ruby  3:  actors,  isolation,  channels,  streams,  … – Easy  to  use  high-­‐level  concurrency  abstraction 9 • Unanswered  questions: – How  do  we  write  fast  concurrent  data-­‐structures? – How  do  we  write  more  concurrent  abstractions?

Slide 10

Slide 10 text

Copyright  ©  2016, Oracle  and/or  its  affiliates.  All  rights  reserved.    | Reordering 10

Slide 11

Slide 11 text

Copyright  ©  2016, Oracle  and/or  its  affiliates.  All  rights  reserved.    | When  we  can  see  it? • Fast  Ruby  implementation • Parallel  execution 11

Slide 12

Slide 12 text

Copyright  ©  2016, Oracle  and/or  its  affiliates.  All  rights  reserved.    | Fast  Ruby  implementation For  Ruby  language  to  be  fast  an  implementation  with speculatively   optimizing  dynamic  compilation and parallel execution  is  needed. • Speculative:  can  speculate  on  following  propositions – Method  body  is  stable – Constant's  value  is  stable – Type  speculation – … 12 COUNT = 2 def foo(a, b) COUNT * (a + b) end foo(1, 2)

Slide 13

Slide 13 text

Copyright  ©  2016, Oracle  and/or  its  affiliates.  All  rights  reserved.    | Fast  Ruby  implementation For  Ruby  language  to  be  fast  an  implementation  with speculatively   optimizing  dynamic  compilation and parallel execution  is  needed. • Optimizing:  does  all  the  clever  optimizations  as  e.g.  gcc – In-­‐lining – Splitting – Constant  folding – Value  numbering – Hoisting – … 13

Slide 14

Slide 14 text

Copyright  ©  2016, Oracle  and/or  its  affiliates.  All  rights  reserved.    | Fast  Ruby  implementation For  Ruby  language  to  be  fast  an  implementation  with speculatively   optimizing  dynamic  compilation and parallel execution  is  needed. • Dynamic: – Just-­‐in-­‐time  compilation  of  hot  methods – Also  deoptimize when  speculatively  taken  assumptions  fail • Parallel: – Ruby  code  runs  in  parallel 14 COUNT = 2 def foo(a, b) COUNT * (a + b) end COUNT = 3

Slide 15

Slide 15 text

Copyright  ©  2016, Oracle  and/or  its  affiliates.  All  rights  reserved.    | Fast  Ruby  implementation • JRuby+Truffle is  such  an  implementation – Truffle: self  optimizing  AST  interpreter – Graal: compiler  written  in  Java 15

Slide 16

Slide 16 text

Copyright  ©  2016, Oracle  and/or  its  affiliates.  All  rights  reserved.    | Sources  of  reordering 16

Slide 17

Slide 17 text

Copyright  ©  2016, Oracle  and/or  its  affiliates.  All  rights  reserved.    | Compiler  reorders  code • Optimizes  by  transforming  the  code • Is  allowed  to  perform  any  optimization  if  the  transformation  cannot  be   observed  on  the  same  thread – The  code  has  the  same  result – Assumes  only  one  thread 17

Slide 18

Slide 18 text

Copyright  ©  2016, Oracle  and/or  its  affiliates.  All  rights  reserved.    | Seemingly  sequential  Ruby  code def foo(a, b, c, d) x = a + b y = c + d x * y end 18 + + × Expanded  to  a  parallel  graph  in  the  compiler These  two  operations  can happen  in  either  order Why?  Because  they  are   independent  operations  – there   are  no  dependencies  between   the  two. foo(a, b, c, d) end

Slide 19

Slide 19 text

Copyright  ©  2016, Oracle  and/or  its  affiliates.  All  rights  reserved.    | Seemingly  sequential  Ruby  code 19 add  a  b  %r1 add  c  d  %r2 mul %r1  %r2  %r3 ret  %r3 add  c  d  %r1 add  a  b  %r2 mul %r1  %r2  %r3 ret  %r3 Generated  machine  code  can  use  either  order  of  operations Why?  Because  they  are   independent  operations  – there   are  no  dependencies  between   the  two.

Slide 20

Slide 20 text

Copyright  ©  2016, Oracle  and/or  its  affiliates.  All  rights  reserved.    | Seemingly  sequential  Ruby  code 20 add a b %r1 Even  if  our  compiler  didn’t  reorder,  the  processor  could  do  it  anyway! mul %r1 %r2 %r3 add c d %r2 ret %r3 Why?  Because  they  are   independent  instructions  – there   are  no  dependencies  between   the  two. These  two  operations  can happen  in  your  processor in  either  order

Slide 21

Slide 21 text

Copyright  ©  2016, Oracle  and/or  its  affiliates.  All  rights  reserved.    | Decker’s  algorithm  seen  by  compiler 21 flag1 = flag2 = false Thread  1 flag1 = true flag2 ? contention : critical_section Thread  2 flag2 = true flag1 ? contention : critical_section flag1 = true if flag2 contention critical_section

Slide 22

Slide 22 text

Copyright  ©  2016, Oracle  and/or  its  affiliates.  All  rights  reserved.    | Example class Future def initialize; @value = nil; end def fulfill(v); end def value; end end 22 Thread  2 def value Thread.pass until @value @value end Transformed  into def value temp = @value Thread.pass until temp @value end Thread  1 def fulfill(result) @value = result end If value is  called  before fulfill it  will  block indefinitely. Order 2: temp = @value # nil 2: Thread.pass until temp # nil 1: @value = result # :result

Slide 23

Slide 23 text

Copyright  ©  2016, Oracle  and/or  its  affiliates.  All  rights  reserved.    | Cache  reordering  effects • Dekker's  algorithm • Compiled  without  reordering • Old  processor  executing  in  program  order – No  out-­‐of-­‐order  execution • Coherent  cache  with  just  a  write  buffer 23

Slide 24

Slide 24 text

Copyright  ©  2016, Oracle  and/or  its  affiliates.  All  rights  reserved.    | Cache  reordering  effects 24 flag1 = flag2 = false Thread  1 flag1 = true flag2 ? contention : critical_section Thread  2 flag2 = true flag1 ? contention : critical_section

Slide 25

Slide 25 text

Copyright  ©  2016, Oracle  and/or  its  affiliates.  All  rights  reserved.    | Cache  reordering  effects 25 Thread  1 flag1 = true flag2 ? contention : critical_section Global  memory Thread  2 flag2 = true flag1 ? contention : critical_section Store  buffer Store  buffer false false

Slide 26

Slide 26 text

Copyright  ©  2016, Oracle  and/or  its  affiliates.  All  rights  reserved.    | Processor  reordering  effects • Decker's  algorithm • Compiled  without  reordering • Out-­‐of-­‐order  processor • No  cache 26

Slide 27

Slide 27 text

Copyright  ©  2016, Oracle  and/or  its  affiliates.  All  rights  reserved.    | Processor  reordering  effects 27 flag1 = flag2 = false Thread  1 flag1 = true flag2 ? contention : critical_section Thread  2 flag2 = true flag1 ? contention : critical_section

Slide 28

Slide 28 text

Copyright  ©  2016, Oracle  and/or  its  affiliates.  All  rights  reserved.    | Processor  reordering  effects 28 flag1 = flag2 = false Thread  1 r1 = flag2 # read flag1 = true # write r1 ? contention : critical_section Thread  2 r1 = flag1 # read flag2 = true # write r1 ? contention : critical_section • Store  reordered  with  load • StoreLoad reordering  is  allowed  on  x86

Slide 29

Slide 29 text

Copyright  ©  2016, Oracle  and/or  its  affiliates.  All  rights  reserved.    | Who  reordered  my  code?! • It  might  have  been: – Compiler – Cache – Processor • We  do  not  care  who  it  was  though,  only  the  actual  execution  matters • The  reordered  code  runs  faster  while  the  transformation  cannot  be   observed  on  a  single  thread 29

Slide 30

Slide 30 text

Copyright  ©  2016, Oracle  and/or  its  affiliates.  All  rights  reserved.    | Do  we  want  reordering? • Yes – Even  the  very  basic  code  transformations  would  be  forbidden  without  it – It  would  require  memory  barriers  around  every  read  and  write – It  cannot  be  avoided • We  want  to  let  the  compiler,  cache,  processor – keep  working  for  us – run  our  code faster than  we  wrote  it – minimize  waiting  for  memory 30

Slide 31

Slide 31 text

Copyright  ©  2016, Oracle  and/or  its  affiliates.  All  rights  reserved.    | Relaxed  memory  order class Variable def initialize @mutex, @updates, @seen_up_to = Mutex.new, [], {} end def write(value) @mutex.synchronize do @seen_up_to[Thread.current] = @updates.size @updates << value end value end # def read -> end 31 Updates Seen  by -­‐ Thread  1 0 1 Thread  2,  Thread  3 42 Thread  4 54 def read @mutex.synchronize do seen = @seen_up_to[Thread.current] || 0 new_seen = (seen...@updates).to_a.sample @seen_up_to[Thread.current] = new_seen return @updates[new_seen] end end

Slide 32

Slide 32 text

Copyright  ©  2016, Oracle  and/or  its  affiliates.  All  rights  reserved.    | Relaxed  memory  order • Each  thread  sees  different  values • Variables  are  completely  independent • Only  the  order  of  the  values  is  shared • Not  every  value  has  to  be  seen  by  a  given  thread • No  way  to  tell  if  a  thread  got  the  latest  value 32

Slide 33

Slide 33 text

Copyright  ©  2016, Oracle  and/or  its  affiliates.  All  rights  reserved.    | Taming  reordering 33

Slide 34

Slide 34 text

Copyright  ©  2016, Oracle  and/or  its  affiliates.  All  rights  reserved.    | Sequential  consistency “The  result  of  any  execution  is  the  same  as  if  the  operations  of  all  the  processors  were   executed  in  some  sequential  order,  and  the  operations  of  each  individual  processor  appear   in  this  sequence  in  the  order  specified  by  its  program.”  — Leslie  Lamport 1979 • Allows  to  reason  about  the  program  as  if  it  is  executed  interleaved  on  one   thread  even  though  it's  executed  in  parallel  on  many  threads • Cannot  be  done  for  all  variables • Better  to  apply  to  just  shared  variables 34

Slide 35

Slide 35 text

Copyright  ©  2016, Oracle  and/or  its  affiliates.  All  rights  reserved.    | Sequential  consistency line :a line :b line 1 line 2 line :a line 1 line 2 line :b line :a line 1 line :b line 2 line 1 line 2 line :a line :b line 1 line :a line 2 line :b line 1 line :a line :b line 2 35 Thread  1 line :a line :b Thread  2 line 1 line 2 Allowed  orders

Slide 36

Slide 36 text

Copyright  ©  2016, Oracle  and/or  its  affiliates.  All  rights  reserved.    | Sequential  consistency Can  :a and  :b be  both  printed? a = b = false 36 Thread  1 a = true Thread  2 b = true Thread  3 if a && !b puts :a end Thread  4 if b && !a puts :b end Assuming a && !b the  order  has  to  be a = true a && !b # => true # puts :a b = true # puts :a • Impossible to insert b && !a to a  place   where it  would be true • The  reasoning is just  mirrored for puts :b

Slide 37

Slide 37 text

Copyright  ©  2016, Oracle  and/or  its  affiliates.  All  rights  reserved.    | Memory  model • Defines  shared  variables • Allows  optimizations  while  keeping  sequential  consistency • Contract:  the  program  is  sequentially  consistent  if  there  are  no  data  races • Answers  which  values  can  a  particular  read  return  in  a  program • It's  difficult  to  define   – We'll  focus  only  on  implications 37

Slide 38

Slide 38 text

Copyright  ©  2016, Oracle  and/or  its  affiliates.  All  rights  reserved.    | Shared  variables • Called volatile in  Java  and atomic in  C++ • We  have  to  tell  the  compiler  which  variables  are  shared – It  has  to  assume  that  they  may  be  accessed  at  any  time  from  other  threads – Reads  and  writes  of  shared  variables  cannot  be  reordered • Reads  and  writes  are  atomic 38

Slide 39

Slide 39 text

Copyright  ©  2016, Oracle  and/or  its  affiliates.  All  rights  reserved.    | Shared  variables • To  conform  with  sequential  consistency,  intuitively: – When  written,  it  has  to  be  made  visible  immediately  to  all  other  threads,  called   release – When  read,  it  reads  the  latest  value,  called  acquire • Provides  safe  publication – Release  and  acquire  has  very  useful  effect  on  non-­‐shared  variables 39 Release  on  variable  @a Changes Thread  1 Visible  changes Thread  2 Acquire  on  variable  @a

Slide 40

Slide 40 text

Copyright  ©  2016, Oracle  and/or  its  affiliates.  All  rights  reserved.    | Shared  variables Thread  1 a = 42 # cannot be moved down shared = true # release Thread  2 if (r1 = shared) # acquire r2 = a # cannot be moved up end [r1, r2] # => [true, 42], [false, nil] 40 a = 0 shared = false r1 = shared # false # no `r2 = a` a = 42 shared = true a = 42 r1 = shared # false # no `r2 = a` shared = true a = 42 shared = true r1 = shared # true r2 = a Possible  orders

Slide 41

Slide 41 text

Copyright  ©  2016, Oracle  and/or  its  affiliates.  All  rights  reserved.    | Decker’s  algorithm  seen  by  compiler  – fixed 41 flag1 = flag2 = false shared :flag1, :flag2 Thread  1 flag1 = true flag2 ? contention : critical_section Thread  2 flag2 = true flag1 ? contention : critical_section Order flag1 = true flag2 ? contention : critical_section # false -> critical flag2 = true flag1 ? contention : critical_section # true -> contention

Slide 42

Slide 42 text

Copyright  ©  2016, Oracle  and/or  its  affiliates.  All  rights  reserved.    | Example  – fixed class Future shared :@value def initialize; @value = nil; end def fulfill(v); end def value; end end 42 Thread  1 def value Thread.pass until @value @value end Transformed  into def value temp = @value Thread.pass until temp @value end Thread  2 def fulfill(value) @value = value end @value cannot  be  reordered,  has  to  actually  read the  value  each  time.

Slide 43

Slide 43 text

Copyright  ©  2016, Oracle  and/or  its  affiliates.  All  rights  reserved.    | Building  with  memory  model 43

Slide 44

Slide 44 text

Copyright  ©  2016, Oracle  and/or  its  affiliates.  All  rights  reserved.    | Counter • A  counter: – .new(value = 0) – #add(increment = 1) – #value • Let’s  build  one  using  the  core  library – Mutex 44

Slide 45

Slide 45 text

Copyright  ©  2016, Oracle  and/or  its  affiliates.  All  rights  reserved.    | Counter class MutexCounter def initialize(value = 0) @mutex = Mutex.new @mutex.synchronize { @value = value } end def add(increment = 1) @mutex.synchronize do @value += increment end end def value @mutex.synchronize { @value } end end 45

Slide 46

Slide 46 text

Copyright  ©  2016, Oracle  and/or  its  affiliates.  All  rights  reserved.    | Counter class SharedCounter def initialize(value = 0) @mutex = Mutex.new @value = AtomicReference.new value end def add(increment = 1) @mutex.synchronize do @value.set @value.get + increment end end def value @value.get end end 46

Slide 47

Slide 47 text

Copyright  ©  2016, Oracle  and/or  its  affiliates.  All  rights  reserved.    | Benchmark  – value improvement 24,29 11,07 9,69 4,96 1,01 0,17 0 5 10 15 20 25 30 MRI JRuby JRuby+Truffle MutexCounter SharedCounter Confidential  – Oracle  Internal/Restricted/Highly  Restricted 47

Slide 48

Slide 48 text

Copyright  ©  2016, Oracle  and/or  its  affiliates.  All  rights  reserved.    | Compare-­‐and-­‐set  operations • Atomic  operation  on  a  shared  variable compare_and_set expected, new_value # => true || false attr_atomic :value # shared variable self.value = 1 48 Thread  1 while true current = value new_value = current + 1 break if compare_and_set_value( current, new_value) end Thread  2 while true current = value new_value = current * 2 break if compare_and_set_value( current, new_value) end

Slide 49

Slide 49 text

Copyright  ©  2016, Oracle  and/or  its  affiliates.  All  rights  reserved.    | Counter class CasCounter def initialize(value = 0) @value = AtomicReference.new value end def add(increment = 1) while true current = @value.get new_value = current + increment break if @value.compare_and_set(current, new_value) end end def value @value.get end end 49

Slide 50

Slide 50 text

Copyright  ©  2016, Oracle  and/or  its  affiliates.  All  rights  reserved.    | Benchmark  – add improvement 26,81 20,23 9,95 15,06 2,97 1,75 0 5 10 15 20 25 30 MRI JRuby JRuby+Truffle MutexCounter CasCounter Confidential  – Oracle  Internal/Restricted/Highly  Restricted 50

Slide 51

Slide 51 text

Copyright  ©  2016, Oracle  and/or  its  affiliates.  All  rights  reserved.    | Conclusions • Fast  Ruby  implementation • Parallel  execution   • Shared  memory   51 Reordering • Shared  variables • Sequential  consistency Fast  concurrent  data  structures  and   concurrency  abstractions  built  directly   in  Ruby It  is  not  for  every  day  coding.  Look  for  abstractions  in  gems  like concurrent-­‐ruby first. Memory  model

Slide 52

Slide 52 text

Copyright  ©  2016, Oracle  and/or  its  affiliates.  All  rights  reserved.    | Acknowledgements Benoit  Daloze Brandon  Fish Petr  Chalupa Kevin  Menard Chris  Seaton Jruby &  Rubinius  Contributors Oracle Danilo  Ansaloni Stefan  Anzinger Cosmin  Basca Daniele  Bonetta Matthias  Brantner Petr  Chalupa Jürgen  Christ Laurent  Daynès Gilles  Duboscq Martin  Entlicher Bastian  Hossbach Christian  Humer Mick  Jordan Vojin  Jovanovic Peter  Kessler Oracle  (continued) David  Leopoldseder Kevin  Menard Jakub  Podlešák Aleksandar  Prokopec Tom  Rodriguez Roland  Schatz Chris  Seaton Doug  Simon Štěpán  Šindelář Zbyněk  Šlajchrt Lukas  Stadler Codrut  Stancu Jan  Štola Jaroslav  Tulach Michael  Van  De  Vanter Adam  Welc Christian  Wimmer Christian  Wirth Paul  Wögerer Mario  Wolczko Andreas  Wöß Thomas  Würthinger JKU  Linz Prof.  Hanspeter  Mössenböck Benoit  Daloze Josef  Eisl Thomas  Feichtinger Matthias  Grimmer Christian  Häubl Josef  Haider Christian  Huber Stefan  Marr Manuel  Rigger Stefan  Rumzucker Bernhard  Urban University  of Edinburgh Christophe  Dubach Juan  José  Fumero Alfonso Ranjeet Singh Toomas Remmelg LaBRI Floréal Morandat University  of California,  Irvine Prof.  Michael  Franz Gulfem  Savrun  Yeniceri Wei  Zhang Purdue University Prof.  Jan  Vitek Tomas  Kalibera Petr  Maj Lei  Zhao T.  U.  Dortmund Prof.  Peter  Marwedel Helena  Kotthaus Ingo  Korb University  of California,  Davis Prof.  Duncan  Temple  Lang Nicholas  Ulle University  of Lugano,  Switzerland Prof.  Walter  Binder Sun  Haiyang Yudi  Zheng Oracle  Interns Brian  Belleville   Miguel  Garcia Shams  Imam Alexey  Karyakin Stephen  Kell Andreas  Kunft Volker  Lanting Gero  Leinemann Julian  Lettner Joe  Nash David  Piorkowski Gregor  Richards Robert  Seilbeck Rifat  Shariyar Alumni Erik  Eckstein Michael  Haupt Christos  Kotselidis Hyunjin  Lee David  Leibs Chris  Thalinger Till  Westmann

Slide 53

Slide 53 text

Copyright  ©  2016, Oracle  and/or  its  affiliates.  All  rights  reserved.    | Safe  Harbor  Statement The  preceding  is  intended  to  provide  some  insight  into  a  line  of  research  in  Oracle  Labs.  It   is  intended  for  information  purposes  only,  and  may  not  be  incorporated  into  any   contract. It  is  not  a  commitment  to  deliver  any  material,  code,  or  functionality,  and   should  not  be  relied  upon  in  making  purchasing  decisions.  Oracle  reserves  the  right  to   alter  its  development  plans  and  practices  at  any  time,  and  the  development,  release,  and   timing  of  any  features  or  functionality  described  in  connection  with  any  Oracle  product  or   service  remains  at  the  sole  discretion  of  Oracle. Any  views  expressed  in  this  presentation   are  my  own  and  do  not  necessarily  reflect  the  views  of  Oracle. Confidential  – Oracle  Internal/Restricted/Highly  Restricted 53

Slide 54

Slide 54 text

Copyright  ©  2016, Oracle  and/or  its  affiliates.  All  rights  reserved.    | 54

Slide 55

Slide 55 text

No content