Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Who reordered my code?!

Petr Chalupa
September 08, 2016

Who reordered my code?!

There is a hidden problem waiting as Ruby becomes 3x faster and starts to support parallel computation - reordering by JIT compilers and CPUs.

In this talk, we’ll start by trying to optimize a few simple Ruby snippets. We’ll play the role of a JIT and a CPU and order operations as the rules of the system allow. Then we add a second thread to the snippets and watch it as it breaks horribly.

In the second part, we’ll fix the unwanted reorderings by introducing a memory model to Ruby. We’ll discuss in detail how it fixes the snippets and how it can be used to write faster code for parallel execution.

Petr Chalupa

September 08, 2016
Tweet

More Decks by Petr Chalupa

Other Decks in Programming

Transcript

  1. Copyright  ©  2016, Oracle  and/or  its  affiliates.  All  rights  reserved.

       | Who reordered my  code?! Petr  Chalupa Principal  Member  of  Technical  Staff Oracle  Labs September  08,  2016   JRuby+Truffle Concurrent  Ruby
  2. Copyright  ©  2016, Oracle  and/or  its  affiliates.  All  rights  reserved.

       | Safe  Harbor  Statement The  following  is  intended  to  provide  some  insight  into  a  line  of  research  in  Oracle  Labs.  It   is  intended  for  information  purposes  only,  and  may  not  be  incorporated  into  any  contract.   It  is  not  a  commitment  to  deliver  any  material,  code,  or  functionality,  and  should  not  be   relied  upon  in  making  purchasing  decisions.  Oracle  reserves  the  right  to  alter  its   development  plans  and  practices  at  any  time,  and  the  development,  release,  and  timing   of  any  features  or  functionality  described  in  connection  with  any  Oracle  product  or   service  remains  at  the  sole  discretion  of  Oracle.  Any  views  expressed  in  this  presentation   are  my  own  and  do  not  necessarily  reflect  the  views  of  Oracle. 3
  3. Copyright  ©  2016, Oracle  and/or  its  affiliates.  All  rights  reserved.

       | Live  example • Mutual  exclusion  of  two  threads • No  locks 4 Decker’s  algorithm flag1 = flag2 = false Thread  1 flag1 = true flag2 ? contention : critical_section Thread  2 flag2 = true flag1 ? contention : critical_section
  4. Copyright  ©  2016, Oracle  and/or  its  affiliates.  All  rights  reserved.

       | Outline When  you  can  see  reordering? What  does  it  do? Embrace  or  reject? How  to  deal  with  reordering? Does  it  have  a  practical  use? 1 2 3 4 5 5
  5. Copyright  ©  2016, Oracle  and/or  its  affiliates.  All  rights  reserved.

       | Performance • CRuby 3x3  (Heroku,  Appfolio) • Ruby  OMR  preview  – OMR,  J9  (IBM) • JRuby – invokedynamic,  new  IR  (Red  Hat) • JRuby+Truffle – Truffle,  Graal (Oracle) 7
  6. Copyright  ©  2016, Oracle  and/or  its  affiliates.  All  rights  reserved.

       | Parallelism • Almost  every  computer  has  more  than  one  core • Parallel  computation  has  to  be  supported  to  utilize  all  cores • JRuby,  JRuby+Truffle and  Rubinius support  parallel  execution • Maybe  GIL  will  be  removed  in  Ruby  3? 8 Kernel Ruby  interpreter C  extensions Ruby  Threads GIL OS  Threads Kernel Ruby  Interpreted Ruby  Threads OS  Threads C  extensions Ruby  compiled C  code
  7. Copyright  ©  2016, Oracle  and/or  its  affiliates.  All  rights  reserved.

       | Concurrent  library • Ideas  considered  for  Ruby  3:  actors,  isolation,  channels,  streams,  … – Easy  to  use  high-­‐level  concurrency  abstraction 9 • Unanswered  questions: – How  do  we  write  fast  concurrent  data-­‐structures? – How  do  we  write  more  concurrent  abstractions?
  8. Copyright  ©  2016, Oracle  and/or  its  affiliates.  All  rights  reserved.

       | When  we  can  see  it? • Fast  Ruby  implementation • Parallel  execution 11
  9. Copyright  ©  2016, Oracle  and/or  its  affiliates.  All  rights  reserved.

       | Fast  Ruby  implementation For  Ruby  language  to  be  fast  an  implementation  with speculatively   optimizing  dynamic  compilation and parallel execution  is  needed. • Speculative:  can  speculate  on  following  propositions – Method  body  is  stable – Constant's  value  is  stable – Type  speculation – … 12 COUNT = 2 def foo(a, b) COUNT * (a + b) end foo(1, 2)
  10. Copyright  ©  2016, Oracle  and/or  its  affiliates.  All  rights  reserved.

       | Fast  Ruby  implementation For  Ruby  language  to  be  fast  an  implementation  with speculatively   optimizing  dynamic  compilation and parallel execution  is  needed. • Optimizing:  does  all  the  clever  optimizations  as  e.g.  gcc – In-­‐lining – Splitting – Constant  folding – Value  numbering – Hoisting – … 13
  11. Copyright  ©  2016, Oracle  and/or  its  affiliates.  All  rights  reserved.

       | Fast  Ruby  implementation For  Ruby  language  to  be  fast  an  implementation  with speculatively   optimizing  dynamic  compilation and parallel execution  is  needed. • Dynamic: – Just-­‐in-­‐time  compilation  of  hot  methods – Also  deoptimize when  speculatively  taken  assumptions  fail • Parallel: – Ruby  code  runs  in  parallel 14 COUNT = 2 def foo(a, b) COUNT * (a + b) end COUNT = 3
  12. Copyright  ©  2016, Oracle  and/or  its  affiliates.  All  rights  reserved.

       | Fast  Ruby  implementation • JRuby+Truffle is  such  an  implementation – Truffle: self  optimizing  AST  interpreter – Graal: compiler  written  in  Java 15
  13. Copyright  ©  2016, Oracle  and/or  its  affiliates.  All  rights  reserved.

       | Compiler  reorders  code • Optimizes  by  transforming  the  code • Is  allowed  to  perform  any  optimization  if  the  transformation  cannot  be   observed  on  the  same  thread – The  code  has  the  same  result – Assumes  only  one  thread 17
  14. Copyright  ©  2016, Oracle  and/or  its  affiliates.  All  rights  reserved.

       | Seemingly  sequential  Ruby  code def foo(a, b, c, d) x = a + b y = c + d x * y end 18 + + × Expanded  to  a  parallel  graph  in  the  compiler These  two  operations  can happen  in  either  order Why?  Because  they  are   independent  operations  – there   are  no  dependencies  between   the  two. foo(a, b, c, d) end
  15. Copyright  ©  2016, Oracle  and/or  its  affiliates.  All  rights  reserved.

       | Seemingly  sequential  Ruby  code 19 add  a  b  %r1 add  c  d  %r2 mul %r1  %r2  %r3 ret  %r3 add  c  d  %r1 add  a  b  %r2 mul %r1  %r2  %r3 ret  %r3 Generated  machine  code  can  use  either  order  of  operations Why?  Because  they  are   independent  operations  – there   are  no  dependencies  between   the  two.
  16. Copyright  ©  2016, Oracle  and/or  its  affiliates.  All  rights  reserved.

       | Seemingly  sequential  Ruby  code 20 add a b %r1 Even  if  our  compiler  didn’t  reorder,  the  processor  could  do  it  anyway! mul %r1 %r2 %r3 add c d %r2 ret %r3 Why?  Because  they  are   independent  instructions  – there   are  no  dependencies  between   the  two. These  two  operations  can happen  in  your  processor in  either  order
  17. Copyright  ©  2016, Oracle  and/or  its  affiliates.  All  rights  reserved.

       | Decker’s  algorithm  seen  by  compiler 21 flag1 = flag2 = false Thread  1 flag1 = true flag2 ? contention : critical_section Thread  2 flag2 = true flag1 ? contention : critical_section flag1 = true if flag2 contention critical_section
  18. Copyright  ©  2016, Oracle  and/or  its  affiliates.  All  rights  reserved.

       | Example class Future def initialize; @value = nil; end def fulfill(v); end def value; end end 22 Thread  2 def value Thread.pass until @value @value end Transformed  into def value temp = @value Thread.pass until temp @value end Thread  1 def fulfill(result) @value = result end If value is  called  before fulfill it  will  block indefinitely. Order 2: temp = @value # nil 2: Thread.pass until temp # nil 1: @value = result # :result
  19. Copyright  ©  2016, Oracle  and/or  its  affiliates.  All  rights  reserved.

       | Cache  reordering  effects • Dekker's  algorithm • Compiled  without  reordering • Old  processor  executing  in  program  order – No  out-­‐of-­‐order  execution • Coherent  cache  with  just  a  write  buffer 23
  20. Copyright  ©  2016, Oracle  and/or  its  affiliates.  All  rights  reserved.

       | Cache  reordering  effects 24 flag1 = flag2 = false Thread  1 flag1 = true flag2 ? contention : critical_section Thread  2 flag2 = true flag1 ? contention : critical_section
  21. Copyright  ©  2016, Oracle  and/or  its  affiliates.  All  rights  reserved.

       | Cache  reordering  effects 25 Thread  1 flag1 = true flag2 ? contention : critical_section Global  memory Thread  2 flag2 = true flag1 ? contention : critical_section Store  buffer Store  buffer false false
  22. Copyright  ©  2016, Oracle  and/or  its  affiliates.  All  rights  reserved.

       | Processor  reordering  effects • Decker's  algorithm • Compiled  without  reordering • Out-­‐of-­‐order  processor • No  cache 26
  23. Copyright  ©  2016, Oracle  and/or  its  affiliates.  All  rights  reserved.

       | Processor  reordering  effects 27 flag1 = flag2 = false Thread  1 flag1 = true flag2 ? contention : critical_section Thread  2 flag2 = true flag1 ? contention : critical_section
  24. Copyright  ©  2016, Oracle  and/or  its  affiliates.  All  rights  reserved.

       | Processor  reordering  effects 28 flag1 = flag2 = false Thread  1 r1 = flag2 # read flag1 = true # write r1 ? contention : critical_section Thread  2 r1 = flag1 # read flag2 = true # write r1 ? contention : critical_section • Store  reordered  with  load • StoreLoad reordering  is  allowed  on  x86
  25. Copyright  ©  2016, Oracle  and/or  its  affiliates.  All  rights  reserved.

       | Who  reordered  my  code?! • It  might  have  been: – Compiler – Cache – Processor • We  do  not  care  who  it  was  though,  only  the  actual  execution  matters • The  reordered  code  runs  faster  while  the  transformation  cannot  be   observed  on  a  single  thread 29
  26. Copyright  ©  2016, Oracle  and/or  its  affiliates.  All  rights  reserved.

       | Do  we  want  reordering? • Yes – Even  the  very  basic  code  transformations  would  be  forbidden  without  it – It  would  require  memory  barriers  around  every  read  and  write – It  cannot  be  avoided • We  want  to  let  the  compiler,  cache,  processor – keep  working  for  us – run  our  code faster than  we  wrote  it – minimize  waiting  for  memory 30
  27. Copyright  ©  2016, Oracle  and/or  its  affiliates.  All  rights  reserved.

       | Relaxed  memory  order class Variable def initialize @mutex, @updates, @seen_up_to = Mutex.new, [], {} end def write(value) @mutex.synchronize do @seen_up_to[Thread.current] = @updates.size @updates << value end value end # def read -> end 31 Updates Seen  by -­‐ Thread  1 0 1 Thread  2,  Thread  3 42 Thread  4 54 def read @mutex.synchronize do seen = @seen_up_to[Thread.current] || 0 new_seen = (seen...@updates).to_a.sample @seen_up_to[Thread.current] = new_seen return @updates[new_seen] end end
  28. Copyright  ©  2016, Oracle  and/or  its  affiliates.  All  rights  reserved.

       | Relaxed  memory  order • Each  thread  sees  different  values • Variables  are  completely  independent • Only  the  order  of  the  values  is  shared • Not  every  value  has  to  be  seen  by  a  given  thread • No  way  to  tell  if  a  thread  got  the  latest  value 32
  29. Copyright  ©  2016, Oracle  and/or  its  affiliates.  All  rights  reserved.

       | Sequential  consistency “The  result  of  any  execution  is  the  same  as  if  the  operations  of  all  the  processors  were   executed  in  some  sequential  order,  and  the  operations  of  each  individual  processor  appear   in  this  sequence  in  the  order  specified  by  its  program.”  — Leslie  Lamport 1979 • Allows  to  reason  about  the  program  as  if  it  is  executed  interleaved  on  one   thread  even  though  it's  executed  in  parallel  on  many  threads • Cannot  be  done  for  all  variables • Better  to  apply  to  just  shared  variables 34
  30. Copyright  ©  2016, Oracle  and/or  its  affiliates.  All  rights  reserved.

       | Sequential  consistency line :a line :b line 1 line 2 line :a line 1 line 2 line :b line :a line 1 line :b line 2 line 1 line 2 line :a line :b line 1 line :a line 2 line :b line 1 line :a line :b line 2 35 Thread  1 line :a line :b Thread  2 line 1 line 2 Allowed  orders
  31. Copyright  ©  2016, Oracle  and/or  its  affiliates.  All  rights  reserved.

       | Sequential  consistency Can  :a and  :b be  both  printed? a = b = false 36 Thread  1 a = true Thread  2 b = true Thread  3 if a && !b puts :a end Thread  4 if b && !a puts :b end Assuming a && !b the  order  has  to  be a = true a && !b # => true # puts :a b = true # puts :a • Impossible to insert b && !a to a  place   where it  would be true • The  reasoning is just  mirrored for puts :b
  32. Copyright  ©  2016, Oracle  and/or  its  affiliates.  All  rights  reserved.

       | Memory  model • Defines  shared  variables • Allows  optimizations  while  keeping  sequential  consistency • Contract:  the  program  is  sequentially  consistent  if  there  are  no  data  races • Answers  which  values  can  a  particular  read  return  in  a  program • It's  difficult  to  define   – We'll  focus  only  on  implications 37
  33. Copyright  ©  2016, Oracle  and/or  its  affiliates.  All  rights  reserved.

       | Shared  variables • Called volatile in  Java  and atomic in  C++ • We  have  to  tell  the  compiler  which  variables  are  shared – It  has  to  assume  that  they  may  be  accessed  at  any  time  from  other  threads – Reads  and  writes  of  shared  variables  cannot  be  reordered • Reads  and  writes  are  atomic 38
  34. Copyright  ©  2016, Oracle  and/or  its  affiliates.  All  rights  reserved.

       | Shared  variables • To  conform  with  sequential  consistency,  intuitively: – When  written,  it  has  to  be  made  visible  immediately  to  all  other  threads,  called   release – When  read,  it  reads  the  latest  value,  called  acquire • Provides  safe  publication – Release  and  acquire  has  very  useful  effect  on  non-­‐shared  variables 39 Release  on  variable  @a Changes Thread  1 Visible  changes Thread  2 Acquire  on  variable  @a
  35. Copyright  ©  2016, Oracle  and/or  its  affiliates.  All  rights  reserved.

       | Shared  variables Thread  1 a = 42 # cannot be moved down shared = true # release Thread  2 if (r1 = shared) # acquire r2 = a # cannot be moved up end [r1, r2] # => [true, 42], [false, nil] 40 a = 0 shared = false r1 = shared # false # no `r2 = a` a = 42 shared = true a = 42 r1 = shared # false # no `r2 = a` shared = true a = 42 shared = true r1 = shared # true r2 = a Possible  orders
  36. Copyright  ©  2016, Oracle  and/or  its  affiliates.  All  rights  reserved.

       | Decker’s  algorithm  seen  by  compiler  – fixed 41 flag1 = flag2 = false shared :flag1, :flag2 Thread  1 flag1 = true flag2 ? contention : critical_section Thread  2 flag2 = true flag1 ? contention : critical_section Order flag1 = true flag2 ? contention : critical_section # false -> critical flag2 = true flag1 ? contention : critical_section # true -> contention
  37. Copyright  ©  2016, Oracle  and/or  its  affiliates.  All  rights  reserved.

       | Example  – fixed class Future shared :@value def initialize; @value = nil; end def fulfill(v); end def value; end end 42 Thread  1 def value Thread.pass until @value @value end Transformed  into def value temp = @value Thread.pass until temp @value end Thread  2 def fulfill(value) @value = value end @value cannot  be  reordered,  has  to  actually  read the  value  each  time.
  38. Copyright  ©  2016, Oracle  and/or  its  affiliates.  All  rights  reserved.

       | Counter • A  counter: – .new(value = 0) – #add(increment = 1) – #value • Let’s  build  one  using  the  core  library – Mutex 44
  39. Copyright  ©  2016, Oracle  and/or  its  affiliates.  All  rights  reserved.

       | Counter class MutexCounter def initialize(value = 0) @mutex = Mutex.new @mutex.synchronize { @value = value } end def add(increment = 1) @mutex.synchronize do @value += increment end end def value @mutex.synchronize { @value } end end 45
  40. Copyright  ©  2016, Oracle  and/or  its  affiliates.  All  rights  reserved.

       | Counter class SharedCounter def initialize(value = 0) @mutex = Mutex.new @value = AtomicReference.new value end def add(increment = 1) @mutex.synchronize do @value.set @value.get + increment end end def value @value.get end end 46
  41. Copyright  ©  2016, Oracle  and/or  its  affiliates.  All  rights  reserved.

       | Benchmark  – value improvement 24,29 11,07 9,69 4,96 1,01 0,17 0 5 10 15 20 25 30 MRI JRuby JRuby+Truffle MutexCounter SharedCounter Confidential  – Oracle  Internal/Restricted/Highly  Restricted 47
  42. Copyright  ©  2016, Oracle  and/or  its  affiliates.  All  rights  reserved.

       | Compare-­‐and-­‐set  operations • Atomic  operation  on  a  shared  variable compare_and_set expected, new_value # => true || false attr_atomic :value # shared variable self.value = 1 48 Thread  1 while true current = value new_value = current + 1 break if compare_and_set_value( current, new_value) end Thread  2 while true current = value new_value = current * 2 break if compare_and_set_value( current, new_value) end
  43. Copyright  ©  2016, Oracle  and/or  its  affiliates.  All  rights  reserved.

       | Counter class CasCounter def initialize(value = 0) @value = AtomicReference.new value end def add(increment = 1) while true current = @value.get new_value = current + increment break if @value.compare_and_set(current, new_value) end end def value @value.get end end 49
  44. Copyright  ©  2016, Oracle  and/or  its  affiliates.  All  rights  reserved.

       | Benchmark  – add improvement 26,81 20,23 9,95 15,06 2,97 1,75 0 5 10 15 20 25 30 MRI JRuby JRuby+Truffle MutexCounter CasCounter Confidential  – Oracle  Internal/Restricted/Highly  Restricted 50
  45. Copyright  ©  2016, Oracle  and/or  its  affiliates.  All  rights  reserved.

       | Conclusions • Fast  Ruby  implementation • Parallel  execution   • Shared  memory   51 Reordering • Shared  variables • Sequential  consistency Fast  concurrent  data  structures  and   concurrency  abstractions  built  directly   in  Ruby It  is  not  for  every  day  coding.  Look  for  abstractions  in  gems  like concurrent-­‐ruby first. Memory  model
  46. Copyright  ©  2016, Oracle  and/or  its  affiliates.  All  rights  reserved.

       | Acknowledgements Benoit  Daloze Brandon  Fish Petr  Chalupa Kevin  Menard Chris  Seaton Jruby &  Rubinius  Contributors Oracle Danilo  Ansaloni Stefan  Anzinger Cosmin  Basca Daniele  Bonetta Matthias  Brantner Petr  Chalupa Jürgen  Christ Laurent  Daynès Gilles  Duboscq Martin  Entlicher Bastian  Hossbach Christian  Humer Mick  Jordan Vojin  Jovanovic Peter  Kessler Oracle  (continued) David  Leopoldseder Kevin  Menard Jakub  Podlešák Aleksandar  Prokopec Tom  Rodriguez Roland  Schatz Chris  Seaton Doug  Simon Štěpán  Šindelář Zbyněk  Šlajchrt Lukas  Stadler Codrut  Stancu Jan  Štola Jaroslav  Tulach Michael  Van  De  Vanter Adam  Welc Christian  Wimmer Christian  Wirth Paul  Wögerer Mario  Wolczko Andreas  Wöß Thomas  Würthinger JKU  Linz Prof.  Hanspeter  Mössenböck Benoit  Daloze Josef  Eisl Thomas  Feichtinger Matthias  Grimmer Christian  Häubl Josef  Haider Christian  Huber Stefan  Marr Manuel  Rigger Stefan  Rumzucker Bernhard  Urban University  of Edinburgh Christophe  Dubach Juan  José  Fumero Alfonso Ranjeet Singh Toomas Remmelg LaBRI Floréal Morandat University  of California,  Irvine Prof.  Michael  Franz Gulfem  Savrun  Yeniceri Wei  Zhang Purdue University Prof.  Jan  Vitek Tomas  Kalibera Petr  Maj Lei  Zhao T.  U.  Dortmund Prof.  Peter  Marwedel Helena  Kotthaus Ingo  Korb University  of California,  Davis Prof.  Duncan  Temple  Lang Nicholas  Ulle University  of Lugano,  Switzerland Prof.  Walter  Binder Sun  Haiyang Yudi  Zheng Oracle  Interns Brian  Belleville   Miguel  Garcia Shams  Imam Alexey  Karyakin Stephen  Kell Andreas  Kunft Volker  Lanting Gero  Leinemann Julian  Lettner Joe  Nash David  Piorkowski Gregor  Richards Robert  Seilbeck Rifat  Shariyar Alumni Erik  Eckstein Michael  Haupt Christos  Kotselidis Hyunjin  Lee David  Leibs Chris  Thalinger Till  Westmann
  47. Copyright  ©  2016, Oracle  and/or  its  affiliates.  All  rights  reserved.

       | Safe  Harbor  Statement The  preceding  is  intended  to  provide  some  insight  into  a  line  of  research  in  Oracle  Labs.  It   is  intended  for  information  purposes  only,  and  may  not  be  incorporated  into  any   contract. It  is  not  a  commitment  to  deliver  any  material,  code,  or  functionality,  and   should  not  be  relied  upon  in  making  purchasing  decisions.  Oracle  reserves  the  right  to   alter  its  development  plans  and  practices  at  any  time,  and  the  development,  release,  and   timing  of  any  features  or  functionality  described  in  connection  with  any  Oracle  product  or   service  remains  at  the  sole  discretion  of  Oracle. Any  views  expressed  in  this  presentation   are  my  own  and  do  not  necessarily  reflect  the  views  of  Oracle. Confidential  – Oracle  Internal/Restricted/Highly  Restricted 53