Method JIT Compiler for MRI

Method JIT Compiler for MRI

RubyElixirConf 2018
https://2018.rubyconf.tw/

08d5432a5bc31e6d9edec87b94cb1db1?s=128

Takashi Kokubun

April 27, 2018
Tweet

Transcript

  1. Method JIT Compiler for MRI RubyElixirConf Taiwan 2018 ~ Optimizations

    in Ruby 2.6.0 preview1, 2 ~ @k0kubun / Treasure Data Inc.
  2. @k0kubun Treasure Data Inc. ERB maintainer, developing Ruby’s JIT

  3. The history of MRI JIT

  4. March 2017: RTL & MJIT

  5. October 2017: YARV-MJIT 0QUDBSSPUXJUI-BO@.BTUFS GQT     

    Ruby 2.0 Ruby 2.5 YARV-MJIT RTL MJIT     https://github.com/k0kubun/yarv-mjit/tree/master-171211#optcarrot-benchmark
  6. February 2018: Merge MJIT infrastructure

  7. February 2018: Released in 2.6.0-preview1

  8. How does it work?

  9. Optionally enabled by "--jit" Tips: RUBYOPT="--jit" ruby … works too

  10. New runtime dependency: gcc / clang

  11. How Ruby’s method JIT works Methods Interpret

  12. Methods Interpret Frequent calls ! How Ruby’s method JIT works

  13. Methods Compile Machine code Interpret How Ruby’s method JIT works

  14. Methods Machine code Interpret Call How Ruby’s method JIT works

  15. Methods Machine code Interpret Call How Ruby’s method JIT works

    Compile
  16. Methods Machine code Call How Ruby’s method JIT works Compile

  17. Machine code Call How Ruby’s method JIT works

  18. Latest Ruby’s performance benchmarks

  19. Ruby 2.6.0-preview1 https://benchmark-driver.github.io/benchmarks/optcarrot/releases.html

  20. Ruby trunk https://benchmark-driver.github.io/benchmarks/optcarrot/commits.html

  21. Ruby trunk https://benchmark-driver.github.io/benchmarks/optcarrot/commits.html 2.6.0 Preview1 2.6.0 Preview2 ?

  22. Micro benchmark: while 5.7x faster 2.6.0 Preview1 2.6.0 Preview2 ?

    https://benchmark-driver.github.io/benchmarks/mjit/commits.html
  23. 0QUDBSSPUXJUI-BO@.BTUFS GQT      Ruby 2.0 trunk

    trunk+JIT RTL+JIT Ruby 3x3      But… we’re still far from Ruby 3x3 https://gist.github.com/k0kubun/7074ad434d0affd1bd98edaaa011ac1d 39fps to go
  24. How to get there? Just inlining method doesn’t help if

    code is too complex We need more effort to exploit C compiler optimizations Let’s see what we’ve done so far
  25. 2.6.0-Preview1 Optimizations

  26. 1. Basic inlining of Ruby method (r62197) def foo bar

    end Ruby code
  27. 1. Basic inlining of Ruby method (r62197) def foo bar

    end Ruby code ISeq Compile putself send :bar, cache: nil leave
  28. 1. Basic inlining of Ruby method (r62197) def foo bar

    end Ruby code putself send :bar, cache: nil leave ISeq Ruby VM Program Counter Interpret
  29. 1. Basic inlining of Ruby method (r62197) def foo bar

    end Ruby code putself send :bar, cache: nil leave ISeq Ruby VM Program Counter Call putself() { val = GET_SELF(); } C code for instruction
  30. 1. Basic inlining of Ruby method (r62197) def foo bar

    end Ruby code putself send :bar, cache: nil leave ISeq Ruby VM Program Counter Interpret
  31. 1. Basic inlining of Ruby method (r62197) def foo bar

    end Ruby code putself send :bar, cache: nil leave ISeq Ruby VM Program Counter Call send(cache) { search_method(cache); CALL_METHOD(cache); } C code for instruction
  32. 1. Basic inlining of Ruby method (r62197) def foo bar

    end Ruby code putself send :bar, cache: nil leave ISeq Ruby VM Program Counter Call send(cache) { search_method(cache); CALL_METHOD(cache); } C code for instruction Ruby method push C method call attr_reader attr_writer . . . Which type will be called?
  33. 1. Basic inlining of Ruby method (r62197) def foo bar

    end Ruby code putself send :bar, cache: Ruby leave ISeq Ruby VM Program Counter Call send(cache) { search_method(cache); CALL_METHOD(cache); } C code for instruction Store C function pointer w/ class timestamp Ruby method push C method call attr_reader attr_writer . . .
  34. 1. Basic inlining of Ruby method (r62197) def foo bar

    end Ruby code putself send :bar, cache: Ruby leave Ruby VM Program Counter Call send(cache) { search_method(cache); CALL_METHOD(cache); } C code for instruction Ruby method push Dispatch it by calling function pointer (compiler can't optimize)
  35. 1. Basic inlining of Ruby method (r62197) def foo bar

    end Ruby code putself send :bar, cache: Ruby leave Ruby VM Program Counter Call send(cache) { search_method(cache); CALL_METHOD(cache); } C code for instruction Ruby method push Ruby method push In JIT, we can inline this operation by checking cache in ISeq ISeq
  36. 1. Basic inlining of Ruby method (r62197) Using “method cache”,

    we can bypass method dispatch and inline the C function to push Ruby method frame If it's inlined, C compiler can apply various optimizations to Ruby method call, which is known as slow Optcarrot: 53.84fps -> 57.52fps
  37. 2. Bypass Array/Hash check for #[] (r62398) optimized_#[](recv, key) {

    if recv.is_a?(Array) { fast_Array#[](recv, key); } else if recv.is_a?(Hash) { fast_Hash#[](recv, key); } else { dispatch(recv, #[], key); } }
  38. 2. Bypass Array/Hash check for #[] (r62398) optimized_#[](recv, key) {

    if recv.is_a?(Array) { fast_Array#[](recv, key); } else if recv.is_a?(Hash) { fast_Hash#[](recv, key); } else { dispatch(recv, #[], key); } } array = [1,2,3] array[1]
  39. 2. Bypass Array/Hash check for #[] (r62398) optimized_#[](recv, key) {

    if recv.is_a?(Array) { fast_Array#[](recv, key); } else if recv.is_a?(Hash) { fast_Hash#[](recv, key); } else { dispatch(recv, #[], key); } } hash = { foo: 1} hash[:foo]
  40. 2. Bypass Array/Hash check for #[] (r62398) optimized_#[](recv, key) {

    if recv.is_a?(Array) { fast_Array#[](recv, key); } else if recv.is_a?(Hash) { fast_Hash#[](recv, key); } else { dispatch(recv, #[], key); } } def show params[:id] end ActionController::Parameters#[]
  41. 2. Bypass Array/Hash check for #[] (r62398) optimized_#[](recv, key) {

    if recv.is_a?(Array) { fast_Array#[](recv, key); } else if recv.is_a?(Hash) { fast_Hash#[](recv, key); } else { dispatch(recv, #[], key); } } def show params[:id] end ActionController::Parameters#[] These checks are NOT needed for classes other than Array, Hash
  42. 2. Bypass Array/Hash check for #[] (r62398) jit_#[](recv, key) {

    dispatch(recv, #[], key); } def show params[:id] end ActionController::Parameters#[]
  43. 2. Bypass Array/Hash check for #[] (r62398) Ruby always optimizes

    #[] for Array/Hash, but it’s suboptimal for other classes JIT removes the guard for Array/Hash by seeing call cache, and also inlines pushing a method frame The same optimization can be applied to other methods later
  44. 3. Inline Array#[] with Integer (r62388) optimized_#[](recv, key) { if

    recv.is_a?(Array) { fast_Array#[](recv, key); // extern } else if recv.is_a?(Hash) { fast_Hash#[](recv, key); } else { dispatch(recv, #[], key); } } It's not inlined and optimized well by compiler
  45. 3. Inline Array#[] with Integer (r62388) optimized_#[](recv, key) { if

    recv.is_a?(Array) { if key.is_a?(Integer) { Array#[Integer](recv, key); // inline } else { fast_Array#[](recv, key); // extern } } else if recv.is_a?(Hash) { fast_Hash#[](recv, key); } else { dispatch(recv, #[], key); } } This special path is inlined and optimized well on JIT
  46. 3. Inline Array#[] with Integer (r62388) Currently “JIT header“ has

    limited definitions of C functions in Ruby core I inlined a part of Array#[] definition, and then C compiler could optimize the code Optcarrot: 54.93fps -> 58.41fps
  47. 2.6.0-Preview1 wrap up I mainly worked for portability, stability, maintainability

    Fix SEGV and deadlock, remove broken optimizations… Notable optimizations were only 3, so it wasn't fast yet
  48. 2.6.0-Preview2 Optimizations

  49. 1. Use C local variable for VM stack (r62655) def

    three 1 + 2 end Ruby code
  50. 1. Use C local variable for VM stack (r62655) def

    three 1 + 2 end Ruby code ISeq putobject 1 putobject 2 opt_plus leave
  51. 1. Use C local variable for VM stack (r62655) def

    three 1 + 2 end Ruby code ISeq putobject 1 putobject 2 opt_plus leave Ruby VM Program Counter Stack Pointer VM stack empty
  52. 1. Use C local variable for VM stack (r62655) def

    three 1 + 2 end Ruby code ISeq putobject 1 putobject 2 opt_plus leave Ruby VM Program Counter Stack Pointer VM stack 1
  53. 1. Use C local variable for VM stack (r62655) def

    three 1 + 2 end Ruby code ISeq putobject 1 putobject 2 opt_plus leave Ruby VM Program Counter Stack Pointer VM stack 1
  54. 1. Use C local variable for VM stack (r62655) def

    three 1 + 2 end Ruby code ISeq putobject 1 putobject 2 opt_plus leave Ruby VM Program Counter Stack Pointer VM stack 1 2
  55. 1. Use C local variable for VM stack (r62655) def

    three 1 + 2 end Ruby code ISeq putobject 1 putobject 2 opt_plus leave Ruby VM Program Counter Stack Pointer VM stack 1 2
  56. 1. Use C local variable for VM stack (r62655) def

    three 1 + 2 end Ruby code ISeq putobject 1 putobject 2 opt_plus leave Ruby VM Program Counter Stack Pointer VM stack 3
  57. 1. Use C local variable for VM stack (r62655) def

    three 1 + 2 end Ruby code ISeq putobject 1 putobject 2 opt_plus leave Ruby VM Program Counter Stack Pointer VM stack 3
  58. 1. Use C local variable for VM stack (r62655) def

    three 1 + 2 end Ruby code ISeq putobject 1 putobject 2 opt_plus leave Ruby VM Program Counter Stack Pointer VM stack 3 How to skip the stack pointer motion in JIT?
  59. 1. Use C local variable for VM stack (r62655) def

    three 1 + 2 end Ruby code ISeq putobject 1 putobject 2 opt_plus leave jit_three() { } JIT-ed code: before
  60. 1. Use C local variable for VM stack (r62655) def

    three 1 + 2 end Ruby code ISeq putobject 1 putobject 2 opt_plus leave jit_three() { *sp = 1; sp++; } JIT-ed code: before
  61. 1. Use C local variable for VM stack (r62655) def

    three 1 + 2 end Ruby code ISeq putobject 1 putobject 2 opt_plus leave jit_three() { *sp = 1; sp++; *sp = 2; sp++; } JIT-ed code: before
  62. 1. Use C local variable for VM stack (r62655) def

    three 1 + 2 end Ruby code ISeq putobject 1 putobject 2 opt_plus leave jit_three() { *sp = 1; sp++; *sp = 2; sp++; *(sp-2) = opt_plus( *(sp-2),*(sp-1)); sp--; } JIT-ed code: before
  63. 1. Use C local variable for VM stack (r62655) def

    three 1 + 2 end Ruby code ISeq putobject 1 putobject 2 opt_plus leave jit_three() { *sp = 1; sp++; *sp = 2; sp++; *(sp-2) = opt_plus( *(sp-2),*(sp-1)); sp--; return *(sp-1); } JIT-ed code: before
  64. 1. Use C local variable for VM stack (r62655) def

    three 1 + 2 end Ruby code ISeq putobject 1 putobject 2 opt_plus leave jit_three() { *sp = 1; sp++; *sp = 2; sp++; *(sp-2) = opt_plus( *(sp-2),*(sp-1)); sp--; return *(sp-1); } JIT-ed code: before jit_three() { VALUE stack[2]; stack[0] = 1; stack[1] = 2; stack[0] = opt_plus( stack[0], stack[1]); return stack[0]; } JIT-ed code: after
  65. 1. Use C local variable for VM stack (r62655) def

    three 1 + 2 end Ruby code ISeq putobject 1 putobject 2 opt_plus leave jit_three() { *sp = 1; sp++; *sp = 2; sp++; *(sp-2) = opt_plus( *(sp-2),*(sp-1)); sp--; return *(sp-1); } JIT-ed code: before jit_three() { VALUE stack[2]; stack[0] = 1; stack[1] = 2; stack[0] = opt_plus( stack[0], stack[1]); return stack[0]; } JIT-ed code: after Array local variable This seems okay for just "1 + 2", but...
  66. 1. Use C local variable for VM stack (r62655) def

    err raise 'error' end def three 1 + (err rescue 2) end Ruby code
  67. 1. Use C local variable for VM stack (r62655) def

    err # JIT-ed raise 'error' end def three # JIT-ed 1 + (err rescue 2) end Ruby code main() Call stack in C VM stack empty
  68. 1. Use C local variable for VM stack (r62655) def

    err # JIT-ed raise 'error' end def three # JIT-ed 1 + (err rescue 2) end Ruby code main() Call stack in C ruby_vm() (setjmp called) VM stack empty
  69. 1. Use C local variable for VM stack (r62655) def

    err # JIT-ed raise 'error' end def three # JIT-ed 1 + (err rescue 2) end Ruby code main() Call stack in C ruby_vm() (setjmp called) jit_three() stack[nil, nil] in jit_three() VM stack empty
  70. 1. Use C local variable for VM stack (r62655) def

    err # JIT-ed raise 'error' end def three # JIT-ed 1 + (err rescue 2) end Ruby code main() Call stack in C ruby_vm() (setjmp called) jit_three() stack[1, nil] in jit_three() Push 1 to array local variable VM stack empty
  71. 1. Use C local variable for VM stack (r62655) def

    err # JIT-ed raise 'error' end def three # JIT-ed 1 + (err rescue 2) end Ruby code main() Call stack in C ruby_vm() (setjmp called) jit_three() stack[1, nil] in jit_three() jit_err() VM stack empty
  72. 1. Use C local variable for VM stack (r62655) def

    err # JIT-ed raise 'error' end def three # JIT-ed 1 + (err rescue 2) end Ruby code main() Call stack in C ruby_vm() (setjmp called) jit_three() stack[1, nil] in jit_three() jit_err() rb_raise() (call longjmp) VM stack empty
  73. 1. Use C local variable for VM stack (r62655) def

    err # JIT-ed raise 'error' end def three # JIT-ed 1 + (err rescue 2) end Ruby code main() Call stack in C ruby_vm() (setjmp called) jit_three() stack[1, nil] in jit_three() jit_err() rb_raise() (call longjmp) longjmp purges JIT-ed frames VM stack empty
  74. 1. Use C local variable for VM stack (r62655) def

    err # JIT-ed raise 'error' end def three # JIT-ed 1 + (err rescue 2) end Ruby code main() Call stack in C ruby_vm() (setjmp called) jit_three() stack[1, nil] in jit_three() jit_err() rb_raise() (call longjmp) VM stack empty 2
  75. 1. Use C local variable for VM stack (r62655) def

    err # JIT-ed raise 'error' end def three # JIT-ed 1 + (err rescue 2) end Ruby code main() Call stack in C ruby_vm() (setjmp called) jit_three() stack[1, nil] in jit_three() jit_err() rb_raise() (call longjmp) VM stack empty 2 VM Stack doesn't have 2 values => SEGV 1 is expired
  76. 1. Use C local variable for VM stack (r62655) When

    "catch table" (rescue, ensure, etc.) does not exist, we don't need to resurrect stack values on exception So we can use just C local variables to reproduce the stack of Ruby VM only when catch table does not exist Stack pointer is not moved and compiler can inline values Optcarrot: 57.13fps -> 62.14fps
  77. 2. Bypass setjmp for yield (r62643) setjmp is slow If

    JIT-ed code is directly called from VM (no C function frames are created yet), we don’t need to call setjmp again Now yield is 1.3x faster than a non-JIT-ed case
  78. 3. Skip moving program counter (r62678) def err raise 'error'

    end def three 1 + (err rescue 2) end Ruby code
  79. 3. Skip moving program counter (r62678) def err raise 'error'

    end def three 1 + (err rescue 2) end Ruby code Ruby call stack #three Program Counter
  80. 3. Skip moving program counter (r62678) def err raise 'error'

    end def three 1 + (err rescue 2) end Ruby code Ruby call stack #three Program Counter
  81. 3. Skip moving program counter (r62678) def err raise 'error'

    end def three 1 + (err rescue 2) end Ruby code Ruby call stack #three Program Counter
  82. 3. Skip moving program counter (r62678) def err raise 'error'

    end def three 1 + (err rescue 2) end Ruby code Ruby call stack #three #err Program Counter Program Counter
  83. 3. Skip moving program counter (r62678) def err raise 'error'

    end def three 1 + (err rescue 2) end Ruby code Ruby call stack #three #err Program Counter Program Counter
  84. 3. Skip moving program counter (r62678) def err raise 'error'

    end def three 1 + (err rescue 2) end Ruby code Ruby call stack #three #err Program Counter Program Counter #raise Program Counter longjmp
  85. 3. Skip moving program counter (r62678) def err raise 'error'

    end def three 1 + (err rescue 2) end Ruby code Ruby call stack #three #err Program Counter Program Counter #raise Program Counter Program counter is used to resurrect the position after longjmp
  86. 3. Skip moving program counter (r62678) Same as the stack

    value's situation, we don't move the program counter only when catch table does not exist (rescue, ensure, etc.) Optcarrot: 64.92fps -> 68.08fps
  87. 4. Force inlining arithmetic instructions (r62677) C compiler has a

    threshold of function size to be inlined Some Ruby's instructions (+, -, *, /, ...) are too large to be inlined by default, so I applied an "always inline" attribute In the future, we should reduce the size of code instead Optcarrot: 60.19fps -> 64.92fps
  88. 5. Force inlining ivar instructions (r62693) Not only arithmetic instructions,

    but also instructions for instance variable are large too, so I force-inlined it Optcarrot: 67.04fps -> 68.20fps
  89. 6. Disable stack consistency check (r63092) Ruby VM is always

    asserting the size of stack when returning from a method, and it's slow We can skip it on JIT because it's already checked by VM Optcarrot: 67.43fps -> 69.92fps
  90. 7. Inline attr_reader method call (r63212) . def foo bar

    end Ruby code putself send :bar, cache: nil leave ISeq Ruby VM Program Counter Call send(cache) { search_method(cache); CALL_METHOD(cache); } C code for instruction Ruby method push C method call attr_reader attr_writer . .
  91. 7. Inline attr_reader method call (r63212) def foo bar end

    Ruby code putself send :bar, cache: attr leave ISeq Ruby VM Program Counter Call send(cache) { search_method(cache); CALL_METHOD(cache); } C code for instruction Ruby method push C method call attr_reader attr_writer . . .
  92. 7. Inline attr_reader method call (r63212) def foo bar end

    Ruby code putself send :bar, cache: attr leave ISeq Ruby VM Program Counter Call send(cache) { search_method(cache); CALL_METHOD(cache); } C code for instruction get_istance_variable() attr_reader
  93. 7. Inline attr_reader method call (r63212) Using call cache in

    the same way as Ruby method, we can fully inline attr_reader without large compilation time The cost becomes the same as reference to normal instance variables Calling attr_reader is made 4x faster
  94. 2.6.0-Preview2 (trunk) wrap up I've mainly worked on performance because

    it's useless if it's slow Generated code is much simplified and made fast by removing program counter and stack pointer motions But it still has some complexity and it blocks significant performance improvement by Ruby method inlining
  95. Future of Ruby's JIT

  96. 1. Deoptimization by longjmp We can generate aggressive code and

    cancel all JIT-ed calls by longjmp when something unexpected happens I’m going to remove guard for TracePoint and cancel it later It should also be used when all method caches are purged
  97. 2. Instruction specialization for types Currently the same code is

    generated for both Hash#[] and Array#[] We need some instrumentation to detect the type which is passed to an optimized instruction Vladimir's RTL instruction achieves this by dynamic modification of instruction
  98. 3. Multi-tier JIT Some other languages have multiple stages for

    JIT Depending on how frequently it's called, it may be better to balance compilation time and optimization level Vladimir is working on light JIT compilation Sometimes people deploy an application every 10 minutes
  99. 4. Profile-guided JIT C compiler has a feature to profile

    compiled code and generate faster code using the profiling result Using the multi-tier JIT, we may be able to profile code in the first tier and generate faster code in the second tier
  100. 5. Better JIT scheduler for Rails In Rails, an application

    becomes slower only during JIT compilation happens The possible cause might be the number of methods to be JIT-ed, compared to some other benchmarks Possibly we should reduce the number of methods to be JIT-ed or reduce frequency of JIT compilation
  101. 6. Ruby / C method inlining I already succeeded to

    implement Ruby method inlining, but it increases compilation time I have ideas to implement C method inlining, but which method to be inlined should be solved first
  102. 7. Exploit more C compiler optimizations Loop invariant motion Folding

    Ruby's constant Type check removal by type inference Reduce unnecessary memory accesses to VM registers
  103. Conclusion 2.6.0-preview2 will be much faster than 2.6.0-preview1 (Still not

    ready for Rails) We still have so many things to be done for Ruby 3x3