Method JIT Compiler for MRI

Slide 1

Slide 1 text

Method JIT Compiler for MRI RubyElixirConf Taiwan 2018 ~ Optimizations in Ruby 2.6.0 preview1, 2 ~ @k0kubun / Treasure Data Inc.

Slide 2

Slide 2 text

@k0kubun Treasure Data Inc. ERB maintainer, developing Ruby’s JIT

Slide 3

Slide 3 text

The history of MRI JIT

Slide 4

Slide 4 text

March 2017: RTL & MJIT

Slide 5

Slide 5 text

October 2017: YARV-MJIT 0QUDBSSPUXJUI[email protected] GQT Ruby 2.0 Ruby 2.5 YARV-MJIT RTL MJIT https://github.com/k0kubun/yarv-mjit/tree/master-171211#optcarrot-benchmark

Slide 6

Slide 6 text

February 2018: Merge MJIT infrastructure

Slide 7

Slide 7 text

February 2018: Released in 2.6.0-preview1

Slide 8

Slide 8 text

How does it work?

Slide 9

Slide 9 text

Optionally enabled by "--jit" Tips: RUBYOPT="--jit" ruby … works too

Slide 10

Slide 10 text

New runtime dependency: gcc / clang

Slide 11

Slide 11 text

How Ruby’s method JIT works Methods Interpret

Slide 12

Slide 12 text

Methods Interpret Frequent calls ! How Ruby’s method JIT works

Slide 13

Slide 13 text

Methods Compile Machine code Interpret How Ruby’s method JIT works

Slide 14

Slide 14 text

Methods Machine code Interpret Call How Ruby’s method JIT works

Slide 15

Slide 15 text

Methods Machine code Interpret Call How Ruby’s method JIT works Compile

Slide 16

Slide 16 text

Methods Machine code Call How Ruby’s method JIT works Compile

Slide 17

Slide 17 text

Machine code Call How Ruby’s method JIT works

Slide 18

Slide 18 text

Latest Ruby’s performance benchmarks

Slide 19

Slide 19 text

Ruby 2.6.0-preview1 https://benchmark-driver.github.io/benchmarks/optcarrot/releases.html

Slide 20

Slide 20 text

Ruby trunk https://benchmark-driver.github.io/benchmarks/optcarrot/commits.html

Slide 21

Slide 21 text

Ruby trunk https://benchmark-driver.github.io/benchmarks/optcarrot/commits.html 2.6.0 Preview1 2.6.0 Preview2 ?

Slide 22

Slide 22 text

Micro benchmark: while 5.7x faster 2.6.0 Preview1 2.6.0 Preview2 ? https://benchmark-driver.github.io/benchmarks/mjit/commits.html

Slide 23

Slide 23 text

0QUDBSSPUXJUI[email protected] GQT Ruby 2.0 trunk trunk+JIT RTL+JIT Ruby 3x3 But… we’re still far from Ruby 3x3 https://gist.github.com/k0kubun/7074ad434d0affd1bd98edaaa011ac1d 39fps to go

Slide 24

Slide 24 text

How to get there? Just inlining method doesn’t help if code is too complex We need more effort to exploit C compiler optimizations Let’s see what we’ve done so far

Slide 25

Slide 25 text

2.6.0-Preview1 Optimizations

Slide 26

Slide 26 text

1. Basic inlining of Ruby method (r62197) def foo bar end Ruby code

Slide 27

Slide 27 text

1. Basic inlining of Ruby method (r62197) def foo bar end Ruby code ISeq Compile putself send :bar, cache: nil leave

Slide 28

Slide 28 text

1. Basic inlining of Ruby method (r62197) def foo bar end Ruby code putself send :bar, cache: nil leave ISeq Ruby VM Program Counter Interpret

Slide 29

Slide 29 text

1. Basic inlining of Ruby method (r62197) def foo bar end Ruby code putself send :bar, cache: nil leave ISeq Ruby VM Program Counter Call putself() { val = GET_SELF(); } C code for instruction

Slide 30

Slide 30 text

1. Basic inlining of Ruby method (r62197) def foo bar end Ruby code putself send :bar, cache: nil leave ISeq Ruby VM Program Counter Interpret

Slide 31

Slide 31 text

1. Basic inlining of Ruby method (r62197) def foo bar end Ruby code putself send :bar, cache: nil leave ISeq Ruby VM Program Counter Call send(cache) { search_method(cache); CALL_METHOD(cache); } C code for instruction

Slide 32

Slide 32 text

Slide 33

Slide 33 text

1. Basic inlining of Ruby method (r62197) def foo bar end Ruby code putself send :bar, cache: Ruby leave ISeq Ruby VM Program Counter Call send(cache) { search_method(cache); CALL_METHOD(cache); } C code for instruction Store C function pointer w/ class timestamp Ruby method push C method call attr_reader attr_writer . . .

Slide 34

Slide 34 text

1. Basic inlining of Ruby method (r62197) def foo bar end Ruby code putself send :bar, cache: Ruby leave Ruby VM Program Counter Call send(cache) { search_method(cache); CALL_METHOD(cache); } C code for instruction Ruby method push Dispatch it by calling function pointer (compiler can't optimize)

Slide 35

Slide 35 text

1. Basic inlining of Ruby method (r62197) def foo bar end Ruby code putself send :bar, cache: Ruby leave Ruby VM Program Counter Call send(cache) { search_method(cache); CALL_METHOD(cache); } C code for instruction Ruby method push Ruby method push In JIT, we can inline this operation by checking cache in ISeq ISeq

Slide 36

Slide 36 text

1. Basic inlining of Ruby method (r62197) Using “method cache”, we can bypass method dispatch and inline the C function to push Ruby method frame If it's inlined, C compiler can apply various optimizations to Ruby method call, which is known as slow Optcarrot: 53.84fps -> 57.52fps

Slide 37

Slide 37 text

Slide 38

Slide 38 text

2. Bypass Array/Hash check for #[] (r62398) optimized_#[](recv, key) { if recv.is_a?(Array) { fast_Array#[](recv, key); } else if recv.is_a?(Hash) { fast_Hash#[](recv, key); } else { dispatch(recv, #[], key); } } array = [1,2,3] array[1]

Slide 39

Slide 39 text

Slide 40

Slide 40 text

Slide 41

Slide 41 text

Slide 42

Slide 42 text

2. Bypass Array/Hash check for #[] (r62398) jit_#[](recv, key) { dispatch(recv, #[], key); } def show params[:id] end ActionController::Parameters#[]

Slide 43

Slide 43 text

2. Bypass Array/Hash check for #[] (r62398) Ruby always optimizes #[] for Array/Hash, but it’s suboptimal for other classes JIT removes the guard for Array/Hash by seeing call cache, and also inlines pushing a method frame The same optimization can be applied to other methods later

Slide 44

Slide 44 text

3. Inline Array#[] with Integer (r62388) optimized_#[](recv, key) { if recv.is_a?(Array) { fast_Array#[](recv, key); // extern } else if recv.is_a?(Hash) { fast_Hash#[](recv, key); } else { dispatch(recv, #[], key); } } It's not inlined and optimized well by compiler

Slide 45

Slide 45 text

3. Inline Array#[] with Integer (r62388) optimized_#[](recv, key) { if recv.is_a?(Array) { if key.is_a?(Integer) { Array#[Integer](recv, key); // inline } else { fast_Array#[](recv, key); // extern } } else if recv.is_a?(Hash) { fast_Hash#[](recv, key); } else { dispatch(recv, #[], key); } } This special path is inlined and optimized well on JIT

Slide 46

Slide 46 text

3. Inline Array#[] with Integer (r62388) Currently “JIT header“ has limited deﬁnitions of C functions in Ruby core I inlined a part of Array#[] deﬁnition, and then C compiler could optimize the code Optcarrot: 54.93fps -> 58.41fps

Slide 47

Slide 47 text

2.6.0-Preview1 wrap up I mainly worked for portability, stability, maintainability Fix SEGV and deadlock, remove broken optimizations… Notable optimizations were only 3, so it wasn't fast yet

Slide 48

Slide 48 text

2.6.0-Preview2 Optimizations

Slide 49

Slide 49 text

1. Use C local variable for VM stack (r62655) def three 1 + 2 end Ruby code

Slide 50

Slide 50 text

1. Use C local variable for VM stack (r62655) def three 1 + 2 end Ruby code ISeq putobject 1 putobject 2 opt_plus leave

Slide 51

Slide 51 text

1. Use C local variable for VM stack (r62655) def three 1 + 2 end Ruby code ISeq putobject 1 putobject 2 opt_plus leave Ruby VM Program Counter Stack Pointer VM stack empty

Slide 52

Slide 52 text

1. Use C local variable for VM stack (r62655) def three 1 + 2 end Ruby code ISeq putobject 1 putobject 2 opt_plus leave Ruby VM Program Counter Stack Pointer VM stack 1

Slide 53

Slide 53 text

1. Use C local variable for VM stack (r62655) def three 1 + 2 end Ruby code ISeq putobject 1 putobject 2 opt_plus leave Ruby VM Program Counter Stack Pointer VM stack 1

Slide 54

Slide 54 text

1. Use C local variable for VM stack (r62655) def three 1 + 2 end Ruby code ISeq putobject 1 putobject 2 opt_plus leave Ruby VM Program Counter Stack Pointer VM stack 1 2

Slide 55

Slide 55 text

1. Use C local variable for VM stack (r62655) def three 1 + 2 end Ruby code ISeq putobject 1 putobject 2 opt_plus leave Ruby VM Program Counter Stack Pointer VM stack 1 2

Slide 56

Slide 56 text

1. Use C local variable for VM stack (r62655) def three 1 + 2 end Ruby code ISeq putobject 1 putobject 2 opt_plus leave Ruby VM Program Counter Stack Pointer VM stack 3

Slide 57

Slide 57 text

1. Use C local variable for VM stack (r62655) def three 1 + 2 end Ruby code ISeq putobject 1 putobject 2 opt_plus leave Ruby VM Program Counter Stack Pointer VM stack 3

Slide 58

Slide 58 text

1. Use C local variable for VM stack (r62655) def three 1 + 2 end Ruby code ISeq putobject 1 putobject 2 opt_plus leave Ruby VM Program Counter Stack Pointer VM stack 3 How to skip the stack pointer motion in JIT?

Slide 59

Slide 59 text

1. Use C local variable for VM stack (r62655) def three 1 + 2 end Ruby code ISeq putobject 1 putobject 2 opt_plus leave jit_three() { } JIT-ed code: before

Slide 60

Slide 60 text

1. Use C local variable for VM stack (r62655) def three 1 + 2 end Ruby code ISeq putobject 1 putobject 2 opt_plus leave jit_three() { *sp = 1; sp++; } JIT-ed code: before

Slide 61

Slide 61 text

1. Use C local variable for VM stack (r62655) def three 1 + 2 end Ruby code ISeq putobject 1 putobject 2 opt_plus leave jit_three() { *sp = 1; sp++; *sp = 2; sp++; } JIT-ed code: before

Slide 62

Slide 62 text

Slide 63

Slide 63 text

Slide 64

Slide 64 text

1. Use C local variable for VM stack (r62655) def three 1 + 2 end Ruby code ISeq putobject 1 putobject 2 opt_plus leave jit_three() { *sp = 1; sp++; *sp = 2; sp++; *(sp-2) = opt_plus( *(sp-2),*(sp-1)); sp--; return *(sp-1); } JIT-ed code: before jit_three() { VALUE stack[2]; stack[0] = 1; stack[1] = 2; stack[0] = opt_plus( stack[0], stack[1]); return stack[0]; } JIT-ed code: after

Slide 65

Slide 65 text

Slide 66

Slide 66 text

1. Use C local variable for VM stack (r62655) def err raise 'error' end def three 1 + (err rescue 2) end Ruby code

Slide 67

Slide 67 text

1. Use C local variable for VM stack (r62655) def err # JIT-ed raise 'error' end def three # JIT-ed 1 + (err rescue 2) end Ruby code main() Call stack in C VM stack empty

Slide 68

Slide 68 text

1. Use C local variable for VM stack (r62655) def err # JIT-ed raise 'error' end def three # JIT-ed 1 + (err rescue 2) end Ruby code main() Call stack in C ruby_vm() (setjmp called) VM stack empty

Slide 69

Slide 69 text

1. Use C local variable for VM stack (r62655) def err # JIT-ed raise 'error' end def three # JIT-ed 1 + (err rescue 2) end Ruby code main() Call stack in C ruby_vm() (setjmp called) jit_three() stack[nil, nil] in jit_three() VM stack empty

Slide 70

Slide 70 text

Slide 71

Slide 71 text

Slide 72

Slide 72 text

Slide 73

Slide 73 text

Slide 74

Slide 74 text

Slide 75

Slide 75 text

Slide 76

Slide 76 text

1. Use C local variable for VM stack (r62655) When "catch table" (rescue, ensure, etc.) does not exist, we don't need to resurrect stack values on exception So we can use just C local variables to reproduce the stack of Ruby VM only when catch table does not exist Stack pointer is not moved and compiler can inline values Optcarrot: 57.13fps -> 62.14fps

Slide 77

Slide 77 text

2. Bypass setjmp for yield (r62643) setjmp is slow If JIT-ed code is directly called from VM (no C function frames are created yet), we don’t need to call setjmp again Now yield is 1.3x faster than a non-JIT-ed case

Slide 78

Slide 78 text

3. Skip moving program counter (r62678) def err raise 'error' end def three 1 + (err rescue 2) end Ruby code

Slide 79

Slide 79 text

3. Skip moving program counter (r62678) def err raise 'error' end def three 1 + (err rescue 2) end Ruby code Ruby call stack #three Program Counter

Slide 80

Slide 80 text

3. Skip moving program counter (r62678) def err raise 'error' end def three 1 + (err rescue 2) end Ruby code Ruby call stack #three Program Counter

Slide 81

Slide 81 text

3. Skip moving program counter (r62678) def err raise 'error' end def three 1 + (err rescue 2) end Ruby code Ruby call stack #three Program Counter

Slide 82

Slide 82 text

3. Skip moving program counter (r62678) def err raise 'error' end def three 1 + (err rescue 2) end Ruby code Ruby call stack #three #err Program Counter Program Counter

Slide 83

Slide 83 text

3. Skip moving program counter (r62678) def err raise 'error' end def three 1 + (err rescue 2) end Ruby code Ruby call stack #three #err Program Counter Program Counter

Slide 84

Slide 84 text

3. Skip moving program counter (r62678) def err raise 'error' end def three 1 + (err rescue 2) end Ruby code Ruby call stack #three #err Program Counter Program Counter #raise Program Counter longjmp

Slide 85

Slide 85 text

3. Skip moving program counter (r62678) def err raise 'error' end def three 1 + (err rescue 2) end Ruby code Ruby call stack #three #err Program Counter Program Counter #raise Program Counter Program counter is used to resurrect the position after longjmp

Slide 86

Slide 86 text

3. Skip moving program counter (r62678) Same as the stack value's situation, we don't move the program counter only when catch table does not exist (rescue, ensure, etc.) Optcarrot: 64.92fps -> 68.08fps

Slide 87

Slide 87 text

4. Force inlining arithmetic instructions (r62677) C compiler has a threshold of function size to be inlined Some Ruby's instructions (+, -, *, /, ...) are too large to be inlined by default, so I applied an "always inline" attribute In the future, we should reduce the size of code instead Optcarrot: 60.19fps -> 64.92fps

Slide 88

Slide 88 text

5. Force inlining ivar instructions (r62693) Not only arithmetic instructions, but also instructions for instance variable are large too, so I force-inlined it Optcarrot: 67.04fps -> 68.20fps

Slide 89

Slide 89 text

6. Disable stack consistency check (r63092) Ruby VM is always asserting the size of stack when returning from a method, and it's slow We can skip it on JIT because it's already checked by VM Optcarrot: 67.43fps -> 69.92fps

Slide 90

Slide 90 text

7. Inline attr_reader method call (r63212) . def foo bar end Ruby code putself send :bar, cache: nil leave ISeq Ruby VM Program Counter Call send(cache) { search_method(cache); CALL_METHOD(cache); } C code for instruction Ruby method push C method call attr_reader attr_writer . .

Slide 91

Slide 91 text

7. Inline attr_reader method call (r63212) def foo bar end Ruby code putself send :bar, cache: attr leave ISeq Ruby VM Program Counter Call send(cache) { search_method(cache); CALL_METHOD(cache); } C code for instruction Ruby method push C method call attr_reader attr_writer . . .

Slide 92

Slide 92 text

Slide 93

Slide 93 text

7. Inline attr_reader method call (r63212) Using call cache in the same way as Ruby method, we can fully inline attr_reader without large compilation time The cost becomes the same as reference to normal instance variables Calling attr_reader is made 4x faster

Slide 94

Slide 94 text

2.6.0-Preview2 (trunk) wrap up I've mainly worked on performance because it's useless if it's slow Generated code is much simpliﬁed and made fast by removing program counter and stack pointer motions But it still has some complexity and it blocks signiﬁcant performance improvement by Ruby method inlining

Slide 95

Slide 95 text

Future of Ruby's JIT

Slide 96

Slide 96 text

1. Deoptimization by longjmp We can generate aggressive code and cancel all JIT-ed calls by longjmp when something unexpected happens I’m going to remove guard for TracePoint and cancel it later It should also be used when all method caches are purged

Slide 97

Slide 97 text

2. Instruction specialization for types Currently the same code is generated for both Hash#[] and Array#[] We need some instrumentation to detect the type which is passed to an optimized instruction Vladimir's RTL instruction achieves this by dynamic modiﬁcation of instruction

Slide 98

Slide 98 text

3. Multi-tier JIT Some other languages have multiple stages for JIT Depending on how frequently it's called, it may be better to balance compilation time and optimization level Vladimir is working on light JIT compilation Sometimes people deploy an application every 10 minutes

Slide 99

Slide 99 text

4. Profile-guided JIT C compiler has a feature to profile compiled code and generate faster code using the profiling result Using the multi-tier JIT, we may be able to profile code in the first tier and generate faster code in the second tier

Slide 100

Slide 100 text

5. Better JIT scheduler for Rails In Rails, an application becomes slower only during JIT compilation happens The possible cause might be the number of methods to be JIT-ed, compared to some other benchmarks Possibly we should reduce the number of methods to be JIT-ed or reduce frequency of JIT compilation

Slide 101

Slide 101 text

6. Ruby / C method inlining I already succeeded to implement Ruby method inlining, but it increases compilation time I have ideas to implement C method inlining, but which method to be inlined should be solved ﬁrst

Slide 102

Slide 102 text

7. Exploit more C compiler optimizations Loop invariant motion Folding Ruby's constant Type check removal by type inference Reduce unnecessary memory accesses to VM registers

Slide 103

Slide 103 text

Conclusion 2.6.0-preview2 will be much faster than 2.6.0-preview1 (Still not ready for Rails) We still have so many things to be done for Ruby 3x3