Slide 1

Slide 1 text

JIT compiler improvements in Ruby 2.7 @k0kubun |

Slide 2

Slide 2 text

@k0kubun Ruby's JIT, ERB, Haml, Hamlit

Slide 3

Slide 3 text

No content

Slide 4

Slide 4 text

JIT

Slide 5

Slide 5 text

Just-In-Time compiler

Slide 6

Slide 6 text

What's JIT? • Experimental optional feature since Ruby 2.6 • Compile your Ruby code to faster C code automatically • Just-in-Time: Use runtime information for optimizations $ ruby --jit

Slide 7

Slide 7 text

Ruby 3x3 benchmark: Optcarrot NES emulator: mame/optcarrot

Slide 8

Slide 8 text

Ruby 3x3 benchmark: Optcarrot Intel 4.0GHz i7-4790K 8 cores, memory 16GB, x86-64 Ubuntu Ruby 2.6.0 w/ Optcarrot Frames Per Second (fps) 0 23 45 68 90 53.8 JIT off JIT on

Slide 9

Slide 9 text

Ruby 3x3 benchmark: Optcarrot Speed 1.61x Intel 4.0GHz i7-4790K 8 cores, memory 16GB, x86-64 Ubuntu Ruby 2.6.0 w/ Optcarrot Frames Per Second (fps) 0 23 45 68 90 86.6 53.8 JIT off JIT on

Slide 10

Slide 10 text

What's JIT? $ ps aufx ruby --jit bin/optcarrot-bench \_ /usr/bin/gcc -w -Wfatal-errors -fPIC -shared -w -pipe -O3 \_ /usr/lib/gcc/x86_64-linux-gnu/7/cc1 -quiet -imultiarch \_ as -W --64 -o /tmp/_ruby_mjit_p31673u20.o

Slide 11

Slide 11 text

How does it work? VM's C code Ruby process header queue VM Thread Build time Transform Precompile precompiled header MJIT Worker Thread

Slide 12

Slide 12 text

VM's C code Ruby process header queue VM Thread Build time Enqueue / Dequeue Bytecode to JIT precompiled header MJIT Worker Thread How does it work?

Slide 13

Slide 13 text

Ruby process queue VM Thread Build time Enqueue / Dequeue Bytecode to JIT Included Generate precompiled header C code MJIT Worker Thread VM's C code header How does it work?

Slide 14

Slide 14 text

Ruby process queue VM Thread Build time Enqueue / Dequeue Bytecode to JIT CC Included Generate precompiled header .o file C code MJIT Worker Thread VM's C code header How does it work?

Slide 15

Slide 15 text

Ruby process queue VM Thread Build time Enqueue / Dequeue Bytecode to JIT .so file CC Included Generate precompiled header .o file Link C code MJIT Worker Thread VM's C code header How does it work?

Slide 16

Slide 16 text

Ruby process queue VM Thread Build time Enqueue / Dequeue Bytecode to JIT .so file CC Included Generate Function pointer of machine code Load Called by precompiled header .o file Link C code MJIT Worker Thread VM's C code header How does it work?

Slide 17

Slide 17 text

How to use JIT

Slide 18

Slide 18 text

How to use JIT • Just "--jit" is fine • You can also use RUBYOPT=--jit environment variable $ ruby --jit

Slide 19

Slide 19 text

How to use JIT $ ruby --help JIT options (experimental): --jit-warnings Enable printing JIT warnings --jit-debug Enable JIT debugging (very slow) --jit-wait Wait until JIT compilation is finished everytime (for testing) --jit-save-temps Save JIT temporary files in $TMP or /tmp (for testing) --jit-verbose=num Print JIT logs of level num or less to stderr (default: 0) --jit-max-cache=num Max number of methods to be JIT-ed in a cache (default: 100) --jit-min-calls=num Number of calls to trigger JIT (for testing, default: 10000)

Slide 20

Slide 20 text

How to use JIT $ ruby --help JIT options (experimental): --jit-warnings Enable printing JIT warnings --jit-debug Enable JIT debugging (very slow) --jit-wait Wait until JIT compilation is finished everytime (for testing) --jit-save-temps Save JIT temporary files in $TMP or /tmp (for testing) --jit-verbose=num Print JIT logs of level num or less to stderr (default: 0) --jit-max-cache=num Max number of methods to be JIT-ed in a cache (default: 100) --jit-min-calls=num Number of calls to trigger JIT (for testing, default: 10000)

Slide 21

Slide 21 text

How to use JIT $ ruby --jit-verbose=1 ...

Slide 22

Slide 22 text

How to use JIT $ ruby --jit-verbose=1 ... JIT success (35.1ms): block in symbolize_keys!@... JIT success (89.9ms): block in forwarded_scheme@...

Slide 23

Slide 23 text

How to use JIT $ ruby --jit-verbose=1 ... JIT success (35.1ms): block in symbolize_keys!@... JIT success (89.9ms): block in forwarded_scheme@... JIT inline: unwrapped_html_escape@... JIT success (106.9ms): unwrapped_html_escape@... JIT inline: present?@... JIT success (37.5ms): present?@...

Slide 24

Slide 24 text

How to use JIT $ ruby --jit-verbose=1 ... JIT success (35.1ms): block in symbolize_keys!@... JIT success (89.9ms): block in forwarded_scheme@... JIT inline: unwrapped_html_escape@... JIT success (106.9ms): unwrapped_html_escape@... JIT inline: present?@... JIT success (37.5ms): present?@... Optimization in Ruby 2.7

Slide 25

Slide 25 text

How to use JIT $ ruby --jit-verbose=1 ... JIT success (35.1ms): block in symbolize_keys!@... JIT success (89.9ms): block in forwarded_scheme@... JIT inline: unwrapped_html_escape@... JIT success (106.9ms): unwrapped_html_escape@... JIT inline: present?@... JIT success (37.5ms): present?@... JIT recompile: present?@...

Slide 26

Slide 26 text

How to use JIT $ ruby --jit-verbose=1 ... JIT success (35.1ms): block in symbolize_keys!@... JIT success (89.9ms): block in forwarded_scheme@... JIT inline: unwrapped_html_escape@... JIT success (106.9ms): unwrapped_html_escape@... JIT inline: present?@... JIT success (37.5ms): present?@... JIT recompile: present?@... Another optimization in Ruby 2.7

Slide 27

Slide 27 text

How to use JIT $ ruby --jit-verbose=1 ... JIT success (35.1ms): block in symbolize_keys!@... JIT success (89.9ms): block in forwarded_scheme@... JIT inline: unwrapped_html_escape@... JIT success (106.9ms): unwrapped_html_escape@... JIT inline: present?@... JIT success (37.5ms): present?@... JIT recompile: present?@... ... JIT compaction (17.0ms): Compacted 100 methods -> ...

Slide 28

Slide 28 text

How to use JIT $ ruby --jit-verbose=1 ... JIT success (35.1ms): block in symbolize_keys!@... JIT success (89.9ms): block in forwarded_scheme@... JIT inline: unwrapped_html_escape@... JIT success (106.9ms): unwrapped_html_escape@... JIT inline: present?@... JIT success (37.5ms): present?@... JIT recompile: present?@... ... JIT compaction (17.0ms): Compacted 100 methods -> ... ?

Slide 29

Slide 29 text

Function pointer of machine code Ruby process queue VM Thread Build time Function pointer of machine code Called by precompiled header .o file .o file MJIT Worker Thread .o file Function pointer of machine code VM's C code header "JIT compaction"

Slide 30

Slide 30 text

Ruby process queue VM Thread Build time precompiled header .o file .o file MJIT Worker Thread .o file .so file Link all VM's C code header Function pointer of machine code Function pointer of machine code Called by Function pointer of machine code "JIT compaction"

Slide 31

Slide 31 text

Ruby process queue VM Thread Build time Function pointers of machine code Reload all Called by precompiled header .o file .o file MJIT Worker Thread .o file .so file Link all VM's C code header "JIT compaction"

Slide 32

Slide 32 text

Ruby process queue VM Thread Build time Function pointers of machine code Called by precompiled header .o file .o file MJIT Worker Thread .o file VM's C code header "JIT compaction"

Slide 33

Slide 33 text

JIT's performance on Rails

Slide 34

Slide 34 text

Ruby benchmark on Rails: Railsbench • Just rails scaffold #show: k0kubun/railsbench • headius/pgrailsbench, but on Rails 5.2 and w/ db:seed • Small but capturing some Rails characteristics

Slide 35

Slide 35 text

Ruby 2.6 Request Per Second (#/s) 0 235 470 705 940 720.7 924.9 JIT off JIT on k0kubun/railsbench : WARMUP=30000 BENCHMARK=10000 bin/bench Railsbench: Speed Intel 4.0GHz i7-4790K 8 cores, memory 16GB, x86-64 Ubuntu, Ruby 2.6=2.6.2 Ruby2.7=r67600

Slide 36

Slide 36 text

Ruby 2.6 Request Per Second (#/s) 0 235 470 705 940 720.7 924.9 JIT off JIT on k0kubun/railsbench : WARMUP=30000 BENCHMARK=10000 bin/bench Ruby 2.7 Request Per Second (#/s) 0 235 470 705 940 899.9 932.0 JIT off JIT on Railsbench: Speed Intel 4.0GHz i7-4790K 8 cores, memory 16GB, x86-64 Ubuntu, Ruby 2.6=2.6.2 Ruby2.7=r67600

Slide 37

Slide 37 text

Intel 4.0GHz i7-4790K 8 cores, memory 16GB, x86-64 Ubuntu Ruby 2.6 Request Per Second (#/s) 0 27 54 81 108 107.2 105.2 JIT off JIT on k0kubun/railsbench : WARMUP=30000 BENCHMARK=10000 bin/bench Ruby 2.7 Request Per Second (#/s) 0 27 54 81 108 107.6 106.5 JIT off JIT on Railsbench: Memory

Slide 38

Slide 38 text

Why is it slow on Rails? • Too many methods => Cache inefficiency • Less CPU bound and fewer optimization chances

Slide 39

Slide 39 text

Performance Improvements in Ruby 2.7 JIT

Slide 40

Slide 40 text

Ruby 2.7 JIT Performance Improvements 1. Default Option Changes 2. Deoptimized Recompilation 3. Method Inlining 4. Optimized Dispatch of JIT-ed Code (WIP) 5. Stack-based Object Allocation (PoC)

Slide 41

Slide 41 text

1. Default Option Changes

Slide 42

Slide 42 text

1. Default Option Changes • Ruby 2.7 changes in default values of JIT options • --jit-min-calls: 5 → 10,000 • --jit-max-cache: 1,000 → 100

Slide 43

Slide 43 text

No content

Slide 44

Slide 44 text

2. Deoptimized Recompilation

Slide 45

Slide 45 text

Problem 2: JIT calls may be cancelled frequently • The "Cancel JIT execution" had some overhead • How many cancels did we have?

Slide 46

Slide 46 text

Problem 2: JIT calls may be cancelled frequently

Slide 47

Slide 47 text

No content

Slide 48

Slide 48 text

No content

Slide 49

Slide 49 text

self's class change causes JIT cancel

Slide 50

Slide 50 text

Solution 2: Deoptimized Recompilation • Recompile a method when JIT's speculation is invalidated • It was in the original MJIT by Vladimir Makarov, but removed for simplicity in Ruby 2.6

Slide 51

Slide 51 text

Solution 2: Deoptimized Recompilation • Committed to trunk. Inspectable with --jit-verbose=1

Slide 52

Slide 52 text

Solution 2: Deoptimized Recompilation

Slide 53

Slide 53 text

3. Method Inlining

Slide 54

Slide 54 text

Problem 3: Method call is slow • We're calling methods everywhere • Method call cost: VM → VM 10.28ns VM → JIT 9.12ns JIT → JIT 8.98ns JIT → VM 19.59ns

Slide 55

Slide 55 text

Problem 3: Method call is slow

Slide 56

Slide 56 text

Solution 3: Method Inlining • Method inlining levels: • Level 1: Just call an inline function instead of JIT-ed code's function pointer • Level 2: Skip pushing a call frame by default, but lazily push it when something happens • For 2, We need to know "purity" of VM instruction

Slide 57

Slide 57 text

Solution 3: Method Inlining • Can Numeric#zero? written in Ruby be pure?

Slide 58

Slide 58 text

No content

Slide 59

Slide 59 text

Solution 3: Method Inlining

Slide 60

Slide 60 text

Solution 4: Frame-omitted Method Inlining

Slide 61

Slide 61 text

Solution 3: Method Inlining VM

Slide 62

Slide 62 text

Solution 3: Method Inlining VM JIT

Slide 63

Slide 63 text

Solution 3: Method Inlining • Method inlining is already on master! • It's working for limited things like #html_safe?, #present? • To make it really useful, we need to prepare Ruby version of core class methods for JIT

Slide 64

Slide 64 text

4. Optimized Dispatch of JIT-ed Code

Slide 65

Slide 65 text

Problem 4: Calling JIT-ed code seems slow • When benchmarking after-compile Rails performance, maximum number of methods should be compiled • Max: 1,000 in Ruby 2.6, 100 in Ruby 2.7 • Note: only 30 methods are compiled on Optcarrot

Slide 66

Slide 66 text

Problem 4: Calling JIT-ed code seems slow Time to call a method returning nil (ns) 0 8 16 24 32 Number of called methods 1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 97 VM JIT def foo3 nil end def foo2 nil end def foo1 nil end

Slide 67

Slide 67 text

So we did this in Ruby 2.6 Ruby process queue VM Thread Build time Function pointers of machine code Reload all Called by precompiled header .o file .o file MJIT Worker Thread .o file .so file Link all VM's C code header

Slide 68

Slide 68 text

After "JIT compaction" in Ruby 2.6 Time to call a method returning nil (ns) 0 8 16 24 32 Number of called methods 1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 97 VM JIT def foo3 nil end def foo2 nil end def foo1 nil end

Slide 69

Slide 69 text

But still we see icache stall

Slide 70

Slide 70 text

Solution 4: Profile-guided Optimization? • Use GCC/Clang's -fprofile-generate and -fprofile-use

Slide 71

Slide 71 text

Solution 4: Profile-guided Optimization? • Use GCC/Clang's -fprofile-generate and -fprofile-use • Unfortunately this did not help the situation

Slide 72

Slide 72 text

Solution 4: Optimized Dispatch of JIT-ed Code • Calling JIT-ed code from VM is slow • Can we generate special code for dispatch from VM? • We can reduce # of virtual calls from two to one • Work in progress, but I can show you a graph

Slide 73

Slide 73 text

After optimized dispatch of JIT-ed code (WIP) Time to call a method returning nil (ns) 0 8 16 24 32 Number of called methods 1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 97 VM JIT def foo3 nil end def foo2 nil end def foo1 nil end

Slide 74

Slide 74 text

5. Stack-based Object Allocation

Slide 75

Slide 75 text

Problem 5: Object allocation is slow • Rails app allocates objects (of course!), unlike Optcarrot • It takes time to allocate memory from heap and GC it

Slide 76

Slide 76 text

Problem 5: Object allocation is slow • Railsbench takes time for memory management in perf memory management, GC 9.3%

Slide 77

Slide 77 text

Solution 5: Stack-based Object Allocation (PoC) • If an object does not "escape", we can allocate an object on stack • Implementing really clever escape analysis is hard, but some basic one can suffice some of real-world use cases

Slide 78

Slide 78 text

Solution 5: Stack-based Object Allocation (PoC)

Slide 79

Slide 79 text

Solution 5: Stack-based Object Allocation (PoC) VM

Slide 80

Slide 80 text

Solution 5: Stack-based Object Allocation (PoC) VM JIT

Slide 81

Slide 81 text

Solution 5: Stack-based Object Allocation (PoC)

Slide 82

Slide 82 text

Solution 5: Stack-based Object Allocation (PoC) VM

Slide 83

Slide 83 text

Solution 5: Stack-based Object Allocation (PoC) VM JIT

Slide 84

Slide 84 text

Conclusion • Optimizing JIT-ed code dispatch may offset the current JIT's bottleneck in JIT on Rails • Once the problem is solved, we'd be able to continuously improve performance • By allocating objects on stack, eliminating branches, ...