Slide 1

Slide 1 text

Performance Improvement of Ruby 2.7 JIT in Real World @k0kubun / RubyKaigi 2019

Slide 2

Slide 2 text

@k0kubun Ruby Committer: JIT, ERB

Slide 3

Slide 3 text

What's JIT? • Experimental optional feature since Ruby 2.6 • Compile your Ruby code to faster C code automatically • Just-in-Time: Use runtime information for optimizations $ ruby --jit

Slide 4

Slide 4 text

What's "MJIT"? VM's C code Ruby process header queue VM Thread Build time Transform Precompile precompiled header MJIT Worker Thread

Slide 5

Slide 5 text

VM's C code Ruby process header queue VM Thread Build time Enqueue / Dequeue Bytecode to JIT precompiled header MJIT Worker Thread What's "MJIT"?

Slide 6

Slide 6 text

Ruby process queue VM Thread Build time Enqueue / Dequeue Bytecode to JIT Included Generate precompiled header C code MJIT Worker Thread VM's C code header What's "MJIT"?

Slide 7

Slide 7 text

Ruby process queue VM Thread Build time Enqueue / Dequeue Bytecode to JIT CC Included Generate precompiled header .o file C code MJIT Worker Thread VM's C code header What's "MJIT"?

Slide 8

Slide 8 text

Ruby process queue VM Thread Build time Enqueue / Dequeue Bytecode to JIT .so file CC Included Generate precompiled header .o file Link C code MJIT Worker Thread VM's C code header What's "MJIT"?

Slide 9

Slide 9 text

Ruby process queue VM Thread Build time Enqueue / Dequeue Bytecode to JIT .so file CC Included Generate Function pointer of machine code Load Called by precompiled header .o file Link C code MJIT Worker Thread VM's C code header What's "MJIT"?

Slide 10

Slide 10 text

Ruby process queue VM Thread Build time Enqueue / Dequeue Bytecode to JIT .so file CC Included Generate Load precompiled header Link C code MJIT Worker Thread Function pointer of machine code Function pointer of machine code Called by Function pointer of machine code .o file .o file .o file VM's C code header What's "MJIT"?

Slide 11

Slide 11 text

Function pointer of machine code Ruby process queue VM Thread Build time Function pointer of machine code Called by precompiled header .o file .o file MJIT Worker Thread .o file Function pointer of machine code VM's C code header What's "MJIT"?

Slide 12

Slide 12 text

Ruby process queue VM Thread Build time precompiled header .o file .o file MJIT Worker Thread .o file .so file Link all VM's C code header Function pointer of machine code Function pointer of machine code Called by Function pointer of machine code What's "MJIT"?

Slide 13

Slide 13 text

Ruby process queue VM Thread Build time Function pointers of machine code Reload all Called by precompiled header .o file .o file MJIT Worker Thread .o file .so file Link all VM's C code header What's "MJIT"?

Slide 14

Slide 14 text

Ruby process queue VM Thread Build time Function pointers of machine code Called by precompiled header .o file .o file MJIT Worker Thread .o file VM's C code header What's "MJIT"?

Slide 15

Slide 15 text

Ruby 2.6 JIT stability • No SEGV report after release • To avoid bugs, it’s designed conservatively • We've run JIT CIs for 24h and detected bugs by ourselves

Slide 16

Slide 16 text

JIT performance in real world

Slide 17

Slide 17 text

Ruby 3x3 benchmark: 1. Optcarrot NES emulator: mame/optcarrot

Slide 18

Slide 18 text

Ruby 3x3 benchmark: 1. Optcarrot Speed 1.61x Intel 4.0GHz i7-4790K 8 cores, memory 16GB, x86-64 Ubuntu Ruby 2.6.0 w/ Optcarrot Frames Per Second (fps) 0 23 45 68 90 86.6 53.8 JIT off JIT on

Slide 19

Slide 19 text

Ruby 3x3 benchmark: 1. Optcarrot Memory 1.01x Intel 4.0GHz i7-4790K 8 cores, memory 16GB, x86-64 Ubuntu Ruby 2.6.0 w/ Optcarrot Max Resident Set Size (MB) 0 16 32 48 64 63.7 62.8 JIT off JIT on

Slide 20

Slide 20 text

What's "real world"? • We're not running NES emulator on production • Rails is popular for large-scale use

Slide 21

Slide 21 text

Ruby 3x3 benchmark: 2. Discourse

Slide 22

Slide 22 text

Ruby 3x3 benchmark: 2. Discourse • Forum application: discourse/discourse • For AWS: noahgibbs/rails_ruby_bench (a.k.a. RRB) • He has reported JIT’s performance of all preview, rc, and final releases (thanks!)

Slide 23

Slide 23 text

Ruby 3x3 benchmark: 2. Discourse

Slide 24

Slide 24 text

Ruby 3x3 benchmark: 2. Discourse • It has captured first 10~100k requests after start • Is it “real-world”? • Compiling 1,000 methods take a long time and make it slow • Ruby 2.7 uses --jit-max-cache=100 by default and it will run in an all-compiled state for most of the time

Slide 25

Slide 25 text

Ruby 2.6 Request Per Second (#/s) 0 5 9 14 18 16.2 17.1 JIT off JIT on k0kubun/discourse : WARMUP=5000 BENCHMARK=1000 script/simple_bench Discourse: Speed Intel 4.0GHz i7-4790K 8 cores, memory 16GB, x86-64 Ubuntu, Ruby 2.6=2.6.2 Ruby2.7=r67600

Slide 26

Slide 26 text

Ruby 2.6 Request Per Second (#/s) 0 5 9 14 18 16.2 17.1 JIT off JIT on k0kubun/discourse : WARMUP=5000 BENCHMARK=1000 script/simple_bench Ruby 2.7 Request Per Second (#/s) 0 5 9 14 18 17.4 17.5 JIT off JIT on Discourse: Speed Intel 4.0GHz i7-4790K 8 cores, memory 16GB, x86-64 Ubuntu, Ruby 2.6=2.6.2 Ruby2.7=r67600

Slide 27

Slide 27 text

Why JIT is not fast yet on Rails? • Checking Discourse may not be suitable for knowing it • Some hotspots are not Rails-specific • Reaching a stable state takes some time

Slide 28

Slide 28 text

New Ruby benchmark: Railsbench • Just rails scaffold #show: k0kubun/railsbench • headius/pgrailsbench, but on Rails 5.2 and w/ db:seed • Small but capturing some Rails characteristics

Slide 29

Slide 29 text

Ruby 2.6 Request Per Second (#/s) 0 235 470 705 940 720.7 924.9 JIT off JIT on k0kubun/railsbench : WARMUP=30000 BENCHMARK=10000 bin/bench Railsbench: Speed Intel 4.0GHz i7-4790K 8 cores, memory 16GB, x86-64 Ubuntu, Ruby 2.6=2.6.2 Ruby2.7=r67600

Slide 30

Slide 30 text

Ruby 2.6 Request Per Second (#/s) 0 235 470 705 940 720.7 924.9 JIT off JIT on k0kubun/railsbench : WARMUP=30000 BENCHMARK=10000 bin/bench Ruby 2.7 Request Per Second (#/s) 0 235 470 705 940 899.9 932.0 JIT off JIT on Railsbench: Speed Intel 4.0GHz i7-4790K 8 cores, memory 16GB, x86-64 Ubuntu, Ruby 2.6=2.6.2 Ruby2.7=r67600

Slide 31

Slide 31 text

Intel 4.0GHz i7-4790K 8 cores, memory 16GB, x86-64 Ubuntu Ruby 2.6 Request Per Second (#/s) 0 27 54 81 108 107.2 105.2 JIT off JIT on k0kubun/railsbench : WARMUP=30000 BENCHMARK=10000 bin/bench Ruby 2.7 Request Per Second (#/s) 0 27 54 81 108 107.6 106.5 JIT off JIT on Railsbench: Memory

Slide 32

Slide 32 text

Railsbench • We can easily run it as the application is very light-weight • By profiling it, we’ve found which parts are not good for JIT • And the parts should exist in all Rails applications too

Slide 33

Slide 33 text

Why is JIT still slow on Rails? Let’s explore our challenges for JIT on Rails!!

Slide 34

Slide 34 text

Ruby 2.7 JIT Performance Challenges

Slide 35

Slide 35 text

Ruby 2.7 JIT Performance Challenges 1. Profile-guided Optimization 2. Optimization Prediction 3. Deoptimized Recompilation 4. Frame-omitted Method Inlining 5. Stack-based Object Allocation

Slide 36

Slide 36 text

Problem 1: Calling JIT-ed code seems slow • When benchmarking after-compile Rails performance, maximum number of methods should be compiled • Max: 1,000 in Ruby 2.6, 100 in trunk • Note: only 30 methods are compiled on Optcarrot

Slide 37

Slide 37 text

Problem 1: Calling JIT-ed code seems slow Time to call methods returning nil (s) 0 1.5 3 4.5 6 Number of called methods 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 VM JIT def foo3 nil end def foo2 nil end def foo1 nil end

Slide 38

Slide 38 text

So we did this in Ruby 2.6 Ruby process queue VM Thread Build time Function pointers of machine code Reload all Called by precompiled header .o file .o file MJIT Worker Thread .o file .so file Link all VM's C code header

Slide 39

Slide 39 text

But still we see icache stall

Slide 40

Slide 40 text

Approach 1: Profile-guided Optimization • Use GCC/Clang's -fprofile-generate and -fprofile-use

Slide 41

Slide 41 text

Approach 1: Profile-guided Optimization https://github.com/k0kubun/ruby/tree/pgo

Slide 42

Slide 42 text

Approach 1: Profile-guided Optimization • It didn't magically solve the current bottleneck on Rails • It may be helpful later, but not now • As originally discussed, it complicates the build system

Slide 43

Slide 43 text

Problem 2: Can we avoid making things slow? • Why can't we just skip compiling things like a method returning just nil? • If we compile only well-optimized methods, it should be fast

Slide 44

Slide 44 text

Approach 2: Optimization Prediction • Several situations are known to be well-optimized in the current JIT • Let's measure the impacts and build heuristics!

Slide 45

Slide 45 text

Approach 2: Optimization Prediction Call overhead w/ 100 methods (distributed) +22.4ns Call overhead w/ 100 methods (compacted) +3.22ns Invoke another VM +7.90s Cancel JIT execution +1.94s Stack pointer motion elimination -0.15ns Method call w/ inline cache -2.21ns Instance variable read w/ inline cache -0.86ns opt_* instructions -1~4ns

Slide 46

Slide 46 text

Approach 2: Optimization Prediction • It did not magically solve the current bottleneck, again • Maybe the impact is more dynamic than I assumed • Compiling the same number of hotspot methods always seem to bring the same level of overhead • Our code is too big for icache?

Slide 47

Slide 47 text

Problem 3: JIT calls may be cancelled frequently • The "Cancel JIT execution" had some overhead • How many cancels did we have?

Slide 48

Slide 48 text

Problem 3: JIT calls may be cancelled frequently

Slide 49

Slide 49 text

No content

Slide 50

Slide 50 text

No content

Slide 51

Slide 51 text

self's class change causes JIT cancel

Slide 52

Slide 52 text

Solution 3: Deoptimized Recompilation • Recompile a method when JIT's speculation is invalidated • It was in the original MJIT by Vladimir Makarov, but removed for simplicity in Ruby 2.6

Slide 53

Slide 53 text

Solution 3: Deoptimized Recompilation • Committed to trunk. Inspectable with --jit-verbose=1

Slide 54

Slide 54 text

Solution 3: Deoptimized Recompilation

Slide 55

Slide 55 text

Problem 4: Method call is slow • We're calling methods everywhere • Method call cost: VM -> VM 10.28ns VM -> JIT 9.12ns JIT -> JIT 8.98ns JIT -> VM 19.59ns

Slide 56

Slide 56 text

Problem 4: Method call is slow

Slide 57

Slide 57 text

Solution 4: Frame-omitted Method Inlining • Method inlining levels: • Level 1: Just call an inline function instead of JIT-ed code's function pointer • Level 2: Skip pushing a call frame by default, but lazily push it when something happens • For 2, We need to know "purity" of VM instruction

Slide 58

Slide 58 text

Solution 4: Frame-omitted Method Inlining • Can Numeric#zero? written in Ruby be pure?

Slide 59

Slide 59 text

Solution 4: Frame-omitted Method Inlining

Slide 60

Slide 60 text

Solution 4: Frame-omitted Method Inlining If it were true (valid for Integer), what would happen?

Slide 61

Slide 61 text

Solution 4: Frame-omitted Method Inlining

Slide 62

Slide 62 text

Solution 4: Frame-omitted Method Inlining

Slide 63

Slide 63 text

Solution 4: Frame-omitted Method Inlining VM

Slide 64

Slide 64 text

Solution 4: Frame-omitted Method Inlining VM JIT

Slide 65

Slide 65 text

Solution 4: Frame-omitted Method Inlining • Frame-omitted method inlining (level 2) is already on trunk! • It's working for limited things like #html_safe?, #present? • To make it really useful, we need to improve metadata for methods and VM instructions

Slide 66

Slide 66 text

Problem 5: Object allocation is slow • Rails app allocates objects (of course!), unlike Optcarrot • It takes time to allocate memory from heap and GC it

Slide 67

Slide 67 text

Problem 5: Object allocation is slow • Railsbench takes time for memory management in perf memory management, GC 9.3%

Slide 68

Slide 68 text

Solution 5: Stack-based Object Allocation • If an object does not "escape", we can allocate an object on stack • Implementing really clever escape analysis is hard, but some basic one can suffice some of real-world use cases

Slide 69

Slide 69 text

Solution 5: Stack-based Object Allocation

Slide 70

Slide 70 text

Solution 5: Stack-based Object Allocation VM

Slide 71

Slide 71 text

Solution 5: Stack-based Object Allocation VM JIT

Slide 72

Slide 72 text

Solution 5: Stack-based Object Allocation

Slide 73

Slide 73 text

Solution 5: Stack-based Object Allocation VM

Slide 74

Slide 74 text

Solution 5: Stack-based Object Allocation VM JIT

Slide 75

Slide 75 text

Ruby 2.7 JIT future works

Slide 76

Slide 76 text

More Ruby code in core • Many parts of Ruby implementation are written in C and it blocks optimizations like method inlining • @ko1 is proposing to use Ruby in core and add method's metadata more. • Let’s do it

Slide 77

Slide 77 text

TracePoint support • Ruby 2.6's JIT just stops when TracePoint is enabled • TracePoint is often enabled on development environment • web-console + bindex, zeitwerk

Slide 78

Slide 78 text

Use GVL from MJIT worker • Ruby 2.6's JIT does not compile methods during blocking IO or sleep • That would be the best timing for doing compilation!

Slide 79

Slide 79 text

LLVM as an optional loader • The overhead might be coming from dlopen's way of loading • What if we: • Generate LLVM IR with Clang from MJIT's C code • Load it to LLVM Module and execute

Slide 80

Slide 80 text

Conclusion • We’re focusing on Rails speed on JIT after all compilations at this stage • We're not there yet, but moving forward • JIT will allow us to stop caring about micro optimizations in real world code • No more Performance/* cops!