Performance Improvement of Ruby 2.7 JIT in Real World / RubyKaigi 2019

Performance Improvement of Ruby 2.7 JIT in Real World @k0kubun
/ RubyKaigi 2019

@k0kubun Ruby Committer: JIT, ERB

What's JIT? • Experimental optional feature since Ruby 2.6 •
Compile your Ruby code to faster C code automatically • Just-in-Time: Use runtime information for optimizations $ ruby --jit

What's "MJIT"? VM's C code Ruby process header queue VM
Thread Build time Transform Precompile precompiled header MJIT Worker Thread

VM's C code Ruby process header queue VM Thread Build
time Enqueue / Dequeue Bytecode to JIT precompiled header MJIT Worker Thread What's "MJIT"?

Ruby process queue VM Thread Build time Enqueue / Dequeue
Bytecode to JIT Included Generate precompiled header C code MJIT Worker Thread VM's C code header What's "MJIT"?

Bytecode to JIT CC Included Generate precompiled header .o ﬁle C code MJIT Worker Thread VM's C code header What's "MJIT"?

Bytecode to JIT .so ﬁle CC Included Generate precompiled header .o ﬁle Link C code MJIT Worker Thread VM's C code header What's "MJIT"?

Bytecode to JIT .so ﬁle CC Included Generate Function pointer of machine code Load Called by precompiled header .o ﬁle Link C code MJIT Worker Thread VM's C code header What's "MJIT"?

Bytecode to JIT .so file CC Included Generate Load precompiled header Link C code MJIT Worker Thread Function pointer of machine code Function pointer of machine code Called by Function pointer of machine code .o file .o file .o file VM's C code header What's "MJIT"?

Function pointer of machine code Ruby process queue VM Thread
Build time Function pointer of machine code Called by precompiled header .o file .o file MJIT Worker Thread .o file Function pointer of machine code VM's C code header What's "MJIT"?

Ruby process queue VM Thread Build time precompiled header .o
file .o file MJIT Worker Thread .o file .so file Link all VM's C code header Function pointer of machine code Function pointer of machine code Called by Function pointer of machine code What's "MJIT"?

Ruby process queue VM Thread Build time Function pointers of
machine code Reload all Called by precompiled header .o file .o file MJIT Worker Thread .o file .so file Link all VM's C code header What's "MJIT"?

Ruby process queue VM Thread Build time Function pointers of
machine code Called by precompiled header .o file .o file MJIT Worker Thread .o file VM's C code header What's "MJIT"?

Ruby 2.6 JIT stability • No SEGV report after release
• To avoid bugs, it’s designed conservatively • We've run JIT CIs for 24h and detected bugs by ourselves

JIT performance in real world

Ruby 3x3 benchmark: 1. Optcarrot NES emulator: mame/optcarrot

Ruby 3x3 benchmark: 1. Optcarrot Speed 1.61x Intel 4.0GHz i7-4790K
8 cores, memory 16GB, x86-64 Ubuntu Ruby 2.6.0 w/ Optcarrot Frames Per Second (fps) 0 23 45 68 90 86.6 53.8 JIT oﬀ JIT on

Ruby 3x3 benchmark: 1. Optcarrot Memory 1.01x Intel 4.0GHz i7-4790K
8 cores, memory 16GB, x86-64 Ubuntu Ruby 2.6.0 w/ Optcarrot Max Resident Set Size (MB) 0 16 32 48 64 63.7 62.8 JIT oﬀ JIT on

What's "real world"? • We're not running NES emulator on
production • Rails is popular for large-scale use

Ruby 3x3 benchmark: 2. Discourse

Ruby 3x3 benchmark: 2. Discourse • Forum application: discourse/discourse •
For AWS: noahgibbs/rails_ruby_bench (a.k.a. RRB) • He has reported JIT’s performance of all preview, rc, and ﬁnal releases (thanks!)

Ruby 3x3 benchmark: 2. Discourse

Ruby 3x3 benchmark: 2. Discourse • It has captured ﬁrst
10~100k requests after start • Is it “real-world”? • Compiling 1,000 methods take a long time and make it slow • Ruby 2.7 uses --jit-max-cache=100 by default and it will run in an all-compiled state for most of the time

Ruby 2.6 Request Per Second (#/s) 0 5 9 14
18 16.2 17.1 JIT oﬀ JIT on k0kubun/discourse : WARMUP=5000 BENCHMARK=1000 script/simple_bench Discourse: Speed Intel 4.0GHz i7-4790K 8 cores, memory 16GB, x86-64 Ubuntu, Ruby 2.6=2.6.2 Ruby2.7=r67600

18 16.2 17.1 JIT oﬀ JIT on k0kubun/discourse : WARMUP=5000 BENCHMARK=1000 script/simple_bench Ruby 2.7 Request Per Second (#/s) 0 5 9 14 18 17.4 17.5 JIT oﬀ JIT on Discourse: Speed Intel 4.0GHz i7-4790K 8 cores, memory 16GB, x86-64 Ubuntu, Ruby 2.6=2.6.2 Ruby2.7=r67600

Why JIT is not fast yet on Rails? • Checking
Discourse may not be suitable for knowing it • Some hotspots are not Rails-speciﬁc • Reaching a stable state takes some time

New Ruby benchmark: Railsbench • Just rails scaﬀold #show: k0kubun/railsbench
• headius/pgrailsbench, but on Rails 5.2 and w/ db:seed • Small but capturing some Rails characteristics

940 720.7 924.9 JIT oﬀ JIT on k0kubun/railsbench : WARMUP=30000 BENCHMARK=10000 bin/bench Railsbench: Speed Intel 4.0GHz i7-4790K 8 cores, memory 16GB, x86-64 Ubuntu, Ruby 2.6=2.6.2 Ruby2.7=r67600

940 720.7 924.9 JIT oﬀ JIT on k0kubun/railsbench : WARMUP=30000 BENCHMARK=10000 bin/bench Ruby 2.7 Request Per Second (#/s) 0 235 470 705 940 899.9 932.0 JIT oﬀ JIT on Railsbench: Speed Intel 4.0GHz i7-4790K 8 cores, memory 16GB, x86-64 Ubuntu, Ruby 2.6=2.6.2 Ruby2.7=r67600

Intel 4.0GHz i7-4790K 8 cores, memory 16GB, x86-64 Ubuntu Ruby
2.6 Request Per Second (#/s) 0 27 54 81 108 107.2 105.2 JIT oﬀ JIT on k0kubun/railsbench : WARMUP=30000 BENCHMARK=10000 bin/bench Ruby 2.7 Request Per Second (#/s) 0 27 54 81 108 107.6 106.5 JIT oﬀ JIT on Railsbench: Memory

Railsbench • We can easily run it as the application
is very light-weight • By proﬁling it, we’ve found which parts are not good for JIT • And the parts should exist in all Rails applications too

Why is JIT still slow on Rails? Let’s explore our
challenges for JIT on Rails!!

Ruby 2.7 JIT Performance Challenges

Ruby 2.7 JIT Performance Challenges 1. Proﬁle-guided Optimization 2. Optimization
Prediction 3. Deoptimized Recompilation 4. Frame-omitted Method Inlining 5. Stack-based Object Allocation

Problem 1: Calling JIT-ed code seems slow • When benchmarking
after-compile Rails performance, maximum number of methods should be compiled • Max: 1,000 in Ruby 2.6, 100 in trunk • Note: only 30 methods are compiled on Optcarrot

Problem 1: Calling JIT-ed code seems slow Time to call
methods returning nil (s) 0 1.5 3 4.5 6 Number of called methods 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 VM JIT def foo3 nil end def foo2 nil end def foo1 nil end

So we did this in Ruby 2.6 Ruby process queue
VM Thread Build time Function pointers of machine code Reload all Called by precompiled header .o file .o file MJIT Worker Thread .o file .so file Link all VM's C code header

But still we see icache stall

Approach 1: Profile-guided Optimization • Use GCC/Clang's -fprofile-generate and -fprofile-use

Approach 1: Proﬁle-guided Optimization https://github.com/k0kubun/ruby/tree/pgo

Approach 1: Proﬁle-guided Optimization • It didn't magically solve the
current bottleneck on Rails • It may be helpful later, but not now • As originally discussed, it complicates the build system

Problem 2: Can we avoid making things slow? • Why
can't we just skip compiling things like a method returning just nil? • If we compile only well-optimized methods, it should be fast

Approach 2: Optimization Prediction • Several situations are known to
be well-optimized in the current JIT • Let's measure the impacts and build heuristics!

Approach 2: Optimization Prediction Call overhead w/ 100 methods (distributed)
+22.4ns Call overhead w/ 100 methods (compacted) +3.22ns Invoke another VM +7.90s Cancel JIT execution +1.94s Stack pointer motion elimination -0.15ns Method call w/ inline cache -2.21ns Instance variable read w/ inline cache -0.86ns opt_* instructions -1~4ns

Approach 2: Optimization Prediction • It did not magically solve
the current bottleneck, again • Maybe the impact is more dynamic than I assumed • Compiling the same number of hotspot methods always seem to bring the same level of overhead • Our code is too big for icache?

Problem 3: JIT calls may be cancelled frequently • The
"Cancel JIT execution" had some overhead • How many cancels did we have?

Problem 3: JIT calls may be cancelled frequently

self's class change causes JIT cancel

Solution 3: Deoptimized Recompilation • Recompile a method when JIT's
speculation is invalidated • It was in the original MJIT by Vladimir Makarov, but removed for simplicity in Ruby 2.6

Solution 3: Deoptimized Recompilation • Committed to trunk. Inspectable with
--jit-verbose=1

Solution 3: Deoptimized Recompilation

Problem 4: Method call is slow • We're calling methods
everywhere • Method call cost: VM -> VM 10.28ns VM -> JIT 9.12ns JIT -> JIT 8.98ns JIT -> VM 19.59ns

Problem 4: Method call is slow

Solution 4: Frame-omitted Method Inlining • Method inlining levels: •
Level 1: Just call an inline function instead of JIT-ed code's function pointer • Level 2: Skip pushing a call frame by default, but lazily push it when something happens • For 2, We need to know "purity" of VM instruction

Solution 4: Frame-omitted Method Inlining • Can Numeric#zero? written in
Ruby be pure?

Solution 4: Frame-omitted Method Inlining

Solution 4: Frame-omitted Method Inlining If it were true (valid
for Integer), what would happen?

Solution 4: Frame-omitted Method Inlining

Solution 4: Frame-omitted Method Inlining VM

Solution 4: Frame-omitted Method Inlining VM JIT

Solution 4: Frame-omitted Method Inlining • Frame-omitted method inlining (level
2) is already on trunk! • It's working for limited things like #html_safe?, #present? • To make it really useful, we need to improve metadata for methods and VM instructions

Problem 5: Object allocation is slow • Rails app allocates
objects (of course!), unlike Optcarrot • It takes time to allocate memory from heap and GC it

Problem 5: Object allocation is slow • Railsbench takes time
for memory management in perf memory management, GC 9.3%

Solution 5: Stack-based Object Allocation • If an object does
not "escape", we can allocate an object on stack • Implementing really clever escape analysis is hard, but some basic one can suﬃce some of real-world use cases

Solution 5: Stack-based Object Allocation

Solution 5: Stack-based Object Allocation VM

Solution 5: Stack-based Object Allocation VM JIT

Solution 5: Stack-based Object Allocation

Solution 5: Stack-based Object Allocation VM

Solution 5: Stack-based Object Allocation VM JIT

Ruby 2.7 JIT future works

More Ruby code in core • Many parts of Ruby
implementation are written in C and it blocks optimizations like method inlining • @ko1 is proposing to use Ruby in core and add method's metadata more. • Let’s do it

TracePoint support • Ruby 2.6's JIT just stops when TracePoint
is enabled • TracePoint is often enabled on development environment • web-console + bindex, zeitwerk

Use GVL from MJIT worker • Ruby 2.6's JIT does
not compile methods during blocking IO or sleep • That would be the best timing for doing compilation!

LLVM as an optional loader • The overhead might be
coming from dlopen's way of loading • What if we: • Generate LLVM IR with Clang from MJIT's C code • Load it to LLVM Module and execute

Conclusion • We’re focusing on Rails speed on JIT after
all compilations at this stage • We're not there yet, but moving forward • JIT will allow us to stop caring about micro optimizations in real world code • No more Performance/* cops!

Performance Improvement of Ruby 2.7 JIT in Real...

Performance Improvement of Ruby 2.7 JIT in Real World / RubyKaigi 2019

More Decks by Takashi Kokubun

Other Decks in Programming

Featured

Transcript