MJIT is over 3x faster! Which is very impressive and it's already doing better than both JRuby and Rubinius. TrufﬂeRuby is over 300x faster (I only mention it because it's my own implementation of a Ruby JIT), so there's still lots of rooms for optimizations, as the authors have already said themselves.
languages, and I created PoC: LLRB • http://github.com/k0kubun/llrb • But I learned that we can't eﬃciently use it for Ruby • Major optimization is done by inlining Ruby core's LLVM IR generated by clang • Just generating C code and using clang seemed enough
• It puts a C ﬁle generated by a method's bytecode on a disk (method JIT) • Then it lets cc(1) compile the C code to .so ﬁle, and dynamically loads it • This idea is proposed and implemented by Vladimir Makarov • https://github.com/vnmakarov/ruby/tree/rtl_mjit_branch
Ruby process header queue MJIT Worker Thread VM Thread Build time Transform CC Enqueue / Dequeue Bytecode to JIT C code .so ﬁle CC Included by C code Generate C code from bytecode Function pointer of machine code Load Called by
is almost not changed • Maintenance cost of JIT compiler is relatively low • Downside • C compiler becomes optional runtime dependency • It's highly recommended to keep C compiler used to build Ruby available on your server/container
JIT Infrastructure: "MJIT" • JIT Compiler: "YARV-MJIT" • MJIT had built-in JIT compiler, but it required many VM changes and is risky • So I built conservative JIT compiler which runs on top of MJIT • Let's talk about those 2 components
Ruby runtime on MJIT worker thread • Ruby VM is process global, and Ruby runtime is not thread safe • Who wants to apply GVL between main thread and JIT thread? • Using Ruby runtime on MJIT worker causes random SEGV...
compiles ﬁles like: "/tmp/_ruby_mjit_p12789u161.c" • p12789 is PID, u161 is a sequential number, so it can be easily predicted • MJIT worker should prevent it from being modiﬁed by others • Initial implementation had vulnerability • nobu ﬁxed it to use: "open(c_ﬁle, O_EXCL|O_CREAT, 0600)" • "O_EXCL|O_CREAT" is needed because an existing ﬁle may have unexpected permission
AIX, NetBSD, MinGW... • JIT header generation depends on gcc/clang's "-E -dD" which preprocesses C code leaving macro • But Visual Studio doesn't have such feature... • Use Pure-Ruby C preprocessor for Windows (!?) • Dynamic C code transformation by regexp (!!!) • Adding "static inline" for inlining and to reduce compilation time
--jit-wait - if JIT is triggered, wait until JIT compilation is ﬁnished • --jit-min-calls=N - change the threshold to trigger JIT • This is needed to control inlining by call cache (explained later) • Now trunk has unit tests that spawn "ruby --jit-wait --jit-min-calls=1 --jit- verbose=1", and conﬁrms stderr has "JIT success" output • When big JIT change is made, we need to verify that "make test-all" passes with RUN_OPTS="--jit-wait --jit-min-calls=1" (and "--jit-min-calls=5" too for call cache)
a single object ﬁle mjit_compile.o, and its interface is only a single function mjit_compile() • I believe the current approach is the easiest way to maintain and has no blocker for any JIT optimization • But if we found a better strategy for JIT compiler, we can fully replace it easily • Vladimir Makarov is working on another approach that uses RTL as intermediate representation between YARV instructions and JIT-ed code
#compile Kernel #eval fprintf "This is an ERB template that generates Ruby code that generates C code that generates JIT-ed C code." Machine Code gcc/clang Source Build-time only MJIT worker source JIT-ed temporary code
THROW_EXCEPTION • Special compilation of JUMP for opt_case_dispatch • Keep moving program counter to meet catch table • Properly ignore unhandled execution from exception handler • We may be able to support it later tl;dr it was hard
function deﬁnitions in MJIT header as many as possible • Major optimization is done here, by inlining VM operations in MJIT header • Non-automated example: • Carve out fast path of method search function and inline it • Inline function used by instruction optimized by VM • I inlined Array# with Integer argument and it makes VM faster too
Method call setup: method search, prepare arguments, push frame • VM has cache for method call, and JIT compiler utilizes it • But it requires receiver class to invalidate cache • JIT compiler doesn't know receiver on compilation • I introduced the invalidator for obsoleted call cache to avoid random SEGV
def baz 2 end Bytecode A: putobject 2 Bytecode B: putobject 1 opt_send :baz, opt_plus cache :A, serial: 2 When receiver object's class is Foo, it has new serial and invalidates old one def baz 3 end Bytecode C: putobject 3 On method redeﬁnition, it increments serial
instructions has guard for receiver class to optimize (like opt_aref has guard for Array / Hash), and it dispatches normal method call if the class is not expected one • But if not optimized method is called, we can eliminate it by call cache
is called, JIT-ed function call frame goes away • We must restore VM's state so that it's the same as the middle of JIT-ed function • I'm moving stack pointer in JIT-ed code even though it's sometimes unnecessary • As we're moving program counter, we can restore stack pointer from it • But it's hard...
"trace" instruction by default, and it dynamically alters all bytecodes to support tracing when TracePoint is enabled • It means that we need to cancel JIT function call on it • For now, I added guards for it after any method call • If we can cancel JIT-ed function call to VM execution outside the frame by longjmp properly, we can remove the guards
of NES emulator (optcarrot) is diﬀerent from Rails, and currently Rails is not optimized by the JIT • There is no single perfect benchmark for Ruby • I believe JIT can improve performance of many pure-Ruby parts on Rails, but somehow it's not the case for now • I need more time to investigate the reason
inlining • We can use the same strategy as Ruby -> Ruby method inlining • If we successfully build a header that has both core method deﬁnitions and VM implementation, we may be able to do this • Not tried yet, but identifying the function in call cache might be a blocker
Using "while" is faster than "Enumerable#each", but many Ruby developers don't want to write "while" • Inlining block in JIT should solve it • But such block invocation in Ruby core methods is out of control when generating JIT-ed code for now