MJIT is over 3x faster! Which is very impressive and it's already doing better than both JRuby and Rubinius. TruffleRuby is over 300x faster (I only mention it because it's my own implementation of a Ruby JIT), so there's still lots of rooms for optimizations, as the authors have already said themselves.
languages, and I created PoC: LLRB • http://github.com/k0kubun/llrb • But I learned that we can't efficiently use it for Ruby • Major optimization is done by inlining Ruby core's LLVM IR generated by clang • Just generating C code and using clang seemed enough
• It puts a C file generated by a method's bytecode on a disk (method JIT) • Then it lets cc(1) compile the C code to .so file, and dynamically loads it • This idea is proposed and implemented by Vladimir Makarov • https://github.com/vnmakarov/ruby/tree/rtl_mjit_branch
Ruby process header queue MJIT Worker Thread VM Thread Build time Transform CC Enqueue / Dequeue Bytecode to JIT C code .so file CC Included by C code Generate C code from bytecode
Ruby process header queue MJIT Worker Thread VM Thread Build time Transform CC Enqueue / Dequeue Bytecode to JIT C code .so file CC Included by C code Generate C code from bytecode Function pointer of machine code Load Called by
is almost not changed • Maintenance cost of JIT compiler is relatively low • Downside • C compiler becomes optional runtime dependency • It's highly recommended to keep C compiler used to build Ruby available on your server/container
JIT Infrastructure: "MJIT" • JIT Compiler: "YARV-MJIT" • MJIT had built-in JIT compiler, but it required many VM changes and is risky • So I built conservative JIT compiler which runs on top of MJIT • Let's talk about those 2 components
Ruby runtime on MJIT worker thread • Ruby VM is process global, and Ruby runtime is not thread safe • Who wants to apply GVL between main thread and JIT thread? • Using Ruby runtime on MJIT worker causes random SEGV...
compiles files like: "/tmp/_ruby_mjit_p12789u161.c" • p12789 is PID, u161 is a sequential number, so it can be easily predicted • MJIT worker should prevent it from being modified by others • Initial implementation had vulnerability • nobu fixed it to use: "open(c_file, O_EXCL|O_CREAT, 0600)" • "O_EXCL|O_CREAT" is needed because an existing file may have unexpected permission
Windows native thread early • The actual hard parts: • long is 32bit - MinGW still seems to have some issue on it • cl.exe (Visual Studio) and Windows headers are not good for preprocessing
AIX, NetBSD, MinGW... • JIT header generation depends on gcc/clang's "-E -dD" which preprocesses C code leaving macro • But Visual Studio doesn't have such feature... • Use Pure-Ruby C preprocessor for Windows (!?) • Dynamic C code transformation by regexp (!!!) • Adding "static inline" for inlining and to reduce compilation time
--jit-wait - if JIT is triggered, wait until JIT compilation is finished • --jit-min-calls=N - change the threshold to trigger JIT • This is needed to control inlining by call cache (explained later) • Now trunk has unit tests that spawn "ruby --jit-wait --jit-min-calls=1 --jit- verbose=1", and confirms stderr has "JIT success" output • When big JIT change is made, we need to verify that "make test-all" passes with RUN_OPTS="--jit-wait --jit-min-calls=1" (and "--jit-min-calls=5" too for call cache)
a single object file mjit_compile.o, and its interface is only a single function mjit_compile() • I believe the current approach is the easiest way to maintain and has no blocker for any JIT optimization • But if we found a better strategy for JIT compiler, we can fully replace it easily • Vladimir Makarov is working on another approach that uses RTL as intermediate representation between YARV instructions and JIT-ed code
#compile Kernel #eval fprintf "This is an ERB template that generates Ruby code that generates C code that generates JIT-ed C code." Machine Code gcc/clang Source Build-time only MJIT worker source JIT-ed temporary code
THROW_EXCEPTION • Special compilation of JUMP for opt_case_dispatch • Keep moving program counter to meet catch table • Properly ignore unhandled execution from exception handler • We may be able to support it later tl;dr it was hard
function definitions in MJIT header as many as possible • Major optimization is done here, by inlining VM operations in MJIT header • Non-automated example: • Carve out fast path of method search function and inline it • Inline function used by instruction optimized by VM • I inlined Array#[] with Integer argument and it makes VM faster too
Method call setup: method search, prepare arguments, push frame • VM has cache for method call, and JIT compiler utilizes it • But it requires receiver class to invalidate cache • JIT compiler doesn't know receiver on compilation • I introduced the invalidator for obsoleted call cache to avoid random SEGV
def baz 2 end Bytecode A: putobject 2 Bytecode B: putobject 1 opt_send :baz, opt_plus cache :A, serial: 2 Once method is called, it holds pointer to bytecode and serial
def baz 2 end Bytecode A: putobject 2 Bytecode B: putobject 1 opt_send :baz, opt_plus cache :A, serial: 2 When receiver object's class is Foo, it has new serial and invalidates old one def baz 3 end Bytecode C: putobject 3 On method redefinition, it increments serial
we have JIT compiler for bytecode, when call cache has valid bytecode, we can inline it and invalidate it by call cache • Patch is almost completed but is not properly verified/measured yet
instructions has guard for receiver class to optimize (like opt_aref has guard for Array / Hash), and it dispatches normal method call if the class is not expected one • But if not optimized method is called, we can eliminate it by call cache
is called, JIT-ed function call frame goes away • We must restore VM's state so that it's the same as the middle of JIT-ed function • I'm moving stack pointer in JIT-ed code even though it's sometimes unnecessary • As we're moving program counter, we can restore stack pointer from it • But it's hard...
baz raise "err" end JIT local variable array VM stack 1 Program counter yyy nil What we need to do xxx Dynamic stack extension (difficult) to insert value
"trace" instruction by default, and it dynamically alters all bytecodes to support tracing when TracePoint is enabled • It means that we need to cancel JIT function call on it • For now, I added guards for it after any method call • If we can cancel JIT-ed function call to VM execution outside the frame by longjmp properly, we can remove the guards
of NES emulator (optcarrot) is different from Rails, and currently Rails is not optimized by the JIT • There is no single perfect benchmark for Ruby • I believe JIT can improve performance of many pure-Ruby parts on Rails, but somehow it's not the case for now • I need more time to investigate the reason
somewhat working on MinGW, but it still has some bugs to be addressed • Visual Studio support • usa already did some great jobs • Installing VM sources or pure-Ruby C preprocessor?
inlining • We can use the same strategy as Ruby -> Ruby method inlining • If we successfully build a header that has both core method definitions and VM implementation, we may be able to do this • Not tried yet, but identifying the function in call cache might be a blocker
Using "while" is faster than "Enumerable#each", but many Ruby developers don't want to write "while" • Inlining block in JIT should solve it • But such block invocation in Ruby core methods is out of control when generating JIT-ed code for now