Slide 1

Slide 1 text

VM-Generated JIT compiler for Ruby 2.6 PLAZMA OSS Day: TD Tech Talk 2018 Takashi Kokubun

Slide 2

Slide 2 text

Who? • GitHub, Twitter: k0kubun • Ruby Committer • Maintainer of default template engine: ERB • Developed some JIT compilers for Ruby • LLRB, YARV-MJIT

Slide 3

Slide 3 text

No content

Slide 4

Slide 4 text

Ad: WEB+DB PRESS Vol.103 • Introducing optimized Ruby 2.5 features • Real example of Ruby code optimization • Profiling • Bytecode-wise optimization

Slide 5

Slide 5 text

NEWS: Ruby 2.6 merged JIT compiler

Slide 6

Slide 6 text

How is the performance? Optcarrot benchmark fps 0 15 30 45 60 2.0.0 2.1.0 2.2.0 2.3.0 2.4.0 2.5.0 2.6.0-dev r62403 59.22 53.09 48.33 45.54 38.92 38.32 38.76 37.2 JIT off JIT on Intel 4.0GHz i7-4790K with 16GB memory under x86-64 Ubuntu 8 Cores https://github.com/mame/optcarrot

Slide 7

Slide 7 text

How is the performance? MJIT micro benchmarks w/ 2.6.0-dev r62403 speedup ratio compared to JIT off 0 1 2 2 3 aread aref aset aw rte call const2 fannk fib ivread ivw rite m andelbrot m eteor nbody nest-ntim es nest-w rite norm nsvb sieve trees w hile 3.0 1.1 1.2 1.1 1.2 1.2 1.3 1.1 1.1 2.1 2.9 1.5 1.0 2.3 2.3 1.5 1.9 2.1 2.8 2.1 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 JIT off JIT on Intel 4.0GHz i7-4790K with 16GB memory under x86-64 Ubuntu 8 Cores https://github.com/benchmark-driver/mjit-benchmarks

Slide 8

Slide 8 text

How is the performance? https://twitter.com/ChrisGSeaton/status/961035035385237509 Running that it looks like MJIT is over 3x faster! Which is very impressive and it's already doing better than both JRuby and Rubinius. TruffleRuby is over 300x faster (I only mention it because it's my own implementation of a Ruby JIT), so there's still lots of rooms for optimizations, as the authors have already said themselves.

Slide 9

Slide 9 text

Agenda 1. Overview of Ruby's JIT compilation 2. JIT Infrastructure: The hard works for portability 3. JIT Compiler: Internals of VM-Generated JIT compiler 4. Future works

Slide 10

Slide 10 text

1. Overview of Ruby's JIT compilation

Slide 11

Slide 11 text

Options for JIT compilation • What to JIT-compile • Method JIT • Tracing JIT • How to JIT-compile • Generate assembly code and assemble • Use JIT library's interface like LLVM

Slide 12

Slide 12 text

How about constructing LLVM IR? • It's popular in modern languages, and I created PoC: LLRB • http://github.com/k0kubun/llrb • But I learned that we can't efficiently use it for Ruby • Major optimization is done by inlining Ruby core's LLVM IR generated by clang • Just generating C code and using clang seemed enough

Slide 13

Slide 13 text

The Ruby's way: "MJIT" infrastructure • "MJIT" (MRI JIT) infrastructure • It puts a C file generated by a method's bytecode on a disk (method JIT) • Then it lets cc(1) compile the C code to .so file, and dynamically loads it • This idea is proposed and implemented by Vladimir Makarov • https://github.com/vnmakarov/ruby/tree/rtl_mjit_branch

Slide 14

Slide 14 text

The Ruby's way: "MJIT" infrastructure VM's C code Ruby process queue MJIT Worker Thread VM Thread Build time

Slide 15

Slide 15 text

The Ruby's way: "MJIT" infrastructure VM's C code Ruby process queue MJIT Worker Thread VM Thread Build time header Transform

Slide 16

Slide 16 text

The Ruby's way: "MJIT" infrastructure VM's C code precompiled header Ruby process header queue MJIT Worker Thread VM Thread Build time Transform CC

Slide 17

Slide 17 text

The Ruby's way: "MJIT" infrastructure VM's C code precompiled header Ruby process header queue MJIT Worker Thread VM Thread Build time Transform CC Enqueue / Dequeue Bytecode to JIT

Slide 18

Slide 18 text

The Ruby's way: "MJIT" infrastructure VM's C code precompiled header Ruby process header queue MJIT Worker Thread VM Thread Build time Transform CC Enqueue / Dequeue Bytecode to JIT C code Generate C code from bytecode

Slide 19

Slide 19 text

The Ruby's way: "MJIT" infrastructure VM's C code precompiled header Ruby process header queue MJIT Worker Thread VM Thread Build time Transform CC Enqueue / Dequeue Bytecode to JIT C code .so file CC Included by C code Generate C code from bytecode

Slide 20

Slide 20 text

The Ruby's way: "MJIT" infrastructure VM's C code precompiled header Ruby process header queue MJIT Worker Thread VM Thread Build time Transform CC Enqueue / Dequeue Bytecode to JIT C code .so file CC Included by C code Generate C code from bytecode Function pointer of machine code Load Called by

Slide 21

Slide 21 text

The Ruby's way: "MJIT" infrastructure • Upside • Build dependency is almost not changed • Maintenance cost of JIT compiler is relatively low • Downside • C compiler becomes optional runtime dependency • It's highly recommended to keep C compiler used to build Ruby available on your server/container

Slide 22

Slide 22 text

What did Ruby 2.6 merge? • Ruby 2.6 merged: • JIT Infrastructure: "MJIT" • JIT Compiler: "YARV-MJIT" • MJIT had built-in JIT compiler, but it required many VM changes and is risky • So I built conservative JIT compiler which runs on top of MJIT • Let's talk about those 2 components

Slide 23

Slide 23 text

2. JIT Infrastructure: The hard works for portability

Slide 24

Slide 24 text

Command line construction for C compilers • Spawn compiler with $(CC) and compiler-specific flags (improved by nobu, usa) • gcc: gcc -fPIC -shared -w -pipe ... • clang: clang -O2 -dynamic -w -bundle -include-pch ... • cl.exe: cl.exe -Fe ...

Slide 25

Slide 25 text

Command line construction for C compilers Ruby committers are desiring to use Ruby

Slide 26

Slide 26 text

Command line construction for C compilers • We can't use Ruby runtime on MJIT worker thread • Ruby VM is process global, and Ruby runtime is not thread safe • Who wants to apply GVL between main thread and JIT thread? • Using Ruby runtime on MJIT worker causes random SEGV...

Slide 27

Slide 27 text

Extra topic: Security on dynamic loading • It creates and compiles files like: "/tmp/_ruby_mjit_p12789u161.c" • p12789 is PID, u161 is a sequential number, so it can be easily predicted • MJIT worker should prevent it from being modified by others • Initial implementation had vulnerability • nobu fixed it to use: "open(c_file, O_EXCL|O_CREAT, 0600)" • "O_EXCL|O_CREAT" is needed because an existing file may have unexpected permission

Slide 28

Slide 28 text

Windows support • I could port MJIT's pthread usage to Windows native thread early • The actual hard parts: • long is 32bit - MinGW still seems to have some issue on it • cl.exe (Visual Studio) and Windows headers are not good for preprocessing

Slide 29

Slide 29 text

Transformation of C header for JIT • Platform supports: ICC, AIX, NetBSD, MinGW... • JIT header generation depends on gcc/clang's "-E -dD" which preprocesses C code leaving macro • But Visual Studio doesn't have such feature... • Use Pure-Ruby C preprocessor for Windows (!?) • Dynamic C code transformation by regexp (!!!) • Adding "static inline" for inlining and to reduce compilation time

Slide 30

Slide 30 text

Transformation of C header for JIT He says it is not matured and not so serious for now

Slide 31

Slide 31 text

Find C function with regexp ↓ Transform with String#sub!

Slide 32

Slide 32 text

Testing strategy • ruby(1) introduced options for JIT testing: • --jit-wait - if JIT is triggered, wait until JIT compilation is finished • --jit-min-calls=N - change the threshold to trigger JIT • This is needed to control inlining by call cache (explained later) • Now trunk has unit tests that spawn "ruby --jit-wait --jit-min-calls=1 --jit- verbose=1", and confirms stderr has "JIT success" output • When big JIT change is made, we need to verify that "make test-all" passes with RUN_OPTS="--jit-wait --jit-min-calls=1" (and "--jit-min-calls=5" too for call cache)

Slide 33

Slide 33 text

Replaceable JIT compiler • Ruby's JIT compiler is implemented as a single object file mjit_compile.o, and its interface is only a single function mjit_compile() • I believe the current approach is the easiest way to maintain and has no blocker for any JIT optimization • But if we found a better strategy for JIT compiler, we can fully replace it easily • Vladimir Makarov is working on another approach that uses RTL as intermediate representation between YARV instructions and JIT-ed code

Slide 34

Slide 34 text

3. JIT Compiler: Internals of VM-Generated JIT compiler

Slide 35

Slide 35 text

The design philosophy of my JIT compiler • Make it very easy to maintain and debug • Keep it simple at the first release to minimize risks

Slide 36

Slide 36 text

A commit for the Ruby's initial JIT compiler

Slide 37

Slide 37 text

JIT compiler needed only 680 lines (2,584 in total with MJIT infrastructure)

Slide 38

Slide 38 text

Super meta code generator ERB template Ruby C C ERB #compile Kernel #eval fprintf "This is an ERB template that generates Ruby code that generates C code that generates JIT-ed C code." Machine Code gcc/clang Source Build-time only MJIT worker source JIT-ed temporary code

Slide 39

Slide 39 text

Switch-case for each instruction ERB

Slide 40

Slide 40 text

Static macro expansion Main JIT implementation (Just printing VM source) Dynamic macro expansion ERB

Slide 41

Slide 41 text

Generated C code (JIT compiler) fprintf for each instruction

Slide 42

Slide 42 text

Generated C code (JIT-ed code) Copy-paste of VM instruction code (sometimes optimized)

Slide 43

Slide 43 text

Super meta code generator • Even while I'm sleeping, JIT compiler's source code is updated automatically when VM implementation is changed • JIT compiler actually worked before and after recent VM changes

Slide 44

Slide 44 text

Hacks to achieve this automation • Replacing macros like EXEC_EC_CFP, THROW_EXCEPTION • Special compilation of JUMP for opt_case_dispatch • Keep moving program counter to meet catch table • Properly ignore unhandled execution from exception handler • We may be able to support it later tl;dr it was hard

Slide 45

Slide 45 text

Optimization 1: VM instruction inlining for JIT • Have C function definitions in MJIT header as many as possible • Major optimization is done here, by inlining VM operations in MJIT header • Non-automated example: • Carve out fast path of method search function and inline it • Inline function used by instruction optimized by VM • I inlined Array#[] with Integer argument and it makes VM faster too

Slide 46

Slide 46 text

Separate slow path as external function (which is slow to compile, so header doesn't have its definition) Make sure fast path is inlined (kept in JIT header)

Slide 47

Slide 47 text

Change external function reference to inline function (for fast path) Array#[] with Integer is optimized in both VM and JIT

Slide 48

Slide 48 text

Optimization 2: Inlining method call setup by call cache • Method call setup: method search, prepare arguments, push frame • VM has cache for method call, and JIT compiler utilizes it • But it requires receiver class to invalidate cache • JIT compiler doesn't know receiver on compilation • I introduced the invalidator for obsoleted call cache to avoid random SEGV

Slide 49

Slide 49 text

class Foo (serial 0)

Slide 50

Slide 50 text

class Foo (serial 1) def baz 2 end Increment class serial on method definition

Slide 51

Slide 51 text

class Foo (serial 2) def bar 1 + baz end def baz 2 end Increment class serial on method definition

Slide 52

Slide 52 text

class Foo (serial 2) def bar 1 + baz end def baz 2 end Bytecode A: putobject 2 Bytecode B: putobject 1 opt_send :baz, opt_plus cache nil On generating bytecode, it creates call cache

Slide 53

Slide 53 text

class Foo (serial 2) def bar 1 + baz end def baz 2 end Bytecode A: putobject 2 Bytecode B: putobject 1 opt_send :baz, opt_plus cache :A, serial: 2 Once method is called, it holds pointer to bytecode and serial

Slide 54

Slide 54 text

class Foo (serial 3) def bar 1 + baz end def baz 2 end Bytecode A: putobject 2 Bytecode B: putobject 1 opt_send :baz, opt_plus cache :A, serial: 2 When receiver object's class is Foo, it has new serial and invalidates old one def baz 3 end Bytecode C: putobject 3 On method redefinition, it increments serial

Slide 55

Slide 55 text

Optimization 2: Inlining method call setup by call cache • Why don't you use this for method inlining? • Currently it's only used for inlining Ruby-specific method call setup • But working on it!

Slide 56

Slide 56 text

WIP Optimization 3: Ruby -> Ruby method inlining • As we have JIT compiler for bytecode, when call cache has valid bytecode, we can inline it and invalidate it by call cache • Patch is almost completed but is not properly verified/measured yet

Slide 57

Slide 57 text

Inlined call Redefinition guard

Slide 58

Slide 58 text

Optimization 4: Call cache based type guard removal • Some instructions has guard for receiver class to optimize (like opt_aref has guard for Array / Hash), and it dispatches normal method call if the class is not expected one • But if not optimized method is called, we can eliminate it by call cache

Slide 59

Slide 59 text

Optimized case for Array / Hash (This is removed for others in JIT) Only this is needed for other classes

Slide 60

Slide 60 text

WIP Optimization 5: Lazy stack pointer motion • When longjmp is called, JIT-ed function call frame goes away • We must restore VM's state so that it's the same as the middle of JIT-ed function • I'm moving stack pointer in JIT-ed code even though it's sometimes unnecessary • As we're moving program counter, we can restore stack pointer from it • But it's hard...

Slide 61

Slide 61 text

I want to change this to local variable. (currently it's VM's and needs sp) Then this stack pointer motion is removed

Slide 62

Slide 62 text

class Foo def bar (JIT-ed) 1 + baz end def baz raise "err" end JIT local variable array VM stack Program counter yyy xxx What we need to do

Slide 63

Slide 63 text

class Foo def bar (JIT-ed) 1 + baz end def baz raise "err" end JIT local variable array VM stack 1 Program counter xxx yyy What we need to do

Slide 64

Slide 64 text

class Foo def bar (JIT-ed) 1 + baz end def baz raise "err" end JIT local variable array VM stack 1 Program counter xxx yyy What we need to do

Slide 65

Slide 65 text

class Foo def bar (JIT-ed) 1 + baz end def baz raise "err" end JIT local variable array VM stack 1 Program counter yyy nil What we need to do xxx Dynamic stack extension (difficult) to insert value

Slide 66

Slide 66 text

class Foo def bar (JIT-ed) 1 + baz end def baz raise "err" end JIT local variable array VM stack 1 Program counter yyy 1 This should be done before longjmp xxx

Slide 67

Slide 67 text

4. Future works

Slide 68

Slide 68 text

Near future 1: TracePoint check removal • Ruby 2.5 removed "trace" instruction by default, and it dynamically alters all bytecodes to support tracing when TracePoint is enabled • It means that we need to cancel JIT function call on it • For now, I added guards for it after any method call • If we can cancel JIT-ed function call to VM execution outside the frame by longjmp properly, we can remove the guards

Slide 69

Slide 69 text

Near future 1: TracePoint check removal I want to remove this guard

Slide 70

Slide 70 text

Near future 2: Improve performance on Rails • Unfortunately workload of NES emulator (optcarrot) is different from Rails, and currently Rails is not optimized by the JIT • There is no single perfect benchmark for Ruby • I believe JIT can improve performance of many pure-Ruby parts on Rails, but somehow it's not the case for now • I need more time to investigate the reason

Slide 71

Slide 71 text

Near future 3: Full Windows support • JIT compiler is somewhat working on MinGW, but it still has some bugs to be addressed • Visual Studio support • usa already did some great jobs • Installing VM sources or pure-Ruby C preprocessor?

Slide 72

Slide 72 text

A little far future 4: Ruby -> C core method inlining • We can use the same strategy as Ruby -> Ruby method inlining • If we successfully build a header that has both core method definitions and VM implementation, we may be able to do this • Not tried yet, but identifying the function in call cache might be a blocker

Slide 73

Slide 73 text

Far future 5: C core -> Ruby method inlining • Using "while" is faster than "Enumerable#each", but many Ruby developers don't want to write "while" • Inlining block in JIT should solve it • But such block invocation in Ruby core methods is out of control when generating JIT-ed code for now

Slide 74

Slide 74 text

Conclusion • We're working hard to improve portability and performance • Not so fast yet, but many optimizations are made possible and we have much time to do them until Ruby 2.6 • Ruby method inlining is almost there