VM-Generated JIT Compiler for Ruby 2.6

VM-Generated JIT Compiler for Ruby 2.6

PLAZMA OSS Day: TD Tech Talk 2018
https://techplay.jp/event/650389

08d5432a5bc31e6d9edec87b94cb1db1?s=128

Takashi Kokubun

February 15, 2018
Tweet

Transcript

  1. 2.

    Who? • GitHub, Twitter: k0kubun • Ruby Committer • Maintainer

    of default template engine: ERB • Developed some JIT compilers for Ruby • LLRB, YARV-MJIT
  2. 3.
  3. 4.

    Ad: WEB+DB PRESS Vol.103 • Introducing optimized Ruby 2.5 features

    • Real example of Ruby code optimization • Profiling • Bytecode-wise optimization
  4. 6.

    How is the performance? Optcarrot benchmark fps 0 15 30

    45 60 2.0.0 2.1.0 2.2.0 2.3.0 2.4.0 2.5.0 2.6.0-dev r62403 59.22 53.09 48.33 45.54 38.92 38.32 38.76 37.2 JIT off JIT on Intel 4.0GHz i7-4790K with 16GB memory under x86-64 Ubuntu 8 Cores https://github.com/mame/optcarrot
  5. 7.

    How is the performance? MJIT micro benchmarks w/ 2.6.0-dev r62403

    speedup ratio compared to JIT off 0 1 2 2 3 aread aref aset aw rte call const2 fannk fib ivread ivw rite m andelbrot m eteor nbody nest-ntim es nest-w rite norm nsvb sieve trees w hile 3.0 1.1 1.2 1.1 1.2 1.2 1.3 1.1 1.1 2.1 2.9 1.5 1.0 2.3 2.3 1.5 1.9 2.1 2.8 2.1 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 JIT off JIT on Intel 4.0GHz i7-4790K with 16GB memory under x86-64 Ubuntu 8 Cores https://github.com/benchmark-driver/mjit-benchmarks
  6. 8.

    How is the performance? https://twitter.com/ChrisGSeaton/status/961035035385237509 Running that it looks like

    MJIT is over 3x faster! Which is very impressive and it's already doing better than both JRuby and Rubinius. TruffleRuby is over 300x faster (I only mention it because it's my own implementation of a Ruby JIT), so there's still lots of rooms for optimizations, as the authors have already said themselves.
  7. 9.

    Agenda 1. Overview of Ruby's JIT compilation 2. JIT Infrastructure:

    The hard works for portability 3. JIT Compiler: Internals of VM-Generated JIT compiler 4. Future works
  8. 11.

    Options for JIT compilation • What to JIT-compile • Method

    JIT • Tracing JIT • How to JIT-compile • Generate assembly code and assemble • Use JIT library's interface like LLVM
  9. 12.

    How about constructing LLVM IR? • It's popular in modern

    languages, and I created PoC: LLRB • http://github.com/k0kubun/llrb • But I learned that we can't efficiently use it for Ruby • Major optimization is done by inlining Ruby core's LLVM IR generated by clang • Just generating C code and using clang seemed enough
  10. 13.

    The Ruby's way: "MJIT" infrastructure • "MJIT" (MRI JIT) infrastructure

    • It puts a C file generated by a method's bytecode on a disk (method JIT) • Then it lets cc(1) compile the C code to .so file, and dynamically loads it • This idea is proposed and implemented by Vladimir Makarov • https://github.com/vnmakarov/ruby/tree/rtl_mjit_branch
  11. 14.

    The Ruby's way: "MJIT" infrastructure VM's C code Ruby process

    queue MJIT Worker Thread VM Thread Build time
  12. 15.

    The Ruby's way: "MJIT" infrastructure VM's C code Ruby process

    queue MJIT Worker Thread VM Thread Build time header Transform
  13. 16.

    The Ruby's way: "MJIT" infrastructure VM's C code precompiled header

    Ruby process header queue MJIT Worker Thread VM Thread Build time Transform CC
  14. 17.

    The Ruby's way: "MJIT" infrastructure VM's C code precompiled header

    Ruby process header queue MJIT Worker Thread VM Thread Build time Transform CC Enqueue / Dequeue Bytecode to JIT
  15. 18.

    The Ruby's way: "MJIT" infrastructure VM's C code precompiled header

    Ruby process header queue MJIT Worker Thread VM Thread Build time Transform CC Enqueue / Dequeue Bytecode to JIT C code Generate C code from bytecode
  16. 19.

    The Ruby's way: "MJIT" infrastructure VM's C code precompiled header

    Ruby process header queue MJIT Worker Thread VM Thread Build time Transform CC Enqueue / Dequeue Bytecode to JIT C code .so file CC Included by C code Generate C code from bytecode
  17. 20.

    The Ruby's way: "MJIT" infrastructure VM's C code precompiled header

    Ruby process header queue MJIT Worker Thread VM Thread Build time Transform CC Enqueue / Dequeue Bytecode to JIT C code .so file CC Included by C code Generate C code from bytecode Function pointer of machine code Load Called by
  18. 21.

    The Ruby's way: "MJIT" infrastructure • Upside • Build dependency

    is almost not changed • Maintenance cost of JIT compiler is relatively low • Downside • C compiler becomes optional runtime dependency • It's highly recommended to keep C compiler used to build Ruby available on your server/container
  19. 22.

    What did Ruby 2.6 merge? • Ruby 2.6 merged: •

    JIT Infrastructure: "MJIT" • JIT Compiler: "YARV-MJIT" • MJIT had built-in JIT compiler, but it required many VM changes and is risky • So I built conservative JIT compiler which runs on top of MJIT • Let's talk about those 2 components
  20. 24.

    Command line construction for C compilers • Spawn compiler with

    $(CC) and compiler-specific flags (improved by nobu, usa) • gcc: gcc -fPIC -shared -w -pipe ... • clang: clang -O2 -dynamic -w -bundle -include-pch ... • cl.exe: cl.exe -Fe ...
  21. 26.

    Command line construction for C compilers • We can't use

    Ruby runtime on MJIT worker thread • Ruby VM is process global, and Ruby runtime is not thread safe • Who wants to apply GVL between main thread and JIT thread? • Using Ruby runtime on MJIT worker causes random SEGV...
  22. 27.

    Extra topic: Security on dynamic loading • It creates and

    compiles files like: "/tmp/_ruby_mjit_p12789u161.c" • p12789 is PID, u161 is a sequential number, so it can be easily predicted • MJIT worker should prevent it from being modified by others • Initial implementation had vulnerability • nobu fixed it to use: "open(c_file, O_EXCL|O_CREAT, 0600)" • "O_EXCL|O_CREAT" is needed because an existing file may have unexpected permission
  23. 28.

    Windows support • I could port MJIT's pthread usage to

    Windows native thread early • The actual hard parts: • long is 32bit - MinGW still seems to have some issue on it • cl.exe (Visual Studio) and Windows headers are not good for preprocessing
  24. 29.

    Transformation of C header for JIT • Platform supports: ICC,

    AIX, NetBSD, MinGW... • JIT header generation depends on gcc/clang's "-E -dD" which preprocesses C code leaving macro • But Visual Studio doesn't have such feature... • Use Pure-Ruby C preprocessor for Windows (!?) • Dynamic C code transformation by regexp (!!!) • Adding "static inline" for inlining and to reduce compilation time
  25. 30.

    Transformation of C header for JIT He says it is

    not matured and not so serious for now
  26. 32.

    Testing strategy • ruby(1) introduced options for JIT testing: •

    --jit-wait - if JIT is triggered, wait until JIT compilation is finished • --jit-min-calls=N - change the threshold to trigger JIT • This is needed to control inlining by call cache (explained later) • Now trunk has unit tests that spawn "ruby --jit-wait --jit-min-calls=1 --jit- verbose=1", and confirms stderr has "JIT success" output • When big JIT change is made, we need to verify that "make test-all" passes with RUN_OPTS="--jit-wait --jit-min-calls=1" (and "--jit-min-calls=5" too for call cache)
  27. 33.

    Replaceable JIT compiler • Ruby's JIT compiler is implemented as

    a single object file mjit_compile.o, and its interface is only a single function mjit_compile() • I believe the current approach is the easiest way to maintain and has no blocker for any JIT optimization • But if we found a better strategy for JIT compiler, we can fully replace it easily • Vladimir Makarov is working on another approach that uses RTL as intermediate representation between YARV instructions and JIT-ed code
  28. 35.

    The design philosophy of my JIT compiler • Make it

    very easy to maintain and debug • Keep it simple at the first release to minimize risks
  29. 38.

    Super meta code generator ERB template Ruby C C ERB

    #compile Kernel #eval fprintf "This is an ERB template that generates Ruby code that generates C code that generates JIT-ed C code." Machine Code gcc/clang Source Build-time only MJIT worker source JIT-ed temporary code
  30. 43.

    Super meta code generator • Even while I'm sleeping, JIT

    compiler's source code is updated automatically when VM implementation is changed • JIT compiler actually worked before and after recent VM changes
  31. 44.

    Hacks to achieve this automation • Replacing macros like EXEC_EC_CFP,

    THROW_EXCEPTION • Special compilation of JUMP for opt_case_dispatch • Keep moving program counter to meet catch table • Properly ignore unhandled execution from exception handler • We may be able to support it later tl;dr it was hard
  32. 45.

    Optimization 1: VM instruction inlining for JIT • Have C

    function definitions in MJIT header as many as possible • Major optimization is done here, by inlining VM operations in MJIT header • Non-automated example: • Carve out fast path of method search function and inline it • Inline function used by instruction optimized by VM • I inlined Array#[] with Integer argument and it makes VM faster too
  33. 46.

    Separate slow path as external function (which is slow to

    compile, so header doesn't have its definition) Make sure fast path is inlined (kept in JIT header)
  34. 47.

    Change external function reference to inline function (for fast path)

    Array#[] with Integer is optimized in both VM and JIT
  35. 48.

    Optimization 2: Inlining method call setup by call cache •

    Method call setup: method search, prepare arguments, push frame • VM has cache for method call, and JIT compiler utilizes it • But it requires receiver class to invalidate cache • JIT compiler doesn't know receiver on compilation • I introduced the invalidator for obsoleted call cache to avoid random SEGV
  36. 50.
  37. 51.

    class Foo (serial 2) def bar 1 + baz end

    def baz 2 end Increment class serial on method definition
  38. 52.

    class Foo (serial 2) def bar 1 + baz end

    def baz 2 end Bytecode A: putobject 2 Bytecode B: putobject 1 opt_send :baz, opt_plus cache nil On generating bytecode, it creates call cache
  39. 53.

    class Foo (serial 2) def bar 1 + baz end

    def baz 2 end Bytecode A: putobject 2 Bytecode B: putobject 1 opt_send :baz, opt_plus cache :A, serial: 2 Once method is called, it holds pointer to bytecode and serial
  40. 54.

    class Foo (serial 3) def bar 1 + baz end

    def baz 2 end Bytecode A: putobject 2 Bytecode B: putobject 1 opt_send :baz, opt_plus cache :A, serial: 2 When receiver object's class is Foo, it has new serial and invalidates old one def baz 3 end Bytecode C: putobject 3 On method redefinition, it increments serial
  41. 55.

    Optimization 2: Inlining method call setup by call cache •

    Why don't you use this for method inlining? • Currently it's only used for inlining Ruby-specific method call setup • But working on it!
  42. 56.

    WIP Optimization 3: Ruby -> Ruby method inlining • As

    we have JIT compiler for bytecode, when call cache has valid bytecode, we can inline it and invalidate it by call cache • Patch is almost completed but is not properly verified/measured yet
  43. 58.

    Optimization 4: Call cache based type guard removal • Some

    instructions has guard for receiver class to optimize (like opt_aref has guard for Array / Hash), and it dispatches normal method call if the class is not expected one • But if not optimized method is called, we can eliminate it by call cache
  44. 59.

    Optimized case for Array / Hash (This is removed for

    others in JIT) Only this is needed for other classes
  45. 60.

    WIP Optimization 5: Lazy stack pointer motion • When longjmp

    is called, JIT-ed function call frame goes away • We must restore VM's state so that it's the same as the middle of JIT-ed function • I'm moving stack pointer in JIT-ed code even though it's sometimes unnecessary • As we're moving program counter, we can restore stack pointer from it • But it's hard...
  46. 61.

    I want to change this to local variable. (currently it's

    VM's and needs sp) Then this stack pointer motion is removed
  47. 62.

    class Foo def bar (JIT-ed) 1 + baz end def

    baz raise "err" end JIT local variable array VM stack Program counter yyy xxx What we need to do
  48. 63.

    class Foo def bar (JIT-ed) 1 + baz end def

    baz raise "err" end JIT local variable array VM stack 1 Program counter xxx yyy What we need to do
  49. 64.

    class Foo def bar (JIT-ed) 1 + baz end def

    baz raise "err" end JIT local variable array VM stack 1 Program counter xxx yyy What we need to do
  50. 65.

    class Foo def bar (JIT-ed) 1 + baz end def

    baz raise "err" end JIT local variable array VM stack 1 Program counter yyy nil What we need to do xxx Dynamic stack extension (difficult) to insert value
  51. 66.

    class Foo def bar (JIT-ed) 1 + baz end def

    baz raise "err" end JIT local variable array VM stack 1 Program counter yyy 1 This should be done before longjmp xxx
  52. 68.

    Near future 1: TracePoint check removal • Ruby 2.5 removed

    "trace" instruction by default, and it dynamically alters all bytecodes to support tracing when TracePoint is enabled • It means that we need to cancel JIT function call on it • For now, I added guards for it after any method call • If we can cancel JIT-ed function call to VM execution outside the frame by longjmp properly, we can remove the guards
  53. 70.

    Near future 2: Improve performance on Rails • Unfortunately workload

    of NES emulator (optcarrot) is different from Rails, and currently Rails is not optimized by the JIT • There is no single perfect benchmark for Ruby • I believe JIT can improve performance of many pure-Ruby parts on Rails, but somehow it's not the case for now • I need more time to investigate the reason
  54. 71.

    Near future 3: Full Windows support • JIT compiler is

    somewhat working on MinGW, but it still has some bugs to be addressed • Visual Studio support • usa already did some great jobs • Installing VM sources or pure-Ruby C preprocessor?
  55. 72.

    A little far future 4: Ruby -> C core method

    inlining • We can use the same strategy as Ruby -> Ruby method inlining • If we successfully build a header that has both core method definitions and VM implementation, we may be able to do this • Not tried yet, but identifying the function in call cache might be a blocker
  56. 73.

    Far future 5: C core -> Ruby method inlining •

    Using "while" is faster than "Enumerable#each", but many Ruby developers don't want to write "while" • Inlining block in JIT should solve it • But such block invocation in Ruby core methods is out of control when generating JIT-ed code for now
  57. 74.

    Conclusion • We're working hard to improve portability and performance

    • Not so fast yet, but many optimizations are made possible and we have much time to do them until Ruby 2.6 • Ruby method inlining is almost there