Performance Improvement of Ruby 2.7 JIT in Real World / RubyKaigi 2019

Performance Improvement of Ruby 2.7 JIT in Real World / RubyKaigi 2019

08d5432a5bc31e6d9edec87b94cb1db1?s=128

Takashi Kokubun

April 18, 2019
Tweet

Transcript

  1. 3.

    What's JIT? • Experimental optional feature since Ruby 2.6 •

    Compile your Ruby code to faster C code automatically • Just-in-Time: Use runtime information for optimizations $ ruby --jit
  2. 4.

    What's "MJIT"? VM's C code Ruby process header queue VM

    Thread Build time Transform Precompile precompiled header MJIT Worker Thread
  3. 5.

    VM's C code Ruby process header queue VM Thread Build

    time Enqueue / Dequeue Bytecode to JIT precompiled header MJIT Worker Thread What's "MJIT"?
  4. 6.

    Ruby process queue VM Thread Build time Enqueue / Dequeue

    Bytecode to JIT Included Generate precompiled header C code MJIT Worker Thread VM's C code header What's "MJIT"?
  5. 7.

    Ruby process queue VM Thread Build time Enqueue / Dequeue

    Bytecode to JIT CC Included Generate precompiled header .o file C code MJIT Worker Thread VM's C code header What's "MJIT"?
  6. 8.

    Ruby process queue VM Thread Build time Enqueue / Dequeue

    Bytecode to JIT .so file CC Included Generate precompiled header .o file Link C code MJIT Worker Thread VM's C code header What's "MJIT"?
  7. 9.

    Ruby process queue VM Thread Build time Enqueue / Dequeue

    Bytecode to JIT .so file CC Included Generate Function pointer of machine code Load Called by precompiled header .o file Link C code MJIT Worker Thread VM's C code header What's "MJIT"?
  8. 10.

    Ruby process queue VM Thread Build time Enqueue / Dequeue

    Bytecode to JIT .so file CC Included Generate Load precompiled header Link C code MJIT Worker Thread Function pointer of machine code Function pointer of machine code Called by Function pointer of machine code .o file .o file .o file VM's C code header What's "MJIT"?
  9. 11.

    Function pointer of machine code Ruby process queue VM Thread

    Build time Function pointer of machine code Called by precompiled header .o file .o file MJIT Worker Thread .o file Function pointer of machine code VM's C code header What's "MJIT"?
  10. 12.

    Ruby process queue VM Thread Build time precompiled header .o

    file .o file MJIT Worker Thread .o file .so file Link all VM's C code header Function pointer of machine code Function pointer of machine code Called by Function pointer of machine code What's "MJIT"?
  11. 13.

    Ruby process queue VM Thread Build time Function pointers of

    machine code Reload all Called by precompiled header .o file .o file MJIT Worker Thread .o file .so file Link all VM's C code header What's "MJIT"?
  12. 14.

    Ruby process queue VM Thread Build time Function pointers of

    machine code Called by precompiled header .o file .o file MJIT Worker Thread .o file VM's C code header What's "MJIT"?
  13. 15.

    Ruby 2.6 JIT stability • No SEGV report after release

    • To avoid bugs, it’s designed conservatively • We've run JIT CIs for 24h and detected bugs by ourselves
  14. 18.

    Ruby 3x3 benchmark: 1. Optcarrot Speed 1.61x Intel 4.0GHz i7-4790K

    8 cores, memory 16GB, x86-64 Ubuntu Ruby 2.6.0 w/ Optcarrot Frames Per Second (fps) 0 23 45 68 90 86.6 53.8 JIT off JIT on
  15. 19.

    Ruby 3x3 benchmark: 1. Optcarrot Memory 1.01x Intel 4.0GHz i7-4790K

    8 cores, memory 16GB, x86-64 Ubuntu Ruby 2.6.0 w/ Optcarrot Max Resident Set Size (MB) 0 16 32 48 64 63.7 62.8 JIT off JIT on
  16. 20.

    What's "real world"? • We're not running NES emulator on

    production • Rails is popular for large-scale use
  17. 22.

    Ruby 3x3 benchmark: 2. Discourse • Forum application: discourse/discourse •

    For AWS: noahgibbs/rails_ruby_bench (a.k.a. RRB) • He has reported JIT’s performance of all preview, rc, and final releases (thanks!)
  18. 24.

    Ruby 3x3 benchmark: 2. Discourse • It has captured first

    10~100k requests after start • Is it “real-world”? • Compiling 1,000 methods take a long time and make it slow • Ruby 2.7 uses --jit-max-cache=100 by default and it will run in an all-compiled state for most of the time
  19. 25.

    Ruby 2.6 Request Per Second (#/s) 0 5 9 14

    18 16.2 17.1 JIT off JIT on k0kubun/discourse : WARMUP=5000 BENCHMARK=1000 script/simple_bench Discourse: Speed Intel 4.0GHz i7-4790K 8 cores, memory 16GB, x86-64 Ubuntu, Ruby 2.6=2.6.2 Ruby2.7=r67600
  20. 26.

    Ruby 2.6 Request Per Second (#/s) 0 5 9 14

    18 16.2 17.1 JIT off JIT on k0kubun/discourse : WARMUP=5000 BENCHMARK=1000 script/simple_bench Ruby 2.7 Request Per Second (#/s) 0 5 9 14 18 17.4 17.5 JIT off JIT on Discourse: Speed Intel 4.0GHz i7-4790K 8 cores, memory 16GB, x86-64 Ubuntu, Ruby 2.6=2.6.2 Ruby2.7=r67600
  21. 27.

    Why JIT is not fast yet on Rails? • Checking

    Discourse may not be suitable for knowing it • Some hotspots are not Rails-specific • Reaching a stable state takes some time
  22. 28.

    New Ruby benchmark: Railsbench • Just rails scaffold #show: k0kubun/railsbench

    • headius/pgrailsbench, but on Rails 5.2 and w/ db:seed • Small but capturing some Rails characteristics
  23. 29.

    Ruby 2.6 Request Per Second (#/s) 0 235 470 705

    940 720.7 924.9 JIT off JIT on k0kubun/railsbench : WARMUP=30000 BENCHMARK=10000 bin/bench Railsbench: Speed Intel 4.0GHz i7-4790K 8 cores, memory 16GB, x86-64 Ubuntu, Ruby 2.6=2.6.2 Ruby2.7=r67600
  24. 30.

    Ruby 2.6 Request Per Second (#/s) 0 235 470 705

    940 720.7 924.9 JIT off JIT on k0kubun/railsbench : WARMUP=30000 BENCHMARK=10000 bin/bench Ruby 2.7 Request Per Second (#/s) 0 235 470 705 940 899.9 932.0 JIT off JIT on Railsbench: Speed Intel 4.0GHz i7-4790K 8 cores, memory 16GB, x86-64 Ubuntu, Ruby 2.6=2.6.2 Ruby2.7=r67600
  25. 31.

    Intel 4.0GHz i7-4790K 8 cores, memory 16GB, x86-64 Ubuntu Ruby

    2.6 Request Per Second (#/s) 0 27 54 81 108 107.2 105.2 JIT off JIT on k0kubun/railsbench : WARMUP=30000 BENCHMARK=10000 bin/bench Ruby 2.7 Request Per Second (#/s) 0 27 54 81 108 107.6 106.5 JIT off JIT on Railsbench: Memory
  26. 32.

    Railsbench • We can easily run it as the application

    is very light-weight • By profiling it, we’ve found which parts are not good for JIT • And the parts should exist in all Rails applications too
  27. 33.
  28. 35.

    Ruby 2.7 JIT Performance Challenges 1. Profile-guided Optimization 2. Optimization

    Prediction 3. Deoptimized Recompilation 4. Frame-omitted Method Inlining 5. Stack-based Object Allocation
  29. 36.

    Problem 1: Calling JIT-ed code seems slow • When benchmarking

    after-compile Rails performance, maximum number of methods should be compiled • Max: 1,000 in Ruby 2.6, 100 in trunk • Note: only 30 methods are compiled on Optcarrot
  30. 37.

    Problem 1: Calling JIT-ed code seems slow Time to call

    methods returning nil (s) 0 1.5 3 4.5 6 Number of called methods 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 VM JIT def foo3 nil end def foo2 nil end def foo1 nil end
  31. 38.

    So we did this in Ruby 2.6 Ruby process queue

    VM Thread Build time Function pointers of machine code Reload all Called by precompiled header .o file .o file MJIT Worker Thread .o file .so file Link all VM's C code header
  32. 42.

    Approach 1: Profile-guided Optimization • It didn't magically solve the

    current bottleneck on Rails • It may be helpful later, but not now • As originally discussed, it complicates the build system
  33. 43.

    Problem 2: Can we avoid making things slow? • Why

    can't we just skip compiling things like a method returning just nil? • If we compile only well-optimized methods, it should be fast
  34. 44.

    Approach 2: Optimization Prediction • Several situations are known to

    be well-optimized in the current JIT • Let's measure the impacts and build heuristics!
  35. 45.

    Approach 2: Optimization Prediction Call overhead w/ 100 methods (distributed)

    +22.4ns Call overhead w/ 100 methods (compacted) +3.22ns Invoke another VM +7.90s Cancel JIT execution +1.94s Stack pointer motion elimination -0.15ns Method call w/ inline cache -2.21ns Instance variable read w/ inline cache -0.86ns opt_* instructions -1~4ns
  36. 46.

    Approach 2: Optimization Prediction • It did not magically solve

    the current bottleneck, again • Maybe the impact is more dynamic than I assumed • Compiling the same number of hotspot methods always seem to bring the same level of overhead • Our code is too big for icache?
  37. 47.

    Problem 3: JIT calls may be cancelled frequently • The

    "Cancel JIT execution" had some overhead • How many cancels did we have?
  38. 49.
  39. 50.
  40. 52.

    Solution 3: Deoptimized Recompilation • Recompile a method when JIT's

    speculation is invalidated • It was in the original MJIT by Vladimir Makarov, but removed for simplicity in Ruby 2.6
  41. 55.

    Problem 4: Method call is slow • We're calling methods

    everywhere • Method call cost: VM -> VM 10.28ns VM -> JIT 9.12ns JIT -> JIT 8.98ns JIT -> VM 19.59ns
  42. 57.

    Solution 4: Frame-omitted Method Inlining • Method inlining levels: •

    Level 1: Just call an inline function instead of JIT-ed code's function pointer • Level 2: Skip pushing a call frame by default, but lazily push it when something happens • For 2, We need to know "purity" of VM instruction
  43. 65.

    Solution 4: Frame-omitted Method Inlining • Frame-omitted method inlining (level

    2) is already on trunk! • It's working for limited things like #html_safe?, #present? • To make it really useful, we need to improve metadata for methods and VM instructions
  44. 66.

    Problem 5: Object allocation is slow • Rails app allocates

    objects (of course!), unlike Optcarrot • It takes time to allocate memory from heap and GC it
  45. 67.

    Problem 5: Object allocation is slow • Railsbench takes time

    for memory management in perf memory management, GC 9.3%
  46. 68.

    Solution 5: Stack-based Object Allocation • If an object does

    not "escape", we can allocate an object on stack • Implementing really clever escape analysis is hard, but some basic one can suffice some of real-world use cases
  47. 76.

    More Ruby code in core • Many parts of Ruby

    implementation are written in C and it blocks optimizations like method inlining • @ko1 is proposing to use Ruby in core and add method's metadata more. • Let’s do it
  48. 77.

    TracePoint support • Ruby 2.6's JIT just stops when TracePoint

    is enabled • TracePoint is often enabled on development environment • web-console + bindex, zeitwerk
  49. 78.

    Use GVL from MJIT worker • Ruby 2.6's JIT does

    not compile methods during blocking IO or sleep • That would be the best timing for doing compilation!
  50. 79.

    LLVM as an optional loader • The overhead might be

    coming from dlopen's way of loading • What if we: • Generate LLVM IR with Clang from MJIT's C code • Load it to LLVM Module and execute
  51. 80.

    Conclusion • We’re focusing on Rails speed on JIT after

    all compilations at this stage • We're not there yet, but moving forward • JIT will allow us to stop caring about micro optimizations in real world code • No more Performance/* cops!