Pro Yearly is on sale from $80 to $50! »

Performance Improvement of Ruby 2.7 JIT in Real World / RubyKaigi 2019

Performance Improvement of Ruby 2.7 JIT in Real World / RubyKaigi 2019

08d5432a5bc31e6d9edec87b94cb1db1?s=128

Takashi Kokubun

April 18, 2019
Tweet

Transcript

  1. Performance Improvement of Ruby 2.7 JIT in Real World @k0kubun

    / RubyKaigi 2019
  2. @k0kubun Ruby Committer: JIT, ERB

  3. What's JIT? • Experimental optional feature since Ruby 2.6 •

    Compile your Ruby code to faster C code automatically • Just-in-Time: Use runtime information for optimizations $ ruby --jit
  4. What's "MJIT"? VM's C code Ruby process header queue VM

    Thread Build time Transform Precompile precompiled header MJIT Worker Thread
  5. VM's C code Ruby process header queue VM Thread Build

    time Enqueue / Dequeue Bytecode to JIT precompiled header MJIT Worker Thread What's "MJIT"?
  6. Ruby process queue VM Thread Build time Enqueue / Dequeue

    Bytecode to JIT Included Generate precompiled header C code MJIT Worker Thread VM's C code header What's "MJIT"?
  7. Ruby process queue VM Thread Build time Enqueue / Dequeue

    Bytecode to JIT CC Included Generate precompiled header .o file C code MJIT Worker Thread VM's C code header What's "MJIT"?
  8. Ruby process queue VM Thread Build time Enqueue / Dequeue

    Bytecode to JIT .so file CC Included Generate precompiled header .o file Link C code MJIT Worker Thread VM's C code header What's "MJIT"?
  9. Ruby process queue VM Thread Build time Enqueue / Dequeue

    Bytecode to JIT .so file CC Included Generate Function pointer of machine code Load Called by precompiled header .o file Link C code MJIT Worker Thread VM's C code header What's "MJIT"?
  10. Ruby process queue VM Thread Build time Enqueue / Dequeue

    Bytecode to JIT .so file CC Included Generate Load precompiled header Link C code MJIT Worker Thread Function pointer of machine code Function pointer of machine code Called by Function pointer of machine code .o file .o file .o file VM's C code header What's "MJIT"?
  11. Function pointer of machine code Ruby process queue VM Thread

    Build time Function pointer of machine code Called by precompiled header .o file .o file MJIT Worker Thread .o file Function pointer of machine code VM's C code header What's "MJIT"?
  12. Ruby process queue VM Thread Build time precompiled header .o

    file .o file MJIT Worker Thread .o file .so file Link all VM's C code header Function pointer of machine code Function pointer of machine code Called by Function pointer of machine code What's "MJIT"?
  13. Ruby process queue VM Thread Build time Function pointers of

    machine code Reload all Called by precompiled header .o file .o file MJIT Worker Thread .o file .so file Link all VM's C code header What's "MJIT"?
  14. Ruby process queue VM Thread Build time Function pointers of

    machine code Called by precompiled header .o file .o file MJIT Worker Thread .o file VM's C code header What's "MJIT"?
  15. Ruby 2.6 JIT stability • No SEGV report after release

    • To avoid bugs, it’s designed conservatively • We've run JIT CIs for 24h and detected bugs by ourselves
  16. JIT performance in real world

  17. Ruby 3x3 benchmark: 1. Optcarrot NES emulator: mame/optcarrot

  18. Ruby 3x3 benchmark: 1. Optcarrot Speed 1.61x Intel 4.0GHz i7-4790K

    8 cores, memory 16GB, x86-64 Ubuntu Ruby 2.6.0 w/ Optcarrot Frames Per Second (fps) 0 23 45 68 90 86.6 53.8 JIT off JIT on
  19. Ruby 3x3 benchmark: 1. Optcarrot Memory 1.01x Intel 4.0GHz i7-4790K

    8 cores, memory 16GB, x86-64 Ubuntu Ruby 2.6.0 w/ Optcarrot Max Resident Set Size (MB) 0 16 32 48 64 63.7 62.8 JIT off JIT on
  20. What's "real world"? • We're not running NES emulator on

    production • Rails is popular for large-scale use
  21. Ruby 3x3 benchmark: 2. Discourse

  22. Ruby 3x3 benchmark: 2. Discourse • Forum application: discourse/discourse •

    For AWS: noahgibbs/rails_ruby_bench (a.k.a. RRB) • He has reported JIT’s performance of all preview, rc, and final releases (thanks!)
  23. Ruby 3x3 benchmark: 2. Discourse

  24. Ruby 3x3 benchmark: 2. Discourse • It has captured first

    10~100k requests after start • Is it “real-world”? • Compiling 1,000 methods take a long time and make it slow • Ruby 2.7 uses --jit-max-cache=100 by default and it will run in an all-compiled state for most of the time
  25. Ruby 2.6 Request Per Second (#/s) 0 5 9 14

    18 16.2 17.1 JIT off JIT on k0kubun/discourse : WARMUP=5000 BENCHMARK=1000 script/simple_bench Discourse: Speed Intel 4.0GHz i7-4790K 8 cores, memory 16GB, x86-64 Ubuntu, Ruby 2.6=2.6.2 Ruby2.7=r67600
  26. Ruby 2.6 Request Per Second (#/s) 0 5 9 14

    18 16.2 17.1 JIT off JIT on k0kubun/discourse : WARMUP=5000 BENCHMARK=1000 script/simple_bench Ruby 2.7 Request Per Second (#/s) 0 5 9 14 18 17.4 17.5 JIT off JIT on Discourse: Speed Intel 4.0GHz i7-4790K 8 cores, memory 16GB, x86-64 Ubuntu, Ruby 2.6=2.6.2 Ruby2.7=r67600
  27. Why JIT is not fast yet on Rails? • Checking

    Discourse may not be suitable for knowing it • Some hotspots are not Rails-specific • Reaching a stable state takes some time
  28. New Ruby benchmark: Railsbench • Just rails scaffold #show: k0kubun/railsbench

    • headius/pgrailsbench, but on Rails 5.2 and w/ db:seed • Small but capturing some Rails characteristics
  29. Ruby 2.6 Request Per Second (#/s) 0 235 470 705

    940 720.7 924.9 JIT off JIT on k0kubun/railsbench : WARMUP=30000 BENCHMARK=10000 bin/bench Railsbench: Speed Intel 4.0GHz i7-4790K 8 cores, memory 16GB, x86-64 Ubuntu, Ruby 2.6=2.6.2 Ruby2.7=r67600
  30. Ruby 2.6 Request Per Second (#/s) 0 235 470 705

    940 720.7 924.9 JIT off JIT on k0kubun/railsbench : WARMUP=30000 BENCHMARK=10000 bin/bench Ruby 2.7 Request Per Second (#/s) 0 235 470 705 940 899.9 932.0 JIT off JIT on Railsbench: Speed Intel 4.0GHz i7-4790K 8 cores, memory 16GB, x86-64 Ubuntu, Ruby 2.6=2.6.2 Ruby2.7=r67600
  31. Intel 4.0GHz i7-4790K 8 cores, memory 16GB, x86-64 Ubuntu Ruby

    2.6 Request Per Second (#/s) 0 27 54 81 108 107.2 105.2 JIT off JIT on k0kubun/railsbench : WARMUP=30000 BENCHMARK=10000 bin/bench Ruby 2.7 Request Per Second (#/s) 0 27 54 81 108 107.6 106.5 JIT off JIT on Railsbench: Memory
  32. Railsbench • We can easily run it as the application

    is very light-weight • By profiling it, we’ve found which parts are not good for JIT • And the parts should exist in all Rails applications too
  33. Why is JIT still slow on Rails? Let’s explore our

    challenges for JIT on Rails!!
  34. Ruby 2.7 JIT Performance Challenges

  35. Ruby 2.7 JIT Performance Challenges 1. Profile-guided Optimization 2. Optimization

    Prediction 3. Deoptimized Recompilation 4. Frame-omitted Method Inlining 5. Stack-based Object Allocation
  36. Problem 1: Calling JIT-ed code seems slow • When benchmarking

    after-compile Rails performance, maximum number of methods should be compiled • Max: 1,000 in Ruby 2.6, 100 in trunk • Note: only 30 methods are compiled on Optcarrot
  37. Problem 1: Calling JIT-ed code seems slow Time to call

    methods returning nil (s) 0 1.5 3 4.5 6 Number of called methods 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 VM JIT def foo3 nil end def foo2 nil end def foo1 nil end
  38. So we did this in Ruby 2.6 Ruby process queue

    VM Thread Build time Function pointers of machine code Reload all Called by precompiled header .o file .o file MJIT Worker Thread .o file .so file Link all VM's C code header
  39. But still we see icache stall

  40. Approach 1: Profile-guided Optimization • Use GCC/Clang's -fprofile-generate and -fprofile-use

  41. Approach 1: Profile-guided Optimization https://github.com/k0kubun/ruby/tree/pgo

  42. Approach 1: Profile-guided Optimization • It didn't magically solve the

    current bottleneck on Rails • It may be helpful later, but not now • As originally discussed, it complicates the build system
  43. Problem 2: Can we avoid making things slow? • Why

    can't we just skip compiling things like a method returning just nil? • If we compile only well-optimized methods, it should be fast
  44. Approach 2: Optimization Prediction • Several situations are known to

    be well-optimized in the current JIT • Let's measure the impacts and build heuristics!
  45. Approach 2: Optimization Prediction Call overhead w/ 100 methods (distributed)

    +22.4ns Call overhead w/ 100 methods (compacted) +3.22ns Invoke another VM +7.90s Cancel JIT execution +1.94s Stack pointer motion elimination -0.15ns Method call w/ inline cache -2.21ns Instance variable read w/ inline cache -0.86ns opt_* instructions -1~4ns
  46. Approach 2: Optimization Prediction • It did not magically solve

    the current bottleneck, again • Maybe the impact is more dynamic than I assumed • Compiling the same number of hotspot methods always seem to bring the same level of overhead • Our code is too big for icache?
  47. Problem 3: JIT calls may be cancelled frequently • The

    "Cancel JIT execution" had some overhead • How many cancels did we have?
  48. Problem 3: JIT calls may be cancelled frequently

  49. None
  50. None
  51. self's class change causes JIT cancel

  52. Solution 3: Deoptimized Recompilation • Recompile a method when JIT's

    speculation is invalidated • It was in the original MJIT by Vladimir Makarov, but removed for simplicity in Ruby 2.6
  53. Solution 3: Deoptimized Recompilation • Committed to trunk. Inspectable with

    --jit-verbose=1
  54. Solution 3: Deoptimized Recompilation

  55. Problem 4: Method call is slow • We're calling methods

    everywhere • Method call cost: VM -> VM 10.28ns VM -> JIT 9.12ns JIT -> JIT 8.98ns JIT -> VM 19.59ns
  56. Problem 4: Method call is slow

  57. Solution 4: Frame-omitted Method Inlining • Method inlining levels: •

    Level 1: Just call an inline function instead of JIT-ed code's function pointer • Level 2: Skip pushing a call frame by default, but lazily push it when something happens • For 2, We need to know "purity" of VM instruction
  58. Solution 4: Frame-omitted Method Inlining • Can Numeric#zero? written in

    Ruby be pure?
  59. Solution 4: Frame-omitted Method Inlining

  60. Solution 4: Frame-omitted Method Inlining If it were true (valid

    for Integer), what would happen?
  61. Solution 4: Frame-omitted Method Inlining

  62. Solution 4: Frame-omitted Method Inlining

  63. Solution 4: Frame-omitted Method Inlining VM

  64. Solution 4: Frame-omitted Method Inlining VM JIT

  65. Solution 4: Frame-omitted Method Inlining • Frame-omitted method inlining (level

    2) is already on trunk! • It's working for limited things like #html_safe?, #present? • To make it really useful, we need to improve metadata for methods and VM instructions
  66. Problem 5: Object allocation is slow • Rails app allocates

    objects (of course!), unlike Optcarrot • It takes time to allocate memory from heap and GC it
  67. Problem 5: Object allocation is slow • Railsbench takes time

    for memory management in perf memory management, GC 9.3%
  68. Solution 5: Stack-based Object Allocation • If an object does

    not "escape", we can allocate an object on stack • Implementing really clever escape analysis is hard, but some basic one can suffice some of real-world use cases
  69. Solution 5: Stack-based Object Allocation

  70. Solution 5: Stack-based Object Allocation VM

  71. Solution 5: Stack-based Object Allocation VM JIT

  72. Solution 5: Stack-based Object Allocation

  73. Solution 5: Stack-based Object Allocation VM

  74. Solution 5: Stack-based Object Allocation VM JIT

  75. Ruby 2.7 JIT future works

  76. More Ruby code in core • Many parts of Ruby

    implementation are written in C and it blocks optimizations like method inlining • @ko1 is proposing to use Ruby in core and add method's metadata more. • Let’s do it
  77. TracePoint support • Ruby 2.6's JIT just stops when TracePoint

    is enabled • TracePoint is often enabled on development environment • web-console + bindex, zeitwerk
  78. Use GVL from MJIT worker • Ruby 2.6's JIT does

    not compile methods during blocking IO or sleep • That would be the best timing for doing compilation!
  79. LLVM as an optional loader • The overhead might be

    coming from dlopen's way of loading • What if we: • Generate LLVM IR with Clang from MJIT's C code • Load it to LLVM Module and execute
  80. Conclusion • We’re focusing on Rails speed on JIT after

    all compilations at this stage • We're not there yet, but moving forward • JIT will allow us to stop caring about micro optimizations in real world code • No more Performance/* cops!