JIT compiler improvements in Ruby 2.7 / RubyRussia 2019

JIT compiler improvements in Ruby 2.7 / RubyRussia 2019

08d5432a5bc31e6d9edec87b94cb1db1?s=128

Takashi Kokubun

September 28, 2019
Tweet

Transcript

  1. JIT compiler improvements in Ruby 2.7 @k0kubun |

  2. @k0kubun Ruby's JIT, ERB, Haml, Hamlit

  3. None
  4. JIT

  5. Just-In-Time compiler

  6. What's JIT? • Experimental optional feature since Ruby 2.6 •

    Compile your Ruby code to faster C code automatically • Just-in-Time: Use runtime information for optimizations $ ruby --jit
  7. Ruby 3x3 benchmark: Optcarrot NES emulator: mame/optcarrot

  8. Ruby 3x3 benchmark: Optcarrot Intel 4.0GHz i7-4790K 8 cores, memory

    16GB, x86-64 Ubuntu Ruby 2.6.0 w/ Optcarrot Frames Per Second (fps) 0 23 45 68 90 53.8 JIT off JIT on
  9. Ruby 3x3 benchmark: Optcarrot Speed 1.61x Intel 4.0GHz i7-4790K 8

    cores, memory 16GB, x86-64 Ubuntu Ruby 2.6.0 w/ Optcarrot Frames Per Second (fps) 0 23 45 68 90 86.6 53.8 JIT off JIT on
  10. What's JIT? $ ps aufx ruby --jit bin/optcarrot-bench \_ /usr/bin/gcc

    -w -Wfatal-errors -fPIC -shared -w -pipe -O3 \_ /usr/lib/gcc/x86_64-linux-gnu/7/cc1 -quiet -imultiarch \_ as -W --64 -o /tmp/_ruby_mjit_p31673u20.o
  11. How does it work? VM's C code Ruby process header

    queue VM Thread Build time Transform Precompile precompiled header MJIT Worker Thread
  12. VM's C code Ruby process header queue VM Thread Build

    time Enqueue / Dequeue Bytecode to JIT precompiled header MJIT Worker Thread How does it work?
  13. Ruby process queue VM Thread Build time Enqueue / Dequeue

    Bytecode to JIT Included Generate precompiled header C code MJIT Worker Thread VM's C code header How does it work?
  14. Ruby process queue VM Thread Build time Enqueue / Dequeue

    Bytecode to JIT CC Included Generate precompiled header .o file C code MJIT Worker Thread VM's C code header How does it work?
  15. Ruby process queue VM Thread Build time Enqueue / Dequeue

    Bytecode to JIT .so file CC Included Generate precompiled header .o file Link C code MJIT Worker Thread VM's C code header How does it work?
  16. Ruby process queue VM Thread Build time Enqueue / Dequeue

    Bytecode to JIT .so file CC Included Generate Function pointer of machine code Load Called by precompiled header .o file Link C code MJIT Worker Thread VM's C code header How does it work?
  17. How to use JIT

  18. How to use JIT • Just "--jit" is fine •

    You can also use RUBYOPT=--jit environment variable $ ruby --jit
  19. How to use JIT $ ruby --help JIT options (experimental):

    --jit-warnings Enable printing JIT warnings --jit-debug Enable JIT debugging (very slow) --jit-wait Wait until JIT compilation is finished everytime (for testing) --jit-save-temps Save JIT temporary files in $TMP or /tmp (for testing) --jit-verbose=num Print JIT logs of level num or less to stderr (default: 0) --jit-max-cache=num Max number of methods to be JIT-ed in a cache (default: 100) --jit-min-calls=num Number of calls to trigger JIT (for testing, default: 10000)
  20. How to use JIT $ ruby --help JIT options (experimental):

    --jit-warnings Enable printing JIT warnings --jit-debug Enable JIT debugging (very slow) --jit-wait Wait until JIT compilation is finished everytime (for testing) --jit-save-temps Save JIT temporary files in $TMP or /tmp (for testing) --jit-verbose=num Print JIT logs of level num or less to stderr (default: 0) --jit-max-cache=num Max number of methods to be JIT-ed in a cache (default: 100) --jit-min-calls=num Number of calls to trigger JIT (for testing, default: 10000)
  21. How to use JIT $ ruby --jit-verbose=1 ...

  22. How to use JIT $ ruby --jit-verbose=1 ... JIT success

    (35.1ms): block in symbolize_keys!@... JIT success (89.9ms): block in forwarded_scheme@...
  23. How to use JIT $ ruby --jit-verbose=1 ... JIT success

    (35.1ms): block in symbolize_keys!@... JIT success (89.9ms): block in forwarded_scheme@... JIT inline: unwrapped_html_escape@... JIT success (106.9ms): unwrapped_html_escape@... JIT inline: present?@... JIT success (37.5ms): present?@...
  24. How to use JIT $ ruby --jit-verbose=1 ... JIT success

    (35.1ms): block in symbolize_keys!@... JIT success (89.9ms): block in forwarded_scheme@... JIT inline: unwrapped_html_escape@... JIT success (106.9ms): unwrapped_html_escape@... JIT inline: present?@... JIT success (37.5ms): present?@... Optimization in Ruby 2.7
  25. How to use JIT $ ruby --jit-verbose=1 ... JIT success

    (35.1ms): block in symbolize_keys!@... JIT success (89.9ms): block in forwarded_scheme@... JIT inline: unwrapped_html_escape@... JIT success (106.9ms): unwrapped_html_escape@... JIT inline: present?@... JIT success (37.5ms): present?@... JIT recompile: present?@...
  26. How to use JIT $ ruby --jit-verbose=1 ... JIT success

    (35.1ms): block in symbolize_keys!@... JIT success (89.9ms): block in forwarded_scheme@... JIT inline: unwrapped_html_escape@... JIT success (106.9ms): unwrapped_html_escape@... JIT inline: present?@... JIT success (37.5ms): present?@... JIT recompile: present?@... Another optimization in Ruby 2.7
  27. How to use JIT $ ruby --jit-verbose=1 ... JIT success

    (35.1ms): block in symbolize_keys!@... JIT success (89.9ms): block in forwarded_scheme@... JIT inline: unwrapped_html_escape@... JIT success (106.9ms): unwrapped_html_escape@... JIT inline: present?@... JIT success (37.5ms): present?@... JIT recompile: present?@... ... JIT compaction (17.0ms): Compacted 100 methods -> ...
  28. How to use JIT $ ruby --jit-verbose=1 ... JIT success

    (35.1ms): block in symbolize_keys!@... JIT success (89.9ms): block in forwarded_scheme@... JIT inline: unwrapped_html_escape@... JIT success (106.9ms): unwrapped_html_escape@... JIT inline: present?@... JIT success (37.5ms): present?@... JIT recompile: present?@... ... JIT compaction (17.0ms): Compacted 100 methods -> ... ?
  29. Function pointer of machine code Ruby process queue VM Thread

    Build time Function pointer of machine code Called by precompiled header .o file .o file MJIT Worker Thread .o file Function pointer of machine code VM's C code header "JIT compaction"
  30. Ruby process queue VM Thread Build time precompiled header .o

    file .o file MJIT Worker Thread .o file .so file Link all VM's C code header Function pointer of machine code Function pointer of machine code Called by Function pointer of machine code "JIT compaction"
  31. Ruby process queue VM Thread Build time Function pointers of

    machine code Reload all Called by precompiled header .o file .o file MJIT Worker Thread .o file .so file Link all VM's C code header "JIT compaction"
  32. Ruby process queue VM Thread Build time Function pointers of

    machine code Called by precompiled header .o file .o file MJIT Worker Thread .o file VM's C code header "JIT compaction"
  33. JIT's performance on Rails

  34. Ruby benchmark on Rails: Railsbench • Just rails scaffold #show:

    k0kubun/railsbench • headius/pgrailsbench, but on Rails 5.2 and w/ db:seed • Small but capturing some Rails characteristics
  35. Ruby 2.6 Request Per Second (#/s) 0 235 470 705

    940 720.7 924.9 JIT off JIT on k0kubun/railsbench : WARMUP=30000 BENCHMARK=10000 bin/bench Railsbench: Speed Intel 4.0GHz i7-4790K 8 cores, memory 16GB, x86-64 Ubuntu, Ruby 2.6=2.6.2 Ruby2.7=r67600
  36. Ruby 2.6 Request Per Second (#/s) 0 235 470 705

    940 720.7 924.9 JIT off JIT on k0kubun/railsbench : WARMUP=30000 BENCHMARK=10000 bin/bench Ruby 2.7 Request Per Second (#/s) 0 235 470 705 940 899.9 932.0 JIT off JIT on Railsbench: Speed Intel 4.0GHz i7-4790K 8 cores, memory 16GB, x86-64 Ubuntu, Ruby 2.6=2.6.2 Ruby2.7=r67600
  37. Intel 4.0GHz i7-4790K 8 cores, memory 16GB, x86-64 Ubuntu Ruby

    2.6 Request Per Second (#/s) 0 27 54 81 108 107.2 105.2 JIT off JIT on k0kubun/railsbench : WARMUP=30000 BENCHMARK=10000 bin/bench Ruby 2.7 Request Per Second (#/s) 0 27 54 81 108 107.6 106.5 JIT off JIT on Railsbench: Memory
  38. Why is it slow on Rails? • Too many methods

    => Cache inefficiency • Less CPU bound and fewer optimization chances
  39. Performance Improvements in Ruby 2.7 JIT

  40. Ruby 2.7 JIT Performance Improvements 1. Default Option Changes 2.

    Deoptimized Recompilation 3. Method Inlining 4. Optimized Dispatch of JIT-ed Code (WIP) 5. Stack-based Object Allocation (PoC)
  41. 1. Default Option Changes

  42. 1. Default Option Changes • Ruby 2.7 changes in default

    values of JIT options • --jit-min-calls: 5 → 10,000 • --jit-max-cache: 1,000 → 100
  43. None
  44. 2. Deoptimized Recompilation

  45. Problem 2: JIT calls may be cancelled frequently • The

    "Cancel JIT execution" had some overhead • How many cancels did we have?
  46. Problem 2: JIT calls may be cancelled frequently

  47. None
  48. None
  49. self's class change causes JIT cancel

  50. Solution 2: Deoptimized Recompilation • Recompile a method when JIT's

    speculation is invalidated • It was in the original MJIT by Vladimir Makarov, but removed for simplicity in Ruby 2.6
  51. Solution 2: Deoptimized Recompilation • Committed to trunk. Inspectable with

    --jit-verbose=1
  52. Solution 2: Deoptimized Recompilation

  53. 3. Method Inlining

  54. Problem 3: Method call is slow • We're calling methods

    everywhere • Method call cost: VM → VM 10.28ns VM → JIT 9.12ns JIT → JIT 8.98ns JIT → VM 19.59ns
  55. Problem 3: Method call is slow

  56. Solution 3: Method Inlining • Method inlining levels: • Level

    1: Just call an inline function instead of JIT-ed code's function pointer • Level 2: Skip pushing a call frame by default, but lazily push it when something happens • For 2, We need to know "purity" of VM instruction
  57. Solution 3: Method Inlining • Can Numeric#zero? written in Ruby

    be pure?
  58. None
  59. Solution 3: Method Inlining

  60. Solution 4: Frame-omitted Method Inlining

  61. Solution 3: Method Inlining VM

  62. Solution 3: Method Inlining VM JIT

  63. Solution 3: Method Inlining • Method inlining is already on

    master! • It's working for limited things like #html_safe?, #present? • To make it really useful, we need to prepare Ruby version of core class methods for JIT
  64. 4. Optimized Dispatch of JIT-ed Code

  65. Problem 4: Calling JIT-ed code seems slow • When benchmarking

    after-compile Rails performance, maximum number of methods should be compiled • Max: 1,000 in Ruby 2.6, 100 in Ruby 2.7 • Note: only 30 methods are compiled on Optcarrot
  66. Problem 4: Calling JIT-ed code seems slow Time to call

    a method returning nil (ns) 0 8 16 24 32 Number of called methods 1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 97 VM JIT def foo3 nil end def foo2 nil end def foo1 nil end
  67. So we did this in Ruby 2.6 Ruby process queue

    VM Thread Build time Function pointers of machine code Reload all Called by precompiled header .o file .o file MJIT Worker Thread .o file .so file Link all VM's C code header
  68. After "JIT compaction" in Ruby 2.6 Time to call a

    method returning nil (ns) 0 8 16 24 32 Number of called methods 1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 97 VM JIT def foo3 nil end def foo2 nil end def foo1 nil end
  69. But still we see icache stall

  70. Solution 4: Profile-guided Optimization? • Use GCC/Clang's -fprofile-generate and -fprofile-use

  71. Solution 4: Profile-guided Optimization? • Use GCC/Clang's -fprofile-generate and -fprofile-use

    • Unfortunately this did not help the situation
  72. Solution 4: Optimized Dispatch of JIT-ed Code • Calling JIT-ed

    code from VM is slow • Can we generate special code for dispatch from VM? • We can reduce # of virtual calls from two to one • Work in progress, but I can show you a graph
  73. After optimized dispatch of JIT-ed code (WIP) Time to call

    a method returning nil (ns) 0 8 16 24 32 Number of called methods 1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 97 VM JIT def foo3 nil end def foo2 nil end def foo1 nil end
  74. 5. Stack-based Object Allocation

  75. Problem 5: Object allocation is slow • Rails app allocates

    objects (of course!), unlike Optcarrot • It takes time to allocate memory from heap and GC it
  76. Problem 5: Object allocation is slow • Railsbench takes time

    for memory management in perf memory management, GC 9.3%
  77. Solution 5: Stack-based Object Allocation (PoC) • If an object

    does not "escape", we can allocate an object on stack • Implementing really clever escape analysis is hard, but some basic one can suffice some of real-world use cases
  78. Solution 5: Stack-based Object Allocation (PoC)

  79. Solution 5: Stack-based Object Allocation (PoC) VM

  80. Solution 5: Stack-based Object Allocation (PoC) VM JIT

  81. Solution 5: Stack-based Object Allocation (PoC)

  82. Solution 5: Stack-based Object Allocation (PoC) VM

  83. Solution 5: Stack-based Object Allocation (PoC) VM JIT

  84. Conclusion • Optimizing JIT-ed code dispatch may offset the current

    JIT's bottleneck in JIT on Rails • Once the problem is solved, we'd be able to continuously improve performance • By allocating objects on stack, eliminating branches, ...