$30 off During Our Annual Pro Sale. View Details »

JIT compiler improvements in Ruby 2.7 / RubyRussia 2019

JIT compiler improvements in Ruby 2.7 / RubyRussia 2019

Takashi Kokubun

September 28, 2019
Tweet

More Decks by Takashi Kokubun

Other Decks in Programming

Transcript

  1. JIT compiler improvements in Ruby 2.7
    @k0kubun |

    View Slide

  2. @k0kubun
    Ruby's JIT, ERB, Haml, Hamlit

    View Slide

  3. View Slide

  4. JIT

    View Slide

  5. Just-In-Time compiler

    View Slide

  6. What's JIT?
    • Experimental optional feature since Ruby 2.6

    • Compile your Ruby code to faster C code automatically

    • Just-in-Time: Use runtime information for optimizations
    $ ruby --jit

    View Slide

  7. Ruby 3x3 benchmark: Optcarrot
    NES emulator: mame/optcarrot

    View Slide

  8. Ruby 3x3 benchmark: Optcarrot
    Intel 4.0GHz i7-4790K 8 cores, memory 16GB, x86-64 Ubuntu
    Ruby 2.6.0 w/ Optcarrot
    Frames Per Second (fps)
    0
    23
    45
    68
    90
    53.8
    JIT off JIT on

    View Slide

  9. Ruby 3x3 benchmark: Optcarrot
    Speed
    1.61x
    Intel 4.0GHz i7-4790K 8 cores, memory 16GB, x86-64 Ubuntu
    Ruby 2.6.0 w/ Optcarrot
    Frames Per Second (fps)
    0
    23
    45
    68
    90 86.6
    53.8
    JIT off JIT on

    View Slide

  10. What's JIT?
    $ ps aufx
    ruby --jit bin/optcarrot-bench
    \_ /usr/bin/gcc -w -Wfatal-errors -fPIC -shared -w -pipe -O3
    \_ /usr/lib/gcc/x86_64-linux-gnu/7/cc1 -quiet -imultiarch
    \_ as -W --64 -o /tmp/_ruby_mjit_p31673u20.o

    View Slide

  11. How does it work?
    VM's
    C code
    Ruby process
    header
    queue VM Thread
    Build time
    Transform
    Precompile precompiled
    header
    MJIT Worker
    Thread

    View Slide

  12. VM's
    C code
    Ruby process
    header
    queue VM Thread
    Build time
    Enqueue / Dequeue
    Bytecode to JIT
    precompiled
    header
    MJIT Worker
    Thread
    How does it work?

    View Slide

  13. Ruby process
    queue VM Thread
    Build time
    Enqueue / Dequeue
    Bytecode to JIT
    Included
    Generate
    precompiled
    header
    C code
    MJIT Worker
    Thread
    VM's
    C code
    header
    How does it work?

    View Slide

  14. Ruby process
    queue VM Thread
    Build time
    Enqueue / Dequeue
    Bytecode to JIT
    CC
    Included
    Generate
    precompiled
    header
    .o file
    C code
    MJIT Worker
    Thread
    VM's
    C code
    header
    How does it work?

    View Slide

  15. Ruby process
    queue VM Thread
    Build time
    Enqueue / Dequeue
    Bytecode to JIT
    .so file
    CC
    Included
    Generate
    precompiled
    header
    .o file
    Link
    C code
    MJIT Worker
    Thread
    VM's
    C code
    header
    How does it work?

    View Slide

  16. Ruby process
    queue VM Thread
    Build time
    Enqueue / Dequeue
    Bytecode to JIT
    .so file
    CC
    Included
    Generate
    Function pointer
    of machine code
    Load
    Called by
    precompiled
    header
    .o file
    Link
    C code
    MJIT Worker
    Thread
    VM's
    C code
    header
    How does it work?

    View Slide

  17. How to use JIT

    View Slide

  18. How to use JIT
    • Just "--jit" is fine

    • You can also use RUBYOPT=--jit environment variable
    $ ruby --jit

    View Slide

  19. How to use JIT
    $ ruby --help
    JIT options (experimental):
    --jit-warnings Enable printing JIT warnings
    --jit-debug Enable JIT debugging (very slow)
    --jit-wait Wait until JIT compilation is finished everytime (for testing)
    --jit-save-temps
    Save JIT temporary files in $TMP or /tmp (for testing)
    --jit-verbose=num
    Print JIT logs of level num or less to stderr (default: 0)
    --jit-max-cache=num
    Max number of methods to be JIT-ed in a cache (default: 100)
    --jit-min-calls=num
    Number of calls to trigger JIT (for testing, default: 10000)

    View Slide

  20. How to use JIT
    $ ruby --help
    JIT options (experimental):
    --jit-warnings Enable printing JIT warnings
    --jit-debug Enable JIT debugging (very slow)
    --jit-wait Wait until JIT compilation is finished everytime (for testing)
    --jit-save-temps
    Save JIT temporary files in $TMP or /tmp (for testing)
    --jit-verbose=num
    Print JIT logs of level num or less to stderr (default: 0)
    --jit-max-cache=num
    Max number of methods to be JIT-ed in a cache (default: 100)
    --jit-min-calls=num
    Number of calls to trigger JIT (for testing, default: 10000)

    View Slide

  21. How to use JIT
    $ ruby --jit-verbose=1 ...

    View Slide

  22. How to use JIT
    $ ruby --jit-verbose=1 ...
    JIT success (35.1ms): block in symbolize_keys!@...
    JIT success (89.9ms): block in forwarded_scheme@...

    View Slide

  23. How to use JIT
    $ ruby --jit-verbose=1 ...
    JIT success (35.1ms): block in symbolize_keys!@...
    JIT success (89.9ms): block in forwarded_scheme@...
    JIT inline: unwrapped_html_escape@...
    JIT success (106.9ms): unwrapped_html_escape@...
    JIT inline: present?@...
    JIT success (37.5ms): present?@...

    View Slide

  24. How to use JIT
    $ ruby --jit-verbose=1 ...
    JIT success (35.1ms): block in symbolize_keys!@...
    JIT success (89.9ms): block in forwarded_scheme@...
    JIT inline: unwrapped_html_escape@...
    JIT success (106.9ms): unwrapped_html_escape@...
    JIT inline: present?@...
    JIT success (37.5ms): present?@...
    Optimization in Ruby 2.7

    View Slide

  25. How to use JIT
    $ ruby --jit-verbose=1 ...
    JIT success (35.1ms): block in symbolize_keys!@...
    JIT success (89.9ms): block in forwarded_scheme@...
    JIT inline: unwrapped_html_escape@...
    JIT success (106.9ms): unwrapped_html_escape@...
    JIT inline: present?@...
    JIT success (37.5ms): present?@...
    JIT recompile: present?@...

    View Slide

  26. How to use JIT
    $ ruby --jit-verbose=1 ...
    JIT success (35.1ms): block in symbolize_keys!@...
    JIT success (89.9ms): block in forwarded_scheme@...
    JIT inline: unwrapped_html_escape@...
    JIT success (106.9ms): unwrapped_html_escape@...
    JIT inline: present?@...
    JIT success (37.5ms): present?@...
    JIT recompile: present?@...
    Another optimization in Ruby 2.7

    View Slide

  27. How to use JIT
    $ ruby --jit-verbose=1 ...
    JIT success (35.1ms): block in symbolize_keys!@...
    JIT success (89.9ms): block in forwarded_scheme@...
    JIT inline: unwrapped_html_escape@...
    JIT success (106.9ms): unwrapped_html_escape@...
    JIT inline: present?@...
    JIT success (37.5ms): present?@...
    JIT recompile: present?@...
    ...
    JIT compaction (17.0ms): Compacted 100 methods -> ...

    View Slide

  28. How to use JIT
    $ ruby --jit-verbose=1 ...
    JIT success (35.1ms): block in symbolize_keys!@...
    JIT success (89.9ms): block in forwarded_scheme@...
    JIT inline: unwrapped_html_escape@...
    JIT success (106.9ms): unwrapped_html_escape@...
    JIT inline: present?@...
    JIT success (37.5ms): present?@...
    JIT recompile: present?@...
    ...
    JIT compaction (17.0ms): Compacted 100 methods -> ...
    ?

    View Slide

  29. Function pointer
    of machine code
    Ruby process
    queue VM Thread
    Build time
    Function pointer
    of machine code
    Called by
    precompiled
    header
    .o file
    .o file
    MJIT Worker
    Thread
    .o file Function pointer
    of machine code
    VM's
    C code
    header
    "JIT compaction"

    View Slide

  30. Ruby process
    queue VM Thread
    Build time
    precompiled
    header
    .o file
    .o file
    MJIT Worker
    Thread
    .o file
    .so file
    Link all
    VM's
    C code
    header
    Function pointer
    of machine code
    Function pointer
    of machine code
    Called by
    Function pointer
    of machine code
    "JIT compaction"

    View Slide

  31. Ruby process
    queue VM Thread
    Build time
    Function pointers
    of machine code
    Reload all
    Called by
    precompiled
    header
    .o file
    .o file
    MJIT Worker
    Thread
    .o file
    .so file
    Link all
    VM's
    C code
    header
    "JIT compaction"

    View Slide

  32. Ruby process
    queue VM Thread
    Build time
    Function pointers
    of machine code
    Called by
    precompiled
    header
    .o file
    .o file
    MJIT Worker
    Thread
    .o file
    VM's
    C code
    header
    "JIT compaction"

    View Slide

  33. JIT's performance on Rails

    View Slide

  34. Ruby benchmark on Rails: Railsbench
    • Just rails scaffold #show: k0kubun/railsbench

    • headius/pgrailsbench, but on Rails 5.2 and w/ db:seed

    • Small but capturing some Rails characteristics

    View Slide

  35. Ruby 2.6
    Request Per Second (#/s)
    0
    235
    470
    705
    940
    720.7
    924.9
    JIT off JIT on
    k0kubun/railsbench : WARMUP=30000 BENCHMARK=10000 bin/bench
    Railsbench: Speed
    Intel 4.0GHz i7-4790K 8 cores, memory 16GB, x86-64 Ubuntu, Ruby 2.6=2.6.2 Ruby2.7=r67600

    View Slide

  36. Ruby 2.6
    Request Per Second (#/s)
    0
    235
    470
    705
    940
    720.7
    924.9
    JIT off JIT on
    k0kubun/railsbench : WARMUP=30000 BENCHMARK=10000 bin/bench
    Ruby 2.7
    Request Per Second (#/s)
    0
    235
    470
    705
    940 899.9
    932.0
    JIT off JIT on
    Railsbench: Speed
    Intel 4.0GHz i7-4790K 8 cores, memory 16GB, x86-64 Ubuntu, Ruby 2.6=2.6.2 Ruby2.7=r67600

    View Slide

  37. Intel 4.0GHz i7-4790K 8 cores, memory 16GB, x86-64 Ubuntu
    Ruby 2.6
    Request Per Second (#/s)
    0
    27
    54
    81
    108
    107.2
    105.2
    JIT off JIT on
    k0kubun/railsbench : WARMUP=30000 BENCHMARK=10000 bin/bench
    Ruby 2.7
    Request Per Second (#/s)
    0
    27
    54
    81
    108
    107.6
    106.5
    JIT off JIT on
    Railsbench: Memory

    View Slide

  38. Why is it slow on Rails?
    • Too many methods => Cache inefficiency

    • Less CPU bound and fewer optimization chances

    View Slide

  39. Performance Improvements
    in Ruby 2.7 JIT

    View Slide

  40. Ruby 2.7 JIT Performance Improvements
    1. Default Option Changes

    2. Deoptimized Recompilation

    3. Method Inlining

    4. Optimized Dispatch of JIT-ed Code (WIP)

    5. Stack-based Object Allocation (PoC)

    View Slide

  41. 1. Default Option Changes

    View Slide

  42. 1. Default Option Changes
    • Ruby 2.7 changes in default values of JIT options

    • --jit-min-calls: 5 → 10,000

    • --jit-max-cache: 1,000 → 100

    View Slide

  43. View Slide

  44. 2. Deoptimized Recompilation

    View Slide

  45. Problem 2: JIT calls may be cancelled frequently
    • The "Cancel JIT execution" had some overhead

    • How many cancels did we have?

    View Slide

  46. Problem 2: JIT calls may be cancelled frequently

    View Slide

  47. View Slide

  48. View Slide

  49. self's class change
    causes JIT cancel

    View Slide

  50. Solution 2: Deoptimized Recompilation
    • Recompile a method when JIT's speculation is invalidated

    • It was in the original MJIT by Vladimir Makarov, but
    removed for simplicity in Ruby 2.6

    View Slide

  51. Solution 2: Deoptimized Recompilation
    • Committed to trunk. Inspectable with --jit-verbose=1

    View Slide

  52. Solution 2: Deoptimized Recompilation

    View Slide

  53. 3. Method Inlining

    View Slide

  54. Problem 3: Method call is slow
    • We're calling methods everywhere

    • Method call cost:
    VM → VM 10.28ns
    VM → JIT 9.12ns
    JIT → JIT 8.98ns
    JIT → VM 19.59ns

    View Slide

  55. Problem 3: Method call is slow

    View Slide

  56. Solution 3: Method Inlining
    • Method inlining levels:

    • Level 1: Just call an inline function instead of JIT-ed
    code's function pointer

    • Level 2: Skip pushing a call frame by default, but lazily
    push it when something happens

    • For 2, We need to know "purity" of VM instruction

    View Slide

  57. Solution 3: Method Inlining
    • Can Numeric#zero? written in Ruby be pure?

    View Slide

  58. View Slide

  59. Solution 3: Method Inlining

    View Slide

  60. Solution 4: Frame-omitted Method Inlining

    View Slide

  61. Solution 3: Method Inlining
    VM

    View Slide

  62. Solution 3: Method Inlining
    VM
    JIT

    View Slide

  63. Solution 3: Method Inlining
    • Method inlining is already on master!

    • It's working for limited things like #html_safe?, #present?

    • To make it really useful, we need to prepare Ruby version
    of core class methods for JIT

    View Slide

  64. 4. Optimized Dispatch of JIT-ed Code

    View Slide

  65. Problem 4: Calling JIT-ed code seems slow
    • When benchmarking after-compile Rails performance,
    maximum number of methods should be compiled

    • Max: 1,000 in Ruby 2.6, 100 in Ruby 2.7

    • Note: only 30 methods are compiled on Optcarrot

    View Slide

  66. Problem 4: Calling JIT-ed code seems slow
    Time to call a method returning nil
    (ns)
    0
    8
    16
    24
    32
    Number of called methods
    1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 97
    VM JIT
    def foo3
    nil
    end
    def foo2
    nil
    end
    def foo1
    nil
    end

    View Slide

  67. So we did this in Ruby 2.6
    Ruby process
    queue VM Thread
    Build time
    Function pointers
    of machine code
    Reload all
    Called by
    precompiled
    header
    .o file
    .o file
    MJIT Worker
    Thread
    .o file
    .so file
    Link all
    VM's
    C code
    header

    View Slide

  68. After "JIT compaction" in Ruby 2.6
    Time to call a method returning nil
    (ns)
    0
    8
    16
    24
    32
    Number of called methods
    1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 97
    VM JIT
    def foo3
    nil
    end
    def foo2
    nil
    end
    def foo1
    nil
    end

    View Slide

  69. But still we see icache stall

    View Slide

  70. Solution 4: Profile-guided Optimization?
    • Use GCC/Clang's -fprofile-generate and -fprofile-use

    View Slide

  71. Solution 4: Profile-guided Optimization?
    • Use GCC/Clang's -fprofile-generate and -fprofile-use

    • Unfortunately this did not help the situation

    View Slide

  72. Solution 4: Optimized Dispatch of JIT-ed Code
    • Calling JIT-ed code from VM is slow

    • Can we generate special code for dispatch from VM?

    • We can reduce # of virtual calls from two to one

    • Work in progress, but I can show you a graph

    View Slide

  73. After optimized dispatch of JIT-ed code (WIP)
    Time to call a method returning nil
    (ns)
    0
    8
    16
    24
    32
    Number of called methods
    1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 97
    VM JIT
    def foo3
    nil
    end
    def foo2
    nil
    end
    def foo1
    nil
    end

    View Slide

  74. 5. Stack-based Object Allocation

    View Slide

  75. Problem 5: Object allocation is slow
    • Rails app allocates objects (of course!), unlike Optcarrot

    • It takes time to allocate memory from heap and GC it

    View Slide

  76. Problem 5: Object allocation is slow
    • Railsbench takes time for memory management in perf
    memory management,

    GC 9.3%

    View Slide

  77. Solution 5: Stack-based Object Allocation (PoC)
    • If an object does not "escape", we can allocate an object
    on stack

    • Implementing really clever escape analysis is hard, but
    some basic one can suffice some of real-world use cases

    View Slide

  78. Solution 5: Stack-based Object Allocation (PoC)

    View Slide

  79. Solution 5: Stack-based Object Allocation (PoC)
    VM

    View Slide

  80. Solution 5: Stack-based Object Allocation (PoC)
    VM
    JIT

    View Slide

  81. Solution 5: Stack-based Object Allocation (PoC)

    View Slide

  82. Solution 5: Stack-based Object Allocation (PoC)
    VM

    View Slide

  83. Solution 5: Stack-based Object Allocation (PoC)
    VM
    JIT

    View Slide

  84. Conclusion
    • Optimizing JIT-ed code dispatch may offset the current
    JIT's bottleneck in JIT on Rails

    • Once the problem is solved, we'd be able to continuously
    improve performance

    • By allocating objects on stack, eliminating branches, ...

    View Slide