Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Performance Improvement of Ruby 2.7 JIT in Real World / RubyKaigi 2019

Performance Improvement of Ruby 2.7 JIT in Real World / RubyKaigi 2019

Takashi Kokubun

April 18, 2019
Tweet

More Decks by Takashi Kokubun

Other Decks in Programming

Transcript

  1. Performance Improvement of
    Ruby 2.7 JIT in Real World
    @k0kubun / RubyKaigi 2019

    View full-size slide

  2. @k0kubun
    Ruby Committer: JIT, ERB

    View full-size slide

  3. What's JIT?
    • Experimental optional feature since Ruby 2.6

    • Compile your Ruby code to faster C code automatically

    • Just-in-Time: Use runtime information for optimizations
    $ ruby --jit

    View full-size slide

  4. What's "MJIT"?
    VM's
    C code
    Ruby process
    header
    queue VM Thread
    Build time
    Transform
    Precompile precompiled
    header
    MJIT Worker
    Thread

    View full-size slide

  5. VM's
    C code
    Ruby process
    header
    queue VM Thread
    Build time
    Enqueue / Dequeue
    Bytecode to JIT
    precompiled
    header
    MJIT Worker
    Thread
    What's "MJIT"?

    View full-size slide

  6. Ruby process
    queue VM Thread
    Build time
    Enqueue / Dequeue
    Bytecode to JIT
    Included
    Generate
    precompiled
    header
    C code
    MJIT Worker
    Thread
    VM's
    C code
    header
    What's "MJIT"?

    View full-size slide

  7. Ruby process
    queue VM Thread
    Build time
    Enqueue / Dequeue
    Bytecode to JIT
    CC
    Included
    Generate
    precompiled
    header
    .o file
    C code
    MJIT Worker
    Thread
    VM's
    C code
    header
    What's "MJIT"?

    View full-size slide

  8. Ruby process
    queue VM Thread
    Build time
    Enqueue / Dequeue
    Bytecode to JIT
    .so file
    CC
    Included
    Generate
    precompiled
    header
    .o file
    Link
    C code
    MJIT Worker
    Thread
    VM's
    C code
    header
    What's "MJIT"?

    View full-size slide

  9. Ruby process
    queue VM Thread
    Build time
    Enqueue / Dequeue
    Bytecode to JIT
    .so file
    CC
    Included
    Generate
    Function pointer
    of machine code
    Load
    Called by
    precompiled
    header
    .o file
    Link
    C code
    MJIT Worker
    Thread
    VM's
    C code
    header
    What's "MJIT"?

    View full-size slide

  10. Ruby process
    queue VM Thread
    Build time
    Enqueue / Dequeue
    Bytecode to JIT
    .so file
    CC
    Included
    Generate
    Load
    precompiled
    header
    Link
    C code
    MJIT Worker
    Thread
    Function pointer
    of machine code
    Function pointer
    of machine code
    Called by
    Function pointer
    of machine code
    .o file
    .o file
    .o file
    VM's
    C code
    header
    What's "MJIT"?

    View full-size slide

  11. Function pointer
    of machine code
    Ruby process
    queue VM Thread
    Build time
    Function pointer
    of machine code
    Called by
    precompiled
    header
    .o file
    .o file
    MJIT Worker
    Thread
    .o file Function pointer
    of machine code
    VM's
    C code
    header
    What's "MJIT"?

    View full-size slide

  12. Ruby process
    queue VM Thread
    Build time
    precompiled
    header
    .o file
    .o file
    MJIT Worker
    Thread
    .o file
    .so file
    Link all
    VM's
    C code
    header
    Function pointer
    of machine code
    Function pointer
    of machine code
    Called by
    Function pointer
    of machine code
    What's "MJIT"?

    View full-size slide

  13. Ruby process
    queue VM Thread
    Build time
    Function pointers
    of machine code
    Reload all
    Called by
    precompiled
    header
    .o file
    .o file
    MJIT Worker
    Thread
    .o file
    .so file
    Link all
    VM's
    C code
    header
    What's "MJIT"?

    View full-size slide

  14. Ruby process
    queue VM Thread
    Build time
    Function pointers
    of machine code
    Called by
    precompiled
    header
    .o file
    .o file
    MJIT Worker
    Thread
    .o file
    VM's
    C code
    header
    What's "MJIT"?

    View full-size slide

  15. Ruby 2.6 JIT stability
    • No SEGV report after release

    • To avoid bugs, it’s designed conservatively

    • We've run JIT CIs for 24h and detected bugs by ourselves

    View full-size slide

  16. JIT performance in real world

    View full-size slide

  17. Ruby 3x3 benchmark: 1. Optcarrot
    NES emulator: mame/optcarrot

    View full-size slide

  18. Ruby 3x3 benchmark: 1. Optcarrot
    Speed
    1.61x
    Intel 4.0GHz i7-4790K 8 cores, memory 16GB, x86-64 Ubuntu
    Ruby 2.6.0 w/ Optcarrot
    Frames Per Second (fps)
    0
    23
    45
    68
    90 86.6
    53.8
    JIT off JIT on

    View full-size slide

  19. Ruby 3x3 benchmark: 1. Optcarrot
    Memory
    1.01x
    Intel 4.0GHz i7-4790K 8 cores, memory 16GB, x86-64 Ubuntu
    Ruby 2.6.0 w/ Optcarrot
    Max Resident Set Size (MB)
    0
    16
    32
    48
    64
    63.7
    62.8
    JIT off JIT on

    View full-size slide

  20. What's "real world"?
    • We're not running NES emulator on production

    • Rails is popular for large-scale use

    View full-size slide

  21. Ruby 3x3 benchmark: 2. Discourse

    View full-size slide

  22. Ruby 3x3 benchmark: 2. Discourse
    • Forum application: discourse/discourse
    • For AWS: noahgibbs/rails_ruby_bench (a.k.a. RRB)

    • He has reported JIT’s performance of all preview, rc,
    and final releases (thanks!)

    View full-size slide

  23. Ruby 3x3 benchmark: 2. Discourse

    View full-size slide

  24. Ruby 3x3 benchmark: 2. Discourse
    • It has captured first 10~100k requests after start

    • Is it “real-world”?

    • Compiling 1,000 methods take a long time and make it slow

    • Ruby 2.7 uses --jit-max-cache=100 by default and it will run
    in an all-compiled state for most of the time

    View full-size slide

  25. Ruby 2.6
    Request Per Second (#/s)
    0
    5
    9
    14
    18
    16.2
    17.1
    JIT off JIT on
    k0kubun/discourse : WARMUP=5000 BENCHMARK=1000 script/simple_bench
    Discourse: Speed
    Intel 4.0GHz i7-4790K 8 cores, memory 16GB, x86-64 Ubuntu, Ruby 2.6=2.6.2 Ruby2.7=r67600

    View full-size slide

  26. Ruby 2.6
    Request Per Second (#/s)
    0
    5
    9
    14
    18
    16.2
    17.1
    JIT off JIT on
    k0kubun/discourse : WARMUP=5000 BENCHMARK=1000 script/simple_bench
    Ruby 2.7
    Request Per Second (#/s)
    0
    5
    9
    14
    18 17.4
    17.5
    JIT off JIT on
    Discourse: Speed
    Intel 4.0GHz i7-4790K 8 cores, memory 16GB, x86-64 Ubuntu, Ruby 2.6=2.6.2 Ruby2.7=r67600

    View full-size slide

  27. Why JIT is not fast yet on Rails?
    • Checking Discourse may not be suitable for knowing it

    • Some hotspots are not Rails-specific

    • Reaching a stable state takes some time

    View full-size slide

  28. New Ruby benchmark: Railsbench
    • Just rails scaffold #show: k0kubun/railsbench

    • headius/pgrailsbench, but on Rails 5.2 and w/ db:seed

    • Small but capturing some Rails characteristics

    View full-size slide

  29. Ruby 2.6
    Request Per Second (#/s)
    0
    235
    470
    705
    940
    720.7
    924.9
    JIT off JIT on
    k0kubun/railsbench : WARMUP=30000 BENCHMARK=10000 bin/bench
    Railsbench: Speed
    Intel 4.0GHz i7-4790K 8 cores, memory 16GB, x86-64 Ubuntu, Ruby 2.6=2.6.2 Ruby2.7=r67600

    View full-size slide

  30. Ruby 2.6
    Request Per Second (#/s)
    0
    235
    470
    705
    940
    720.7
    924.9
    JIT off JIT on
    k0kubun/railsbench : WARMUP=30000 BENCHMARK=10000 bin/bench
    Ruby 2.7
    Request Per Second (#/s)
    0
    235
    470
    705
    940 899.9
    932.0
    JIT off JIT on
    Railsbench: Speed
    Intel 4.0GHz i7-4790K 8 cores, memory 16GB, x86-64 Ubuntu, Ruby 2.6=2.6.2 Ruby2.7=r67600

    View full-size slide

  31. Intel 4.0GHz i7-4790K 8 cores, memory 16GB, x86-64 Ubuntu
    Ruby 2.6
    Request Per Second (#/s)
    0
    27
    54
    81
    108
    107.2
    105.2
    JIT off JIT on
    k0kubun/railsbench : WARMUP=30000 BENCHMARK=10000 bin/bench
    Ruby 2.7
    Request Per Second (#/s)
    0
    27
    54
    81
    108
    107.6
    106.5
    JIT off JIT on
    Railsbench: Memory

    View full-size slide

  32. Railsbench
    • We can easily run it as the application is very light-weight

    • By profiling it, we’ve found which parts are not good for JIT

    • And the parts should exist in all Rails applications too

    View full-size slide

  33. Why is JIT still slow on Rails?
    Let’s explore our challenges for JIT on Rails!!

    View full-size slide

  34. Ruby 2.7 JIT
    Performance Challenges

    View full-size slide

  35. Ruby 2.7 JIT Performance Challenges
    1. Profile-guided Optimization

    2. Optimization Prediction

    3. Deoptimized Recompilation

    4. Frame-omitted Method Inlining

    5. Stack-based Object Allocation

    View full-size slide

  36. Problem 1: Calling JIT-ed code seems slow
    • When benchmarking after-compile Rails performance,
    maximum number of methods should be compiled

    • Max: 1,000 in Ruby 2.6, 100 in trunk

    • Note: only 30 methods are compiled on Optcarrot

    View full-size slide

  37. Problem 1: Calling JIT-ed code seems slow
    Time to call methods returning
    nil (s)
    0
    1.5
    3
    4.5
    6
    Number of called methods
    1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39
    VM JIT
    def foo3
    nil
    end
    def foo2
    nil
    end
    def foo1
    nil
    end

    View full-size slide

  38. So we did this in Ruby 2.6
    Ruby process
    queue VM Thread
    Build time
    Function pointers
    of machine code
    Reload all
    Called by
    precompiled
    header
    .o file
    .o file
    MJIT Worker
    Thread
    .o file
    .so file
    Link all
    VM's
    C code
    header

    View full-size slide

  39. But still we see icache stall

    View full-size slide

  40. Approach 1: Profile-guided Optimization
    • Use GCC/Clang's -fprofile-generate and -fprofile-use

    View full-size slide

  41. Approach 1: Profile-guided Optimization
    https://github.com/k0kubun/ruby/tree/pgo

    View full-size slide

  42. Approach 1: Profile-guided Optimization
    • It didn't magically solve the current bottleneck on Rails

    • It may be helpful later, but not now

    • As originally discussed, it complicates the build system

    View full-size slide

  43. Problem 2: Can we avoid making things slow?
    • Why can't we just skip compiling things like a method
    returning just nil?

    • If we compile only well-optimized methods, it should be fast

    View full-size slide

  44. Approach 2: Optimization Prediction
    • Several situations are known to be well-optimized in the
    current JIT

    • Let's measure the impacts and build heuristics!

    View full-size slide

  45. Approach 2: Optimization Prediction
    Call overhead w/ 100
    methods (distributed)
    +22.4ns
    Call overhead w/ 100
    methods (compacted)
    +3.22ns
    Invoke another VM +7.90s
    Cancel JIT execution +1.94s
    Stack pointer

    motion elimination
    -0.15ns
    Method call

    w/ inline cache
    -2.21ns
    Instance variable read

    w/ inline cache
    -0.86ns
    opt_* instructions -1~4ns

    View full-size slide

  46. Approach 2: Optimization Prediction
    • It did not magically solve the current bottleneck, again

    • Maybe the impact is more dynamic than I assumed

    • Compiling the same number of hotspot methods always
    seem to bring the same level of overhead

    • Our code is too big for icache?

    View full-size slide

  47. Problem 3: JIT calls may be cancelled frequently
    • The "Cancel JIT execution" had some overhead

    • How many cancels did we have?

    View full-size slide

  48. Problem 3: JIT calls may be cancelled frequently

    View full-size slide

  49. self's class change
    causes JIT cancel

    View full-size slide

  50. Solution 3: Deoptimized Recompilation
    • Recompile a method when JIT's speculation is invalidated

    • It was in the original MJIT by Vladimir Makarov, but
    removed for simplicity in Ruby 2.6

    View full-size slide

  51. Solution 3: Deoptimized Recompilation
    • Committed to trunk. Inspectable with --jit-verbose=1

    View full-size slide

  52. Solution 3: Deoptimized Recompilation

    View full-size slide

  53. Problem 4: Method call is slow
    • We're calling methods everywhere

    • Method call cost:
    VM -> VM 10.28ns
    VM -> JIT 9.12ns
    JIT -> JIT 8.98ns
    JIT -> VM 19.59ns

    View full-size slide

  54. Problem 4: Method call is slow

    View full-size slide

  55. Solution 4: Frame-omitted Method Inlining
    • Method inlining levels:

    • Level 1: Just call an inline function instead of JIT-ed
    code's function pointer

    • Level 2: Skip pushing a call frame by default, but lazily
    push it when something happens

    • For 2, We need to know "purity" of VM instruction

    View full-size slide

  56. Solution 4: Frame-omitted Method Inlining
    • Can Numeric#zero? written in Ruby be pure?

    View full-size slide

  57. Solution 4: Frame-omitted Method Inlining

    View full-size slide

  58. Solution 4: Frame-omitted Method Inlining
    If it were true (valid for Integer),
    what would happen?

    View full-size slide

  59. Solution 4: Frame-omitted Method Inlining

    View full-size slide

  60. Solution 4: Frame-omitted Method Inlining

    View full-size slide

  61. Solution 4: Frame-omitted Method Inlining
    VM

    View full-size slide

  62. Solution 4: Frame-omitted Method Inlining
    VM
    JIT

    View full-size slide

  63. Solution 4: Frame-omitted Method Inlining
    • Frame-omitted method inlining (level 2) is already on trunk!

    • It's working for limited things like #html_safe?, #present?

    • To make it really useful, we need to improve metadata for
    methods and VM instructions

    View full-size slide

  64. Problem 5: Object allocation is slow
    • Rails app allocates objects (of course!), unlike Optcarrot

    • It takes time to allocate memory from heap and GC it

    View full-size slide

  65. Problem 5: Object allocation is slow
    • Railsbench takes time for memory management in perf
    memory management,

    GC 9.3%

    View full-size slide

  66. Solution 5: Stack-based Object Allocation
    • If an object does not "escape", we can allocate an object
    on stack

    • Implementing really clever escape analysis is hard, but
    some basic one can suffice some of real-world use cases

    View full-size slide

  67. Solution 5: Stack-based Object Allocation

    View full-size slide

  68. Solution 5: Stack-based Object Allocation
    VM

    View full-size slide

  69. Solution 5: Stack-based Object Allocation
    VM
    JIT

    View full-size slide

  70. Solution 5: Stack-based Object Allocation

    View full-size slide

  71. Solution 5: Stack-based Object Allocation
    VM

    View full-size slide

  72. Solution 5: Stack-based Object Allocation
    VM
    JIT

    View full-size slide

  73. Ruby 2.7 JIT future works

    View full-size slide

  74. More Ruby code in core
    • Many parts of Ruby implementation are written in C and
    it blocks optimizations like method inlining

    • @ko1 is proposing to use Ruby in core and add method's
    metadata more.

    • Let’s do it

    View full-size slide

  75. TracePoint support
    • Ruby 2.6's JIT just stops when TracePoint is enabled

    • TracePoint is often enabled on development environment

    • web-console + bindex, zeitwerk

    View full-size slide

  76. Use GVL from MJIT worker
    • Ruby 2.6's JIT does not compile methods during
    blocking IO or sleep

    • That would be the best timing for doing compilation!

    View full-size slide

  77. LLVM as an optional loader
    • The overhead might be coming from dlopen's way of loading

    • What if we:

    • Generate LLVM IR with Clang from MJIT's C code

    • Load it to LLVM Module and execute

    View full-size slide

  78. Conclusion
    • We’re focusing on Rails speed on JIT after all compilations
    at this stage

    • We're not there yet, but moving forward

    • JIT will allow us to stop caring about micro optimizations in
    real world code

    • No more Performance/* cops!

    View full-size slide