Upgrade to Pro — share decks privately, control downloads, hide ads and more …

JIT on Rails / Rails Developers Meetup 2019

JIT on Rails / Rails Developers Meetup 2019

Rails Developers Meetup 2019
https://railsdm.github.io/

Takashi Kokubun

March 23, 2019
Tweet

More Decks by Takashi Kokubun

Other Decks in Programming

Transcript

  1. JIT on Rails
    Takashi Kokubun / RailsDM

    View Slide

  2. Self introduction
    • GitHub, Twitter: k0kubun
    • Arm Treasure Data
    • Ruby committer: JIT, ERB

    View Slide

  3. Treasure Data is hiring

    View Slide

  4. Agenda
    • Benchmark of JIT on Rails
    • What's slow in Ruby on Rails?
    • JIT optimization experiments for Rails

    View Slide

  5. Benchmark of JIT on Rails

    View Slide

  6. Why benchmark?
    • There's no silver bullet
    • Understand patterns and use it only when it's suitable

    View Slide

  7. Why benchmark?
    • There's no silver bullet
    • Understand patterns and use it only when it's suitable
    • Fact: Treasure Data isn't using --jit for Rails w/ Ruby 2.6
    JFYI: I'm enabling --jit on my personal servers with the latest Ruby

    View Slide

  8. Benchmark of non-Rails application
    Optcarrot

    View Slide

  9. Benchmark of non-Rails application
    Intel 4.0GHz i7-4790K 8 cores, memory 16GB, x86-64 Ubuntu
    Ruby 2.6.0 w/ Optcarrot
    Frames Per Second (fps)
    0
    23
    45
    68
    90 86.6
    53.8
    JIT off JIT on
    Speed
    1.61x

    View Slide

  10. Benchmark of non-Rails application
    Intel 4.0GHz i7-4790K 8 cores, memory 16GB, x86-64 Ubuntu
    Ruby 2.6.0 w/ Optcarrot
    Max Resident Set Size (MB)
    0
    16
    32
    48
    64
    63.7
    62.8
    JIT off JIT on
    Memory
    1.01x

    View Slide

  11. Ruby benchmarks using Rails
    • Discourse
    • Railsbench

    View Slide

  12. Discourse

    View Slide

  13. Discourse
    • discourse/discourse: Popular forum application
    • DB seed + Apache Bench with script/bench.rb
    • noahgibbs/rails_ruby_bench: For AWS

    View Slide

  14. Discourse
    http://engineering.appfolio.com/appfolio-engineering/2019/1/4/how-fast-is-the-released-ruby-260
    Ruby 2.6.0:
    Known to be so
    slow at least while
    JIT compilation is
    happening

    View Slide

  15. Railsbench
    • headius/pgrailsbench: Just rails scaffold (#show)
    • Should be a step between Optcarrot and Discourse
    • Used by JRuby team to compare it with CRuby
    • k0kubun/railsbench: Rails 5.2, DB seed

    View Slide

  16. Railsbench
    Intel 4.0GHz i7-4790K 8 cores, memory 16GB, x86-64 Ubuntu
    --enable=jit w/ Railsbench
    Request Per Second (#/s)
    0
    228
    455
    683
    910
    0.0
    670.5
    2.6.2 trunk
    k0kubun/railsbench : WARMUP=30000 BENCHMARK=10000 bin/bench
    Speed

    View Slide

  17. Railsbench
    Intel 4.0GHz i7-4790K 8 cores, memory 16GB, x86-64 Ubuntu
    --enable=jit w/ Railsbench
    Request Per Second (#/s)
    0
    228
    455
    683
    910 861.2
    670.5
    2.6.2 trunk
    k0kubun/railsbench : WARMUP=30000 BENCHMARK=10000 bin/bench
    Speed

    View Slide

  18. Railsbench
    Intel 4.0GHz i7-4790K 8 cores, memory 16GB, x86-64 Ubuntu
    --enable=jit w/ Railsbench
    Request Per Second (#/s)
    0
    228
    455
    683
    910 861.2
    670.5
    2.6.2 trunk
    k0kubun/railsbench : WARMUP=30000 BENCHMARK=10000 bin/bench
    --disable=jit w/ Railsbench
    Request Per Second (#/s)
    0
    228
    455
    683
    910
    909.3
    871.5
    2.6.2 trunk
    Speed

    View Slide

  19. Railsbench
    Intel 4.0GHz i7-4790K 8 cores, memory 16GB, x86-64 Ubuntu
    --enable=jit w/ Railsbench
    Max Resident Set Size (MB)
    0
    28
    55
    83
    110 107.6
    107.2
    2.6.2 trunk
    k0kubun/railsbench : WARMUP=30000 BENCHMARK=10000 bin/bench
    --disable=jit w/ Railsbench
    Max Resident Set Size (MB)
    0
    28
    55
    83
    110 106.5
    105.2
    2.6.2 trunk
    Memory

    View Slide

  20. What's slow in Ruby on Rails?

    View Slide

  21. Bottleneck on non-Rails (Optcarrot)
    super intensive calls to the same set of methods
    (using cache-friendly instance variables)

    View Slide

  22. Bottleneck on non-Rails (Optcarrot)
    --disable=jit

    View Slide

  23. Bottleneck on non-Rails (Optcarrot)
    --enable=jit

    View Slide

  24. Bottleneck on non-Rails (Optcarrot)
    --enable=jit
    --disable=jit

    View Slide

  25. Bottleneck on Railsbench
    Which Rails component?

    View Slide

  26. View Slide

  27. ActiveRecord

    #find 10%

    View Slide

  28. ActionView

    find* 5%

    View Slide

  29. Logger#info 17%
    (AR : #debug, 1.2%)

    View Slide

  30. ActiveRecord

    attributes 3%

    View Slide

  31. Route helper 12%

    View Slide

  32. D D
    link_to 4%

    View Slide

  33. Logger#info 17%
    (AR : #debug, 1.2%)

    View Slide

  34. csrf_meta_tags 5%

    View Slide

  35. stylesheet_

    link_tag 9%

    View Slide

  36. javascript_

    include_tag 7%

    View Slide

  37. What's slow in Ruby on Rails
    "DUJWF3FDPSEpOE
    "DUJPO7JFXpOE
    UFNQMBUFMBZPVU

    -PHHFSJOGP
    "DUJWF3FDPSEBUUST
    3PVUFIFMQFS
    [email protected]
    [email protected]@UBHT
    [email protected]@UBH
    [email protected]@UBH
    Rails scaffold #show

    Bottleneck ࢛ఱԦ

    View Slide

  38. What's slow in "JIT" on Rails
    --disable=jit

    View Slide

  39. What's slow in "JIT" on Rails
    --enable=jit

    View Slide

  40. View Slide

  41. View Slide

  42. View Slide

  43. Instance variable's
    cache key: self's class

    View Slide

  44. Ruby Performance Pro Tips:
    Do not use inheritance

    View Slide

  45. What's slow in "JIT" on Rails
    --enable=jit
    --disable=jit

    View Slide

  46. What's slow in "JIT" on Rails
    --enable=jit
    --disable=jit

    View Slide

  47. What's slow in "JIT" on Rails --disable=jit

    View Slide

  48. What's slow in "JIT" on Rails --enable=jit

    View Slide

  49. What's slow in "JIT" on Rails
    icache.hit
    [Number of Instruction Cache, Streaming Buffer and Victim Cache Reads. both cacheable and
    noncacheable, including UC fetches]
    icache.ifdata_stall
    [Cycles where a code fetch is stalled due to L1 instruction-cache miss]
    icache.ifetch_stall
    [Cycles where a code fetch is stalled due to L1 instruction-cache miss]
    icache.misses
    [Number of Instruction Cache, Streaming Buffer and Victim Cache Misses. Includes Uncacheable accesses]

    View Slide

  50. How about Optcarrot? --disable=jit

    View Slide

  51. How about Optcarrot? --enable=jit

    View Slide

  52. What's slow in "JIT" on Rails
    VM 13.5%
    --disable=jit

    View Slide

  53. What's slow in "JIT" on Rails
    VM 13.5%
    7.5% (-6%)
    --enable=jit

    View Slide

  54. What's slow in "JIT" on Rails
    memory management,

    GC 9.3%

    View Slide

  55. What's slow in "JIT" on Rails
    Regexp 7.3%

    View Slide

  56. How about Optcarrot? --disable=jit

    View Slide

  57. How about Optcarrot? --enable=jit

    View Slide

  58. What's slow in "JIT" on Rails
    • Rails implementation is not friendly with VM's inline cache
    • Even when code is optimized, insn/cycle is made lower by JIT call
    • Uncharted territory is large: GC, Regexp
    • It has a large overhead when JIT compilation is still happening
    (most of existing benchmark results are slow by this)

    View Slide

  59. JIT optimization
    experiments for Rails

    View Slide

  60. JIT optimizations for Rails
    1. Default parameter tuning
    2. Elimination of function pointer reference
    3. Discard code after deoptimization
    4. Multi-level method inlining
    5. Profile-Guided Optimization

    View Slide

  61. Optimization 1:
    Default parameter tuning
    • Ruby 2.7 default JIT pamareter changes
    • --jit-min-calls: 5 → 10,000
    • --jit-max-cache: 1,000 → 100
    https://github.com/ruby/ruby/commit/0fa4a6a618

    View Slide

  62. View Slide

  63. Attempt 2:
    Reduce function pointer reference
    • Before: "Ruby method setup" → "JIT code"
    • After: "Ruby method setup + JIT code"
    https://github.com/k0kubun/ruby/tree/mjit-cc-call

    View Slide

  64. Attempt 2:
    Reduce function pointer reference

    View Slide

  65. Attempt 3:
    Discard code after deoptimization
    • JIT sometimes fallbacks to VM execution
    (deoptimization)
    • Never use such code that tends to fallback
    https://github.com/k0kubun/ruby/tree/discard-code

    View Slide

  66. Attempt 3:
    Discard code after deoptimization

    View Slide

  67. Attempt 4:
    Multi-level method inlining
    • Previously only 1-level send + yield was inlinable
    • Now I implemented recursive inlining, and
    heuristics to make a decision
    https://github.com/k0kubun/ruby/tree/mjit-inline-multi

    View Slide

  68. Attempt 4:
    Multi-level method inlining

    View Slide

  69. Attempt 5:
    Profile-Guided Optimization
    • Use GCC/Clang's -fprofile-generate and -fprofile-use
    • Unfortunately, profiling was heavy and build system
    becomes complicated
    https://github.com/k0kubun/ruby/tree/pgo

    View Slide

  70. Attempt 5:
    Profile-Guided Optimization

    View Slide

  71. Future plans

    View Slide

  72. Future plans
    1. Reduce sync between threads
    2. Specialized JIT code dispatch
    3. Consider icache stall cost in heuristics
    4. "Real" method inlining
    5. Stack allocation of objects

    View Slide

  73. Future 1: Reduce sync between threads
    • Ruby 2.6's MJIT worker synchronizes with main thread
    • This is fixing race condition around inline cache
    • We could copy inline cache on enqueue (w/ trade-offs)
    or use GVL instead of internal synchronization
    mechanism

    View Slide

  74. Future 2: Specialized JIT code dispatch
    • Current implementation:
    • Method dispatch → Call JIT if needed
    • Future options (other than Attempt 2):
    • Modify ISeq to use insn specialized for JIT call
    • Integrate JIT call with a place closer to insn dispatch

    View Slide

  75. Future 3:
    Consider icache stall cost in heuristics
    • Currently method is always JIT-ed if insn size < 1000
    • Estimate benefits by heuristics and skip JIT if it's low
    • Have some threshold to defeat icache stall

    View Slide

  76. Future 4: "Real" method inlining
    • Previous implementation didn't eliminate a call frame
    • Doing it itself is fast, and may help further
    optimizations on compilers
    • Propagate type information to callee for polymorphism
    • Needs to prepare class-specific inline cache?

    View Slide

  77. Future 5: Stack allocation of objects
    • Ruby is taking time for memory management on Rails
    • We need to solve escape analysis first

    View Slide

  78. Conclusion
    • Ruby 2.7 on Rails started to get better but most of my
    patches are still under experiments (not merged yet)
    • Stay tuned for RubyKaigi and Ruby 2.7

    View Slide