Upgrade to Pro — share decks privately, control downloads, hide ads and more …

JIT on Rails / Rails Developers Meetup 2019

JIT on Rails / Rails Developers Meetup 2019

Rails Developers Meetup 2019
https://railsdm.github.io/

Takashi Kokubun

March 23, 2019
Tweet

More Decks by Takashi Kokubun

Other Decks in Programming

Transcript

  1. Agenda • Benchmark of JIT on Rails • What's slow

    in Ruby on Rails? • JIT optimization experiments for Rails
  2. Why benchmark? • There's no silver bullet • Understand patterns

    and use it only when it's suitable • Fact: Treasure Data isn't using --jit for Rails w/ Ruby 2.6 JFYI: I'm enabling --jit on my personal servers with the latest Ruby
  3. Benchmark of non-Rails application Intel 4.0GHz i7-4790K 8 cores, memory

    16GB, x86-64 Ubuntu Ruby 2.6.0 w/ Optcarrot Frames Per Second (fps) 0 23 45 68 90 86.6 53.8 JIT off JIT on Speed 1.61x
  4. Benchmark of non-Rails application Intel 4.0GHz i7-4790K 8 cores, memory

    16GB, x86-64 Ubuntu Ruby 2.6.0 w/ Optcarrot Max Resident Set Size (MB) 0 16 32 48 64 63.7 62.8 JIT off JIT on Memory 1.01x
  5. Discourse • discourse/discourse: Popular forum application • DB seed +

    Apache Bench with script/bench.rb • noahgibbs/rails_ruby_bench: For AWS
  6. Railsbench • headius/pgrailsbench: Just rails scaffold (#show) • Should be

    a step between Optcarrot and Discourse • Used by JRuby team to compare it with CRuby • k0kubun/railsbench: Rails 5.2, DB seed
  7. Railsbench Intel 4.0GHz i7-4790K 8 cores, memory 16GB, x86-64 Ubuntu

    --enable=jit w/ Railsbench Request Per Second (#/s) 0 228 455 683 910 0.0 670.5 2.6.2 trunk k0kubun/railsbench : WARMUP=30000 BENCHMARK=10000 bin/bench Speed
  8. Railsbench Intel 4.0GHz i7-4790K 8 cores, memory 16GB, x86-64 Ubuntu

    --enable=jit w/ Railsbench Request Per Second (#/s) 0 228 455 683 910 861.2 670.5 2.6.2 trunk k0kubun/railsbench : WARMUP=30000 BENCHMARK=10000 bin/bench Speed
  9. Railsbench Intel 4.0GHz i7-4790K 8 cores, memory 16GB, x86-64 Ubuntu

    --enable=jit w/ Railsbench Request Per Second (#/s) 0 228 455 683 910 861.2 670.5 2.6.2 trunk k0kubun/railsbench : WARMUP=30000 BENCHMARK=10000 bin/bench --disable=jit w/ Railsbench Request Per Second (#/s) 0 228 455 683 910 909.3 871.5 2.6.2 trunk Speed
  10. Railsbench Intel 4.0GHz i7-4790K 8 cores, memory 16GB, x86-64 Ubuntu

    --enable=jit w/ Railsbench Max Resident Set Size (MB) 0 28 55 83 110 107.6 107.2 2.6.2 trunk k0kubun/railsbench : WARMUP=30000 BENCHMARK=10000 bin/bench --disable=jit w/ Railsbench Max Resident Set Size (MB) 0 28 55 83 110 106.5 105.2 2.6.2 trunk Memory
  11. Bottleneck on non-Rails (Optcarrot) super intensive calls to the same

    set of methods (using cache-friendly instance variables)
  12. What's slow in Ruby on Rails "DUJWF3FDPSEpOE  "DUJPO7JFXpOE UFNQMBUFMBZPVU

     -PHHFSJOGP  "DUJWF3FDPSEBUUST  3PVUFIFMQFS  MJOL@UP  DTSG@NFUB@UBHT  TUZMFTIFFU@MJOL@UBH  KBWBTDSJQU@JODMVEF@UBH  Rails scaffold #show Bottleneck ࢛ఱԦ
  13. What's slow in "JIT" on Rails icache.hit [Number of Instruction

    Cache, Streaming Buffer and Victim Cache Reads. both cacheable and noncacheable, including UC fetches] icache.ifdata_stall [Cycles where a code fetch is stalled due to L1 instruction-cache miss] icache.ifetch_stall [Cycles where a code fetch is stalled due to L1 instruction-cache miss] icache.misses [Number of Instruction Cache, Streaming Buffer and Victim Cache Misses. Includes Uncacheable accesses]
  14. What's slow in "JIT" on Rails • Rails implementation is

    not friendly with VM's inline cache • Even when code is optimized, insn/cycle is made lower by JIT call • Uncharted territory is large: GC, Regexp • It has a large overhead when JIT compilation is still happening (most of existing benchmark results are slow by this)
  15. JIT optimizations for Rails 1. Default parameter tuning 2. Elimination

    of function pointer reference 3. Discard code after deoptimization 4. Multi-level method inlining 5. Profile-Guided Optimization
  16. Optimization 1: Default parameter tuning • Ruby 2.7 default JIT

    pamareter changes • --jit-min-calls: 5 → 10,000 • --jit-max-cache: 1,000 → 100 https://github.com/ruby/ruby/commit/0fa4a6a618
  17. Attempt 2: Reduce function pointer reference • Before: "Ruby method

    setup" → "JIT code" • After: "Ruby method setup + JIT code" https://github.com/k0kubun/ruby/tree/mjit-cc-call
  18. Attempt 3: Discard code after deoptimization • JIT sometimes fallbacks

    to VM execution (deoptimization) • Never use such code that tends to fallback https://github.com/k0kubun/ruby/tree/discard-code
  19. Attempt 4: Multi-level method inlining • Previously only 1-level send

    + yield was inlinable • Now I implemented recursive inlining, and heuristics to make a decision https://github.com/k0kubun/ruby/tree/mjit-inline-multi
  20. Attempt 5: Profile-Guided Optimization • Use GCC/Clang's -fprofile-generate and -fprofile-use

    • Unfortunately, profiling was heavy and build system becomes complicated https://github.com/k0kubun/ruby/tree/pgo
  21. Future plans 1. Reduce sync between threads 2. Specialized JIT

    code dispatch 3. Consider icache stall cost in heuristics 4. "Real" method inlining 5. Stack allocation of objects
  22. Future 1: Reduce sync between threads • Ruby 2.6's MJIT

    worker synchronizes with main thread • This is fixing race condition around inline cache • We could copy inline cache on enqueue (w/ trade-offs) or use GVL instead of internal synchronization mechanism
  23. Future 2: Specialized JIT code dispatch • Current implementation: •

    Method dispatch → Call JIT if needed • Future options (other than Attempt 2): • Modify ISeq to use insn specialized for JIT call • Integrate JIT call with a place closer to insn dispatch
  24. Future 3: Consider icache stall cost in heuristics • Currently

    method is always JIT-ed if insn size < 1000 • Estimate benefits by heuristics and skip JIT if it's low • Have some threshold to defeat icache stall
  25. Future 4: "Real" method inlining • Previous implementation didn't eliminate

    a call frame • Doing it itself is fast, and may help further optimizations on compilers • Propagate type information to callee for polymorphism • Needs to prepare class-specific inline cache?
  26. Future 5: Stack allocation of objects • Ruby is taking

    time for memory management on Rails • We need to solve escape analysis first
  27. Conclusion • Ruby 2.7 on Rails started to get better

    but most of my patches are still under experiments (not merged yet) • Stay tuned for RubyKaigi and Ruby 2.7