JIT on Rails / Rails Developers Meetup 2019

JIT on Rails / Rails Developers Meetup 2019

Rails Developers Meetup 2019
https://railsdm.github.io/

08d5432a5bc31e6d9edec87b94cb1db1?s=128

Takashi Kokubun

March 23, 2019
Tweet

Transcript

  1. JIT on Rails Takashi Kokubun / RailsDM

  2. Self introduction • GitHub, Twitter: k0kubun • Arm Treasure Data

    • Ruby committer: JIT, ERB
  3. Treasure Data is hiring

  4. Agenda • Benchmark of JIT on Rails • What's slow

    in Ruby on Rails? • JIT optimization experiments for Rails
  5. Benchmark of JIT on Rails

  6. Why benchmark? • There's no silver bullet • Understand patterns

    and use it only when it's suitable
  7. Why benchmark? • There's no silver bullet • Understand patterns

    and use it only when it's suitable • Fact: Treasure Data isn't using --jit for Rails w/ Ruby 2.6 JFYI: I'm enabling --jit on my personal servers with the latest Ruby
  8. Benchmark of non-Rails application Optcarrot

  9. Benchmark of non-Rails application Intel 4.0GHz i7-4790K 8 cores, memory

    16GB, x86-64 Ubuntu Ruby 2.6.0 w/ Optcarrot Frames Per Second (fps) 0 23 45 68 90 86.6 53.8 JIT off JIT on Speed 1.61x
  10. Benchmark of non-Rails application Intel 4.0GHz i7-4790K 8 cores, memory

    16GB, x86-64 Ubuntu Ruby 2.6.0 w/ Optcarrot Max Resident Set Size (MB) 0 16 32 48 64 63.7 62.8 JIT off JIT on Memory 1.01x
  11. Ruby benchmarks using Rails • Discourse • Railsbench

  12. Discourse

  13. Discourse • discourse/discourse: Popular forum application • DB seed +

    Apache Bench with script/bench.rb • noahgibbs/rails_ruby_bench: For AWS
  14. Discourse http://engineering.appfolio.com/appfolio-engineering/2019/1/4/how-fast-is-the-released-ruby-260 Ruby 2.6.0: Known to be so slow at

    least while JIT compilation is happening
  15. Railsbench • headius/pgrailsbench: Just rails scaffold (#show) • Should be

    a step between Optcarrot and Discourse • Used by JRuby team to compare it with CRuby • k0kubun/railsbench: Rails 5.2, DB seed
  16. Railsbench Intel 4.0GHz i7-4790K 8 cores, memory 16GB, x86-64 Ubuntu

    --enable=jit w/ Railsbench Request Per Second (#/s) 0 228 455 683 910 0.0 670.5 2.6.2 trunk k0kubun/railsbench : WARMUP=30000 BENCHMARK=10000 bin/bench Speed
  17. Railsbench Intel 4.0GHz i7-4790K 8 cores, memory 16GB, x86-64 Ubuntu

    --enable=jit w/ Railsbench Request Per Second (#/s) 0 228 455 683 910 861.2 670.5 2.6.2 trunk k0kubun/railsbench : WARMUP=30000 BENCHMARK=10000 bin/bench Speed
  18. Railsbench Intel 4.0GHz i7-4790K 8 cores, memory 16GB, x86-64 Ubuntu

    --enable=jit w/ Railsbench Request Per Second (#/s) 0 228 455 683 910 861.2 670.5 2.6.2 trunk k0kubun/railsbench : WARMUP=30000 BENCHMARK=10000 bin/bench --disable=jit w/ Railsbench Request Per Second (#/s) 0 228 455 683 910 909.3 871.5 2.6.2 trunk Speed
  19. Railsbench Intel 4.0GHz i7-4790K 8 cores, memory 16GB, x86-64 Ubuntu

    --enable=jit w/ Railsbench Max Resident Set Size (MB) 0 28 55 83 110 107.6 107.2 2.6.2 trunk k0kubun/railsbench : WARMUP=30000 BENCHMARK=10000 bin/bench --disable=jit w/ Railsbench Max Resident Set Size (MB) 0 28 55 83 110 106.5 105.2 2.6.2 trunk Memory
  20. What's slow in Ruby on Rails?

  21. Bottleneck on non-Rails (Optcarrot) super intensive calls to the same

    set of methods (using cache-friendly instance variables)
  22. Bottleneck on non-Rails (Optcarrot) --disable=jit

  23. Bottleneck on non-Rails (Optcarrot) --enable=jit

  24. Bottleneck on non-Rails (Optcarrot) --enable=jit --disable=jit

  25. Bottleneck on Railsbench Which Rails component?

  26. None
  27. ActiveRecord #find 10%

  28. ActionView find* 5%

  29. Logger#info 17% (AR : #debug, 1.2%)

  30. ActiveRecord attributes 3%

  31. Route helper 12%

  32. D D link_to 4%

  33. Logger#info 17% (AR : #debug, 1.2%)

  34. csrf_meta_tags 5%

  35. stylesheet_ link_tag 9%

  36. javascript_ include_tag 7%

  37. What's slow in Ruby on Rails "DUJWF3FDPSEpOE  "DUJPO7JFXpOE UFNQMBUFMBZPVU

     -PHHFSJOGP  "DUJWF3FDPSEBUUST  3PVUFIFMQFS  MJOL@UP  DTSG@NFUB@UBHT  TUZMFTIFFU@MJOL@UBH  KBWBTDSJQU@JODMVEF@UBH  Rails scaffold #show Bottleneck ࢛ఱԦ
  38. What's slow in "JIT" on Rails --disable=jit

  39. What's slow in "JIT" on Rails --enable=jit

  40. None
  41. None
  42. None
  43. Instance variable's cache key: self's class

  44. Ruby Performance Pro Tips: Do not use inheritance

  45. What's slow in "JIT" on Rails --enable=jit --disable=jit

  46. What's slow in "JIT" on Rails --enable=jit --disable=jit

  47. What's slow in "JIT" on Rails --disable=jit

  48. What's slow in "JIT" on Rails --enable=jit

  49. What's slow in "JIT" on Rails icache.hit [Number of Instruction

    Cache, Streaming Buffer and Victim Cache Reads. both cacheable and noncacheable, including UC fetches] icache.ifdata_stall [Cycles where a code fetch is stalled due to L1 instruction-cache miss] icache.ifetch_stall [Cycles where a code fetch is stalled due to L1 instruction-cache miss] icache.misses [Number of Instruction Cache, Streaming Buffer and Victim Cache Misses. Includes Uncacheable accesses]
  50. How about Optcarrot? --disable=jit

  51. How about Optcarrot? --enable=jit

  52. What's slow in "JIT" on Rails VM 13.5% --disable=jit

  53. What's slow in "JIT" on Rails VM 13.5% 7.5% (-6%)

    --enable=jit
  54. What's slow in "JIT" on Rails memory management, GC 9.3%

  55. What's slow in "JIT" on Rails Regexp 7.3%

  56. How about Optcarrot? --disable=jit

  57. How about Optcarrot? --enable=jit

  58. What's slow in "JIT" on Rails • Rails implementation is

    not friendly with VM's inline cache • Even when code is optimized, insn/cycle is made lower by JIT call • Uncharted territory is large: GC, Regexp • It has a large overhead when JIT compilation is still happening (most of existing benchmark results are slow by this)
  59. JIT optimization experiments for Rails

  60. JIT optimizations for Rails 1. Default parameter tuning 2. Elimination

    of function pointer reference 3. Discard code after deoptimization 4. Multi-level method inlining 5. Profile-Guided Optimization
  61. Optimization 1: Default parameter tuning • Ruby 2.7 default JIT

    pamareter changes • --jit-min-calls: 5 → 10,000 • --jit-max-cache: 1,000 → 100 https://github.com/ruby/ruby/commit/0fa4a6a618
  62. None
  63. Attempt 2: Reduce function pointer reference • Before: "Ruby method

    setup" → "JIT code" • After: "Ruby method setup + JIT code" https://github.com/k0kubun/ruby/tree/mjit-cc-call
  64. Attempt 2: Reduce function pointer reference

  65. Attempt 3: Discard code after deoptimization • JIT sometimes fallbacks

    to VM execution (deoptimization) • Never use such code that tends to fallback https://github.com/k0kubun/ruby/tree/discard-code
  66. Attempt 3: Discard code after deoptimization

  67. Attempt 4: Multi-level method inlining • Previously only 1-level send

    + yield was inlinable • Now I implemented recursive inlining, and heuristics to make a decision https://github.com/k0kubun/ruby/tree/mjit-inline-multi
  68. Attempt 4: Multi-level method inlining

  69. Attempt 5: Profile-Guided Optimization • Use GCC/Clang's -fprofile-generate and -fprofile-use

    • Unfortunately, profiling was heavy and build system becomes complicated https://github.com/k0kubun/ruby/tree/pgo
  70. Attempt 5: Profile-Guided Optimization

  71. Future plans

  72. Future plans 1. Reduce sync between threads 2. Specialized JIT

    code dispatch 3. Consider icache stall cost in heuristics 4. "Real" method inlining 5. Stack allocation of objects
  73. Future 1: Reduce sync between threads • Ruby 2.6's MJIT

    worker synchronizes with main thread • This is fixing race condition around inline cache • We could copy inline cache on enqueue (w/ trade-offs) or use GVL instead of internal synchronization mechanism
  74. Future 2: Specialized JIT code dispatch • Current implementation: •

    Method dispatch → Call JIT if needed • Future options (other than Attempt 2): • Modify ISeq to use insn specialized for JIT call • Integrate JIT call with a place closer to insn dispatch
  75. Future 3: Consider icache stall cost in heuristics • Currently

    method is always JIT-ed if insn size < 1000 • Estimate benefits by heuristics and skip JIT if it's low • Have some threshold to defeat icache stall
  76. Future 4: "Real" method inlining • Previous implementation didn't eliminate

    a call frame • Doing it itself is fast, and may help further optimizations on compilers • Propagate type information to callee for polymorphism • Needs to prepare class-specific inline cache?
  77. Future 5: Stack allocation of objects • Ruby is taking

    time for memory management on Rails • We need to solve escape analysis first
  78. Conclusion • Ruby 2.7 on Rails started to get better

    but most of my patches are still under experiments (not merged yet) • Stay tuned for RubyKaigi and Ruby 2.7