Why benchmark? • There's no silver bullet • Understand patterns and use it only when it's suitable • Fact: Treasure Data isn't using --jit for Rails w/ Ruby 2.6 JFYI: I'm enabling --jit on my personal servers with the latest Ruby
Discourse http://engineering.appfolio.com/appfolio-engineering/2019/1/4/how-fast-is-the-released-ruby-260 Ruby 2.6.0: Known to be so slow at least while JIT compilation is happening
Railsbench • headius/pgrailsbench: Just rails scaffold (#show) • Should be a step between Optcarrot and Discourse • Used by JRuby team to compare it with CRuby • k0kubun/railsbench: Rails 5.2, DB seed
What's slow in "JIT" on Rails icache.hit [Number of Instruction Cache, Streaming Buffer and Victim Cache Reads. both cacheable and noncacheable, including UC fetches] icache.ifdata_stall [Cycles where a code fetch is stalled due to L1 instruction-cache miss] icache.ifetch_stall [Cycles where a code fetch is stalled due to L1 instruction-cache miss] icache.misses [Number of Instruction Cache, Streaming Buffer and Victim Cache Misses. Includes Uncacheable accesses]
What's slow in "JIT" on Rails • Rails implementation is not friendly with VM's inline cache • Even when code is optimized, insn/cycle is made lower by JIT call • Uncharted territory is large: GC, Regexp • It has a large overhead when JIT compilation is still happening (most of existing benchmark results are slow by this)
Attempt 3: Discard code after deoptimization • JIT sometimes fallbacks to VM execution (deoptimization) • Never use such code that tends to fallback https://github.com/k0kubun/ruby/tree/discard-code
Attempt 4: Multi-level method inlining • Previously only 1-level send + yield was inlinable • Now I implemented recursive inlining, and heuristics to make a decision https://github.com/k0kubun/ruby/tree/mjit-inline-multi
Attempt 5: Profile-Guided Optimization • Use GCC/Clang's -fprofile-generate and -fprofile-use • Unfortunately, profiling was heavy and build system becomes complicated https://github.com/k0kubun/ruby/tree/pgo
Future 1: Reduce sync between threads • Ruby 2.6's MJIT worker synchronizes with main thread • This is fixing race condition around inline cache • We could copy inline cache on enqueue (w/ trade-offs) or use GVL instead of internal synchronization mechanism
Future 2: Specialized JIT code dispatch • Current implementation: • Method dispatch → Call JIT if needed • Future options (other than Attempt 2): • Modify ISeq to use insn specialized for JIT call • Integrate JIT call with a place closer to insn dispatch
Future 3: Consider icache stall cost in heuristics • Currently method is always JIT-ed if insn size < 1000 • Estimate benefits by heuristics and skip JIT if it's low • Have some threshold to defeat icache stall
Future 4: "Real" method inlining • Previous implementation didn't eliminate a call frame • Doing it itself is fast, and may help further optimizations on compilers • Propagate type information to callee for polymorphism • Needs to prepare class-specific inline cache?
Conclusion • Ruby 2.7 on Rails started to get better but most of my patches are still under experiments (not merged yet) • Stay tuned for RubyKaigi and Ruby 2.7