JIT on Rails / Rails Developers Meetup 2019

JIT on Rails Takashi Kokubun / RailsDM

Self introduction • GitHub, Twitter: k0kubun • Arm Treasure Data
• Ruby committer: JIT, ERB

Treasure Data is hiring

Agenda • Benchmark of JIT on Rails • What's slow
in Ruby on Rails? • JIT optimization experiments for Rails

Benchmark of JIT on Rails

Why benchmark? • There's no silver bullet • Understand patterns
and use it only when it's suitable

Why benchmark? • There's no silver bullet • Understand patterns
and use it only when it's suitable • Fact: Treasure Data isn't using --jit for Rails w/ Ruby 2.6 JFYI: I'm enabling --jit on my personal servers with the latest Ruby

Benchmark of non-Rails application Optcarrot

Benchmark of non-Rails application Intel 4.0GHz i7-4790K 8 cores, memory
16GB, x86-64 Ubuntu Ruby 2.6.0 w/ Optcarrot Frames Per Second (fps) 0 23 45 68 90 86.6 53.8 JIT oﬀ JIT on Speed 1.61x

Benchmark of non-Rails application Intel 4.0GHz i7-4790K 8 cores, memory
16GB, x86-64 Ubuntu Ruby 2.6.0 w/ Optcarrot Max Resident Set Size (MB) 0 16 32 48 64 63.7 62.8 JIT oﬀ JIT on Memory 1.01x

Ruby benchmarks using Rails • Discourse • Railsbench

Discourse

Discourse • discourse/discourse: Popular forum application • DB seed +
Apache Bench with script/bench.rb • noahgibbs/rails_ruby_bench: For AWS

Discourse http://engineering.appfolio.com/appfolio-engineering/2019/1/4/how-fast-is-the-released-ruby-260 Ruby 2.6.0: Known to be so slow at
least while JIT compilation is happening

Railsbench • headius/pgrailsbench: Just rails scaffold (#show) • Should be
a step between Optcarrot and Discourse • Used by JRuby team to compare it with CRuby • k0kubun/railsbench: Rails 5.2, DB seed

Railsbench Intel 4.0GHz i7-4790K 8 cores, memory 16GB, x86-64 Ubuntu
--enable=jit w/ Railsbench Request Per Second (#/s) 0 228 455 683 910 0.0 670.5 2.6.2 trunk k0kubun/railsbench : WARMUP=30000 BENCHMARK=10000 bin/bench Speed

--enable=jit w/ Railsbench Request Per Second (#/s) 0 228 455 683 910 861.2 670.5 2.6.2 trunk k0kubun/railsbench : WARMUP=30000 BENCHMARK=10000 bin/bench Speed

--enable=jit w/ Railsbench Request Per Second (#/s) 0 228 455 683 910 861.2 670.5 2.6.2 trunk k0kubun/railsbench : WARMUP=30000 BENCHMARK=10000 bin/bench --disable=jit w/ Railsbench Request Per Second (#/s) 0 228 455 683 910 909.3 871.5 2.6.2 trunk Speed

--enable=jit w/ Railsbench Max Resident Set Size (MB) 0 28 55 83 110 107.6 107.2 2.6.2 trunk k0kubun/railsbench : WARMUP=30000 BENCHMARK=10000 bin/bench --disable=jit w/ Railsbench Max Resident Set Size (MB) 0 28 55 83 110 106.5 105.2 2.6.2 trunk Memory

What's slow in Ruby on Rails?

Bottleneck on non-Rails (Optcarrot) super intensive calls to the same
set of methods (using cache-friendly instance variables)

Bottleneck on non-Rails (Optcarrot) --disable=jit

Bottleneck on non-Rails (Optcarrot) --enable=jit

Bottleneck on non-Rails (Optcarrot) --enable=jit --disable=jit

Bottleneck on Railsbench Which Rails component?

ActiveRecord #ﬁnd 10%

ActionView ﬁnd* 5%

Logger#info 17% (AR : #debug, 1.2%)

ActiveRecord attributes 3%

Route helper 12%

D D link_to 4%

Logger#info 17% (AR : #debug, 1.2%)

csrf_meta_tags 5%

stylesheet_ link_tag 9%

javascript_ include_tag 7%

What's slow in Ruby on Rails "DUJWF3FDPSEpOE "DUJPO7JFXpOE UFNQMBUFMBZPVU
-PHHFSJOGP "DUJWF3FDPSEBUUST 3PVUFIFMQFS MJOL@UP DTSG@NFUB@UBHT TUZMFTIFFU@MJOL@UBH KBWBTDSJQU@JODMVEF@UBH Rails scaﬀold #show Bottleneck ࢛ఱԦ

What's slow in "JIT" on Rails --disable=jit

What's slow in "JIT" on Rails --enable=jit

Instance variable's cache key: self's class

Ruby Performance Pro Tips: Do not use inheritance

What's slow in "JIT" on Rails --enable=jit --disable=jit

What's slow in "JIT" on Rails --disable=jit

What's slow in "JIT" on Rails --enable=jit

What's slow in "JIT" on Rails icache.hit [Number of Instruction
Cache, Streaming Buffer and Victim Cache Reads. both cacheable and noncacheable, including UC fetches] icache.ifdata_stall [Cycles where a code fetch is stalled due to L1 instruction-cache miss] icache.ifetch_stall [Cycles where a code fetch is stalled due to L1 instruction-cache miss] icache.misses [Number of Instruction Cache, Streaming Buffer and Victim Cache Misses. Includes Uncacheable accesses]

How about Optcarrot? --disable=jit

How about Optcarrot? --enable=jit

What's slow in "JIT" on Rails VM 13.5% --disable=jit

What's slow in "JIT" on Rails VM 13.5% 7.5% (-6%)
--enable=jit

What's slow in "JIT" on Rails memory management, GC 9.3%

What's slow in "JIT" on Rails Regexp 7.3%

How about Optcarrot? --disable=jit

How about Optcarrot? --enable=jit

What's slow in "JIT" on Rails • Rails implementation is
not friendly with VM's inline cache • Even when code is optimized, insn/cycle is made lower by JIT call • Uncharted territory is large: GC, Regexp • It has a large overhead when JIT compilation is still happening (most of existing benchmark results are slow by this)

JIT optimization experiments for Rails

JIT optimizations for Rails 1. Default parameter tuning 2. Elimination
of function pointer reference 3. Discard code after deoptimization 4. Multi-level method inlining 5. Proﬁle-Guided Optimization

Optimization 1: Default parameter tuning • Ruby 2.7 default JIT
pamareter changes • --jit-min-calls: 5 → 10,000 • --jit-max-cache: 1,000 → 100 https://github.com/ruby/ruby/commit/0fa4a6a618

Attempt 2: Reduce function pointer reference • Before: "Ruby method
setup" → "JIT code" • After: "Ruby method setup + JIT code" https://github.com/k0kubun/ruby/tree/mjit-cc-call

Attempt 2: Reduce function pointer reference

Attempt 3: Discard code after deoptimization • JIT sometimes fallbacks
to VM execution (deoptimization) • Never use such code that tends to fallback https://github.com/k0kubun/ruby/tree/discard-code

Attempt 3: Discard code after deoptimization

Attempt 4: Multi-level method inlining • Previously only 1-level send
+ yield was inlinable • Now I implemented recursive inlining, and heuristics to make a decision https://github.com/k0kubun/ruby/tree/mjit-inline-multi

Attempt 4: Multi-level method inlining

Attempt 5: Profile-Guided Optimization • Use GCC/Clang's -fprofile-generate and -fprofile-use
• Unfortunately, profiling was heavy and build system becomes complicated https://github.com/k0kubun/ruby/tree/pgo

Attempt 5: Proﬁle-Guided Optimization

Future plans

Future plans 1. Reduce sync between threads 2. Specialized JIT
code dispatch 3. Consider icache stall cost in heuristics 4. "Real" method inlining 5. Stack allocation of objects

Future 1: Reduce sync between threads • Ruby 2.6's MJIT
worker synchronizes with main thread • This is ﬁxing race condition around inline cache • We could copy inline cache on enqueue (w/ trade-offs) or use GVL instead of internal synchronization mechanism

Future 2: Specialized JIT code dispatch • Current implementation: •
Method dispatch → Call JIT if needed • Future options (other than Attempt 2): • Modify ISeq to use insn specialized for JIT call • Integrate JIT call with a place closer to insn dispatch

Future 3: Consider icache stall cost in heuristics • Currently
method is always JIT-ed if insn size < 1000 • Estimate beneﬁts by heuristics and skip JIT if it's low • Have some threshold to defeat icache stall

Future 4: "Real" method inlining • Previous implementation didn't eliminate
a call frame • Doing it itself is fast, and may help further optimizations on compilers • Propagate type information to callee for polymorphism • Needs to prepare class-speciﬁc inline cache?

Future 5: Stack allocation of objects • Ruby is taking
time for memory management on Rails • We need to solve escape analysis ﬁrst

Conclusion • Ruby 2.7 on Rails started to get better
but most of my patches are still under experiments (not merged yet) • Stay tuned for RubyKaigi and Ruby 2.7

JIT on Rails / Rails Developers Meetup 2019

JIT on Rails / Rails Developers Meetup 2019

More Decks by Takashi Kokubun

Other Decks in Programming

Featured

Transcript