Slide 1

Slide 1 text

JIT on Rails Takashi Kokubun / RailsDM

Slide 2

Slide 2 text

Self introduction • GitHub, Twitter: k0kubun • Arm Treasure Data • Ruby committer: JIT, ERB

Slide 3

Slide 3 text

Treasure Data is hiring

Slide 4

Slide 4 text

Agenda • Benchmark of JIT on Rails • What's slow in Ruby on Rails? • JIT optimization experiments for Rails

Slide 5

Slide 5 text

Benchmark of JIT on Rails

Slide 6

Slide 6 text

Why benchmark? • There's no silver bullet • Understand patterns and use it only when it's suitable

Slide 7

Slide 7 text

Why benchmark? • There's no silver bullet • Understand patterns and use it only when it's suitable • Fact: Treasure Data isn't using --jit for Rails w/ Ruby 2.6 JFYI: I'm enabling --jit on my personal servers with the latest Ruby

Slide 8

Slide 8 text

Benchmark of non-Rails application Optcarrot

Slide 9

Slide 9 text

Benchmark of non-Rails application Intel 4.0GHz i7-4790K 8 cores, memory 16GB, x86-64 Ubuntu Ruby 2.6.0 w/ Optcarrot Frames Per Second (fps) 0 23 45 68 90 86.6 53.8 JIT off JIT on Speed 1.61x

Slide 10

Slide 10 text

Benchmark of non-Rails application Intel 4.0GHz i7-4790K 8 cores, memory 16GB, x86-64 Ubuntu Ruby 2.6.0 w/ Optcarrot Max Resident Set Size (MB) 0 16 32 48 64 63.7 62.8 JIT off JIT on Memory 1.01x

Slide 11

Slide 11 text

Ruby benchmarks using Rails • Discourse • Railsbench

Slide 12

Slide 12 text

Discourse

Slide 13

Slide 13 text

Discourse • discourse/discourse: Popular forum application • DB seed + Apache Bench with script/bench.rb • noahgibbs/rails_ruby_bench: For AWS

Slide 14

Slide 14 text

Discourse http://engineering.appfolio.com/appfolio-engineering/2019/1/4/how-fast-is-the-released-ruby-260 Ruby 2.6.0: Known to be so slow at least while JIT compilation is happening

Slide 15

Slide 15 text

Railsbench • headius/pgrailsbench: Just rails scaffold (#show) • Should be a step between Optcarrot and Discourse • Used by JRuby team to compare it with CRuby • k0kubun/railsbench: Rails 5.2, DB seed

Slide 16

Slide 16 text

Railsbench Intel 4.0GHz i7-4790K 8 cores, memory 16GB, x86-64 Ubuntu --enable=jit w/ Railsbench Request Per Second (#/s) 0 228 455 683 910 0.0 670.5 2.6.2 trunk k0kubun/railsbench : WARMUP=30000 BENCHMARK=10000 bin/bench Speed

Slide 17

Slide 17 text

Railsbench Intel 4.0GHz i7-4790K 8 cores, memory 16GB, x86-64 Ubuntu --enable=jit w/ Railsbench Request Per Second (#/s) 0 228 455 683 910 861.2 670.5 2.6.2 trunk k0kubun/railsbench : WARMUP=30000 BENCHMARK=10000 bin/bench Speed

Slide 18

Slide 18 text

Railsbench Intel 4.0GHz i7-4790K 8 cores, memory 16GB, x86-64 Ubuntu --enable=jit w/ Railsbench Request Per Second (#/s) 0 228 455 683 910 861.2 670.5 2.6.2 trunk k0kubun/railsbench : WARMUP=30000 BENCHMARK=10000 bin/bench --disable=jit w/ Railsbench Request Per Second (#/s) 0 228 455 683 910 909.3 871.5 2.6.2 trunk Speed

Slide 19

Slide 19 text

Railsbench Intel 4.0GHz i7-4790K 8 cores, memory 16GB, x86-64 Ubuntu --enable=jit w/ Railsbench Max Resident Set Size (MB) 0 28 55 83 110 107.6 107.2 2.6.2 trunk k0kubun/railsbench : WARMUP=30000 BENCHMARK=10000 bin/bench --disable=jit w/ Railsbench Max Resident Set Size (MB) 0 28 55 83 110 106.5 105.2 2.6.2 trunk Memory

Slide 20

Slide 20 text

What's slow in Ruby on Rails?

Slide 21

Slide 21 text

Bottleneck on non-Rails (Optcarrot) super intensive calls to the same set of methods (using cache-friendly instance variables)

Slide 22

Slide 22 text

Bottleneck on non-Rails (Optcarrot) --disable=jit

Slide 23

Slide 23 text

Bottleneck on non-Rails (Optcarrot) --enable=jit

Slide 24

Slide 24 text

Bottleneck on non-Rails (Optcarrot) --enable=jit --disable=jit

Slide 25

Slide 25 text

Bottleneck on Railsbench Which Rails component?

Slide 26

Slide 26 text

No content

Slide 27

Slide 27 text

ActiveRecord #find 10%

Slide 28

Slide 28 text

ActionView find* 5%

Slide 29

Slide 29 text

Logger#info 17% (AR : #debug, 1.2%)

Slide 30

Slide 30 text

ActiveRecord attributes 3%

Slide 31

Slide 31 text

Route helper 12%

Slide 32

Slide 32 text

D D link_to 4%

Slide 33

Slide 33 text

Logger#info 17% (AR : #debug, 1.2%)

Slide 34

Slide 34 text

csrf_meta_tags 5%

Slide 35

Slide 35 text

stylesheet_ link_tag 9%

Slide 36

Slide 36 text

javascript_ include_tag 7%

Slide 37

Slide 37 text

What's slow in Ruby on Rails "DUJWF3FDPSEpOE "DUJPO7JFXpOE UFNQMBUFMBZPVU -PHHFSJOGP "DUJWF3FDPSEBUUST 3PVUFIFMQFS MJOL@UP DTSG@NFUB@UBHT TUZMFTIFFU@MJOL@UBH KBWBTDSJQU@JODMVEF@UBH Rails scaffold #show Bottleneck ࢛ఱԦ

Slide 38

Slide 38 text

What's slow in "JIT" on Rails --disable=jit

Slide 39

Slide 39 text

What's slow in "JIT" on Rails --enable=jit

Slide 40

Slide 40 text

No content

Slide 41

Slide 41 text

No content

Slide 42

Slide 42 text

No content

Slide 43

Slide 43 text

Instance variable's cache key: self's class

Slide 44

Slide 44 text

Ruby Performance Pro Tips: Do not use inheritance

Slide 45

Slide 45 text

What's slow in "JIT" on Rails --enable=jit --disable=jit

Slide 46

Slide 46 text

What's slow in "JIT" on Rails --enable=jit --disable=jit

Slide 47

Slide 47 text

What's slow in "JIT" on Rails --disable=jit

Slide 48

Slide 48 text

What's slow in "JIT" on Rails --enable=jit

Slide 49

Slide 49 text

What's slow in "JIT" on Rails icache.hit [Number of Instruction Cache, Streaming Buffer and Victim Cache Reads. both cacheable and noncacheable, including UC fetches] icache.ifdata_stall [Cycles where a code fetch is stalled due to L1 instruction-cache miss] icache.ifetch_stall [Cycles where a code fetch is stalled due to L1 instruction-cache miss] icache.misses [Number of Instruction Cache, Streaming Buffer and Victim Cache Misses. Includes Uncacheable accesses]

Slide 50

Slide 50 text

How about Optcarrot? --disable=jit

Slide 51

Slide 51 text

How about Optcarrot? --enable=jit

Slide 52

Slide 52 text

What's slow in "JIT" on Rails VM 13.5% --disable=jit

Slide 53

Slide 53 text

What's slow in "JIT" on Rails VM 13.5% 7.5% (-6%) --enable=jit

Slide 54

Slide 54 text

What's slow in "JIT" on Rails memory management, GC 9.3%

Slide 55

Slide 55 text

What's slow in "JIT" on Rails Regexp 7.3%

Slide 56

Slide 56 text

How about Optcarrot? --disable=jit

Slide 57

Slide 57 text

How about Optcarrot? --enable=jit

Slide 58

Slide 58 text

What's slow in "JIT" on Rails • Rails implementation is not friendly with VM's inline cache • Even when code is optimized, insn/cycle is made lower by JIT call • Uncharted territory is large: GC, Regexp • It has a large overhead when JIT compilation is still happening (most of existing benchmark results are slow by this)

Slide 59

Slide 59 text

JIT optimization experiments for Rails

Slide 60

Slide 60 text

JIT optimizations for Rails 1. Default parameter tuning 2. Elimination of function pointer reference 3. Discard code after deoptimization 4. Multi-level method inlining 5. Profile-Guided Optimization

Slide 61

Slide 61 text

Optimization 1: Default parameter tuning • Ruby 2.7 default JIT pamareter changes • --jit-min-calls: 5 → 10,000 • --jit-max-cache: 1,000 → 100 https://github.com/ruby/ruby/commit/0fa4a6a618

Slide 62

Slide 62 text

No content

Slide 63

Slide 63 text

Attempt 2: Reduce function pointer reference • Before: "Ruby method setup" → "JIT code" • After: "Ruby method setup + JIT code" https://github.com/k0kubun/ruby/tree/mjit-cc-call

Slide 64

Slide 64 text

Attempt 2: Reduce function pointer reference

Slide 65

Slide 65 text

Attempt 3: Discard code after deoptimization • JIT sometimes fallbacks to VM execution (deoptimization) • Never use such code that tends to fallback https://github.com/k0kubun/ruby/tree/discard-code

Slide 66

Slide 66 text

Attempt 3: Discard code after deoptimization

Slide 67

Slide 67 text

Attempt 4: Multi-level method inlining • Previously only 1-level send + yield was inlinable • Now I implemented recursive inlining, and heuristics to make a decision https://github.com/k0kubun/ruby/tree/mjit-inline-multi

Slide 68

Slide 68 text

Attempt 4: Multi-level method inlining

Slide 69

Slide 69 text

Attempt 5: Profile-Guided Optimization • Use GCC/Clang's -fprofile-generate and -fprofile-use • Unfortunately, profiling was heavy and build system becomes complicated https://github.com/k0kubun/ruby/tree/pgo

Slide 70

Slide 70 text

Attempt 5: Profile-Guided Optimization

Slide 71

Slide 71 text

Future plans

Slide 72

Slide 72 text

Future plans 1. Reduce sync between threads 2. Specialized JIT code dispatch 3. Consider icache stall cost in heuristics 4. "Real" method inlining 5. Stack allocation of objects

Slide 73

Slide 73 text

Future 1: Reduce sync between threads • Ruby 2.6's MJIT worker synchronizes with main thread • This is fixing race condition around inline cache • We could copy inline cache on enqueue (w/ trade-offs) or use GVL instead of internal synchronization mechanism

Slide 74

Slide 74 text

Future 2: Specialized JIT code dispatch • Current implementation: • Method dispatch → Call JIT if needed • Future options (other than Attempt 2): • Modify ISeq to use insn specialized for JIT call • Integrate JIT call with a place closer to insn dispatch

Slide 75

Slide 75 text

Future 3: Consider icache stall cost in heuristics • Currently method is always JIT-ed if insn size < 1000 • Estimate benefits by heuristics and skip JIT if it's low • Have some threshold to defeat icache stall

Slide 76

Slide 76 text

Future 4: "Real" method inlining • Previous implementation didn't eliminate a call frame • Doing it itself is fast, and may help further optimizations on compilers • Propagate type information to callee for polymorphism • Needs to prepare class-specific inline cache?

Slide 77

Slide 77 text

Future 5: Stack allocation of objects • Ruby is taking time for memory management on Rails • We need to solve escape analysis first

Slide 78

Slide 78 text

Conclusion • Ruby 2.7 on Rails started to get better but most of my patches are still under experiments (not merged yet) • Stay tuned for RubyKaigi and Ruby 2.7