Slide 1

Slide 1 text

Photo by Tomoyuki Kengaku On sending methods Urabe, Shyouhei

Slide 2

Slide 2 text

About myself • Urabe, Shyouhei • @shyouhei • Ruby committer @ Money Forward, Inc. • 2020’s #2 most active `ruby/ruby` committer, according to GitHub.

Slide 3

Slide 3 text

No content

Slide 4

Slide 4 text

Motivations Photo by Tomoyuki Kengaku

Slide 5

Slide 5 text

One day, I was profiling `ruby` using `discourse/discourse`’s `script/bench.rb`. Then I found something…

Slide 6

Slide 6 text

No content

Slide 7

Slide 7 text

Samples: 39K of event 'cycles', Event count (approx.): 499087050327 Overhead Command Shared Object Symbol 24.91% thread_pool.rb* ruby [.] vm_exec_core 2.90% thread_pool.rb* ruby [.] vm_call_iseq_setup 2.59% thread_pool.rb* ruby [.] method_entry_get 2.52% thread_pool.rb* ruby [.] vm_call_cfunc_with_frame.isra.0 2.48% thread_pool.rb* ruby [.] st_lookup 2.47% thread_pool.rb* ruby [.] rb_id_table_lookup 2.15% thread_pool.rb* ruby [.] vm_call_iseq_setup_normal_0start_0params_0local 1.82% thread_pool.rb* ruby [.] rb_wb_protected_newobj_of 1.71% thread_pool.rb* ruby [.] find_table_entry_ind 1.70% thread_pool.rb* ruby [.] gc_sweep_step 1.47% thread_pool.rb* ruby [.] vm_call_iseq_setup_normal_0start_1params_1local 1.30% thread_pool.rb* ruby [.] find_entry 1.17% thread_pool.rb* ruby [.] setup_parameters_complex 1.11% thread_pool.rb* libc-2.31.so [.] malloc 1.03% thread_pool.rb* ruby [.] rgengc_check_relation.part.0 0.93% thread_pool.rb* ruby [.] rb_obj_is_kind_of 0.92% thread_pool.rb* libc-2.31.so [.] malloc_usable_size 0.90% thread_pool.rb* ruby [.] ruby_sip_hash13 0.89% thread_pool.rb* ruby [.] gc_mark_children 0.89% thread_pool.rb* ruby [.] vm_call_method_each_type 0.81% thread_pool.rb* ruby [.] vm_search_method.isra.0 0.75% thread_pool.rb* ruby [.] vm_call_ivar 0.67% thread_pool.rb* ruby [.] match_at

Slide 8

Slide 8 text

This is WRONG!

Slide 9

Slide 9 text

The problem Photo by Tomoyuki Kengaku

Slide 10

Slide 10 text

So what is vm_call_iseq_setup? zsh % global -xr vm_call_iseq_setup vm_call_iseq_setup 134 vm_eval.c vm_call_iseq_setup(ec, reg_cfp, calli vm_call_iseq_setup 2757 vm_insnhelper.c CC_SET_FASTPATH(cc, vm_call_iseq_setup, T vm_call_iseq_setup 2758 vm_insnhelper.c return vm_call_iseq_setup(ec, cfp, callin zsh %

Slide 11

Slide 11 text

So what is vm_call_iseq_setup? • It sets up VM stack frame, right at the moment when a method is about to run. • This function is not designed to be called frequently. • Normal method calls must skip it, because there are inline method caches.

Slide 12

Slide 12 text

Inline method cache def foo pp self end

Slide 13

Slide 13 text

Inline method cache == disasm: # (catch: FALSE) 0000 putself ( 1)[LiCa 0001 putself 0002 opt_send_without_block 0004 leave [Re]

Slide 14

Slide 14 text

No content

Slide 15

Slide 15 text

No content

Slide 16

Slide 16 text

No content

Slide 17

Slide 17 text

What is cc->call() ?

Slide 18

Slide 18 text

zsh % global -xr CC_SET_FASTPATH CC_SET_FASTPATH 1635 vm_insnhelper.c CC_SET_FASTPATH(cc, vm_call_iseq_setup_func(ci, param_size, CC_SET_FASTPATH 2239 vm_insnhelper.c CC_SET_FASTPATH(cc, vm_call_iseq_setup, TRUE); CC_SET_FASTPATH 2244 vm_insnhelper.c CC_SET_FASTPATH(cc, vm_call_cfunc, TRUE); CC_SET_FASTPATH 2251 vm_insnhelper.c CC_SET_FASTPATH(cc, vm_call_attrset, !((ci->flag & VM_CALL_AR CC_SET_FASTPATH 2258 vm_insnhelper.c CC_SET_FASTPATH(cc, vm_call_ivar, !(ci->flag & VM_CALL_ARGS_ CC_SET_FASTPATH 2263 vm_insnhelper.c CC_SET_FASTPATH(cc, vm_call_method_missing, TRUE); CC_SET_FASTPATH 2267 vm_insnhelper.c CC_SET_FASTPATH(cc, vm_call_bmethod, TRUE); CC_SET_FASTPATH 2278 vm_insnhelper.c CC_SET_FASTPATH(cc, vm_call_opt_send, TRUE); CC_SET_FASTPATH 2281 vm_insnhelper.c CC_SET_FASTPATH(cc, vm_call_opt_call, TRUE); CC_SET_FASTPATH 2284 vm_insnhelper.c CC_SET_FASTPATH(cc, vm_call_opt_block_call, TRUE); CC_SET_FASTPATH 2361 vm_insnhelper.c CC_SET_FASTPATH(cc, vm_call_method_missing, TRUE); CC_SET_FASTPATH 2382 vm_insnhelper.c CC_SET_FASTPATH(cc, vm_call_method_missing, TRUE); CC_SET_FASTPATH 2492 vm_insnhelper.c CC_SET_FASTPATH(cc, vm_call_method_missing, TRUE); CC_SET_FASTPATH 2497 vm_insnhelper.c CC_SET_FASTPATH(cc, vm_call_super_method, TRUE); What is cc->call() ?

Slide 19

Slide 19 text

No content

Slide 20

Slide 20 text

No content

Slide 21

Slide 21 text

No content

Slide 22

Slide 22 text

No content

Slide 23

Slide 23 text

No content

Slide 24

Slide 24 text

In short… • `vm_call_iseq_setup` is called when inline method cache mishits. • This function to show up in the #2 most frequent operation in a profiler output means the cache mishits very frequently. • This is of course not by design!

Slide 25

Slide 25 text

Cache hit rate • Recap `vm_search_method`, where you see there already are cache hit rate counters:

Slide 26

Slide 26 text

No content

Slide 27

Slide 27 text

Cache hit rate • Recap `vm_search_method`, where you see there already are cache hit rate counters. • When enabled, we get this:

Slide 28

Slide 28 text

[RUBY_DEBUG_COUNTER] 42122 killed by signal. [RUBY_DEBUG_COUNTER] mc_inline_hit 928210882 [RUBY_DEBUG_COUNTER] mc_inline_miss 93975080 [RUBY_DEBUG_COUNTER] mc_global_hit 226048406 [RUBY_DEBUG_COUNTER] mc_global_miss 24044279 [RUBY_DEBUG_COUNTER] mc_global_state_miss 9246312 [RUBY_DEBUG_COUNTER] mc_class_serial_miss 84728768 [RUBY_DEBUG_COUNTER] mc_cme_complement 40695315 [RUBY_DEBUG_COUNTER] mc_cme_complement_hit 40690629 [RUBY_DEBUG_COUNTER] mc_search_super 242234049 [RUBY_DEBUG_COUNTER] mc_miss_by_nome 35088 [RUBY_DEBUG_COUNTER] mc_miss_by_distinct 31311981 [RUBY_DEBUG_COUNTER] mc_miss_by_refine 0 [RUBY_DEBUG_COUNTER] mc_miss_by_visi 1002898 [RUBY_DEBUG_COUNTER] mc_miss_spurious 61625113 [RUBY_DEBUG_COUNTER] ccf_general 37693318 [RUBY_DEBUG_COUNTER] ccf_iseq_setup 105021309

Slide 29

Slide 29 text

Cache hit rate • Recap `vm_search_method`, where you see there already are cache hit rate counters. • When enabled, we get this: 928,210,882 928,210,882 + 93,975,080 ≅ 90.80%

Slide 30

Slide 30 text

The design

Slide 31

Slide 31 text

No content

Slide 32

Slide 32 text

Conclusion • Our inline method cache mechanism is not working as intended.

Slide 33

Slide 33 text

Inspection Photo by Tomoyuki Kengaku

Slide 34

Slide 34 text

Deep dive into cache mishits • Now that we know the cache mechanism is somehow not working… • In order to fix it we have to know what exactly is going on when a cache mishits.

Slide 35

Slide 35 text

No content

Slide 36

Slide 36 text

zsh % LC_ALL=C bundle exec script/bench.rb 2>&1 | sort | uniq -c | sort -nr 14894134 cache eviction: ActiveModel::AttributeSet#fetch_value is called from activerecord-6.0.3.1/lib/active_record/attribute_methods/read.rb:39 4139225 cache eviction: Concurrent::Map#[] is called from concurrent-ruby-1.1.6/lib/concurrent-ruby/concurrent/map.rb:133 3912799 cache eviction: Topic#_read_attribute is called from activerecord-6.0.3.1/lib/active_record/attribute_methods/read.rb:17 3329993 cache eviction: User#_read_attribute is called from activerecord-6.0.3.1/lib/active_record/attribute_methods/primary_key.rb:19 1739540 cache eviction: User#_read_attribute is called from activerecord-6.0.3.1/lib/active_record/attribute_methods/read.rb:17 1722911 cache eviction: ActionView::OutputBuffer#<< is called from actionview-6.0.3.1/lib/action_view/buffers.rb:29 1670132 cache eviction: Module#site_id_lookup is called from message_bus-3.3.1/lib/message_bus/distributed_cache.rb:154 1285677 cache eviction: ActiveRecord::Type::HashLookupTypeMap#fetch is called from activerecord-6.0.3.1/lib/active_record/connection_adapters/postgresql_adapte 1158602 cache eviction: Post#_read_attribute is called from activerecord-6.0.3.1/lib/active_record/attribute_methods/read.rb:17 1111963 cache eviction: ActionDispatch::Journey::Visitors::Each#visit is called from actionpack-6.0.3.1/lib/action_dispatch/journey/visitors.rb:162 1079094 cache eviction: NilClass#blank? is called from activesupport-6.0.3.1/lib/active_support/core_ext/object/blank.rb:26 973103 cache eviction: ActiveRecord::Reflection::ThroughReflection#options is called from activerecord-6.0.3.1/lib/active_record/reflection.rb:806 973103 cache eviction: ActiveRecord::Reflection::ThroughReflection#active_record is called from activerecord-6.0.3.1/lib/active_record/reflection.rb:806 940974 cache eviction: Arel::Visitors::PostgreSQL#visit_Arel_Attributes_Attribute is called from activerecord-6.0.3.1/lib/arel/visitors/visitor.rb:30 740641 cache eviction: Post#_read_attribute is called from activerecord-6.0.3.1/lib/active_record/associations/belongs_to_association.rb:119 686061 cache eviction: Topic#_read_attribute is called from activerecord-6.0.3.1/lib/active_record/attribute_methods/primary_key.rb:19 675719 cache eviction: ActiveRecord::ConnectionAdapters::PostgreSQLAdapter#_quote is called from activerecord-6.0.3.1/lib/active_record/connection_adapters/po 610875 cache eviction: User#respond_to? is called from activerecord-6.0.3.1/lib/active_record/attribute_methods.rb:212 609323 cache eviction: Class#type_for_attribute is called from activerecord-6.0.3.1/lib/active_record/table_metadata.rb:33 604621 cache eviction: Topic#_read_attribute is called from activerecord-6.0.3.1/lib/active_record/attribute_methods/read.rb:32 555429 cache eviction: User#read_attribute is called from activerecord-6.0.3.1/lib/active_record/attribute_methods.rb:323 555429 cache eviction: User#_read_attribute is called from activerecord-6.0.3.1/lib/active_record/attribute_methods/read.rb:32 544500 cache eviction: Arel::Visitors::PostgreSQL#visit_Arel_Nodes_BindParam is called from activerecord-6.0.3.1/lib/arel/visitors/visitor.rb:30 541906 cache eviction: ActiveModel::Type::Integer#cast is called from activemodel-6.0.3.1/lib/active_model/type/helpers/numeric.rb:19

Slide 37

Slide 37 text

Something is happening on `active_record/attribute_methods/read.rb`!

Slide 38

Slide 38 text

No content

Slide 39

Slide 39 text

No content

Slide 40

Slide 40 text

So? • `ActiveRecord::AttributeMethods::Read` is… • A Module. • Included from (`discourse`’s) `User` etc. • Which means… • Its `#read_attribute` is shared across several classes.

Slide 41

Slide 41 text

User Post Topic AR::Base AR::AttributeMethods AR::AM::Read Inherit Include Include #read_attribute #_read_attribute Calls

Slide 42

Slide 42 text

No content

Slide 43

Slide 43 text

Actual scenario 1. `User#read_attribute` is called. 1. Runs `AR::AM::Read#read_attribute` 2. Populates its call cache with `User#_read_attribute` 2. `Post#read_attribute` is called. 1. Runs `AR::AM::Read#read_attribute` 2. Evicts previous `User#_read_attribute` entry and populates its own `Post#_read_attribute` 3. `User#read_attribute` is called. 1. Runs `AR::AM::Read#read_attribute` 2. Evicts previous `Post#_read_attribute` entry and populates its own `User#_read_attribute`

Slide 44

Slide 44 text

Fix it Photo by Tomoyuki Kengaku

Slide 45

Slide 45 text

Recap the problem • Inline method cache mishits too frequently. • Mishits happen when a method is shared among several classes possibly due to: • Inheritance • Module inclusion

Slide 46

Slide 46 text

How do we tackle this? • Inline call cache to consider class hierarchy? • NG; there are refinements. Method resolution can vary from call site to call site. Class hierarchy is not suitable for the needs. • Also, it is dead slow to scan inheritance tree every time a method is called. • Remember (class, method) tuple for each and every call site • → Way to go.

Slide 47

Slide 47 text

The new design • An inline cache was a map from (class, symbol) tuple to a method body. We modified it to stores multiple classes, which share the identical method name for identical method body. • Ko1 told me that this is a variant of polymorphic inline caching. • Hölzle et al., “Optimizing dynamically-typed object-oriented languages with polymorphic inline caches” in Proceedings of ECOOP ’91. LNCS 512. https://doi.org/10.1007/BFb0057013

Slide 48

Slide 48 text

No content

Slide 49

Slide 49 text

Expand here to store multiple serials.

Slide 50

Slide 50 text

No content

Slide 51

Slide 51 text

No content

Slide 52

Slide 52 text

No content

Slide 53

Slide 53 text

Benchmarks Photo by Tomoyuki Kengaku

Slide 54

Slide 54 text

Inline cache hit rate improvement before after mc_inline_hit 928,210,882 971,641,144 mc_inline_miss 93,975,080 50,482,155 90.80% 95.06%

Slide 55

Slide 55 text

Samples: 39K of event 'cycles', Event count (approx.): 499087050327 Overhead Command Shared Object Symbol 24.91% thread_pool.rb* ruby [.] vm_exec_core 2.90% thread_pool.rb* ruby [.] vm_call_iseq_setup 2.59% thread_pool.rb* ruby [.] method_entry_get 2.52% thread_pool.rb* ruby [.] vm_call_cfunc_with_frame.isra.0 2.48% thread_pool.rb* ruby [.] st_lookup 2.47% thread_pool.rb* ruby [.] rb_id_table_lookup 2.15% thread_pool.rb* ruby [.] vm_call_iseq_setup_normal_0start_0params_0local 1.82% thread_pool.rb* ruby [.] rb_wb_protected_newobj_of 1.71% thread_pool.rb* ruby [.] find_table_entry_ind 1.70% thread_pool.rb* ruby [.] gc_sweep_step 1.47% thread_pool.rb* ruby [.] vm_call_iseq_setup_normal_0start_1params_1local 1.30% thread_pool.rb* ruby [.] find_entry 1.17% thread_pool.rb* ruby [.] setup_parameters_complex 1.11% thread_pool.rb* libc-2.31.so [.] malloc 1.03% thread_pool.rb* ruby [.] rgengc_check_relation.part.0 0.93% thread_pool.rb* ruby [.] rb_obj_is_kind_of 0.92% thread_pool.rb* libc-2.31.so [.] malloc_usable_size 0.90% thread_pool.rb* ruby [.] ruby_sip_hash13 0.89% thread_pool.rb* ruby [.] gc_mark_children 0.89% thread_pool.rb* ruby [.] vm_call_method_each_type 0.81% thread_pool.rb* ruby [.] vm_search_method.isra.0 0.75% thread_pool.rb* ruby [.] vm_call_ivar 0.67% thread_pool.rb* ruby [.] match_at

Slide 56

Slide 56 text

Samples: 40K of event 'cycles', Event count (approx.): 502933506043 Overhead Command Shared Object Symbol 25.22% thread_pool.rb* ruby [.] vm_exec_core 2.61% thread_pool.rb* ruby [.] vm_call_iseq_setup_normal_0start_0params_0local 2.60% thread_pool.rb* ruby [.] vm_call_cfunc 2.56% thread_pool.rb* ruby [.] rb_st_lookup 2.16% thread_pool.rb* ruby [.] gc_sweep_step 2.12% thread_pool.rb* ruby [.] vm_call_iseq_setup 2.06% thread_pool.rb* ruby [.] find_table_entry_ind 2.04% thread_pool.rb* ruby [.] method_entry_get 1.61% thread_pool.rb* ruby [.] vm_call_iseq_setup_normal_0start_1params_1local 1.58% thread_pool.rb* ruby [.] rb_id_table_lookup 1.56% thread_pool.rb* libc-2.31.so [.] malloc_usable_size 0.98% thread_pool.rb* ruby [.] rb_obj_is_kind_of 0.87% thread_pool.rb* ruby [.] vm_call_ivar 0.86% thread_pool.rb* ruby [.] ruby_sip_hash13 0.76% thread_pool.rb* ruby [.] gc_mark_children 0.74% thread_pool.rb* ruby [.] rgengc_check_relation.part.0 0.71% thread_pool.rb* ruby [.] rb_call0 0.71% thread_pool.rb* ruby [.] rb_get_alloc_func 0.69% thread_pool.rb* ruby [.] vm_call_iseq_setup_normal_opt_start 0.68% thread_pool.rb* ruby [.] match_at 0.68% thread_pool.rb* ruby [.] rb_vm_search_method_slowpath 0.66% thread_pool.rb* ruby [.] rb_wb_protected_newobj_of 0.62% thread_pool.rb* ruby [.] gc_mark_ptr

Slide 57

Slide 57 text

# Event 'cycles' # # Baseline Delta Abs Shared Object Symbol # ........ ......... .................. .................................................... # 2.39% -0.75% ruby [.] rb_id_table_lookup 25.90% -0.34% ruby [.] vm_exec_core 2.35% -0.27% ruby [.] method_entry_get 0.88% -0.19% ruby [.] rb_vm_search_method_slowpath 0.15% +0.17% ruby [.] check_match 1.98% +0.16% ruby [.] find_table_entry_ind 0.57% +0.15% ruby [.] rb_call0 0.80% -0.15% libc-2.31.so [.] malloc +0.14% ruby [.] vm_search_method_fastpath 0.45% +0.13% ruby [.] vm_callee_setup_arg 2.52% +0.11% ruby [.] vm_call_iseq_setup_normal_0start_0params_0locals 0.41% +0.10% ruby [.] rb_vm_exec 0.61% +0.10% ruby [.] rb_get_alloc_func 0.60% +0.09% ruby [.] match_at 0.50% +0.08% ruby [.] vm_call_method_each_type 0.80% +0.08% ruby [.] vm_call_ivar 0.50% +0.07% ruby [.] rb_vm_invoke_bmethod 0.31% -0.07% ruby [.] tbl_update 0.64% -0.07% ruby [.] setup_parameters_complex 0.62% +0.07% ruby [.] rb_wb_protected_newobj_of

Slide 58

Slide 58 text

0 128 256 384 512 64 80 96 112 128 144 160 Histgram Response time [msec] [ `discourse` | `script/bench.rb` | `home` ] result before after Faster

Slide 59

Slide 59 text

Key result • Quantitative improvements: • Cache hit rate improved by 90.80% → 95.06% • Response time median decreased by 70msec → 69msec • Response time mode decreased by 69msec → 67msec • Profiler no longer considers `vm_call_iseq_setup` one of the hottest path. Specialised functions to show up instead.

Slide 60

Slide 60 text

Conclusions Photo by Tomoyuki Kengaku

Slide 61

Slide 61 text

Conclusions • Found that the way interpreter caches method resolution does not work well with inheritances, situations like e.g. Rails. • Fixed that problem by storing multiple keys per call site. • Results in improved cache hit rate, actual Rails performance boost.

Slide 62

Slide 62 text

No content

Slide 63

Slide 63 text

Conclusions • The changeset has already merged into 2.7. • You can boost your Rails app TODAY by just using newer ruby!

Slide 64

Slide 64 text

Future works Photo by Tomoyuki Kengaku

Slide 65

Slide 65 text

No content

Slide 66

Slide 66 text

Future? • After 2.7 was released, inline cache mechanism (not only what I described today) was entirely discarded then rewritten by @ko1. • This was for Ractor (then-called Guild) migration. • https://bugs.ruby-lang.org/issues/16614 • This slowed things down a bit, according to him. • Would 2.7 be the fastest lethal ruby implementation we ever made? Watch out for Ruby 3! • It might be possible to re-implement polymorphic inline caches on top of Ractor. But cannot reuse this original design.

Slide 67

Slide 67 text

No content