Upgrade to Pro — share decks privately, control downloads, hide ads and more …

On sending methods

Urabe Shyouhei
September 04, 2020

On sending methods

Urabe Shyouhei

September 04, 2020
Tweet

More Decks by Urabe Shyouhei

Other Decks in Technology

Transcript

  1. Photo by Tomoyuki Kengaku
    On sending methods
    Urabe, Shyouhei

    View full-size slide

  2. About myself
    • Urabe, Shyouhei
    • @shyouhei
    • Ruby committer @ Money Forward, Inc.
    • 2020’s #2 most active `ruby/ruby` committer, according to
    GitHub.

    View full-size slide

  3. Motivations
    Photo by Tomoyuki Kengaku

    View full-size slide

  4. One day, I was profiling `ruby` using `discourse/discourse`’s
    `script/bench.rb`. Then I found something…

    View full-size slide

  5. Samples: 39K of event 'cycles', Event count (approx.): 499087050327
    Overhead Command Shared Object Symbol
    24.91% thread_pool.rb* ruby [.] vm_exec_core
    2.90% thread_pool.rb* ruby [.] vm_call_iseq_setup
    2.59% thread_pool.rb* ruby [.] method_entry_get
    2.52% thread_pool.rb* ruby [.] vm_call_cfunc_with_frame.isra.0
    2.48% thread_pool.rb* ruby [.] st_lookup
    2.47% thread_pool.rb* ruby [.] rb_id_table_lookup
    2.15% thread_pool.rb* ruby [.] vm_call_iseq_setup_normal_0start_0params_0local
    1.82% thread_pool.rb* ruby [.] rb_wb_protected_newobj_of
    1.71% thread_pool.rb* ruby [.] find_table_entry_ind
    1.70% thread_pool.rb* ruby [.] gc_sweep_step
    1.47% thread_pool.rb* ruby [.] vm_call_iseq_setup_normal_0start_1params_1local
    1.30% thread_pool.rb* ruby [.] find_entry
    1.17% thread_pool.rb* ruby [.] setup_parameters_complex
    1.11% thread_pool.rb* libc-2.31.so [.] malloc
    1.03% thread_pool.rb* ruby [.] rgengc_check_relation.part.0
    0.93% thread_pool.rb* ruby [.] rb_obj_is_kind_of
    0.92% thread_pool.rb* libc-2.31.so [.] malloc_usable_size
    0.90% thread_pool.rb* ruby [.] ruby_sip_hash13
    0.89% thread_pool.rb* ruby [.] gc_mark_children
    0.89% thread_pool.rb* ruby [.] vm_call_method_each_type
    0.81% thread_pool.rb* ruby [.] vm_search_method.isra.0
    0.75% thread_pool.rb* ruby [.] vm_call_ivar
    0.67% thread_pool.rb* ruby [.] match_at

    View full-size slide

  6. This is WRONG!

    View full-size slide

  7. The problem
    Photo by Tomoyuki Kengaku

    View full-size slide

  8. So what is vm_call_iseq_setup?
    zsh % global -xr vm_call_iseq_setup
    vm_call_iseq_setup 134 vm_eval.c vm_call_iseq_setup(ec, reg_cfp, calli
    vm_call_iseq_setup 2757 vm_insnhelper.c CC_SET_FASTPATH(cc, vm_call_iseq_setup, T
    vm_call_iseq_setup 2758 vm_insnhelper.c return vm_call_iseq_setup(ec, cfp, callin
    zsh %

    View full-size slide

  9. So what is vm_call_iseq_setup?
    • It sets up VM stack frame, right at the moment when a
    method is about to run.
    • This function is not designed to be called frequently.
    • Normal method calls must skip it, because there are
    inline method caches.

    View full-size slide

  10. Inline method cache
    def foo
    pp self
    end

    View full-size slide

  11. Inline method cache
    == disasm: # (catch: FALSE)
    0000 putself ( 1)[LiCa
    0001 putself
    0002 opt_send_without_block
    0004 leave [Re]

    View full-size slide

  12. What is cc->call() ?

    View full-size slide

  13. zsh % global -xr CC_SET_FASTPATH
    CC_SET_FASTPATH 1635 vm_insnhelper.c CC_SET_FASTPATH(cc, vm_call_iseq_setup_func(ci, param_size,
    CC_SET_FASTPATH 2239 vm_insnhelper.c CC_SET_FASTPATH(cc, vm_call_iseq_setup, TRUE);
    CC_SET_FASTPATH 2244 vm_insnhelper.c CC_SET_FASTPATH(cc, vm_call_cfunc, TRUE);
    CC_SET_FASTPATH 2251 vm_insnhelper.c CC_SET_FASTPATH(cc, vm_call_attrset, !((ci->flag & VM_CALL_AR
    CC_SET_FASTPATH 2258 vm_insnhelper.c CC_SET_FASTPATH(cc, vm_call_ivar, !(ci->flag & VM_CALL_ARGS_
    CC_SET_FASTPATH 2263 vm_insnhelper.c CC_SET_FASTPATH(cc, vm_call_method_missing, TRUE);
    CC_SET_FASTPATH 2267 vm_insnhelper.c CC_SET_FASTPATH(cc, vm_call_bmethod, TRUE);
    CC_SET_FASTPATH 2278 vm_insnhelper.c CC_SET_FASTPATH(cc, vm_call_opt_send, TRUE);
    CC_SET_FASTPATH 2281 vm_insnhelper.c CC_SET_FASTPATH(cc, vm_call_opt_call, TRUE);
    CC_SET_FASTPATH 2284 vm_insnhelper.c CC_SET_FASTPATH(cc, vm_call_opt_block_call, TRUE);
    CC_SET_FASTPATH 2361 vm_insnhelper.c CC_SET_FASTPATH(cc, vm_call_method_missing, TRUE);
    CC_SET_FASTPATH 2382 vm_insnhelper.c CC_SET_FASTPATH(cc, vm_call_method_missing, TRUE);
    CC_SET_FASTPATH 2492 vm_insnhelper.c CC_SET_FASTPATH(cc, vm_call_method_missing, TRUE);
    CC_SET_FASTPATH 2497 vm_insnhelper.c CC_SET_FASTPATH(cc, vm_call_super_method, TRUE);
    What is cc->call() ?

    View full-size slide

  14. In short…
    • `vm_call_iseq_setup` is called when inline method
    cache mishits.
    • This function to show up in the #2 most frequent
    operation in a profiler output means the cache mishits
    very frequently.
    • This is of course not by design!

    View full-size slide

  15. Cache hit rate
    • Recap `vm_search_method`, where you see there
    already are cache hit rate counters:

    View full-size slide

  16. Cache hit rate
    • Recap `vm_search_method`, where you see there
    already are cache hit rate counters.
    • When enabled, we get this:

    View full-size slide

  17. [RUBY_DEBUG_COUNTER] 42122 killed by signal.
    [RUBY_DEBUG_COUNTER] mc_inline_hit 928210882
    [RUBY_DEBUG_COUNTER] mc_inline_miss 93975080
    [RUBY_DEBUG_COUNTER] mc_global_hit 226048406
    [RUBY_DEBUG_COUNTER] mc_global_miss 24044279
    [RUBY_DEBUG_COUNTER] mc_global_state_miss 9246312
    [RUBY_DEBUG_COUNTER] mc_class_serial_miss 84728768
    [RUBY_DEBUG_COUNTER] mc_cme_complement 40695315
    [RUBY_DEBUG_COUNTER] mc_cme_complement_hit 40690629
    [RUBY_DEBUG_COUNTER] mc_search_super 242234049
    [RUBY_DEBUG_COUNTER] mc_miss_by_nome 35088
    [RUBY_DEBUG_COUNTER] mc_miss_by_distinct 31311981
    [RUBY_DEBUG_COUNTER] mc_miss_by_refine 0
    [RUBY_DEBUG_COUNTER] mc_miss_by_visi 1002898
    [RUBY_DEBUG_COUNTER] mc_miss_spurious 61625113
    [RUBY_DEBUG_COUNTER] ccf_general 37693318
    [RUBY_DEBUG_COUNTER] ccf_iseq_setup 105021309

    View full-size slide

  18. Cache hit rate
    • Recap `vm_search_method`, where you see there
    already are cache hit rate counters.
    • When enabled, we get this:
    928,210,882
    928,210,882 + 93,975,080
    ≅ 90.80%

    View full-size slide

  19. Conclusion
    • Our inline method cache mechanism is not working as
    intended.

    View full-size slide

  20. Inspection
    Photo by Tomoyuki Kengaku

    View full-size slide

  21. Deep dive into cache mishits
    • Now that we know the cache mechanism is somehow
    not working…
    • In order to fix it we have to know what exactly is going
    on when a cache mishits.

    View full-size slide

  22. zsh % LC_ALL=C bundle exec script/bench.rb 2>&1 | sort | uniq -c | sort -nr
    14894134 cache eviction: ActiveModel::AttributeSet#fetch_value is called from activerecord-6.0.3.1/lib/active_record/attribute_methods/read.rb:39
    4139225 cache eviction: Concurrent::Map#[] is called from concurrent-ruby-1.1.6/lib/concurrent-ruby/concurrent/map.rb:133
    3912799 cache eviction: Topic#_read_attribute is called from activerecord-6.0.3.1/lib/active_record/attribute_methods/read.rb:17
    3329993 cache eviction: User#_read_attribute is called from activerecord-6.0.3.1/lib/active_record/attribute_methods/primary_key.rb:19
    1739540 cache eviction: User#_read_attribute is called from activerecord-6.0.3.1/lib/active_record/attribute_methods/read.rb:17
    1722911 cache eviction: ActionView::OutputBuffer#<< is called from actionview-6.0.3.1/lib/action_view/buffers.rb:29
    1670132 cache eviction: Module#site_id_lookup is called from message_bus-3.3.1/lib/message_bus/distributed_cache.rb:154
    1285677 cache eviction: ActiveRecord::Type::HashLookupTypeMap#fetch is called from activerecord-6.0.3.1/lib/active_record/connection_adapters/postgresql_adapte
    1158602 cache eviction: Post#_read_attribute is called from activerecord-6.0.3.1/lib/active_record/attribute_methods/read.rb:17
    1111963 cache eviction: ActionDispatch::Journey::Visitors::Each#visit is called from actionpack-6.0.3.1/lib/action_dispatch/journey/visitors.rb:162
    1079094 cache eviction: NilClass#blank? is called from activesupport-6.0.3.1/lib/active_support/core_ext/object/blank.rb:26
    973103 cache eviction: ActiveRecord::Reflection::ThroughReflection#options is called from activerecord-6.0.3.1/lib/active_record/reflection.rb:806
    973103 cache eviction: ActiveRecord::Reflection::ThroughReflection#active_record is called from activerecord-6.0.3.1/lib/active_record/reflection.rb:806
    940974 cache eviction: Arel::Visitors::PostgreSQL#visit_Arel_Attributes_Attribute is called from activerecord-6.0.3.1/lib/arel/visitors/visitor.rb:30
    740641 cache eviction: Post#_read_attribute is called from activerecord-6.0.3.1/lib/active_record/associations/belongs_to_association.rb:119
    686061 cache eviction: Topic#_read_attribute is called from activerecord-6.0.3.1/lib/active_record/attribute_methods/primary_key.rb:19
    675719 cache eviction: ActiveRecord::ConnectionAdapters::PostgreSQLAdapter#_quote is called from activerecord-6.0.3.1/lib/active_record/connection_adapters/po
    610875 cache eviction: User#respond_to? is called from activerecord-6.0.3.1/lib/active_record/attribute_methods.rb:212
    609323 cache eviction: Class#type_for_attribute is called from activerecord-6.0.3.1/lib/active_record/table_metadata.rb:33
    604621 cache eviction: Topic#_read_attribute is called from activerecord-6.0.3.1/lib/active_record/attribute_methods/read.rb:32
    555429 cache eviction: User#read_attribute is called from activerecord-6.0.3.1/lib/active_record/attribute_methods.rb:323
    555429 cache eviction: User#_read_attribute is called from activerecord-6.0.3.1/lib/active_record/attribute_methods/read.rb:32
    544500 cache eviction: Arel::Visitors::PostgreSQL#visit_Arel_Nodes_BindParam is called from activerecord-6.0.3.1/lib/arel/visitors/visitor.rb:30
    541906 cache eviction: ActiveModel::Type::Integer#cast is called from activemodel-6.0.3.1/lib/active_model/type/helpers/numeric.rb:19

    View full-size slide

  23. Something is happening on
    `active_record/attribute_methods/read.rb`!

    View full-size slide

  24. So?
    • `ActiveRecord::AttributeMethods::Read` is…
    • A Module.
    • Included from (`discourse`’s) `User` etc.
    • Which means…
    • Its `#read_attribute` is shared across several classes.

    View full-size slide

  25. User
    Post
    Topic
    AR::Base
    AR::AttributeMethods
    AR::AM::Read
    Inherit
    Include
    Include
    #read_attribute
    #_read_attribute
    Calls

    View full-size slide

  26. Actual scenario
    1. `User#read_attribute` is called.
    1. Runs `AR::AM::Read#read_attribute`
    2. Populates its call cache with `User#_read_attribute`
    2. `Post#read_attribute` is called.
    1. Runs `AR::AM::Read#read_attribute`
    2. Evicts previous `User#_read_attribute` entry and
    populates its own `Post#_read_attribute`
    3. `User#read_attribute` is called.
    1. Runs `AR::AM::Read#read_attribute`
    2. Evicts previous `Post#_read_attribute` entry and
    populates its own `User#_read_attribute`

    View full-size slide

  27. Fix it
    Photo by Tomoyuki Kengaku

    View full-size slide

  28. Recap the problem
    • Inline method cache mishits too frequently.
    • Mishits happen when a method is shared among several
    classes possibly due to:
    • Inheritance
    • Module inclusion

    View full-size slide

  29. How do we tackle this?
    • Inline call cache to consider class hierarchy?
    • NG; there are refinements. Method resolution can vary from call site
    to call site. Class hierarchy is not suitable for the needs.
    • Also, it is dead slow to scan inheritance tree every time a method is
    called.
    • Remember (class, method) tuple for each and every call site
    • → Way to go.

    View full-size slide

  30. The new design
    • An inline cache was a map from (class, symbol) tuple to a
    method body. We modified it to stores multiple classes, which
    share the identical method name for identical method body.
    • Ko1 told me that this is a variant of polymorphic inline caching.
    • Hölzle et al., “Optimizing dynamically-typed object-oriented languages
    with polymorphic inline caches” in Proceedings of ECOOP ’91.
    LNCS 512. https://doi.org/10.1007/BFb0057013

    View full-size slide

  31. Expand here to store multiple serials.

    View full-size slide

  32. Benchmarks
    Photo by Tomoyuki Kengaku

    View full-size slide

  33. Inline cache hit rate improvement
    before after
    mc_inline_hit 928,210,882 971,641,144
    mc_inline_miss 93,975,080 50,482,155
    90.80% 95.06%

    View full-size slide

  34. Samples: 39K of event 'cycles', Event count (approx.): 499087050327
    Overhead Command Shared Object Symbol
    24.91% thread_pool.rb* ruby [.] vm_exec_core
    2.90% thread_pool.rb* ruby [.] vm_call_iseq_setup
    2.59% thread_pool.rb* ruby [.] method_entry_get
    2.52% thread_pool.rb* ruby [.] vm_call_cfunc_with_frame.isra.0
    2.48% thread_pool.rb* ruby [.] st_lookup
    2.47% thread_pool.rb* ruby [.] rb_id_table_lookup
    2.15% thread_pool.rb* ruby [.] vm_call_iseq_setup_normal_0start_0params_0local
    1.82% thread_pool.rb* ruby [.] rb_wb_protected_newobj_of
    1.71% thread_pool.rb* ruby [.] find_table_entry_ind
    1.70% thread_pool.rb* ruby [.] gc_sweep_step
    1.47% thread_pool.rb* ruby [.] vm_call_iseq_setup_normal_0start_1params_1local
    1.30% thread_pool.rb* ruby [.] find_entry
    1.17% thread_pool.rb* ruby [.] setup_parameters_complex
    1.11% thread_pool.rb* libc-2.31.so [.] malloc
    1.03% thread_pool.rb* ruby [.] rgengc_check_relation.part.0
    0.93% thread_pool.rb* ruby [.] rb_obj_is_kind_of
    0.92% thread_pool.rb* libc-2.31.so [.] malloc_usable_size
    0.90% thread_pool.rb* ruby [.] ruby_sip_hash13
    0.89% thread_pool.rb* ruby [.] gc_mark_children
    0.89% thread_pool.rb* ruby [.] vm_call_method_each_type
    0.81% thread_pool.rb* ruby [.] vm_search_method.isra.0
    0.75% thread_pool.rb* ruby [.] vm_call_ivar
    0.67% thread_pool.rb* ruby [.] match_at

    View full-size slide

  35. Samples: 40K of event 'cycles', Event count (approx.): 502933506043
    Overhead Command Shared Object Symbol
    25.22% thread_pool.rb* ruby [.] vm_exec_core
    2.61% thread_pool.rb* ruby [.] vm_call_iseq_setup_normal_0start_0params_0local
    2.60% thread_pool.rb* ruby [.] vm_call_cfunc
    2.56% thread_pool.rb* ruby [.] rb_st_lookup
    2.16% thread_pool.rb* ruby [.] gc_sweep_step
    2.12% thread_pool.rb* ruby [.] vm_call_iseq_setup
    2.06% thread_pool.rb* ruby [.] find_table_entry_ind
    2.04% thread_pool.rb* ruby [.] method_entry_get
    1.61% thread_pool.rb* ruby [.] vm_call_iseq_setup_normal_0start_1params_1local
    1.58% thread_pool.rb* ruby [.] rb_id_table_lookup
    1.56% thread_pool.rb* libc-2.31.so [.] malloc_usable_size
    0.98% thread_pool.rb* ruby [.] rb_obj_is_kind_of
    0.87% thread_pool.rb* ruby [.] vm_call_ivar
    0.86% thread_pool.rb* ruby [.] ruby_sip_hash13
    0.76% thread_pool.rb* ruby [.] gc_mark_children
    0.74% thread_pool.rb* ruby [.] rgengc_check_relation.part.0
    0.71% thread_pool.rb* ruby [.] rb_call0
    0.71% thread_pool.rb* ruby [.] rb_get_alloc_func
    0.69% thread_pool.rb* ruby [.] vm_call_iseq_setup_normal_opt_start
    0.68% thread_pool.rb* ruby [.] match_at
    0.68% thread_pool.rb* ruby [.] rb_vm_search_method_slowpath
    0.66% thread_pool.rb* ruby [.] rb_wb_protected_newobj_of
    0.62% thread_pool.rb* ruby [.] gc_mark_ptr

    View full-size slide

  36. # Event 'cycles'
    #
    # Baseline Delta Abs Shared Object Symbol
    # ........ ......... .................. ....................................................
    #
    2.39% -0.75% ruby [.] rb_id_table_lookup
    25.90% -0.34% ruby [.] vm_exec_core
    2.35% -0.27% ruby [.] method_entry_get
    0.88% -0.19% ruby [.] rb_vm_search_method_slowpath
    0.15% +0.17% ruby [.] check_match
    1.98% +0.16% ruby [.] find_table_entry_ind
    0.57% +0.15% ruby [.] rb_call0
    0.80% -0.15% libc-2.31.so [.] malloc
    +0.14% ruby [.] vm_search_method_fastpath
    0.45% +0.13% ruby [.] vm_callee_setup_arg
    2.52% +0.11% ruby [.] vm_call_iseq_setup_normal_0start_0params_0locals
    0.41% +0.10% ruby [.] rb_vm_exec
    0.61% +0.10% ruby [.] rb_get_alloc_func
    0.60% +0.09% ruby [.] match_at
    0.50% +0.08% ruby [.] vm_call_method_each_type
    0.80% +0.08% ruby [.] vm_call_ivar
    0.50% +0.07% ruby [.] rb_vm_invoke_bmethod
    0.31% -0.07% ruby [.] tbl_update
    0.64% -0.07% ruby [.] setup_parameters_complex
    0.62% +0.07% ruby [.] rb_wb_protected_newobj_of

    View full-size slide

  37. 0
    128
    256
    384
    512
    64 80 96 112 128 144 160
    Histgram
    Response time [msec]
    [ `discourse` | `script/bench.rb` | `home` ] result
    before
    after
    Faster

    View full-size slide

  38. Key result
    • Quantitative improvements:
    • Cache hit rate improved by 90.80% → 95.06%
    • Response time median decreased by 70msec → 69msec
    • Response time mode decreased by 69msec → 67msec
    • Profiler no longer considers `vm_call_iseq_setup` one of the
    hottest path. Specialised functions to show up instead.

    View full-size slide

  39. Conclusions
    Photo by Tomoyuki Kengaku

    View full-size slide

  40. Conclusions
    • Found that the way interpreter caches method resolution does
    not work well with inheritances, situations like e.g. Rails.
    • Fixed that problem by storing multiple keys per call site.
    • Results in improved cache hit rate, actual Rails performance
    boost.

    View full-size slide

  41. Conclusions
    • The changeset has already merged into 2.7.
    • You can boost your Rails app TODAY by just using newer ruby!

    View full-size slide

  42. Future works
    Photo by Tomoyuki Kengaku

    View full-size slide

  43. Future?
    • After 2.7 was released, inline cache mechanism (not only what
    I described today) was entirely discarded then rewritten by
    @ko1.
    • This was for Ractor (then-called Guild) migration.
    • https://bugs.ruby-lang.org/issues/16614
    • This slowed things down a bit, according to him.
    • Would 2.7 be the fastest lethal ruby implementation we ever
    made? Watch out for Ruby 3!
    • It might be possible to re-implement polymorphic inline caches on top
    of Ractor. But cannot reuse this original design.

    View full-size slide