Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The depths of profiling Ruby - RubyKaigi 2024

osyoyu
May 14, 2024

The depths of profiling Ruby - RubyKaigi 2024

osyoyu

May 14, 2024
Tweet

More Decks by osyoyu

Other Decks in Technology

Transcript

  1. pp @osyoyu Daisuke Aritomo (osyoyu) Member of #rubykaigiNOC (Wi-Fi team)

    Loving ϙʔΫۄࢠ͓ʹ͗Γ Probably in the top 1% among Rubyists pork tamago onigiri
  2. profiler noun /ˈprəʊfaɪlə(r)/ a person whose job is to collect

    useful information about somebody/something so that they can give a description of them or it "Profiler, N. (1)." Oxford Advanced Learner's Dictionary, https://www.oxfordlearnersdictionaries.com/definition/english/profiler. Accessed 6 May 2024.
  3. (Ruby) profilers 101 A tool injected into a target program

    and analyzes the performance Reveals the hotspot The code block that is making the program slow The "flamegraph" visualization. "Heavy" methods are shown as long bars. Flame Graph Reset Zoom Search test_bu.. x.. d.. s.. gener.. S.. _.. unary_.. execute_command_internal x.. bash shell_expand_word_list red.. execute_builtin __xst.. __dup2 main _.. do_redirection_internal do_redirections do_f.. ext4_f.. __GI___li.. __.. expand_word_list_internal unary_.. expan.. tra.. path.. execute_command two_ar.. cleanup_redi.. execute_command do_sy.. trac.. __GI___libc_.. posixt.. execute_builtin_or_function _.. [unknown] tra.. expand_word_internal _.. tracesys ex.. _.. vfs_write c.. test_co.. e.. u.. v.. SY.. i.. execute_simple_command execute_command_internal p.. gl.. s.. sys_write __GI___l.. expand_words d.. __libc_start_main d.. do_sync.. generi.. e.. reader_loop sy.. execute_while_command do_redirecti.. sys_o.. g.. __gene.. __xsta.. tracesys c.. _.. execute_while_or_until
  4. Announcing Pf2, a new Ruby profiler I've been building a

    new profiler! Pf2 is focused on analyzing Ruby programs from various point of views github.com/osyoyu/pf2
  5. Finding slow methods Pf2 can find time-consuming methods and display

    them in a flamegraph (just like other Ruby profilers)
  6. Thread statistics Indicates time in GC and time wasted waiting

    for the CPU lock (GVL) In Ruby, only 1 Thread can use the CPU simultaneously Useful information for adjusting # of Threads
  7. Diving into C extensions With Ruby-only profiling, it's hard to

    explore bottlenecks which live in C extensions
  8. Diving into C extensions read_query_result (Receiving the result = parsing

    results is slow) ppoll (Waiting for query exec = heavy work on the server) do_send_query (Sending the SQL = the network is slow)
  9. Mutlple visualizations Pf2 offers several visualization modes (through Firefox Profiler)

    Call Tree (for finding heavy methods) Flamegraph Inverted call tree (for finding methods called repeatedly)
  10. Why so many angles? Bottlenecks often live in unexpected, unguessable

    regions Especially when the bottleneck is not in your code e.g. Spawning 50 Threads in puma won't make your server 50x faster You'd need to know about the GVL to guess the reason, but a GVL-aware profiler can teach you
  11. How to build this? Building a profiler is not like

    writing a normal Ruby program Let's see how to do this!
  12. A profiler's mission Be informative as possible And never be

    misleading (it's very easy to create a misleading profiler) Maintain low overhead The profiler must not be the bottleneck If collecting more information incurs too much overhead, don't collect it Be a friendly tool
  13. Pf2's life cycle Sample Collection Record the target program's exec

    state periodically Unwinding & Aggregation Collect info about samples (after profiling is done) Visualization Format the data into flamegraphs, call graphs, ...
  14. Pf2's life cycle Sample Collection Record the target program's exec

    state periodically Unwinding & Aggregation Collect info about samples (after profiling is done) Visualization Format the data into flamegraphs, call graphs, ...
  15. Sample collection A sampling profiler collects samples every few milliseconds

    while the target program runs Samples contain the current execution state What Ruby method was running now? What was the underlying C function? Did the Thread have the GVL now?
  16. <main> a() c() b() 0ms 10ms 20ms 30ms c() is

    never recorded since it starts and ends between captures <main>, a <main>, a, b <main> The function running at capture time will be recorded Sampling profiler timeline Samples are periodically captured (in this case, every 10 ms) Samples
  17. Collecting samples 💥 Pf2 observes the execution state from the

    outside Many types of information must be collected Running Ruby methods GC/GVL state Running C functions Profiling from the inside is not impossible, but would be horrible
  18. Querying Ruby about Ruby The CRuby runtime knows the best

    about CRuby state The stack, Threads, Ractors, GC ... Many aspects are not observable from Ruby code Contrarily, tools like perf won't work without special support The Ruby stack is mere memory from perf's POV
  19. Profiling support in CRuby Pf2 Profiler rb_profile_thread_frames() Obtaining the stack

    trace TracePoint API Receiving GC events Thread Events API Observing GVL state
  20. Takes a Thread as an arg, and returns its stack

    trace Ruby 2.1+/3.3+ The stack is represented as a opaque VALUE This is actually a pointer to the CME (callable method entry) We abuse this later rb_profile_thread_frames() Obtaining the stack trace TracePoint API Receiving GC events Thread Events API Observing GVL state Profiling support in CRuby
  21. Capable of reporting many "events", including GC state RUBY_INTERNAL_EVENT_GC_ {START,END_MARK,END_SWEEP}

    (Tracing profilers could be built on this API) Method entry/return events Profiling support in CRuby rb_profile_thread_frames() Obtaining the stack trace TracePoint API Receiving GC events Thread Events API Observing GVL state
  22. Calls a hook when the Thread state changes Started, suspended,

    restarted, waiting for GVL, stopped Profiling support in CRuby rb_profile_thread_frames() Obtaining the stack trace TracePoint API Receiving GC events Thread Events API Observing GVL state
  23. (Off-topic: The alternative strategy) It is possible to do sampling

    without runtime support... ... by knowing everything about the runtime memory If we can correctly re-interpret CRuby's entire memory, we have the same information which CRuby has The rbspy profiler does exactly this. It peeks into CRuby memory (process_vm_readv(2)), and reinterprets it using its copy of all CRuby structs
  24. Capturing C-level stack traces Needed to profile C extensions C-level

    stacks are managed by the OS kernel (as opposed to CRuby-managed Ruby stacks) A completely different approach is needed
  25. Capturing C-level stack traces Pf2 records the current program counter

    (PC) (with the help of libbacktrace) The PC will be used later to restore the original stack trace This process is done pthread-wise, and will be matched with Ruby Threads
  26. Scheduling Controlling when to collect samples Pf2's main strategy is

    using OS-level timers Linux provides several timers through timer_create(2)
  27. 0ms 10ms 20ms 30ms Ruby Thread 1 Ruby Thread 2

    SIGPROF SIGPROF Scheduling: Using timer_create(2) Using CPU Using CPU Using CPU Pf2 signal handler: Capture a sample for Ruby Thread 1 Pf2 signal Handler
  28. Scheduling: Supported definitions of time Real time (CLOCK_MONOTONIC in timer_create(2))

    Time as it flows in the real world. Also known as "wall-clock time" CPU time (CLOCK_PROCESS_CPUTIME_ID) The amount of time which the CPU consumed in behalf of the entire process Per-kernel thread CPU time (CLOCK_THREAD_CPUTIME_ID) This mode correctly reflects non-equal workloads Kernel threads can be linked to Ruby Threads 1:1 (when not +MN)
  29. The sampling happens in the signal handler In signal handlers,

    very little operation can be done No new allocations (no malloc)! Pf2's signal handler is designed not to violate this Collected data is kept in a pre-allocated ring buffer rb_profile_thread_frames() is (mostly) safe Scheduling: Running sampling in signal handlers
  30. Scheduling: Choosing the sampling rate It is important to choose

    a good sampling rate Higher Hz = More accuracy, more overhead Pf2 chooses 19 ms as the default Avoid lockstep sampling (missing periodical events) check() 0ms 10ms 20ms 30ms check() check()
  31. Scheduling: Controlling sampling overhead 0ms 19ms 38ms 57ms Ruby Thread

    Sample! This is 5% overhead Sample! Sample! 19.95ms 38.95ms Sampling (signal handlers) always pauses the program Pf2 default's overhead goal is <5%, so we can use 0.95 ms (950 microseconds) for each round
  32. SYSTEMS PROGRAMMING!!! Building a profiler is like systems programming So

    Pf2 is implemented in Rust unsafe { I thought Rust was a good language for do this... however most code needed to be unsafe { } anyways... So I ended up fighting with SEGVs all day But still, the programming experience was 100x better than C }
  33. Pf2's life cycle Sample Collection Record the target program's exec

    state periodically Unwinding & Aggregation Collect info about samples (after profiling is done) Visualization Format the data into flamegraphs, call graphs, ...
  34. Unwinding & Aggregation Pf2 collects minimum information during sampling to

    make the capture process low-overhead as possible This info does not contain file names and line numbers Unwinding is required to restore the full stack trace This is done after the target program completes
  35. Aggregation: Restoring func information Ruby Stack Method names can be

    obtained through C APIs (rb_profile_frame_full_label) The same for line numbers C stack addr2line, DWARF, etc.
  36. Aggregation: Weaving Ruby and C stacks The Ruby stack and

    C stack are collected as completely separate stacks As Ruby can call C and vice versa, the stacks must be "weaved" to create a meaningful report <main> Integer#times block in <main> Kernel#foo block (2 levels) in <main> main() ruby_run_node() rb_ec_exec_node() rb_vm_exec() vm_exec_core() rb_yield() rb_foo() vm_call_cfunc_with_cfra me() main() ruby_run_node() rb_ec_exec_node() rb_vm_exec() <main> Integer#times block in <main> Kernel#my_cfunc vm_callc_cfunc_with_cf vm_exec_core() rb_yield() Kernel#foo block (2 levels) in <main
  37. Weaving Ruby stack and C stacks <main> Integer#times block in

    <main> Kernel#my_cfunc block (2 levels) in <main> Kernel#foo block (2 levels) in <main> main() ruby_run_node() rb_ec_exec_node() rb_vm_exec() vm_exec_core() rb_yield() rb_foo() main() ruby_run_node() rb_ec_exec_node() rb_vm_exec() Program execution always starts from Native main() Copy the C stack until we hit rb_vm_exec(), then start copying Ruby stack
  38. Weaving Ruby stack and C stack <main> Integer#times block in

    <main> Kernel#my_cfunc block (2 levels) in <main> Kernel#foo block (2 levels) in <main> main() ruby_run_node() rb_ec_exec_node() rb_vm_exec() vm_exec_core() rb_yield() rb_foo() main() ruby_run_node() rb_ec_exec_node() rb_vm_exec() <main> Integer#times block in <main> Kernel#my_cfunc When we see a cfunc, switch back to C stack
  39. Weaving Ruby stack and C stack <main> Integer#times block in

    <main> Kernel#my_cfunc block (2 levels) in <main> Kernel#foo block (2 levels) in <main> main() ... ... vm_callc_cfunc_with_cframe() my_cfunc() vm_exec_core() ... We use the C func address stored in Cfunc's Callable Method Entry (CME) (cme->cfunc->func) We copy CRuby headers to do this (very fragile implementation...) 0x7cabca (non-cfunc) (non-cfunc) 0x7cabca 0x7c0009 0x7c0010
  40. Weaving Ruby stack and C Stack <main> Integer#times block in

    <main> Kernel#my_cfunc block (2 levels) in <main> Kernel#foo block (2 levels) in <main> main() ruby_run_node() rb_ec_exec_node() rb_vm_exec() vm_exec_core() rb_yield() rb_foo() main() ruby_run_node() rb_ec_exec_node() rb_vm_exec() <main> Integer#times block in <main> Kernel#my_cfunc vm_callc_cfunc_with_cframe() vm_exec_core() rb_yield() Kernel#foo block (2 levels) in <main> Repeat this process until we consume the entire original stack
  41. Pf2's life cycle Sample Collection Record the target program's exec

    state periodically Unwinding & Aggregation Collect info about samples (after profiling is done) Visualization Format the data into flamegraphs, call graphs, ...
  42. Visualization Good visualization gives us many ideas However, writing graphical

    code is very different from writing profiling code Pf2 decided to use existing visualizers The Firefox Profiler (profiler.firefox.com)
  43. Wrapping up Profilers cooperate with runtimes (CRuby, OS kernel) to

    collect information Accurate scheduling is the key to controlling the balance between information and overhead Unwinding is a heavy process and is done separately
  44. The future of Ruby profiling As long as Ruby continues

    to evolve, its profiling features shall also continue to evolve We need to be able to perfectly profile weird code! Profilers are closely tied to the runtime implementation e.g. Pf2 relies on the fact that 1 pthread is assigned per 1 Ruby Thread, which is broken when +MN Profilers need to catch up
  45. The future of Ruby profiling Since profiling is so tied

    to its target, maybe more of the profiler should be moved into Ruby Node.js, Go (pprof) has the profiler fully integrated into the runtime
  46. The future of Ruby profiling Software profiling is a never-complete

    art For example, the demand on "Always-on" profiling (continuous profiling) is increasing We'd need to keep lowering the overhead Let's build the future together!
  47. Acknowledgements Work on Pf2 was supported by The Ruby Association

    2023 Grant Program k0kubun san, my mentor ko1 san and mame san, as advisors