The depths of profiling Ruby - RubyKaigi 2024

The depths of pro fi ling Ruby Daisuke Aritomo (@osyoyu)

pp @osyoyu Daisuke Aritomo (osyoyu) Member of #rubykaigiNOC (Wi-Fi team)
Loving ϙʔΫۄࢠ͓ʹ͗Γ Probably in the top 1% among Rubyists pork tamago onigiri

A D We're the Hydration Sponsor ਫɹ෼ɹิɹڅɹεɹϙɹϯɹαɹʔ Meet SmartBank and
get FREE DRINKS outside the Large Hall!

profiler

profiler noun /ˈprəʊfaɪlə(r)/ a person whose job is to collect
useful information about somebody/something so that they can give a description of them or it "Profiler, N. (1)." Oxford Advanced Learner's Dictionary, https://www.oxfordlearnersdictionaries.com/definition/english/profiler. Accessed 6 May 2024.

(Ruby) profilers 101 A tool injected into a target program
and analyzes the performance Reveals the hotspot The code block that is making the program slow The "flamegraph" visualization. "Heavy" methods are shown as long bars. Flame Graph Reset Zoom Search test_bu.. x.. d.. s.. gener.. S.. _.. unary_.. execute_command_internal x.. bash shell_expand_word_list red.. execute_builtin __xst.. __dup2 main _.. do_redirection_internal do_redirections do_f.. ext4_f.. __GI___li.. __.. expand_word_list_internal unary_.. expan.. tra.. path.. execute_command two_ar.. cleanup_redi.. execute_command do_sy.. trac.. __GI___libc_.. posixt.. execute_builtin_or_function _.. [unknown] tra.. expand_word_internal _.. tracesys ex.. _.. vfs_write c.. test_co.. e.. u.. v.. SY.. i.. execute_simple_command execute_command_internal p.. gl.. s.. sys_write __GI___l.. expand_words d.. __libc_start_main d.. do_sync.. generi.. e.. reader_loop sy.. execute_while_command do_redirecti.. sys_o.. g.. __gene.. __xsta.. tracesys c.. _.. execute_while_or_until

Announcing Pf2, a new Ruby profiler I've been building a
new profiler! Pf2 is focused on analyzing Ruby programs from various point of views github.com/osyoyu/pf2

Finding slow methods Pf2 can find time-consuming methods and display
them in a flamegraph (just like other Ruby profilers)

Analyzing multithreaded programs A 4-threaded program

Analyzing multithreaded programs When the Thread did actual work A
4-threaded program

Thread statistics Indicates time in GC and time wasted waiting
for the CPU lock (GVL) In Ruby, only 1 Thread can use the CPU simultaneously Useful information for adjusting # of Threads

Diving into C extensions With Ruby-only profiling, it's hard to
explore bottlenecks which live in C extensions

Diving into C extensions read_query_result (Receiving the result = parsing
results is slow) ppoll (Waiting for query exec = heavy work on the server) do_send_query (Sending the SQL = the network is slow)

Finding "hidden" Ruby methods

Finding "hidden" Ruby methods rb_hash_aref = Hash#[] not appearing in
Ruby stack due to optimization

Mutlple visualizations Pf2 offers several visualization modes (through Firefox Profiler)
Call Tree (for finding heavy methods) Flamegraph Inverted call tree (for finding methods called repeatedly)

Why so many angles? Bottlenecks often live in unexpected, unguessable
regions Especially when the bottleneck is not in your code e.g. Spawning 50 Threads in puma won't make your server 50x faster You'd need to know about the GVL to guess the reason, but a GVL-aware profiler can teach you

How to build this? Building a profiler is not like
writing a normal Ruby program Let's see how to do this!

The depths of pro fi ling Ruby Daisuke Aritomo (@osyoyu)

A profiler's mission Be informative as possible And never be
misleading (it's very easy to create a misleading profiler) Maintain low overhead The profiler must not be the bottleneck If collecting more information incurs too much overhead, don't collect it Be a friendly tool

Pf2's life cycle Sample Collection Record the target program's exec
state periodically Unwinding & Aggregation Collect info about samples (after profiling is done) Visualization Format the data into flamegraphs, call graphs, ...

Sample collection A sampling profiler collects samples every few milliseconds
while the target program runs Samples contain the current execution state What Ruby method was running now? What was the underlying C function? Did the Thread have the GVL now?

<main> a() c() b() 0ms 10ms 20ms 30ms c() is
never recorded since it starts and ends between captures <main>, a <main>, a, b <main> The function running at capture time will be recorded Sampling profiler timeline Samples are periodically captured (in this case, every 10 ms) Samples

Collecting samples 💥 Pf2 observes the execution state from the
outside Many types of information must be collected Running Ruby methods GC/GVL state Running C functions Profiling from the inside is not impossible, but would be horrible

Querying Ruby about Ruby The CRuby runtime knows the best
about CRuby state The stack, Threads, Ractors, GC ... Many aspects are not observable from Ruby code Contrarily, tools like perf won't work without special support The Ruby stack is mere memory from perf's POV

Profiling support in CRuby Pf2 Profiler rb_profile_thread_frames() Obtaining the stack
trace TracePoint API Receiving GC events Thread Events API Observing GVL state

Takes a Thread as an arg, and returns its stack
trace Ruby 2.1+/3.3+ The stack is represented as a opaque VALUE This is actually a pointer to the CME (callable method entry) We abuse this later rb_profile_thread_frames() Obtaining the stack trace TracePoint API Receiving GC events Thread Events API Observing GVL state Profiling support in CRuby

Capable of reporting many "events", including GC state RUBY_INTERNAL_EVENT_GC_ {START,END_MARK,END_SWEEP}
(Tracing profilers could be built on this API) Method entry/return events Profiling support in CRuby rb_profile_thread_frames() Obtaining the stack trace TracePoint API Receiving GC events Thread Events API Observing GVL state

Calls a hook when the Thread state changes Started, suspended,
restarted, waiting for GVL, stopped Profiling support in CRuby rb_profile_thread_frames() Obtaining the stack trace TracePoint API Receiving GC events Thread Events API Observing GVL state

(Off-topic: The alternative strategy) It is possible to do sampling
without runtime support... ... by knowing everything about the runtime memory If we can correctly re-interpret CRuby's entire memory, we have the same information which CRuby has The rbspy profiler does exactly this. It peeks into CRuby memory (process_vm_readv(2)), and reinterprets it using its copy of all CRuby structs

Capturing C-level stack traces Needed to profile C extensions C-level
stacks are managed by the OS kernel (as opposed to CRuby-managed Ruby stacks) A completely different approach is needed

Capturing C-level stack traces Pf2 records the current program counter
(PC) (with the help of libbacktrace) The PC will be used later to restore the original stack trace This process is done pthread-wise, and will be matched with Ruby Threads

<main> a() c() b() 0ms 10ms 20ms 30ms Sampling profiler
timeline Scheduling

Scheduling Controlling when to collect samples Pf2's main strategy is
using OS-level timers Linux provides several timers through timer_create(2)

0ms 10ms 20ms 30ms Ruby Thread 1 Ruby Thread 2
SIGPROF SIGPROF Scheduling: Using timer_create(2) Using CPU Using CPU Using CPU Pf2 signal handler: Capture a sample for Ruby Thread 1 Pf2 signal Handler

Scheduling: Supported definitions of time Real time (CLOCK_MONOTONIC in timer_create(2))
Time as it flows in the real world. Also known as "wall-clock time" CPU time (CLOCK_PROCESS_CPUTIME_ID) The amount of time which the CPU consumed in behalf of the entire process Per-kernel thread CPU time (CLOCK_THREAD_CPUTIME_ID) This mode correctly reflects non-equal workloads Kernel threads can be linked to Ruby Threads 1:1 (when not +MN)

The sampling happens in the signal handler In signal handlers,
very little operation can be done No new allocations (no malloc)! Pf2's signal handler is designed not to violate this Collected data is kept in a pre-allocated ring buffer rb_profile_thread_frames() is (mostly) safe Scheduling: Running sampling in signal handlers

Scheduling: Choosing the sampling rate It is important to choose
a good sampling rate Higher Hz = More accuracy, more overhead Pf2 chooses 19 ms as the default Avoid lockstep sampling (missing periodical events) check() 0ms 10ms 20ms 30ms check() check()

Scheduling: Controlling sampling overhead 0ms 19ms 38ms 57ms Ruby Thread
Sample! This is 5% overhead Sample! Sample! 19.95ms 38.95ms Sampling (signal handlers) always pauses the program Pf2 default's overhead goal is <5%, so we can use 0.95 ms (950 microseconds) for each round

SYSTEMS PROGRAMMING!!! Building a profiler is like systems programming So
Pf2 is implemented in Rust unsafe { I thought Rust was a good language for do this... however most code needed to be unsafe { } anyways... So I ended up fighting with SEGVs all day But still, the programming experience was 100x better than C }

Unwinding & Aggregation Pf2 collects minimum information during sampling to
make the capture process low-overhead as possible This info does not contain file names and line numbers Unwinding is required to restore the full stack trace This is done after the target program completes

Aggregation: Restoring func information Ruby Stack Method names can be
obtained through C APIs (rb_profile_frame_full_label) The same for line numbers C stack addr2line, DWARF, etc.

Aggregation: Weaving Ruby and C stacks The Ruby stack and
C stack are collected as completely separate stacks As Ruby can call C and vice versa, the stacks must be "weaved" to create a meaningful report <main> Integer#times block in <main> Kernel#foo block (2 levels) in <main> main() ruby_run_node() rb_ec_exec_node() rb_vm_exec() vm_exec_core() rb_yield() rb_foo() vm_call_cfunc_with_cfra me() main() ruby_run_node() rb_ec_exec_node() rb_vm_exec() <main> Integer#times block in <main> Kernel#my_cfunc vm_callc_cfunc_with_cf vm_exec_core() rb_yield() Kernel#foo block (2 levels) in <main

Weaving Ruby stack and C stacks <main> Integer#times block in
<main> Kernel#my_cfunc block (2 levels) in <main> Kernel#foo block (2 levels) in <main> main() ruby_run_node() rb_ec_exec_node() rb_vm_exec() vm_exec_core() rb_yield() rb_foo() main() ruby_run_node() rb_ec_exec_node() rb_vm_exec() Program execution always starts from Native main() Copy the C stack until we hit rb_vm_exec(), then start copying Ruby stack

Weaving Ruby stack and C stack <main> Integer#times block in
<main> Kernel#my_cfunc block (2 levels) in <main> Kernel#foo block (2 levels) in <main> main() ruby_run_node() rb_ec_exec_node() rb_vm_exec() vm_exec_core() rb_yield() rb_foo() main() ruby_run_node() rb_ec_exec_node() rb_vm_exec() <main> Integer#times block in <main> Kernel#my_cfunc When we see a cfunc, switch back to C stack

Weaving Ruby stack and C stack <main> Integer#times block in
<main> Kernel#my_cfunc block (2 levels) in <main> Kernel#foo block (2 levels) in <main> main() ... ... vm_callc_cfunc_with_cframe() my_cfunc() vm_exec_core() ... We use the C func address stored in Cfunc's Callable Method Entry (CME) (cme->cfunc->func) We copy CRuby headers to do this (very fragile implementation...) 0x7cabca (non-cfunc) (non-cfunc) 0x7cabca 0x7c0009 0x7c0010

Weaving Ruby stack and C Stack <main> Integer#times block in
<main> Kernel#my_cfunc block (2 levels) in <main> Kernel#foo block (2 levels) in <main> main() ruby_run_node() rb_ec_exec_node() rb_vm_exec() vm_exec_core() rb_yield() rb_foo() main() ruby_run_node() rb_ec_exec_node() rb_vm_exec() <main> Integer#times block in <main> Kernel#my_cfunc vm_callc_cfunc_with_cframe() vm_exec_core() rb_yield() Kernel#foo block (2 levels) in <main> Repeat this process until we consume the entire original stack

Visualization Good visualization gives us many ideas However, writing graphical
code is very different from writing profiling code Pf2 decided to use existing visualizers The Firefox Profiler (profiler.firefox.com)

Wrapping up Profilers cooperate with runtimes (CRuby, OS kernel) to
collect information Accurate scheduling is the key to controlling the balance between information and overhead Unwinding is a heavy process and is done separately

The future of Ruby pro fi ling

The future of Ruby profiling As long as Ruby continues
to evolve, its profiling features shall also continue to evolve We need to be able to perfectly profile weird code! Profilers are closely tied to the runtime implementation e.g. Pf2 relies on the fact that 1 pthread is assigned per 1 Ruby Thread, which is broken when +MN Profilers need to catch up

The future of Ruby profiling Since profiling is so tied
to its target, maybe more of the profiler should be moved into Ruby Node.js, Go (pprof) has the profiler fully integrated into the runtime

The future of Ruby profiling Software profiling is a never-complete
art For example, the demand on "Always-on" profiling (continuous profiling) is increasing We'd need to keep lowering the overhead Let's build the future together!

Acknowledgements Work on Pf2 was supported by The Ruby Association
2023 Grant Program k0kubun san, my mentor ko1 san and mame san, as advisors

Try out Pf2 & feedback please! github.com/osyoyu/pf2 ϙʔΫۄࢠ͓ʹ͗Γ ΋ ΑΖ͘͠

May 15th - 17th, 2024 NAHA CULTURAL ARTS THEATER NAHArt,
Okinawa, Japan Thank you!

The depths of profiling Ruby - RubyKaigi 2024

The depths of profiling Ruby - RubyKaigi 2024

More Decks by osyoyu

Other Decks in Technology

Featured

Transcript