useful information about somebody/something so that they can give a description of them or it "Profiler, N. (1)." Oxford Advanced Learner's Dictionary, https://www.oxfordlearnersdictionaries.com/definition/english/profiler. Accessed 6 May 2024.
regions Especially when the bottleneck is not in your code e.g. Spawning 50 Threads in puma won't make your server 50x faster You'd need to know about the GVL to guess the reason, but a GVL-aware profiler can teach you
misleading (it's very easy to create a misleading profiler) Maintain low overhead The profiler must not be the bottleneck If collecting more information incurs too much overhead, don't collect it Be a friendly tool
state periodically Unwinding & Aggregation Collect info about samples (after profiling is done) Visualization Format the data into flamegraphs, call graphs, ...
state periodically Unwinding & Aggregation Collect info about samples (after profiling is done) Visualization Format the data into flamegraphs, call graphs, ...
while the target program runs Samples contain the current execution state What Ruby method was running now? What was the underlying C function? Did the Thread have the GVL now?
never recorded since it starts and ends between captures <main>, a <main>, a, b <main> The function running at capture time will be recorded Sampling profiler timeline Samples are periodically captured (in this case, every 10 ms) Samples
outside Many types of information must be collected Running Ruby methods GC/GVL state Running C functions Profiling from the inside is not impossible, but would be horrible
about CRuby state The stack, Threads, Ractors, GC ... Many aspects are not observable from Ruby code Contrarily, tools like perf won't work without special support The Ruby stack is mere memory from perf's POV
trace Ruby 2.1+/3.3+ The stack is represented as a opaque VALUE This is actually a pointer to the CME (callable method entry) We abuse this later rb_profile_thread_frames() Obtaining the stack trace TracePoint API Receiving GC events Thread Events API Observing GVL state Profiling support in CRuby
(Tracing profilers could be built on this API) Method entry/return events Profiling support in CRuby rb_profile_thread_frames() Obtaining the stack trace TracePoint API Receiving GC events Thread Events API Observing GVL state
restarted, waiting for GVL, stopped Profiling support in CRuby rb_profile_thread_frames() Obtaining the stack trace TracePoint API Receiving GC events Thread Events API Observing GVL state
without runtime support... ... by knowing everything about the runtime memory If we can correctly re-interpret CRuby's entire memory, we have the same information which CRuby has The rbspy profiler does exactly this. It peeks into CRuby memory (process_vm_readv(2)), and reinterprets it using its copy of all CRuby structs
(PC) (with the help of libbacktrace) The PC will be used later to restore the original stack trace This process is done pthread-wise, and will be matched with Ruby Threads
SIGPROF SIGPROF Scheduling: Using timer_create(2) Using CPU Using CPU Using CPU Pf2 signal handler: Capture a sample for Ruby Thread 1 Pf2 signal Handler
Time as it flows in the real world. Also known as "wall-clock time" CPU time (CLOCK_PROCESS_CPUTIME_ID) The amount of time which the CPU consumed in behalf of the entire process Per-kernel thread CPU time (CLOCK_THREAD_CPUTIME_ID) This mode correctly reflects non-equal workloads Kernel threads can be linked to Ruby Threads 1:1 (when not +MN)
very little operation can be done No new allocations (no malloc)! Pf2's signal handler is designed not to violate this Collected data is kept in a pre-allocated ring buffer rb_profile_thread_frames() is (mostly) safe Scheduling: Running sampling in signal handlers
a good sampling rate Higher Hz = More accuracy, more overhead Pf2 chooses 19 ms as the default Avoid lockstep sampling (missing periodical events) check() 0ms 10ms 20ms 30ms check() check()
Sample! This is 5% overhead Sample! Sample! 19.95ms 38.95ms Sampling (signal handlers) always pauses the program Pf2 default's overhead goal is <5%, so we can use 0.95 ms (950 microseconds) for each round
Pf2 is implemented in Rust unsafe { I thought Rust was a good language for do this... however most code needed to be unsafe { } anyways... So I ended up fighting with SEGVs all day But still, the programming experience was 100x better than C }
state periodically Unwinding & Aggregation Collect info about samples (after profiling is done) Visualization Format the data into flamegraphs, call graphs, ...
make the capture process low-overhead as possible This info does not contain file names and line numbers Unwinding is required to restore the full stack trace This is done after the target program completes
C stack are collected as completely separate stacks As Ruby can call C and vice versa, the stacks must be "weaved" to create a meaningful report <main> Integer#times block in <main> Kernel#foo block (2 levels) in <main> main() ruby_run_node() rb_ec_exec_node() rb_vm_exec() vm_exec_core() rb_yield() rb_foo() vm_call_cfunc_with_cfra me() main() ruby_run_node() rb_ec_exec_node() rb_vm_exec() <main> Integer#times block in <main> Kernel#my_cfunc vm_callc_cfunc_with_cf vm_exec_core() rb_yield() Kernel#foo block (2 levels) in <main
<main> Kernel#my_cfunc block (2 levels) in <main> Kernel#foo block (2 levels) in <main> main() ruby_run_node() rb_ec_exec_node() rb_vm_exec() vm_exec_core() rb_yield() rb_foo() main() ruby_run_node() rb_ec_exec_node() rb_vm_exec() Program execution always starts from Native main() Copy the C stack until we hit rb_vm_exec(), then start copying Ruby stack
<main> Kernel#my_cfunc block (2 levels) in <main> Kernel#foo block (2 levels) in <main> main() ruby_run_node() rb_ec_exec_node() rb_vm_exec() vm_exec_core() rb_yield() rb_foo() main() ruby_run_node() rb_ec_exec_node() rb_vm_exec() <main> Integer#times block in <main> Kernel#my_cfunc When we see a cfunc, switch back to C stack
<main> Kernel#my_cfunc block (2 levels) in <main> Kernel#foo block (2 levels) in <main> main() ... ... vm_callc_cfunc_with_cframe() my_cfunc() vm_exec_core() ... We use the C func address stored in Cfunc's Callable Method Entry (CME) (cme->cfunc->func) We copy CRuby headers to do this (very fragile implementation...) 0x7cabca (non-cfunc) (non-cfunc) 0x7cabca 0x7c0009 0x7c0010
<main> Kernel#my_cfunc block (2 levels) in <main> Kernel#foo block (2 levels) in <main> main() ruby_run_node() rb_ec_exec_node() rb_vm_exec() vm_exec_core() rb_yield() rb_foo() main() ruby_run_node() rb_ec_exec_node() rb_vm_exec() <main> Integer#times block in <main> Kernel#my_cfunc vm_callc_cfunc_with_cframe() vm_exec_core() rb_yield() Kernel#foo block (2 levels) in <main> Repeat this process until we consume the entire original stack
state periodically Unwinding & Aggregation Collect info about samples (after profiling is done) Visualization Format the data into flamegraphs, call graphs, ...
collect information Accurate scheduling is the key to controlling the balance between information and overhead Unwinding is a heavy process and is done separately
to evolve, its profiling features shall also continue to evolve We need to be able to perfectly profile weird code! Profilers are closely tied to the runtime implementation e.g. Pf2 relies on the fact that 1 pthread is assigned per 1 Ruby Thread, which is broken when +MN Profilers need to catch up
art For example, the demand on "Always-on" profiling (continuous profiling) is increasing We'd need to keep lowering the overhead Let's build the future together!