Slide 1

Slide 1 text

The depths of pro fi ling Ruby Daisuke Aritomo (@osyoyu)

Slide 2

Slide 2 text

pp @osyoyu Daisuke Aritomo (osyoyu) Member of #rubykaigiNOC (Wi-Fi team) Loving ϙʔΫۄࢠ͓ʹ͗Γ Probably in the top 1% among Rubyists pork tamago onigiri

Slide 3

Slide 3 text

A D We're the Hydration Sponsor ਫɹ෼ɹิɹڅɹεɹϙɹϯɹαɹʔ Meet SmartBank and get FREE DRINKS outside the Large Hall!

Slide 4

Slide 4 text

profiler

Slide 5

Slide 5 text

profiler noun /ˈprəʊfaɪlə(r)/ a person whose job is to collect useful information about somebody/something so that they can give a description of them or it "Profiler, N. (1)." Oxford Advanced Learner's Dictionary, https://www.oxfordlearnersdictionaries.com/definition/english/profiler. Accessed 6 May 2024.

Slide 6

Slide 6 text

(Ruby) profilers 101 A tool injected into a target program and analyzes the performance Reveals the hotspot The code block that is making the program slow The "flamegraph" visualization. "Heavy" methods are shown as long bars. Flame Graph Reset Zoom Search test_bu.. x.. d.. s.. gener.. S.. _.. unary_.. execute_command_internal x.. bash shell_expand_word_list red.. execute_builtin __xst.. __dup2 main _.. do_redirection_internal do_redirections do_f.. ext4_f.. __GI___li.. __.. expand_word_list_internal unary_.. expan.. tra.. path.. execute_command two_ar.. cleanup_redi.. execute_command do_sy.. trac.. __GI___libc_.. posixt.. execute_builtin_or_function _.. [unknown] tra.. expand_word_internal _.. tracesys ex.. _.. vfs_write c.. test_co.. e.. u.. v.. SY.. i.. execute_simple_command execute_command_internal p.. gl.. s.. sys_write __GI___l.. expand_words d.. __libc_start_main d.. do_sync.. generi.. e.. reader_loop sy.. execute_while_command do_redirecti.. sys_o.. g.. __gene.. __xsta.. tracesys c.. _.. execute_while_or_until

Slide 7

Slide 7 text

Announcing Pf2, a new Ruby profiler I've been building a new profiler! Pf2 is focused on analyzing Ruby programs from various point of views github.com/osyoyu/pf2

Slide 8

Slide 8 text

Finding slow methods Pf2 can find time-consuming methods and display them in a flamegraph (just like other Ruby profilers)

Slide 9

Slide 9 text

Analyzing multithreaded programs A 4-threaded program

Slide 10

Slide 10 text

Analyzing multithreaded programs When the Thread did actual work A 4-threaded program

Slide 11

Slide 11 text

Thread statistics Indicates time in GC and time wasted waiting for the CPU lock (GVL) In Ruby, only 1 Thread can use the CPU simultaneously Useful information for adjusting # of Threads

Slide 12

Slide 12 text

Diving into C extensions With Ruby-only profiling, it's hard to explore bottlenecks which live in C extensions

Slide 13

Slide 13 text

Diving into C extensions read_query_result (Receiving the result = parsing results is slow) ppoll (Waiting for query exec = heavy work on the server) do_send_query (Sending the SQL = the network is slow)

Slide 14

Slide 14 text

Finding "hidden" Ruby methods

Slide 15

Slide 15 text

Finding "hidden" Ruby methods rb_hash_aref = Hash#[] not appearing in Ruby stack due to optimization

Slide 16

Slide 16 text

Mutlple visualizations Pf2 offers several visualization modes (through Firefox Profiler) Call Tree (for finding heavy methods) Flamegraph Inverted call tree (for finding methods called repeatedly)

Slide 17

Slide 17 text

Why so many angles? Bottlenecks often live in unexpected, unguessable regions Especially when the bottleneck is not in your code e.g. Spawning 50 Threads in puma won't make your server 50x faster You'd need to know about the GVL to guess the reason, but a GVL-aware profiler can teach you

Slide 18

Slide 18 text

How to build this? Building a profiler is not like writing a normal Ruby program Let's see how to do this!

Slide 19

Slide 19 text

The depths of pro fi ling Ruby Daisuke Aritomo (@osyoyu)

Slide 20

Slide 20 text

A profiler's mission Be informative as possible And never be misleading (it's very easy to create a misleading profiler) Maintain low overhead The profiler must not be the bottleneck If collecting more information incurs too much overhead, don't collect it Be a friendly tool

Slide 21

Slide 21 text

Pf2's life cycle Sample Collection Record the target program's exec state periodically Unwinding & Aggregation Collect info about samples (after profiling is done) Visualization Format the data into flamegraphs, call graphs, ...

Slide 22

Slide 22 text

Pf2's life cycle Sample Collection Record the target program's exec state periodically Unwinding & Aggregation Collect info about samples (after profiling is done) Visualization Format the data into flamegraphs, call graphs, ...

Slide 23

Slide 23 text

Sample collection A sampling profiler collects samples every few milliseconds while the target program runs Samples contain the current execution state What Ruby method was running now? What was the underlying C function? Did the Thread have the GVL now?

Slide 24

Slide 24 text

a() c() b() 0ms 10ms 20ms 30ms c() is never recorded since it starts and ends between captures , a , a, b The function running at capture time will be recorded Sampling profiler timeline Samples are periodically captured (in this case, every 10 ms) Samples

Slide 25

Slide 25 text

Collecting samples 💥 Pf2 observes the execution state from the outside Many types of information must be collected Running Ruby methods GC/GVL state Running C functions Profiling from the inside is not impossible, but would be horrible

Slide 26

Slide 26 text

Querying Ruby about Ruby The CRuby runtime knows the best about CRuby state The stack, Threads, Ractors, GC ... Many aspects are not observable from Ruby code Contrarily, tools like perf won't work without special support The Ruby stack is mere memory from perf's POV

Slide 27

Slide 27 text

Profiling support in CRuby Pf2 Profiler rb_profile_thread_frames() Obtaining the stack trace TracePoint API Receiving GC events Thread Events API Observing GVL state

Slide 28

Slide 28 text

Takes a Thread as an arg, and returns its stack trace Ruby 2.1+/3.3+ The stack is represented as a opaque VALUE This is actually a pointer to the CME (callable method entry) We abuse this later rb_profile_thread_frames() Obtaining the stack trace TracePoint API Receiving GC events Thread Events API Observing GVL state Profiling support in CRuby

Slide 29

Slide 29 text

Capable of reporting many "events", including GC state RUBY_INTERNAL_EVENT_GC_ {START,END_MARK,END_SWEEP} (Tracing profilers could be built on this API) Method entry/return events Profiling support in CRuby rb_profile_thread_frames() Obtaining the stack trace TracePoint API Receiving GC events Thread Events API Observing GVL state

Slide 30

Slide 30 text

Calls a hook when the Thread state changes Started, suspended, restarted, waiting for GVL, stopped Profiling support in CRuby rb_profile_thread_frames() Obtaining the stack trace TracePoint API Receiving GC events Thread Events API Observing GVL state

Slide 31

Slide 31 text

(Off-topic: The alternative strategy) It is possible to do sampling without runtime support... ... by knowing everything about the runtime memory If we can correctly re-interpret CRuby's entire memory, we have the same information which CRuby has The rbspy profiler does exactly this. It peeks into CRuby memory (process_vm_readv(2)), and reinterprets it using its copy of all CRuby structs

Slide 32

Slide 32 text

Capturing C-level stack traces Needed to profile C extensions C-level stacks are managed by the OS kernel (as opposed to CRuby-managed Ruby stacks) A completely different approach is needed

Slide 33

Slide 33 text

Capturing C-level stack traces Pf2 records the current program counter (PC) (with the help of libbacktrace) The PC will be used later to restore the original stack trace This process is done pthread-wise, and will be matched with Ruby Threads

Slide 34

Slide 34 text

a() c() b() 0ms 10ms 20ms 30ms Sampling profiler timeline Scheduling

Slide 35

Slide 35 text

Scheduling Controlling when to collect samples Pf2's main strategy is using OS-level timers Linux provides several timers through timer_create(2)

Slide 36

Slide 36 text

0ms 10ms 20ms 30ms Ruby Thread 1 Ruby Thread 2 SIGPROF SIGPROF Scheduling: Using timer_create(2) Using CPU Using CPU Using CPU Pf2 signal handler: Capture a sample for Ruby Thread 1 Pf2 signal Handler

Slide 37

Slide 37 text

Scheduling: Supported definitions of time Real time (CLOCK_MONOTONIC in timer_create(2)) Time as it flows in the real world. Also known as "wall-clock time" CPU time (CLOCK_PROCESS_CPUTIME_ID) The amount of time which the CPU consumed in behalf of the entire process Per-kernel thread CPU time (CLOCK_THREAD_CPUTIME_ID) This mode correctly reflects non-equal workloads Kernel threads can be linked to Ruby Threads 1:1 (when not +MN)

Slide 38

Slide 38 text

The sampling happens in the signal handler In signal handlers, very little operation can be done No new allocations (no malloc)! Pf2's signal handler is designed not to violate this Collected data is kept in a pre-allocated ring buffer rb_profile_thread_frames() is (mostly) safe Scheduling: Running sampling in signal handlers

Slide 39

Slide 39 text

Scheduling: Choosing the sampling rate It is important to choose a good sampling rate Higher Hz = More accuracy, more overhead Pf2 chooses 19 ms as the default Avoid lockstep sampling (missing periodical events) check() 0ms 10ms 20ms 30ms check() check()

Slide 40

Slide 40 text

Scheduling: Controlling sampling overhead 0ms 19ms 38ms 57ms Ruby Thread Sample! This is 5% overhead Sample! Sample! 19.95ms 38.95ms Sampling (signal handlers) always pauses the program Pf2 default's overhead goal is <5%, so we can use 0.95 ms (950 microseconds) for each round

Slide 41

Slide 41 text

SYSTEMS PROGRAMMING!!! Building a profiler is like systems programming So Pf2 is implemented in Rust unsafe { I thought Rust was a good language for do this... however most code needed to be unsafe { } anyways... So I ended up fighting with SEGVs all day But still, the programming experience was 100x better than C }

Slide 42

Slide 42 text

Pf2's life cycle Sample Collection Record the target program's exec state periodically Unwinding & Aggregation Collect info about samples (after profiling is done) Visualization Format the data into flamegraphs, call graphs, ...

Slide 43

Slide 43 text

Unwinding & Aggregation Pf2 collects minimum information during sampling to make the capture process low-overhead as possible This info does not contain file names and line numbers Unwinding is required to restore the full stack trace This is done after the target program completes

Slide 44

Slide 44 text

Aggregation: Restoring func information Ruby Stack Method names can be obtained through C APIs (rb_profile_frame_full_label) The same for line numbers C stack addr2line, DWARF, etc.

Slide 45

Slide 45 text

Aggregation: Weaving Ruby and C stacks The Ruby stack and C stack are collected as completely separate stacks As Ruby can call C and vice versa, the stacks must be "weaved" to create a meaningful report Integer#times block in Kernel#foo block (2 levels) in main() ruby_run_node() rb_ec_exec_node() rb_vm_exec() vm_exec_core() rb_yield() rb_foo() vm_call_cfunc_with_cfra me() main() ruby_run_node() rb_ec_exec_node() rb_vm_exec() Integer#times block in Kernel#my_cfunc vm_callc_cfunc_with_cf vm_exec_core() rb_yield() Kernel#foo block (2 levels) in

Slide 46

Slide 46 text

Weaving Ruby stack and C stacks Integer#times block in Kernel#my_cfunc block (2 levels) in Kernel#foo block (2 levels) in main() ruby_run_node() rb_ec_exec_node() rb_vm_exec() vm_exec_core() rb_yield() rb_foo() main() ruby_run_node() rb_ec_exec_node() rb_vm_exec() Program execution always starts from Native main() Copy the C stack until we hit rb_vm_exec(), then start copying Ruby stack

Slide 47

Slide 47 text

Weaving Ruby stack and C stack Integer#times block in Kernel#my_cfunc block (2 levels) in Kernel#foo block (2 levels) in main() ruby_run_node() rb_ec_exec_node() rb_vm_exec() vm_exec_core() rb_yield() rb_foo() main() ruby_run_node() rb_ec_exec_node() rb_vm_exec() Integer#times block in Kernel#my_cfunc When we see a cfunc, switch back to C stack

Slide 48

Slide 48 text

Weaving Ruby stack and C stack Integer#times block in Kernel#my_cfunc block (2 levels) in Kernel#foo block (2 levels) in main() ... ... vm_callc_cfunc_with_cframe() my_cfunc() vm_exec_core() ... We use the C func address stored in Cfunc's Callable Method Entry (CME) (cme->cfunc->func) We copy CRuby headers to do this (very fragile implementation...) 0x7cabca (non-cfunc) (non-cfunc) 0x7cabca 0x7c0009 0x7c0010

Slide 49

Slide 49 text

Weaving Ruby stack and C Stack Integer#times block in Kernel#my_cfunc block (2 levels) in Kernel#foo block (2 levels) in main() ruby_run_node() rb_ec_exec_node() rb_vm_exec() vm_exec_core() rb_yield() rb_foo() main() ruby_run_node() rb_ec_exec_node() rb_vm_exec() Integer#times block in Kernel#my_cfunc vm_callc_cfunc_with_cframe() vm_exec_core() rb_yield() Kernel#foo block (2 levels) in Repeat this process until we consume the entire original stack

Slide 50

Slide 50 text

Pf2's life cycle Sample Collection Record the target program's exec state periodically Unwinding & Aggregation Collect info about samples (after profiling is done) Visualization Format the data into flamegraphs, call graphs, ...

Slide 51

Slide 51 text

Visualization Good visualization gives us many ideas However, writing graphical code is very different from writing profiling code Pf2 decided to use existing visualizers The Firefox Profiler (profiler.firefox.com)

Slide 52

Slide 52 text

Wrapping up Profilers cooperate with runtimes (CRuby, OS kernel) to collect information Accurate scheduling is the key to controlling the balance between information and overhead Unwinding is a heavy process and is done separately

Slide 53

Slide 53 text

The future of Ruby pro fi ling

Slide 54

Slide 54 text

The future of Ruby profiling As long as Ruby continues to evolve, its profiling features shall also continue to evolve We need to be able to perfectly profile weird code! Profilers are closely tied to the runtime implementation e.g. Pf2 relies on the fact that 1 pthread is assigned per 1 Ruby Thread, which is broken when +MN Profilers need to catch up

Slide 55

Slide 55 text

The future of Ruby profiling Since profiling is so tied to its target, maybe more of the profiler should be moved into Ruby Node.js, Go (pprof) has the profiler fully integrated into the runtime

Slide 56

Slide 56 text

The future of Ruby profiling Software profiling is a never-complete art For example, the demand on "Always-on" profiling (continuous profiling) is increasing We'd need to keep lowering the overhead Let's build the future together!

Slide 57

Slide 57 text

Acknowledgements Work on Pf2 was supported by The Ruby Association 2023 Grant Program k0kubun san, my mentor ko1 san and mame san, as advisors

Slide 58

Slide 58 text

Try out Pf2 & feedback please! github.com/osyoyu/pf2 ϙʔΫۄࢠ͓ʹ͗Γ ΋ ΑΖ͘͠

Slide 59

Slide 59 text

No content

Slide 60

Slide 60 text

May 15th - 17th, 2024 NAHA CULTURAL ARTS THEATER NAHArt, Okinawa, Japan Thank you!

Slide 61

Slide 61 text

No content