0x.tools with eBPF beta release and demo

0x.tools with eBPF Tanel Poder tanelpoder.com

0x.tools v2 beta nerd-launch :-) tanelpoder.com

Let's sample thread states! System wide utilization metrics vs thread
state sampling

Linux Thread State Sampling method, implementation & tools • Demo
of what you get (all free & open source) • How does it work? • FAQ & future plans tanelpoder.com

sudo xcapture-bpf (raw output) === Active Threads ====================================================================================================== timestamp |
st | flags | tid | pid | username | comm | syscall | offcpu_ustack | ------------------------------------------------------------------------------------------------------------------------- 2024-06-24 01:08:08.776 | RQ | 4210688 | 133703 | 133703 | oracle | oracle_133703_l | read | 39219 | 2024-06-24 01:08:08.776 | R | 4210688 | 133141 | 133141 | oracle | oracle_133141_l | read | 59492 | 2024-06-24 01:08:08.776 | R | 4210688 | 133281 | 133281 | oracle | oracle_133281_l | read | 127357 | 2024-06-24 01:08:08.776 | R | 4210688 | 134311 | 134311 | oracle | oracle_134311_l | read | 84414 | 2024-06-24 01:08:08.776 | R | 4210688 | 134299 | 134299 | oracle | oracle_134299_l | read | 12566 | 2024-06-24 01:08:08.776 | R | 4210688 | 133812 | 133812 | oracle | oracle_133812_l | read | 61924 | 2024-06-24 01:08:08.776 | R | 4210688 | 134031 | 134031 | oracle | oracle_134031_l | read | 126319 | 2024-06-24 01:08:08.776 | R | 4210688 | 133863 | 133863 | oracle | oracle_133863_l | read | 9540 | 2024-06-24 01:08:08.776 | R | 4210688 | 134136 | 134136 | oracle | oracle_134136_l | read | 47853 | 2024-06-24 01:08:08.776 | R | 4210688 | 133423 | 133423 | oracle | oracle_133423_l | read | 39614 | 2024-06-24 01:08:08.776 | R | 4210688 | 134185 | 134185 | oracle | oracle_134185_l | read | 51890 | 2024-06-24 01:08:08.776 | R | 4210688 | 134431 | 134431 | oracle | oracle_134431_l | read | 12124 | 2024-06-24 01:08:08.776 | R | 4210688 | 133797 | 133797 | oracle | oracle_133797_l | read | 48065 | 2024-06-24 01:08:08.776 | R | 4210688 | 134023 | 134023 | oracle | oracle_134023_l | read | 44375 | 2024-06-24 01:08:08.776 | R | 4210688 | 133944 | 133944 | oracle | oracle_133944_l | read | 23243 | 2024-06-24 01:08:08.776 | R | 4210688 | 133544 | 133544 | oracle | oracle_133544_l | read | 102337 | 2024-06-24 01:08:08.776 | R | 4210688 | 134359 | 134359 | oracle | oracle_134359_l | read | 51969 | 2024-06-24 01:08:08.776 | R | 4210688 | 133728 | 133728 | oracle | oracle_133728_l | read | 99119 | 2024-06-24 01:08:08.776 | R | 69238880 | 11338 | 11338 | root | kworker/131:2 | - | 80415 | 2024-06-24 01:08:08.776 | R | 4210688 | 134115 | 134115 | oracle | oracle_134115_l | read | 60264 | tanelpoder.com there will be many more columns!

sudo xtop === Active Threads ====================================================================== samples | avg_threads |
st | username | comm | syscall | offcpu_u | offcpu_k | sch ----------------------------------------------------------------------------------------- 12.00 | 2.40 | R | mysql | connection | - | 53698 | 50735 | __- 1.00 | 0.20 | R | mysql | connection | recvfrom | 53698 | 50735 | __- 1.00 | 0.20 | R | tanel | sysbench | - | 24801 | 46538 | __- 1.00 | 0.20 | R | mysql | connection | ppoll | 53698 | 50735 | __- 1.00 | 0.20 | R | mysql | connection | sendto | 53698 | 50735 | __- 1.00 | 0.20 | R | tanel | sysbench | recvfrom | 42407 | 46538 | __- 1.00 | 0.20 | R | mysql | connection | - | 2360 | 50735 | __- 1.00 | 0.20 | R | mysql | connection | - | 6808 | 50735 | __- 1.00 | 0.20 | R | mysql | connection | - | 53698 | 46538 | __- tanelpoder.com

StackTiles • A new simple way to compress info on
a terminal screen • Structured • Readable • Navigable (search & lookup) • Not fancy, but practical • ... for people who work on command line tanelpoder.com

StackTiles (giant nerd mode) === Active Threads ====================================================================================================================================================================================================== timestamp |
st | flags | tid | pid | username | comm | syscall | offcpu_ustack | offcpu_kstack | profile_ustack | profile_kstack | in_sched_waking | in_sched_wakeup | waker_tid ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 2024-06-24 01:03:15.551 | R | 4210688 | 127994 | 127994 | oracle | oracle_127994_l | read | 63287 | 110001 | -12 | -14 | True | False | 127992 2024-06-24 01:03:15.551 | R | 4210688 | 128568 | 128568 | oracle | oracle_128568_l | read | 122640 | 110001 | -12 | -14 | True | False | 128566 2024-06-24 01:03:15.551 | R | 4210688 | 129235 | 129235 | oracle | oracle_129235_l | read | 96273 | 110001 | 46265 | -14 | True | False | 129233 2024-06-24 01:03:15.551 | RQ | 4210752 | 127896 | 126484 | tanel | UserSession347 | write | 42723 | 79814 | 127544 | -14 | False | False | 127898 2024-06-24 01:03:15.551 | RQ | 4210752 | 128834 | 126484 | tanel | UserSession652 | write | 42723 | 79814 | 98729 | -14 | False | False | 128836 2024-06-24 01:03:15.551 | R | 4210688 | 128490 | 128490 | oracle | oracle_128490_l | read | -12 | 82885 | 93790 | -14 | False | False | 128488 2024-06-24 01:03:15.551 | R | 4210688 | 127324 | 127324 | oracle | oracle_127324_l | read | 5882 | 110001 | 62996 | -14 | False | False | 127322 2024-06-24 01:03:15.551 | R | 4210688 | 128200 | 128200 | oracle | oracle_128200_l | read | 70293 | 110001 | 9415 | -14 | True | False | 128198 2024-06-24 01:03:15.551 | R | 4210688 | 126814 | 126814 | oracle | oracle_126814_l | read | 58328 | 110001 | 9100 | -14 | True | False | 126812 2024-06-24 01:03:15.551 | R | 4210688 | 128836 | 128836 | oracle | oracle_128836_l | read | 119055 | 110001 | 98729 | -14 | True | False | 128836 2024-06-24 01:03:15.551 | R | 4210688 | 129667 | 129667 | oracle | oracle_129667_l | read | 110383 | 110001 | 122449 | -14 | True | False | 129665 -------------------------------------------------------------------------------------------------------------------------------------------------- kstack 7454 | kstack 9826 | kstack 13991 | kstack 26933 | kstack 32123 | | | | | | asm_sysvec_reschedule_ipi | asm_exc_page_fault | ksys_write | __do_sys_getrusage | __x64_sys_semtimedop | irqentry_exit_to_user_mode | exc_page_fault | vfs_write | getrusage | do_semtimedop | exit_to_user_mode_prepare | do_user_addr_fault | sock_write_iter | mmput | schedule_timeout | exit_to_user_mode_loop | | | __cond_resched | schedule | schedule | | | __schedule | __schedule | __schedule | | | handshake_exit | handshake_exit | handshake_exit | | | handshake_exit | handshake_exit | handshake_exit | | | | | -------------------------------------------------------------------------------------------------------------------------------------------------- kstack 32426 | kstack 35640 | kstack 37329 | kstack 41759 | | | | | ksys_write | asm_sysvec_irq_work | kthread | ksys_write | vfs_write | irqentry_exit_to_user_mode | worker_thread | vfs_write | sock_write_iter | exit_to_user_mode_prepare | schedule | sock_write_iter | tcp_sendmsg | exit_to_user_mode_loop | __schedule | tcp_sendmsg | tcp_sendmsg_locked | schedule | handshake_exit | tcp_sendmsg_locked | sk_stream_alloc_skb | __schedule | handshake_exit | __tcp_push_pending_frames | mem_cgroup_charge_skmem | handshake_exit | | tcp_write_xmit | try_charge_memcg | handshake_exit | | __tcp_transmit_skb | page_counter_try_charge | | | __ip_queue_xmit | | | | ip_finish_output2 | | | | __local_bh_enable_ip | | | | do_softirq | | | | __do_softirq | | | | net_rx_action | | | | __napi_poll | | | | process_backlog | | | | __netif_receive_skb_one_core | tanelpoder.com

StackTiles -------------------------------------------------------------------------------------------------------------------------------------------------- kstack 7454 | kstack 9826 | kstack 13991
| kstack 26933 | kstack 32123 | | | | | | asm_sysvec_reschedule_ipi | asm_exc_page_fault | ksys_write | __do_sys_getrusage | __x64_sys_semtimedop | irqentry_exit_to_user_mode | exc_page_fault | vfs_write | getrusage | do_semtimedop | exit_to_user_mode_prepare | do_user_addr_fault | sock_write_iter | mmput | schedule_timeout | exit_to_user_mode_loop | | | __cond_resched | schedule | schedule | | | __schedule | __schedule | __schedule | | | handshake_exit | handshake_exit | handshake_exit | | | handshake_exit | handshake_exit | handshake_exit | | | | | -------------------------------------------------------------------------------------------------------------------------------------------------- kstack 32426 | kstack 35640 | kstack 37329 | kstack 41759 | | | | | ksys_write | asm_sysvec_irq_work | kthread | ksys_write | vfs_write | irqentry_exit_to_user_mode | worker_thread | vfs_write | sock_write_iter | exit_to_user_mode_prepare | schedule | sock_write_iter | tcp_sendmsg | exit_to_user_mode_loop | __schedule | tcp_sendmsg | tcp_sendmsg_locked | schedule | handshake_exit | tcp_sendmsg_locked | sk_stream_alloc_skb | __schedule | handshake_exit | __tcp_push_pending_frames | mem_cgroup_charge_skmem | handshake_exit | | tcp_write_xmit | try_charge_memcg | handshake_exit | | __tcp_transmit_skb | page_counter_try_charge | | | __ip_queue_xmit | | | | ip_finish_output2 | | | | __local_bh_enable_ip | | | | do_softirq | | | | __do_softirq | | | | net_rx_action | | | | __napi_poll | | | | process_backlog | | | | __netif_receive_skb_one_core | | | | ip_local_deliver_finish | | | | ip_protocol_deliver_rcu | | | | tcp_v4_rcv | | | | tcp_v4_do_rcv | | | | tcp_rcv_established | | | | sock_def_readable | | | | _raw_spin_unlock_irqrestore | -------------------------------------------------------------------------------------------------------------------------------------------------- kstack 45647 | kstack 48081 | kstack 56466 | kstack 65895 | | | | | syscall_enter_from_user_mode | ksys_write | __do_sys_getrusage | kthread | | vfs_write | getrusage | worker_thread | | sock_write_iter | | schedule | | tcp_sendmsg | | __schedule | | tcp_sendmsg_locked | | finish_task_switch.isra.0 | | __tcp_push_pending_frames | | | | tcp_write_xmit | | | | __tcp_transmit_skb | | | | __ip_queue_xmit | | | | ip_finish_output2 | | | | __local_bh_enable_ip | | | | do_softirq | | | | __do_softirq | | | | net_rx_action | | | | __napi_poll | | | | process_backlog | | | | __netif_receive_skb_one_core | | | | ip_local_deliver_finish | | | | ip_protocol_deliver_rcu | | | | tcp_v4_rcv | | | tanelpoder.com

How does it work? Explaining mostly the concept here

How does it work?! • Two halves • Instrumenter –
the BPF program (xcapture-bpf.c) • Sampler – currently python/BCC (xcapture-bpf) tanelpoder.com

Time tid 10 tid 11 tid 42 10 11 42
N ... BPF_HASH(syscall_id) 10 10 10 tracepoint:raw_syscalls:sys_enter { @syscall_id[tid] = args->id; } Populating & sampling the thread state "array"

Time tid 10 tid 11 tid 42 10 11 42
N ... BPF_HASH(syscall_id) 10 11 11 11 11 11 BPF_HASH(syscall_ustack) 10 11 42 N ... ... tracepoint:raw_syscalls:sys_enter { @syscall_id[tid] = args->id; } Populating & sampling the thread state "array"

Time tid 10 tid 11 tid 42 10 11 42
N ... BPF_HASH(syscall_id) 10 42 42 42 42 42 BPF_HASH(syscall_ustack) 10 11 42 N ... ... tracepoint:raw_syscalls:sys_enter { @syscall_id[tid] = args->id; } 42 42 42 42 42 42 We are not tracing, logging, appending all events We update, overwrite the current, latest action in custom state arrays ... Populating & sampling the thread state "array"

Time tid 10 tid 11 tid 42 10 11 42
N ... BPF_HASH(syscall_id) tracepoint:raw_syscalls:sys_enter { @syscall_id[tid] = args->id; } A separate, independent program samples the state arrays using its desired frequency and filter rules to userspace BPF_HASH(syscall_ustack) interval:hz:1 { print(@SAMPLE_TIME); print(@syscall_id); } 10 11 42 N 10 11 42 N 10 11 42 N 10 11 42 N Populating & sampling the thread state "array"

Time tid 10 tid 11 tid 42 10 11 42
N ... BPF_HASH(syscall_id) tracepoint:raw_syscalls:sys_enter { @syscall_id[tid] = args->id; } BPF_HASH(syscall_ustack) interval:hz:1 { print(@SAMPLE_TIME); print(@syscall_id); } 10 11 42 N 10 11 42 N 10 11 42 N 10 11 42 N The sampler can be an eBPF program (bpftrace, bcc, libbpf) or an userspace agent that reads the maps' pseudofiles Populating & sampling the thread state "array"

How can it be efficient? • We do not trace
every single event to output • Unrealistic amount of output & high instrumentation overhead • We do not sample only on-CPU threads • The profile event only samples on-CPU threads (also commands like perf top by default) • We will additionally use the finish_task_switch kprobe for thread sleep (off-CPU) analysis • We will "trace" the latest thread state changes into a custom array • And "clients" then periodically sample the thread state array & consume the output

FAQ • xcapture-bpf is not showing stack traces for Java,
Python, etc • Currently you get stacks & symbols only for compiled binaries with symbols or debuginfo available • It is possible to add higher-level language runtime support and Java runtime-optimized code • This has already been done by other tools and works • What's the performance overhead? • Test it out! J • Still beta, I have 6-7 categories of ideas for further improvement • It doesn't matter how frequently the frontend samples the TS arrays, doesn't slow others down • Will this work with distributed systems? • Yes, but not yet implemented (for example, capture + include end-to-end traceID in TS2 array) • Distributed systems are still just a bunch of individual systems - that talk to each other • Instrumentation is investment! tanelpoder.com

FAQ • Practical requirements • RHEL 8.1 or later, Ubuntu
20 or later • bcc-tools package installed • xcapture-bpf running as root • But any Linux user with read access can read its output files! • debuginfo in some cases (ideally) • xcapture-bpf isn't showing some-detail-I-want (like syscall or IO latencies) • I have built out less than 5% of what this method & implementation can provide! • BPFapproaches are not only customizable, but completely programmable • You can access all kernel events & structures related to thread execution and access userspace memory tanelpoder.com

Near-term plans • Community beta testing & tidy up code
• Evangelize! So that drilldown into thread activity eventually makes sense to everyone! • Proper documentation, examples (and a man-page) • Optimize the BPF kernel-space performance, also userspace record extraction • Profile the instrumentation code itself • Improve stack-tracking hashmap to lower memory usage • Make some instrumentation dynamic/optional (get_stack on every N iterations) • Proper distro packaging • Automated CSV compression, archiving (optionally convert to parquet format too) • Release v2 GA (September 2024?) • For future v3 use libbpf, allow multiple independent samplers of the BPF program maps tanelpoder.com

Links • 0x.tools – website with docs and usage examples
• https://github.com/tanelpoder/0xtools – open source code • https://tanelpoder.com/categories/linux/ - my Linux-related posts • eBPF trace id • eBPF java • eBPF python/ruby • https://x.com/tanelpoder @tanelpoder • https://youtube.com/TanelPoder – I'll demo 0x.tools on many apps tanelpoder.com

0x.tools with eBPF Tanel Poder tanelpoder.com

0x.tools with eBPF beta release and demo

0x.tools with eBPF beta release and demo

Tanel Poder

More Decks by Tanel Poder

Featured

Transcript

0x.tools with eBPF Tanel Poder tanelpoder.com

0x.tools v2 beta nerd-launch :-) tanelpoder.com

Let's sample thread states! System wide utilization metrics vs thread

Linux Thread State Sampling method, implementation & tools • Demo

sudo xcapture-bpf (raw output) === Active Threads ====================================================================================================== timestamp |

sudo xtop === Active Threads ====================================================================== samples | avg_threads |

StackTiles • A new simple way to compress info on

StackTiles (giant nerd mode) === Active Threads ====================================================================================================================================================================================================== timestamp |

StackTiles -------------------------------------------------------------------------------------------------------------------------------------------------- kstack 7454 | kstack 9826 | kstack 13991

How does it work? Explaining mostly the concept here

How does it work?! • Two halves • Instrumenter –

Time tid 10 tid 11 tid 42 10 11 42

Time tid 10 tid 11 tid 42 10 11 42

Time tid 10 tid 11 tid 42 10 11 42

Time tid 10 tid 11 tid 42 10 11 42

Time tid 10 tid 11 tid 42 10 11 42

How can it be efficient? • We do not trace

FAQ • xcapture-bpf is not showing stack traces for Java,

FAQ • Practical requirements • RHEL 8.1 or later, Ubuntu

Near-term plans • Community beta testing & tidy up code

Links • 0x.tools – website with docs and usage examples

0x.tools with eBPF Tanel Poder tanelpoder.com