Upgrade to Pro — share decks privately, control downloads, hide ads and more …

0x.tools with eBPF beta release and demo

Tanel Poder
June 25, 2024
300

0x.tools with eBPF beta release and demo

Tanel Poder

June 25, 2024
Tweet

Transcript

  1. Linux Thread State Sampling method, implementation & tools • Demo

    of what you get (all free & open source) • How does it work? • FAQ & future plans tanelpoder.com
  2. sudo xcapture-bpf (raw output) === Active Threads ====================================================================================================== timestamp |

    st | flags | tid | pid | username | comm | syscall | offcpu_ustack | ------------------------------------------------------------------------------------------------------------------------- 2024-06-24 01:08:08.776 | RQ | 4210688 | 133703 | 133703 | oracle | oracle_133703_l | read | 39219 | 2024-06-24 01:08:08.776 | R | 4210688 | 133141 | 133141 | oracle | oracle_133141_l | read | 59492 | 2024-06-24 01:08:08.776 | R | 4210688 | 133281 | 133281 | oracle | oracle_133281_l | read | 127357 | 2024-06-24 01:08:08.776 | R | 4210688 | 134311 | 134311 | oracle | oracle_134311_l | read | 84414 | 2024-06-24 01:08:08.776 | R | 4210688 | 134299 | 134299 | oracle | oracle_134299_l | read | 12566 | 2024-06-24 01:08:08.776 | R | 4210688 | 133812 | 133812 | oracle | oracle_133812_l | read | 61924 | 2024-06-24 01:08:08.776 | R | 4210688 | 134031 | 134031 | oracle | oracle_134031_l | read | 126319 | 2024-06-24 01:08:08.776 | R | 4210688 | 133863 | 133863 | oracle | oracle_133863_l | read | 9540 | 2024-06-24 01:08:08.776 | R | 4210688 | 134136 | 134136 | oracle | oracle_134136_l | read | 47853 | 2024-06-24 01:08:08.776 | R | 4210688 | 133423 | 133423 | oracle | oracle_133423_l | read | 39614 | 2024-06-24 01:08:08.776 | R | 4210688 | 134185 | 134185 | oracle | oracle_134185_l | read | 51890 | 2024-06-24 01:08:08.776 | R | 4210688 | 134431 | 134431 | oracle | oracle_134431_l | read | 12124 | 2024-06-24 01:08:08.776 | R | 4210688 | 133797 | 133797 | oracle | oracle_133797_l | read | 48065 | 2024-06-24 01:08:08.776 | R | 4210688 | 134023 | 134023 | oracle | oracle_134023_l | read | 44375 | 2024-06-24 01:08:08.776 | R | 4210688 | 133944 | 133944 | oracle | oracle_133944_l | read | 23243 | 2024-06-24 01:08:08.776 | R | 4210688 | 133544 | 133544 | oracle | oracle_133544_l | read | 102337 | 2024-06-24 01:08:08.776 | R | 4210688 | 134359 | 134359 | oracle | oracle_134359_l | read | 51969 | 2024-06-24 01:08:08.776 | R | 4210688 | 133728 | 133728 | oracle | oracle_133728_l | read | 99119 | 2024-06-24 01:08:08.776 | R | 69238880 | 11338 | 11338 | root | kworker/131:2 | - | 80415 | 2024-06-24 01:08:08.776 | R | 4210688 | 134115 | 134115 | oracle | oracle_134115_l | read | 60264 | tanelpoder.com there will be many more columns!
  3. sudo xtop === Active Threads ====================================================================== samples | avg_threads |

    st | username | comm | syscall | offcpu_u | offcpu_k | sch ----------------------------------------------------------------------------------------- 12.00 | 2.40 | R | mysql | connection | - | 53698 | 50735 | __- 1.00 | 0.20 | R | mysql | connection | recvfrom | 53698 | 50735 | __- 1.00 | 0.20 | R | tanel | sysbench | - | 24801 | 46538 | __- 1.00 | 0.20 | R | mysql | connection | ppoll | 53698 | 50735 | __- 1.00 | 0.20 | R | mysql | connection | sendto | 53698 | 50735 | __- 1.00 | 0.20 | R | tanel | sysbench | recvfrom | 42407 | 46538 | __- 1.00 | 0.20 | R | mysql | connection | - | 2360 | 50735 | __- 1.00 | 0.20 | R | mysql | connection | - | 6808 | 50735 | __- 1.00 | 0.20 | R | mysql | connection | - | 53698 | 46538 | __- tanelpoder.com
  4. StackTiles • A new simple way to compress info on

    a terminal screen • Structured • Readable • Navigable (search & lookup) • Not fancy, but practical • ... for people who work on command line tanelpoder.com
  5. StackTiles (giant nerd mode) === Active Threads ====================================================================================================================================================================================================== timestamp |

    st | flags | tid | pid | username | comm | syscall | offcpu_ustack | offcpu_kstack | profile_ustack | profile_kstack | in_sched_waking | in_sched_wakeup | waker_tid ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 2024-06-24 01:03:15.551 | R | 4210688 | 127994 | 127994 | oracle | oracle_127994_l | read | 63287 | 110001 | -12 | -14 | True | False | 127992 2024-06-24 01:03:15.551 | R | 4210688 | 128568 | 128568 | oracle | oracle_128568_l | read | 122640 | 110001 | -12 | -14 | True | False | 128566 2024-06-24 01:03:15.551 | R | 4210688 | 129235 | 129235 | oracle | oracle_129235_l | read | 96273 | 110001 | 46265 | -14 | True | False | 129233 2024-06-24 01:03:15.551 | RQ | 4210752 | 127896 | 126484 | tanel | UserSession347 | write | 42723 | 79814 | 127544 | -14 | False | False | 127898 2024-06-24 01:03:15.551 | RQ | 4210752 | 128834 | 126484 | tanel | UserSession652 | write | 42723 | 79814 | 98729 | -14 | False | False | 128836 2024-06-24 01:03:15.551 | R | 4210688 | 128490 | 128490 | oracle | oracle_128490_l | read | -12 | 82885 | 93790 | -14 | False | False | 128488 2024-06-24 01:03:15.551 | R | 4210688 | 127324 | 127324 | oracle | oracle_127324_l | read | 5882 | 110001 | 62996 | -14 | False | False | 127322 2024-06-24 01:03:15.551 | R | 4210688 | 128200 | 128200 | oracle | oracle_128200_l | read | 70293 | 110001 | 9415 | -14 | True | False | 128198 2024-06-24 01:03:15.551 | R | 4210688 | 126814 | 126814 | oracle | oracle_126814_l | read | 58328 | 110001 | 9100 | -14 | True | False | 126812 2024-06-24 01:03:15.551 | R | 4210688 | 128836 | 128836 | oracle | oracle_128836_l | read | 119055 | 110001 | 98729 | -14 | True | False | 128836 2024-06-24 01:03:15.551 | R | 4210688 | 129667 | 129667 | oracle | oracle_129667_l | read | 110383 | 110001 | 122449 | -14 | True | False | 129665 -------------------------------------------------------------------------------------------------------------------------------------------------- kstack 7454 | kstack 9826 | kstack 13991 | kstack 26933 | kstack 32123 | | | | | | asm_sysvec_reschedule_ipi | asm_exc_page_fault | ksys_write | __do_sys_getrusage | __x64_sys_semtimedop | irqentry_exit_to_user_mode | exc_page_fault | vfs_write | getrusage | do_semtimedop | exit_to_user_mode_prepare | do_user_addr_fault | sock_write_iter | mmput | schedule_timeout | exit_to_user_mode_loop | | | __cond_resched | schedule | schedule | | | __schedule | __schedule | __schedule | | | handshake_exit | handshake_exit | handshake_exit | | | handshake_exit | handshake_exit | handshake_exit | | | | | -------------------------------------------------------------------------------------------------------------------------------------------------- kstack 32426 | kstack 35640 | kstack 37329 | kstack 41759 | | | | | ksys_write | asm_sysvec_irq_work | kthread | ksys_write | vfs_write | irqentry_exit_to_user_mode | worker_thread | vfs_write | sock_write_iter | exit_to_user_mode_prepare | schedule | sock_write_iter | tcp_sendmsg | exit_to_user_mode_loop | __schedule | tcp_sendmsg | tcp_sendmsg_locked | schedule | handshake_exit | tcp_sendmsg_locked | sk_stream_alloc_skb | __schedule | handshake_exit | __tcp_push_pending_frames | mem_cgroup_charge_skmem | handshake_exit | | tcp_write_xmit | try_charge_memcg | handshake_exit | | __tcp_transmit_skb | page_counter_try_charge | | | __ip_queue_xmit | | | | ip_finish_output2 | | | | __local_bh_enable_ip | | | | do_softirq | | | | __do_softirq | | | | net_rx_action | | | | __napi_poll | | | | process_backlog | | | | __netif_receive_skb_one_core | tanelpoder.com
  6. StackTiles -------------------------------------------------------------------------------------------------------------------------------------------------- kstack 7454 | kstack 9826 | kstack 13991

    | kstack 26933 | kstack 32123 | | | | | | asm_sysvec_reschedule_ipi | asm_exc_page_fault | ksys_write | __do_sys_getrusage | __x64_sys_semtimedop | irqentry_exit_to_user_mode | exc_page_fault | vfs_write | getrusage | do_semtimedop | exit_to_user_mode_prepare | do_user_addr_fault | sock_write_iter | mmput | schedule_timeout | exit_to_user_mode_loop | | | __cond_resched | schedule | schedule | | | __schedule | __schedule | __schedule | | | handshake_exit | handshake_exit | handshake_exit | | | handshake_exit | handshake_exit | handshake_exit | | | | | -------------------------------------------------------------------------------------------------------------------------------------------------- kstack 32426 | kstack 35640 | kstack 37329 | kstack 41759 | | | | | ksys_write | asm_sysvec_irq_work | kthread | ksys_write | vfs_write | irqentry_exit_to_user_mode | worker_thread | vfs_write | sock_write_iter | exit_to_user_mode_prepare | schedule | sock_write_iter | tcp_sendmsg | exit_to_user_mode_loop | __schedule | tcp_sendmsg | tcp_sendmsg_locked | schedule | handshake_exit | tcp_sendmsg_locked | sk_stream_alloc_skb | __schedule | handshake_exit | __tcp_push_pending_frames | mem_cgroup_charge_skmem | handshake_exit | | tcp_write_xmit | try_charge_memcg | handshake_exit | | __tcp_transmit_skb | page_counter_try_charge | | | __ip_queue_xmit | | | | ip_finish_output2 | | | | __local_bh_enable_ip | | | | do_softirq | | | | __do_softirq | | | | net_rx_action | | | | __napi_poll | | | | process_backlog | | | | __netif_receive_skb_one_core | | | | ip_local_deliver_finish | | | | ip_protocol_deliver_rcu | | | | tcp_v4_rcv | | | | tcp_v4_do_rcv | | | | tcp_rcv_established | | | | sock_def_readable | | | | _raw_spin_unlock_irqrestore | -------------------------------------------------------------------------------------------------------------------------------------------------- kstack 45647 | kstack 48081 | kstack 56466 | kstack 65895 | | | | | syscall_enter_from_user_mode | ksys_write | __do_sys_getrusage | kthread | | vfs_write | getrusage | worker_thread | | sock_write_iter | | schedule | | tcp_sendmsg | | __schedule | | tcp_sendmsg_locked | | finish_task_switch.isra.0 | | __tcp_push_pending_frames | | | | tcp_write_xmit | | | | __tcp_transmit_skb | | | | __ip_queue_xmit | | | | ip_finish_output2 | | | | __local_bh_enable_ip | | | | do_softirq | | | | __do_softirq | | | | net_rx_action | | | | __napi_poll | | | | process_backlog | | | | __netif_receive_skb_one_core | | | | ip_local_deliver_finish | | | | ip_protocol_deliver_rcu | | | | tcp_v4_rcv | | | tanelpoder.com
  7. How does it work?! • Two halves • Instrumenter –

    the BPF program (xcapture-bpf.c) • Sampler – currently python/BCC (xcapture-bpf) tanelpoder.com
  8. Time tid 10 tid 11 tid 42 10 11 42

    N ... BPF_HASH(syscall_id) 10 10 10 tracepoint:raw_syscalls:sys_enter { @syscall_id[tid] = args->id; } Populating & sampling the thread state "array"
  9. Time tid 10 tid 11 tid 42 10 11 42

    N ... BPF_HASH(syscall_id) 10 11 11 11 11 11 BPF_HASH(syscall_ustack) 10 11 42 N ... ... tracepoint:raw_syscalls:sys_enter { @syscall_id[tid] = args->id; } Populating & sampling the thread state "array"
  10. Time tid 10 tid 11 tid 42 10 11 42

    N ... BPF_HASH(syscall_id) 10 42 42 42 42 42 BPF_HASH(syscall_ustack) 10 11 42 N ... ... tracepoint:raw_syscalls:sys_enter { @syscall_id[tid] = args->id; } 42 42 42 42 42 42 We are not tracing, logging, appending all events We update, overwrite the current, latest action in custom state arrays ... Populating & sampling the thread state "array"
  11. Time tid 10 tid 11 tid 42 10 11 42

    N ... BPF_HASH(syscall_id) tracepoint:raw_syscalls:sys_enter { @syscall_id[tid] = args->id; } A separate, independent program samples the state arrays using its desired frequency and filter rules to userspace BPF_HASH(syscall_ustack) interval:hz:1 { print(@SAMPLE_TIME); print(@syscall_id); } 10 11 42 N 10 11 42 N 10 11 42 N 10 11 42 N Populating & sampling the thread state "array"
  12. Time tid 10 tid 11 tid 42 10 11 42

    N ... BPF_HASH(syscall_id) tracepoint:raw_syscalls:sys_enter { @syscall_id[tid] = args->id; } BPF_HASH(syscall_ustack) interval:hz:1 { print(@SAMPLE_TIME); print(@syscall_id); } 10 11 42 N 10 11 42 N 10 11 42 N 10 11 42 N The sampler can be an eBPF program (bpftrace, bcc, libbpf) or an userspace agent that reads the maps' pseudofiles Populating & sampling the thread state "array"
  13. How can it be efficient? • We do not trace

    every single event to output • Unrealistic amount of output & high instrumentation overhead • We do not sample only on-CPU threads • The profile event only samples on-CPU threads (also commands like perf top by default) • We will additionally use the finish_task_switch kprobe for thread sleep (off-CPU) analysis • We will "trace" the latest thread state changes into a custom array • And "clients" then periodically sample the thread state array & consume the output
  14. FAQ • xcapture-bpf is not showing stack traces for Java,

    Python, etc • Currently you get stacks & symbols only for compiled binaries with symbols or debuginfo available • It is possible to add higher-level language runtime support and Java runtime-optimized code • This has already been done by other tools and works • What's the performance overhead? • Test it out! J • Still beta, I have 6-7 categories of ideas for further improvement • It doesn't matter how frequently the frontend samples the TS arrays, doesn't slow others down • Will this work with distributed systems? • Yes, but not yet implemented (for example, capture + include end-to-end traceID in TS2 array) • Distributed systems are still just a bunch of individual systems - that talk to each other • Instrumentation is investment! tanelpoder.com
  15. FAQ • Practical requirements • RHEL 8.1 or later, Ubuntu

    20 or later • bcc-tools package installed • xcapture-bpf running as root • But any Linux user with read access can read its output files! • debuginfo in some cases (ideally) • xcapture-bpf isn't showing some-detail-I-want (like syscall or IO latencies) • I have built out less than 5% of what this method & implementation can provide! • BPFapproaches are not only customizable, but completely programmable • You can access all kernel events & structures related to thread execution and access userspace memory tanelpoder.com
  16. Near-term plans • Community beta testing & tidy up code

    • Evangelize! So that drilldown into thread activity eventually makes sense to everyone! • Proper documentation, examples (and a man-page) • Optimize the BPF kernel-space performance, also userspace record extraction • Profile the instrumentation code itself • Improve stack-tracking hashmap to lower memory usage • Make some instrumentation dynamic/optional (get_stack on every N iterations) • Proper distro packaging • Automated CSV compression, archiving (optionally convert to parquet format too) • Release v2 GA (September 2024?) • For future v3 use libbpf, allow multiple independent samplers of the BPF program maps tanelpoder.com
  17. Links • 0x.tools – website with docs and usage examples

    • https://github.com/tanelpoder/0xtools – open source code • https://tanelpoder.com/categories/linux/ - my Linux-related posts • eBPF trace id • eBPF java • eBPF python/ruby • https://x.com/tanelpoder @tanelpoder • https://youtube.com/TanelPoder – I'll demo 0x.tools on many apps tanelpoder.com