Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Linux Kernel Tracing Internals

Sponsored · Ship Features Fearlessly Turn features on and off without deploys. Used by thousands of Ruby developers.

Linux Kernel Tracing Internals

This session dives into the core mechanisms of Linux kernel tracing. We explore dynamic instrumentation such as exception-based kprobe and code-patching ftrace, comparing internals and performance overhead. The talk also covers low-overhead, statically defined tracepoints and counter-based sampling via perf_event. Finally, we introduce BPF as a programmable execution layer that reuses and enhances tracing sources for flexible and efficient kernel observation.

Presented at: SUSE Labs Conference 2026

Avatar for Hoyeon Lee

Hoyeon Lee

May 18, 2026

More Decks by Hoyeon Lee

Other Decks in Technology

Transcript

  1. Hoyeon Lee Seoul, South Korea SUSE - HW Enablement Taipei

    BPF System Engineer Working on Linux Kernel BPF 2 Whoami
  2. Exception-based kprobe (2.6.9) 4 Kernel Tracing Event Sources Code-patching ftrace

    (2.6.27) Statically-defined tracepoints (2.6.32) Counter-based perf_event (2.6.31) Dynamic Instrumentation Static Instrumentation Sampling
  3. Berkeley Packet Filter (3.18) 5 Programmable Layer on Top Dynamic

    Instrumentation Static Instrumentation Sampling Exception-based kprobe (2.6.9) Code-patching ftrace (2.6.27) Statically-defined tracepoints (2.6.32) Counter-based perf_event (2.6.31)
  4. 1. Dynamic Instrumentation kprobe — exception-based ftrace — function-entry patching

    2. Static Instrumentation tracepoints — predefined observation points 3. Sampling perf_event — counter-overflow sampling 4. Programmable Layer BPF — programs attached to event sources Scope: x86-64, baseline paths only, no optimized paths 6 Outline
  5. Inserting observation points, to capture exact info, at exact moment

    8 What is Instrumentation? [ running ] → (observation point) → [ keep running ]
  6. Inserting observation points, to capture exact info, at exact moment

    Primitive Instrumentation: printk() 9 [ running ] → (observation point) → [ keep running ] void do_something() { · · · · · · printk("Hello World\n"); · · · · · · }
  7. Too expensive on hot paths Perturbs timing-sensitive behavior Requires source

    change and rebuild printk(): Why Not for Tracing? 10 void do_something() { · · · · · · printk("Hello World\n"); · · · · · · }
  8. ⚡ Lower overhead 🔍 Less perturbation 🧩 Enable without rebuild

    11 We Need Better Instrumentation Too expensive on hot paths Perturbs timing-sensitive behavior Requires source change and rebuild
  9. 12 What Instrumentation Can We Use? Dynamic Instrumentation Insert observation

    points at runtime Static Instrumentation Use Observation points inserted at build time void do_something() { · · · · · · · · · · · · · · · · · · } Compiled Result void do_something() { observation_point(); · · · · · · · · · · · · } Runtime void do_something() { · · · · · · observation_point(); · · · · · · } Compiled Result void do_something() { · · · · · · observation_point(); · · · · · · } Runtime Insert Enable
  10. 15 What is a breakpoint, really? How does GDB break

    at main? Does GDB stop execution directly? $ gdb ./hello (gdb) disas /r main Dump of assembler code for function main: 0x00401136 <+0>: 55 push %rbp 0x00401137 <+1>: 48 89 e5 mov %rsp,%rbp 0x0040113a <+4>: bf 04 20 40 00 mov $0x402004,%edi 0x0040113f <+9>: e8 ec fe ff ff call 0x401030 <puts@plt> 0x00401144 <+14>: b8 00 00 00 00 mov $0x0,%eax 0x00401149 <+19>: 5d pop %rbp 0x0040114a <+20>: c3 ret End of assembler dump. (gdb) break *main Breakpoint 1 at 0x401136
  11. 16 What is a breakpoint, really? GDB doesn’t stop execution

    directly. It patches INT3, and the CPU traps itself. #BP exception $ gdb ./hello (gdb) disas /r main Dump of assembler code for function main: 0x00401136 <+0>: + CC INT3 0x00401136 <+0>: - 55 push %rbp 0x00401137 <+1>: 48 89 e5 mov %rsp,%rbp 0x0040113a <+4>: bf 04 20 40 00 mov $0x402004,%edi 0x0040113f <+9>: e8 ec fe ff ff call 0x401030 <puts@plt> 0x00401144 <+14>: b8 00 00 00 00 mov $0x0,%eax 0x00401149 <+19>: 5d pop %rbp 0x0040114a <+20>: c3 ret End of assembler dump. (gdb) break *main Breakpoint 1 at 0x401136 (gdb) run Using host libthread_db library "/lib64/libthread_db.so.1". Breakpoint 1, 0x00000000004011c6 in main ()
  12. 17 What is a breakpoint, really? kprobe uses the same

    breakpoint idea, but the trap is handled inside the kernel CPU raise #BP send SIGTRAP $ gdb ./hello (gdb) disas /r main Dump of assembler code for function main: 0x00401136 <+0>: + CC INT3 0x00401136 <+0>: - 55 push %rbp 0x00401137 <+1>: 48 89 e5 mov %rsp,%rbp 0x0040113a <+4>: bf 04 20 40 00 mov $0x402004,%edi 0x0040113f <+9>: e8 ec fe ff ff call 0x401030 <puts@plt> 0x00401144 <+14>: b8 00 00 00 00 mov $0x0,%eax 0x00401149 <+19>: 5d pop %rbp 0x0040114a <+20>: c3 ret End of assembler dump. (gdb) break *main Breakpoint 1 at 0x401136 (gdb) run Using host libthread_db library "/lib64/libthread_db.so.1". Breakpoint 1, 0x00000000004011c6 in main () Process Execution Execute INT3 Debugger gains control Enter Kernel Mode Trap Handler (Interrupt Descriptor Table) Leave Kernel Mode User Mode Kernel Mode ptrace check
  13. What does the getuid system call look like? 18 A

    Small Kernel Target <__do_sys_getuid> #include <unistd.h> uid_t getuid (void); # id uid=1000(user) ...
  14. 19 Disassembly of getuid syscall $ objdump -d v6.18/vmlinux --disassemble=__do_sys_getuid

    00000000 <__do_sys_getuid>: 0: 55 pushq %rbp 1: 48 89 e5 movq %rsp, %rbp 4: 65 48 8b 05 ec da c8 01 movq %gs:0x1c8daec(%rip), %rax // get_current() c: 48 8b 80 68 08 00 00 movq 0x868(%rax), %rax // current->cred 13: 48 8b b8 90 00 00 00 movq 0x90(%rax), %rdi // current->cred->uid 1a: 8b 70 08 movl 0x8(%rax), %esi // current->cred->user_ns 1d: e8 6e 21 0c 00 callq 0xc2190 <from_kuid_munged> 22: 89 c0 movl %eax, %eax 24: 5d popq %rbp 25: c3 retq SYSCALL_DEFINE0(getuid) { return from_kuid_munged(current_user_ns(), current_uid()); }
  15. 20 kprobe Internals 0. Enable kprobe to __do_sys_getuid // from

    kernel module struct kprobe kp = { .symbol_name = "__do_sys_getuid", .offset = 0, .pre_handler = handler_pre, // optional .post_handler = handler_post, // optional }; register_kprobe(&kp); $ objdump -d v6.18/vmlinux --disassemble=__do_sys_getuid 00000000 <__do_sys_getuid>: 0: 55 pushq %rbp 1: 48 89 e5 movq %rsp, %rbp 4: 65 48 8b 05 ec da c8 01 movq %gs:0x1c8daec(%rip), %rax c: 48 8b 80 68 08 00 00 movq 0x868(%rax), %rax 13: 48 8b b8 90 00 00 00 movq 0x90(%rax), %rdi 1a: 8b 70 08 movl 0x8(%rax), %esi 1d: e8 6e 21 0c 00 callq 0xc2190 <from_kuid_munged> 22: 89 c0 movl %eax, %eax 24: 5d popq %rbp 25: c3 retq
  16. 21 kprobe Internals // include/linux/kprobes.h struct kprobe { ... const

    char *symbol_name; __do_sys_getuid unsigned int offset; 0 kprobe_pre_handler_t pre_handler; kprobe_post_handler_t post_handler; ... struct arch_specific_insn ainsn; }; 1. Save original instruction bytes $ objdump -d v6.18/vmlinux --disassemble=__do_sys_getuid 00000000 <__do_sys_getuid>: 0: 55 pushq %rbp 1: 48 89 e5 movq %rsp, %rbp 4: 65 48 8b 05 ec da c8 01 movq %gs:0x1c8daec(%rip), %rax c: 48 8b 80 68 08 00 00 movq 0x868(%rax), %rax 13: 48 8b b8 90 00 00 00 movq 0x90(%rax), %rdi 1a: 8b 70 08 movl 0x8(%rax), %esi 1d: e8 6e 21 0c 00 callq 0xc2190 <from_kuid_munged> 22: 89 c0 movl %eax, %eax 24: 5d popq %rbp 25: c3 retq pushq %rbp
  17. 22 kprobe Internals // include/linux/kprobes.h struct kprobe { ... const

    char *symbol_name; __do_sys_getuid unsigned int offset; 0 kprobe_pre_handler_t pre_handler; kprobe_post_handler_t post_handler; ... struct arch_specific_insn ainsn; }; $ objdump -d v6.18/vmlinux --disassemble=__do_sys_getuid 00000000 <__do_sys_getuid>: -0: 55 - pushq %rbp +0: CC + INT3 1: 48 89 e5 movq %rsp, %rbp 4: 65 48 8b 05 ec da c8 01 movq %gs:0x1c8daec(%rip), %rax c: 48 8b 80 68 08 00 00 movq 0x868(%rax), %rax 13: 48 8b b8 90 00 00 00 movq 0x90(%rax), %rdi 1a: 8b 70 08 movl 0x8(%rax), %esi 1d: e8 6e 21 0c 00 callq 0xc2190 <from_kuid_munged> 22: 89 c0 movl %eax, %eax 24: 5d popq %rbp 25: c3 retq 2. Patch 1 Byte with INT3 pushq %rbp Patch 1B INT3 Kernel Mode
  18. 23 kprobe Internals 3. INT3 execution traps into kprobe framework

    // include/linux/kprobes.h struct kprobe { ... const char *symbol_name; __do_sys_getuid unsigned int offset; 0 kprobe_pre_handler_t pre_handler; kprobe_post_handler_t post_handler; ... struct arch_specific_insn ainsn; }; $ objdump -d v6.18/vmlinux --disassemble=__do_sys_getuid 00000000 <__do_sys_getuid>: 0: CC INT3 1: 48 89 e5 movq %rsp, %rbp 4: 65 48 8b 05 ec da c8 01 movq %gs:0x1c8daec(%rip), %rax c: 48 8b 80 68 08 00 00 movq 0x868(%rax), %rax 13: 48 8b b8 90 00 00 00 movq 0x90(%rax), %rdi 1a: 8b 70 08 movl 0x8(%rax), %esi 1d: e8 6e 21 0c 00 callq 0xc2190 <from_kuid_munged> 22: 89 c0 movl %eax, %eax 24: 5d popq %rbp 25: c3 retq pushq %rbp CPU raise #BP Patch 1B INT3 Execute INT3 kprobe framework Kernel Mode
  19. 24 kprobe Internals 3. INT3 execution traps into kprobe framework

    4. pre_handler() is called // include/linux/kprobes.h struct kprobe { ... const char *symbol_name; __do_sys_getuid unsigned int offset; 0 kprobe_pre_handler_t pre_handler; kprobe_post_handler_t post_handler; ... struct arch_specific_insn ainsn; }; $ objdump -d v6.18/vmlinux --disassemble=__do_sys_getuid 00000000 <__do_sys_getuid>: 0: CC INT3 1: 48 89 e5 movq %rsp, %rbp 4: 65 48 8b 05 ec da c8 01 movq %gs:0x1c8daec(%rip), %rax c: 48 8b 80 68 08 00 00 movq 0x868(%rax), %rax 13: 48 8b b8 90 00 00 00 movq 0x90(%rax), %rdi 1a: 8b 70 08 movl 0x8(%rax), %esi 1d: e8 6e 21 0c 00 callq 0xc2190 <from_kuid_munged> 22: 89 c0 movl %eax, %eax 24: 5d popq %rbp 25: c3 retq pushq %rbp CPU raise #BP Patch 1B INT3 Execute INT3 kprobe framework Kernel Mode calls pre_handler()
  20. 25 kprobe Internals 3. INT3 execution traps into kprobe framework

    4. pre_handler() is called 5. Single-Step original instruction // include/linux/kprobes.h struct kprobe { ... const char *symbol_name; __do_sys_getuid unsigned int offset; 0 kprobe_pre_handler_t pre_handler; kprobe_post_handler_t post_handler; ... struct arch_specific_insn ainsn; }; pushq %rbp $ objdump -d v6.18/vmlinux --disassemble=__do_sys_getuid 00000000 <__do_sys_getuid>: 0: CC INT3 1: 48 89 e5 movq %rsp, %rbp 4: 65 48 8b 05 ec da c8 01 movq %gs:0x1c8daec(%rip), %rax c: 48 8b 80 68 08 00 00 movq 0x868(%rax), %rax 13: 48 8b b8 90 00 00 00 movq 0x90(%rax), %rdi 1a: 8b 70 08 movl 0x8(%rax), %esi 1d: e8 6e 21 0c 00 callq 0xc2190 <from_kuid_munged> 22: 89 c0 movl %eax, %eax 24: 5d popq %rbp 25: c3 retq pushq %rbp CPU raise #BP Patch 1B INT3 Execute INT3 kprobe framework Kernel Mode calls pre_handler() Single-Step Instruction
  21. 26 kprobe Internals 3. INT3 execution traps into kprobe framework

    4. pre_handler() is called 5. Single-Step original instruction // include/linux/kprobes.h struct kprobe { ... const char *symbol_name; __do_sys_getuid unsigned int offset; 0 kprobe_pre_handler_t pre_handler; kprobe_post_handler_t post_handler; ... struct arch_specific_insn ainsn; }; pushq %rbp 6. post_handler() is called $ objdump -d v6.18/vmlinux --disassemble=__do_sys_getuid 00000000 <__do_sys_getuid>: 0: CC INT3 1: 48 89 e5 movq %rsp, %rbp 4: 65 48 8b 05 ec da c8 01 movq %gs:0x1c8daec(%rip), %rax c: 48 8b 80 68 08 00 00 movq 0x868(%rax), %rax 13: 48 8b b8 90 00 00 00 movq 0x90(%rax), %rdi 1a: 8b 70 08 movl 0x8(%rax), %esi 1d: e8 6e 21 0c 00 callq 0xc2190 <from_kuid_munged> 22: 89 c0 movl %eax, %eax 24: 5d popq %rbp 25: c3 retq pushq %rbp CPU raise #BP Patch 1B INT3 Execute INT3 kprobe framework Kernel Mode calls pre_handler() calls post_handler() Single-Step Instruction
  22. 27 kprobe Internals // include/linux/kprobes.h struct kprobe { ... const

    char *symbol_name; __do_sys_getuid unsigned int offset; 0 kprobe_pre_handler_t pre_handler; kprobe_post_handler_t post_handler; ... struct arch_specific_insn ainsn; }; $ objdump -d v6.18/vmlinux --disassemble=__do_sys_getuid 00000000 <__do_sys_getuid>: 0: CC INT3 1: 48 89 e5 movq %rsp, %rbp 4: 65 48 8b 05 ec da c8 01 movq %gs:0x1c8daec(%rip), %rax c: 48 8b 80 68 08 00 00 movq 0x868(%rax), %rax 13: 48 8b b8 90 00 00 00 movq 0x90(%rax), %rdi 1a: 8b 70 08 movl 0x8(%rax), %esi 1d: e8 6e 21 0c 00 callq 0xc2190 <from_kuid_munged> 22: 89 c0 movl %eax, %eax 24: 5d popq %rbp 25: c3 retq pushq %rbp 7. Resume Execution CPU raise #BP Patch 1B INT3 Execute INT3 kprobe framework Kernel Mode calls pre_handler() Single-Step Instruction Resume Execution calls post_handler()
  23. 29 kprobe in Action DEFINE_PER_CPU(int, getuid_cnt); static int handler_pre(struct kprobe

    *p, struct pt_regs *regs) { this_cpu_inc(getuid_cnt); return 0; } static struct kprobe kp = { .symbol_name = "__do_sys_getuid", .pre_handler = handler_pre, .offset = 0 }; static int __init my_module_init(void) { return register_kprobe(&kp); } static void __exit my_module_exit(void) { int cpu, total = 0; unregister_kprobe(&kp); for_each_possible_cpu(cpu) total += per_cpu(getuid_cnt, cpu); pr_info("__do_sys_getuid count = %d\n", total); } module_init(my_module_init); module_exit(my_module_exit); Looks simple. But what did we just pay?
  24. 30 kprobe in Action entire path runs for every single

    syscall DEFINE_PER_CPU(int, getuid_cnt); static int handler_pre(struct kprobe *p, struct pt_regs *regs) { this_cpu_inc(getuid_cnt); return 0; } static struct kprobe kp = { .symbol_name = "__do_sys_getuid", .pre_handler = handler_pre, .offset = 0 }; static int __init my_module_init(void) { return register_kprobe(&kp); } static void __exit my_module_exit(void) { int cpu, total = 0; unregister_kprobe(&kp); for_each_possible_cpu(cpu) total += per_cpu(getuid_cnt, cpu); pr_info("__do_sys_getuid count = %d\n", total); } module_init(my_module_init); module_exit(my_module_exit); CPU raise #BP Process Execution Execute INT3 kprobe framework Kernel Mode calls pre_handler() Single-Step Instruction Resume Execution calls post_handler()
  25. 32 Why did kprobe need traps? Because there was no

    place to observe. void do_something() { · · · · · · · · · · · · · · · · · · } TRAP INT3 kprobe pre_handler kprobe post_handler Single-Step
  26. 33 Why did kprobe need traps? Because there was no

    place to observe. void do_something() { · · · · · · · · · · · · · · · · · · } INT3 What if the code itself had a place to observe? (not printk or logic, does nothing by default) void do_something() { // instrumentation site · · · · · · · · · · · · } kprobe pre_handler kprobe post_handler Single-Step TRAP
  27. 35 How does an instrumentation site look? Five NOPs inserted

    by the compiler (x86-64) Does nothing by default void do_something() { // instrumentation site · · · · · · · · · · · · } [ NOP ][ NOP ][ NOP ][ NOP ][ NOP ]
  28. 36 Why five NOPs? Five NOPs inserted by the compiler

    (x86-64) Does nothing by default void do_something() { // instrumentation site · · · · · · · · · · · · } [ NOP ][ NOP ][ NOP ][ NOP ][ NOP ] CALL rel32 [ E8X ][ XXX ][ XXX ][ XXX ][ XXX ] Same size as a CALL instruction Will be replaced at runtime.
  29. 37 Revisiting getuid disassembly Compiled with CONFIG_FTRACE=N $ objdump -d

    v6.18/vmlinux --disassemble=__do_sys_getuid 00000000 <__do_sys_getuid>: 0: 55 pushq %rbp 1: 48 89 e5 movq %rsp, %rbp 4: 65 48 8b 05 ec da c8 01 movq %gs:0x1c8daec(%rip), %rax c: 48 8b 80 68 08 00 00 movq 0x868(%rax), %rax 13: 48 8b b8 90 00 00 00 movq 0x90(%rax), %rdi 1a: 8b 70 08 movl 0x8(%rax), %esi 1d: e8 6e 21 0c 00 callq 0xc2190 <from_kuid_munged> 22: 89 c0 movl %eax, %eax 24: 5d popq %rbp 25: c3 retq Compiled with CONFIG_FTRACE=Y $ objdump -d v6.18/vmlinux --disassemble=__do_sys_getuid 00000000 <__do_sys_getuid>: 0: 0f 1f 44 00 00 nopl (%rax, %rax) 5: 55 pushq %rbp 6: 48 89 e5 movq %rsp, %rbp 9: 65 48 8b 05 ec da c8 01 movq %gs:0x1c8daec(%rip), %rax 11: 48 8b 80 68 08 00 00 movq 0x868(%rax), %rax 18: 48 8b b8 90 00 00 00 movq 0x90(%rax), %rdi 1f: 8b 70 08 movl 0x8(%rax), %esi 22: e8 6e 21 0c 00 callq 0xc2190 <from_kuid_munged> 27: 89 c0 movl %eax, %eax 29: 5d popq %rbp 2a: c3 retq
  30. $ objdump -d v6.18/vmlinux --disassemble=__do_sys_getuid 00000000 <__do_sys_getuid>: 0: 0f 1f

    44 00 00 nopl (%rax, %rax) 5: 55 pushq %rbp 6: 48 89 e5 movq %rsp, %rbp 9: 65 48 8b 05 ec da c8 01 movq %gs:0x1c8daec(%rip), %rax 11: 48 8b 80 68 08 00 00 movq 0x868(%rax), %rax 18: 48 8b b8 90 00 00 00 movq 0x90(%rax), %rdi 1f: 8b 70 08 movl 0x8(%rax), %esi 22: e8 6e 21 0c 00 callq 0xc2190 <from_kuid_munged> 27: 89 c0 movl %eax, %eax 29: 5d popq %rbp 2a: c3 retq 38 ftrace Internals 0. Enable ftrace function tracer with __do_sys_getuid // from terminal, tracefs # echo __do_sys_getuid > /sys/kernel/tracing/set_ftrace_filter # echo function > /sys/kernel/tracing/current_tracer # echo 1 > /sys/kernel/tracing/tracing_on
  31. $ objdump -d v6.18/vmlinux --disassemble=__do_sys_getuid 00000000 <__do_sys_getuid>: +0: e8 fb

    96 35 1c call 0xffffffffc0600000 <ftrace_trampoline> -0: 0f 1f 44 00 00 nopl (%rax, %rax) 5: 55 pushq %rbp 6: 48 89 e5 movq %rsp, %rbp 9: 65 48 8b 05 ec da c8 01 movq %gs:0x1c8daec(%rip), %rax 11: 48 8b 80 68 08 00 00 movq 0x868(%rax), %rax 18: 48 8b b8 90 00 00 00 movq 0x90(%rax), %rdi 1f: 8b 70 08 movl 0x8(%rax), %esi 22: e8 6e 21 0c 00 callq 0xc2190 <from_kuid_munged> 27: 89 c0 movl %eax, %eax 29: 5d popq %rbp 2a: c3 retq ftrace Internals 1. 5-NOP is patched to ftrace_trampoline // from terminal # echo __do_sys_getuid > /sys/kernel/tracing/set_ftrace_filter # echo function > /sys/kernel/tracing/current_tracer # echo 1 > /sys/kernel/tracing/tracing_on // from terminal # cat /proc/kallsyms | grep sys_getuid$ ffffffffa42a6900 T __do_sys_getuid # gdb -q vmlinux /proc/kcore (gdb) disas /r __do_sys_getuid, +5 Dump of assembler code from 0xffffffff812a6900 to 0xffffffff812a6905: 0xffffffff812a6900 <__do_sys_getuid+0>: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) (gdb) disas /r 0xffffffffa42a6900, +5 Dump of assembler code from 0xffffffffa42a6900 to 0xffffffffa42a6905: 0xffffffffa42a6900: e8 fb 96 35 1c call 0xffffffffc0600000 # cat /proc/kallsyms | grep ^ffffffffc0600000 ffffffffc0600000 t ftrace_trampoline [__builtin__ftrace] Kernel Mode On enable, patch <ftrace_trampoline>
  32. $ objdump -d v6.18/vmlinux --disassemble=__do_sys_getuid 00000000 <__do_sys_getuid>: +0: e8 fb

    96 35 1c call 0xffffffffc0600000 5: 55 pushq %rbp 6: 48 89 e5 movq %rsp, %rbp 9: 65 48 8b 05 ec da c8 01 movq %gs:0x1c8daec(%rip), %rax 11: 48 8b 80 68 08 00 00 movq 0x868(%rax), %rax 18: 48 8b b8 90 00 00 00 movq 0x90(%rax), %rdi 1f: 8b 70 08 movl 0x8(%rax), %esi 22: e8 6e 21 0c 00 callq 0xc2190 <from_kuid_munged> 27: 89 c0 movl %eax, %eax 29: 5d popq %rbp 2a: c3 retq ftrace Internals // from terminal # echo __do_sys_getuid > /sys/kernel/tracing/set_ftrace_filter # echo function > /sys/kernel/tracing/current_tracer # echo 1 > /sys/kernel/tracing/tracing_on 2. Calls <ftrace_trampoline> // built on-the-fly based on ftrace_caller ftrace_trampoline(ctx) { save_minimal_cpu_state(ctx) for (ops in ftrace_ops_list): ops.func(ctx) restore_cpu_state(ctx) return; } ftrace_trampoline (calls registered function) On enable, patch <ftrace_trampoline> Kernel Mode
  33. $ objdump -d v6.18/vmlinux --disassemble=__do_sys_getuid 00000000 <__do_sys_getuid>: +0: e8 fb

    96 35 1c call 0xffffffffc0600000 5: 55 pushq %rbp 6: 48 89 e5 movq %rsp, %rbp 9: 65 48 8b 05 ec da c8 01 movq %gs:0x1c8daec(%rip), %rax 11: 48 8b 80 68 08 00 00 movq 0x868(%rax), %rax 18: 48 8b b8 90 00 00 00 movq 0x90(%rax), %rdi 1f: 8b 70 08 movl 0x8(%rax), %esi 22: e8 6e 21 0c 00 callq 0xc2190 <from_kuid_munged> 27: 89 c0 movl %eax, %eax 29: 5d popq %rbp 2a: c3 retq ftrace Internals // from terminal # echo __do_sys_getuid > /sys/kernel/tracing/set_ftrace_filter # echo function > /sys/kernel/tracing/current_tracer # echo 1 > /sys/kernel/tracing/tracing_on // built on-the-fly based on ftrace_caller ftrace_trampoline(ctx) { save_minimal_cpu_state(ctx) for (ops in ftrace_ops_list): ops.func(ctx) restore_cpu_state(ctx) return; } 2. Calls <ftrace_trampoline> 3. calls handler function (here ‘function’ tracer) static void function_trace_call(ulong ip, ulong parent_ip, struct ftrace_ops *op, struct ftrace_regs *fregs) { // ... if (unlikely(!op->private->function_enabled)) return; // ... trace_function(tr, ip, parent_ip, trace_ctx, NULL); } ftrace_trampoline (calls registered function) On enable, patch <ftrace_trampoline> Kernel Mode Handler functions (here ‘function’ tracer)
  34. $ objdump -d v6.18/vmlinux --disassemble=__do_sys_getuid 00000000 <__do_sys_getuid>: 0: e8 fb

    96 35 1c call 0xffffffffc0600000 5: 55 pushq %rbp 6: 48 89 e5 movq %rsp, %rbp 9: 65 48 8b 05 ec da c8 01 movq %gs:0x1c8daec(%rip), %rax 11: 48 8b 80 68 08 00 00 movq 0x868(%rax), %rax 18: 48 8b b8 90 00 00 00 movq 0x90(%rax), %rdi 1f: 8b 70 08 movl 0x8(%rax), %esi 22: e8 6e 21 0c 00 callq 0xc2190 <from_kuid_munged> 27: 89 c0 movl %eax, %eax 29: 5d popq %rbp 2a: c3 retq ftrace Internals // from terminal # echo __do_sys_getuid > /sys/kernel/tracing/set_ftrace_filter # echo function > /sys/kernel/tracing/current_tracer # echo 1 > /sys/kernel/tracing/tracing_on // built on-the-fly based on ftrace_caller ftrace_trampoline(ctx) { save_minimal_cpu_state(ctx) for (ops in ftrace_ops_list): ops.func(ctx) restore_cpu_state(ctx) return; } 4. Resume Execution 3. calls handler function (here ’function’ tracer) ftrace_trampoline (calls registered function) On enable, patch <ftrace_trampoline> Kernel Mode Handler functions (here ‘function’ tracer) Resume Execution 2. Calls <ftrace_trampoline>
  35. ftrace consumers 43 function tracer function_graph tracer BPF trampoline (fentry

    / fexit) # CPU DURATION FUNCTION CALLS # | | | | | | | ------------------------------------------ 8) <idle>-0 => migrati-65 ------------------------------------------ 8) | irq_exit_rcu() { 8) 0.231 us | idle_cpu(); 8) | tick_nohz_irq_exit() { 8) 0.260 us | ktime_get(); 8) 0.742 us | } 8) 1.683 us | } 8) | __sysvec_call_function() { 8) | generic_smp_call_function...() { 8) | __flush_smp_call_function...() { 8) 0.261 us | do_sync_core(); 8) 0.781 us | } # TASK-PID CPU# ||| TIMESTAMP FUNCTION # | | | ||| | | master-159 [005] ... 42.0935: __do_sys_getuid <-do_syscall_64 Lsb_release-179 [010] ... 65.2152: __do_sys_getuid <-do_syscall_64 <...>-179 [014] ... 67.7051: __do_sys_getuid <-do_syscall_64 systemd-logi-125 [002] ... 68.4213: __do_sys_getuid <-do_syscall_64 fcitx5-205 [015] ... 70.0857: __do_sys_getuid <-do_syscall_64 systemd-udev-750 [011] ... 84.1000: __do_sys_getuid <-do_syscall_64 kded6-212 [002] ... 90.3384: __do_sys_getuid <-do_syscall_64 # echo function > $TRACEFS/current_tracer # echo function_graph > $TRACEFS/current_tracer SEC("fentry/__do_sys_getuid") int BPF_PROG(trace_getuid) { bpf_printk("getuid called\n"); return 0; } 43 ftrace_trampoline (calls registered function) On enable, patch <ftrace_trampoline> Kernel Mode etc… function_graph function Who consumes this ftrace path?
  36. 45 ftrace in Action OR // from terminal, tracefs #

    cd /sys/kernel/tracing # echo __do_sys_getuid > set_ftrace_filter # echo function > current_tracer # echo 1 > tracing_on Counting __do_sys_getuid calls // from terminal # cat /sys/kernel/tracing/trace # TASK-PID CPU# ||| TIMESTAMP FUNCTION # | | | ||| | | master-1590 [005] ... 42.0935: __do_sys_getuid <-do_syscall_64 Lsb_release-1796 [010] ... 65.2152: __do_sys_getuid <-do_syscall_64 <...>-1793 [014] ... 67.7051: __do_sys_getuid <-do_syscall_64 systemd-logi-1253 [002] ... 68.4213: __do_sys_getuid <-do_syscall_64 fcitx5-2052 [015] ... 70.0857: __do_sys_getuid <-do_syscall_64 systemd-udev-750 [011] ... 84.1000: __do_sys_getuid <-do_syscall_64 kded6-2126 [002] ... 90.3384: __do_sys_getuid <-do_syscall_64 # cat /sys/kernel/tracing/trace | wc -l 36
  37. 46 Dynamic Instrumentation: Flexibility vs. Cost Can attach almost anywhere

    Pays the trap cost kprobe Cheaper Trampoline path Limited to function entry/return ftrace CPU raise #BP Process Execution Execute INT3 kprobe framework Kernel Mode calls pre_handler() Single-Step Instruction Resume Execution calls post_handler() ftrace_trampoline (calls registered function) On enable, patch <ftrace_trampoline> Kernel Mode Resume Execution Process Execution etc… function_graph function
  38. 47 Recap: Dynamic Instrumentation Whether it’s a trap or a

    trampoline, we still intercept execution at runtime. What if the observation point was already there? void do_something() { · · · · · · · · · · · · · · · · · · } void do_something() { // instrumentation site · · · · · · · · · · · · } Runtime INT3
  39. 48 Recap: Dynamic Instrumentation Whether it’s a trap or a

    trampoline, we still intercept execution at runtime. Static Instrumentation Observation points inserted at build time void do_something() { · · · · · · observation_point(); · · · · · · } Compiled Result void do_something() { · · · · · · observation_point(); · · · · · · } Runtime Enable void do_something() { · · · · · · · · · · · · · · · · · · } void do_something() { // instrumentation site · · · · · · · · · · · · } Runtime INT3
  40. 51 Why printk Does Not Scale? printk() - String formatting

    (parse, va_list) - printk("uid=%d pid=%d\n", uid, pid); parse format string variable argument handling convert integer to string should count total length
  41. Why printk Does Not Scale? printk() - String formatting (parse,

    va_list) - Global log buffer (aka. dmesg buffer) - printk("uid=%d pid=%d\n", uid, pid); All written into single shared buffer (kernel log buffer, aka. dmesg buffer) printk("uid=%d\n", uid); printk("pid=%d\n", pid); 52
  42. 53 Why printk Does Not Scale? printk() - String formatting

    (parse, va_list) - Global log buffer (aka. dmesg buffer) - Unstructured strings (hard to filter) printk("uid=%d pid=%d\n", uid, pid); printk("user identifier=%d\n", uid); printk("user number=%d\n", uid); Format can be arbitrary and hard to filter
  43. 54 Solution? tracepoints - Fixed event schema (lower overhead) -

    Dedicated tracing buffer - Structured binary data (easier filtering, processing, and aggregating) printk() - String formatting (parse, va_list) - Global log buffer (aka. dmesg buffer) - Unstructured strings (hard to filter)
  44. 56 Example: syscall tracepoints syscall tracepoints are fired before the

    actual syscall <do_syscall_x64()> <trace_sys_enter()> SYSCALL instruction entry_SYSCALL_64 (arch/x86/entry/entry_64.s) do_syscall_64() (arch/x86/entry/syscall_64.c) syscall_enter_from _user_mode() syscall_trace_enter() trace_sys_enter() do_syscall_x64()
  45. 57 Example: syscall tracepoints TRACE_EVENT_SYSCALL(sys_enter, TP_STRUCT__entry( __field(long, id) __array(unsigned long,

    args, 6) ), TP_fast_assign( __entry->id = id; syscall_get_arguments(current, regs, __entry->args); ), // ... snip ... ); CPU 0 CPU 1 CPU 2 CPU 3 Just structured data written to a ring buffer. Per-CPU trace buffer SYSCALL instruction entry_SYSCALL_64 (arch/x86/entry/entry_64.s) do_syscall_64() (arch/x86/entry/syscall_64.c) syscall_enter_from _user_mode() syscall_trace_enter() trace_sys_enter() do_syscall_x64()
  46. 59 tracepoint in Action // from terminal # cd /sys/kernel/tracing

    # echo 1 > events/syscalls/sys_enter_getuid/enable # echo 1 > tracing_on Counting __do_sys_getuid calls with syscalls/sys_enter_getuid // from terminal # cat /sys/kernel/tracing/trace bash-601 [004] ..... 178.993751: sys_getuid() tmux: client-698 [005] ..... 180.146847: sys_getuid() tmux: server-700 [007] ..... 180.155717: sys_getuid() tput-709 [010] ..... 180.257137: sys_getuid() bash-706 [010] ..... 180.246584: sys_getuid() tput-716 [006] ..... 180.266727: sys_getuid() bash-701 [008] ..... 180.268412: sys_getuid() # cat /sys/kernel/tracing/trace | wc -l 29 OR
  47. 60 void do_something() { · · · · · ·

    observation_point(); · · · · · · } void do_something() { · · · · · · · · · · · · · · · · · · } void do_something() { // instrumentation site · · · · · · · · · · · · } INT3 We examined three tracing mechanisms for the same syscall. kprobe ftrace tracepoint Event-driven instrumentation - observe events as they happen observing every event pays the tracing cost
  48. 61 Cost of Event-driven Tracing https://www.ebpf.top/en/post/bpf_rawtracepoint/ kprobe ftrace tracepoint Overhead

    Cost > ≈ In Event-driven instrumentation, overhead grows with event rate $ cd tools/testing/selftests/bpf $ ./benchs/run_bench_trigger.sh
  49. 64 So how Sampling Works? Sampling replaces "observe every event"

    with "observe when a counter overflows". Events increment a counter. Sampling happens only on overflow. Event timeline: | | | | | | | | | | | | | | | | Counter value : 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ^ ^ [ sample1 ] [ sample2 ]
  50. 66 What does perf_event can do? struct perf_event_attr attr =

    { .size = sizeof(struct perf_event_attr), .config = PERF_COUNT_HW_CPU_CYCLES, .type = PERF_TYPE_HARDWARE, .sample_period = 100000, .sample_type = PERF_SAMPLE_IP | PERF_SAMPLE_CALLCHAIN, }; fd = perf_event_open(&attr, -1, 0, -1, 0); Event timeline | | | | | | | | | | | | | | | | | | | | Counter value 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 (x10,000) ^ ^ [ sample1 ] [ sample2 ] perf_event defines: - What event to count - When to sample - What to observe
  51. 67 What does perf_event can do? perf_event defines: - What

    event to count Event timeline | | | | | | | | | | | | | | | | | | | | Counter value 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 (x10,000) ^ ^ [ sample1 ] [ sample2 ] ➔ Hardware / PMU events (cycles, instructions, cache-misses) ➔ Software events (context-switches, page-fault, cpu-clock) ➔ Tracepoint-backed events ➔ Kprobe-backed events struct perf_event_attr attr = { .size = sizeof(struct perf_event_attr), .config = PERF_COUNT_HW_CPU_CYCLES, .type = PERF_TYPE_HARDWARE, .sample_period = 100000, .sample_type = PERF_SAMPLE_IP | PERF_SAMPLE_CALLCHAIN, }; fd = perf_event_open(&attr, -1, 0, -1, 0);
  52. 68 What does perf_event can do? perf_event defines: - What

    event to count - When to sample Event timeline | | | | | | | | | | | | | | | | | | | | Counter value 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 (x10,000) ^ ^ [ sample1 ] [ sample2 ] ➔ Count-based overflow ➔ Timer-based sampling (clock events) struct perf_event_attr attr = { .size = sizeof(struct perf_event_attr), .config = PERF_COUNT_HW_CPU_CYCLES, .type = PERF_TYPE_HARDWARE, .sample_period = 100000, // every 100,000 cycle .sample_type = PERF_SAMPLE_IP | PERF_SAMPLE_CALLCHAIN, }; fd = perf_event_open(&attr, -1, 0, -1, 0);
  53. 69 What does perf_event can do? perf_event defines: - What

    event to count - When to sample - What to observe Event timeline | | | | | | | | | | | | | | | | | | | | Counter value 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 (x10,000) ^ ^ [ sample1 ] [ sample2 ] ➔ Instruction Pointer ➔ User/Kernel Stack Trace ➔ PID/TID/CPU# ➔ Register Information ➔ Call chains struct perf_event_attr attr = { .size = sizeof(struct perf_event_attr), .config = PERF_COUNT_HW_CPU_CYCLES, .type = PERF_TYPE_HARDWARE, .sample_period = 100000, // every 100,000 cycle .sample_type = PERF_SAMPLE_IP | PERF_SAMPLE_CALLCHAIN, }; fd = perf_event_open(&attr, -1, 0, -1, 0);
  54. 70 What does perf_event can do? Event timeline | |

    | | | | | | | | | | | | | | | | | | Counter value 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 (x10,000) ^ ^ [ sample1 ] [ sample2 ] Every 100,000 CPU Cycles, sample current IP + Call Chain => identify frequently executed code path perf_event defines: - What event to count - When to sample - What to observe struct perf_event_attr attr = { .size = sizeof(struct perf_event_attr), .config = PERF_COUNT_HW_CPU_CYCLES, .type = PERF_TYPE_HARDWARE, .sample_period = 100000, // every 100,000 cycle .sample_type = PERF_SAMPLE_IP | PERF_SAMPLE_CALLCHAIN, }; fd = perf_event_open(&attr, -1, 0, -1, 0);
  55. 72 perf_event in Action struct perf_event_attr attr = { .size

    = sizeof(struct perf_event_attr), .config = PERF_COUNT_HW_CPU_CYCLES, .type = PERF_TYPE_HARDWARE, .sample_period = 100000, .sample_type = PERF_SAMPLE_IP | PERF_SAMPLE_CALLCHAIN, }; fd = perf_event_open(&attr, -1, 0, -1, 0); ioctl(fd, PERF_EVENT_IOC_ENABLE, 0); Identifying hot code path OR # perf record -e cycles -c 100000 -C 0 -g -- sleep 1
  56. 73 perf_event in Action Identifying hot code path # perf

    report --stdio --no-children –sort comm -F overhead,comm # Overhead Command # ........ ............... # 0.21% NetworkManager 0.00% avahi-daemon 0.09% awk 0.35% bash 0.06% cat 81.05% rsync | |--12.57%--arch_local_irq_restore | | | |--3.06%--obj_cgroup_uncharge | | __memcg_slab_free_hook | | kmem_cache_free | | | | | --3.03%--free_buffer_head | | try_to_free_buffers struct perf_event_attr attr = { .size = sizeof(struct perf_event_attr), .config = PERF_COUNT_HW_CPU_CYCLES, .type = PERF_TYPE_HARDWARE, .sample_period = 100000, .sample_type = PERF_SAMPLE_IP | PERF_SAMPLE_CALLCHAIN, }; fd = perf_event_open(&attr, -1, 0, -1, 0); ioctl(fd, PERF_EVENT_IOC_ENABLE, 0); Which process appeared most frequently? OR # perf record -e cycles -c 100000 -C 0 -g -- sleep 1
  57. 74 perf_event in Action OR Identifying hot code path #

    perf record -e cycles -c 100000 -C 0 -g -- sleep 1 # perf report --stdio -F overhead_children,overhead,comm,symbol # Children Self Command Symbol # ........ ........ ............... .................. # 81.03% 0.00% rsync [.] _start | ---_start __libc_start_call_main main | |--60.34%--iwrite.constprop.0 | | | --60.33%--__GI___libc_write | | | ... | --60.03%--invoke_syscall | | | --60.03%--__x64__sys_write | | | --60.03%--ksys_write struct perf_event_attr attr = { .size = sizeof(struct perf_event_attr), .config = PERF_COUNT_HW_CPU_CYCLES, .type = PERF_TYPE_HARDWARE, .sample_period = 100000, .sample_type = PERF_SAMPLE_IP | PERF_SAMPLE_CALLCHAIN, }; fd = perf_event_open(&attr, -1, 0, -1, 0); ioctl(fd, PERF_EVENT_IOC_ENABLE, 0); Which call chain appeared most frequently?
  58. 75 perf_event in Action Identifying hot code path # perf

    script \ | stackcollapse-perf.pl \ | flamegraph.pl > flamegraph.svg struct perf_event_attr attr = { .size = sizeof(struct perf_event_attr), .config = PERF_COUNT_HW_CPU_CYCLES, .type = PERF_TYPE_HARDWARE, .sample_period = 100000, .sample_type = PERF_SAMPLE_IP | PERF_SAMPLE_CALLCHAIN, }; fd = perf_event_open(&attr, -1, 0, -1, 0); ioctl(fd, PERF_EVENT_IOC_ENABLE, 0); FlameGraph OR # perf record -e cycles -c 100000 -C 0 -g -- sleep 1
  59. Exception-based kprobe (2.6.9) 76 Recap: Tracing Mechanism Code-patching ftrace (2.6.27)

    Statically-defined tracepoints (2.6.32) Counter-based perf_event (2.6.31) Dynamic Instrumentation Static Instrumentation Sampling Event-driven Counter-driven Tracing event sources
  60. 77 But that’s not the Whole Story Event source decides

    where/when. Exception-based kprobe Code-patching ftrace Statically-defined tracepoints Counter-based perf_event But who consumes the event?
  61. writes structured-data Event | | | | | | |

    | | | | | | | | Counter 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 (x1,000) ^ ^ [ sample ] [ sample ] 80 Event Consumers so far: ftrace function tracer function_graph tracer SYSCALL instruction entry_SYSCALL_64 (arch/x86/entry/entry_64.s) do_syscall_64() (arch/x86/entry/syscall_64.c) syscall_enter_from _user_mode() syscall_trace_enter() trace_sys_enter() do_syscall_x64() tracepoint perf_event CPU 1 CPU 0 trace event subsystem (events/syscalls/sys_enter_getuid) perf tool (perf record, perf top) struct perf_event_attr attr = { .size = sizeof(struct perf_event_attr), .config = PERF_COUNT_HW_CPU_CYCLES, .type = PERF_TYPE_HARDWARE, .sample_period = 100000, .sample_type = PERF_SAMPLE_IP | PERF_SAMPLE_CALLCHAIN, }; fd = perf_event_open(&attr, -1, 0, -1, 0); 80 ftrace_trampoline (calls registered function) On enable, patch <ftrace_trampoline> Kernel Mode etc… function_graph function
  62. 81 Fixed Consumers: Configurable, But Not Programmable ftrace function tracer

    function_graph tracer tracepoint perf_event trace event subsystem (events/syscalls/sys_enter_getuid) perf tool (perf record, perf top) # CPU DURATION FUNCTION CALLS # | | | | | | | ------------------------------------------ 8) <idle>-0 => migrati-65 ------------------------------------------ 8) | irq_exit_rcu() { 8) 0.231 us | idle_cpu(); 8) | tick_nohz_irq_exit() { 8) 0.260 us | ktime_get(); 8) 0.742 us | } 8) 1.683 us | } 8) | __sysvec_call_function() { 8) | generic_smp_call_function...() { 8) | __flush_smp_call_function...() { 8) 0.261 us | do_sync_core(); 8) 0.781 us | } 8) 1.283 us | } # echo function_graph > $TRACEFS/current_tracer // from terminal # perf record -e cycles -c 100000 -C 0 -g -- sleep 1 # perf report -f --stdio --no-hierarchy # Children Self Command Symbol # ........ ........ ............... .............. # 81.03% 0.00% dd [.] _start | ---_start __libc_start_main@@GLIBC_2.34 __libc_start_call_main main | |--60.34%--iwrite.constprop.0 | | | --60.33%--__GI___libc_write | | // from terminal # cat /sys/kernel/tracing/trace bash-601 [004] ..... 178.993751: sys_getuid() tmux: client-698 [005] ..... 180.146847: sys_getuid() tmux: server-700 [007] ..... 180.155717: sys_getuid() tput-709 [010] ..... 180.257137: sys_getuid() bash-706 [010] ..... 180.246584: sys_getuid() tput-716 [006] ..... 180.266727: sys_getuid() bash-701 [008] ..... 180.268412: sys_getuid() # echo 1 > $TRACEFS/events/syscalls/sys_enter_getuid/enable
  63. 82 Real Tracing Questions Fixed consumer can answer: How many

    times getuid is called? Hard to answer with Fixed Consumers What we often want: List top processes calling getuid() (> 100 times)
  64. 83 But BPF Tracing Programs can! What we often want:

    List top processes calling getuid() (> 100 times) struct { __uint(type, BPF_MAP_TYPE_HASH); __uint(max_entries, 32768); __type(key, __u32); // process ID (TGID) __type(value, __u32); // COUNTER } cnt SEC(".maps"); SEC("kprobe/__do_sys_getuid") int BPF_KPROBE(count_getuid) { __u32 key = bpf_get_current_pid_tgid() >> 32; __u32 *val = bpf_map_lookup_elem(&cnt, &key); __u32 one = 1; if (val) (*val)++; else bpf_map_update_elem(&cnt, &key, &one, BPF_ANY); return 0; }
  65. 84 But BPF Tracing Programs can! struct { __uint(type, BPF_MAP_TYPE_HASH);

    __uint(max_entries, 32768); __type(key, __u32); // process ID (TGID) __type(value, __u32); // COUNTER } cnt SEC(".maps"); SEC("kprobe/__do_sys_getuid") int BPF_KPROBE(count_getuid) { __u32 key = bpf_get_current_pid_tgid() >> 32; __u32 *val = bpf_map_lookup_elem(&cnt, &key); __u32 one = 1; if (val) (*val)++; else bpf_map_update_elem(&cnt, &key, &one, BPF_ANY); return 0; } What we often want: List top processes calling getuid() (> 100 times) 0. Define BPF MAP 15978 15 16221 152 16320 30 19234 192 21245 212 23102 52 Key Value
  66. 85 But BPF Tracing Programs can! struct { __uint(type, BPF_MAP_TYPE_HASH);

    __uint(max_entries, 32768); __type(key, __u32); // process ID (TGID) __type(value, __u32); // COUNTER } cnt SEC(".maps"); SEC("kprobe/__do_sys_getuid") int BPF_KPROBE(count_getuid) { __u32 key = bpf_get_current_pid_tgid() >> 32; __u32 *val = bpf_map_lookup_elem(&cnt, &key); __u32 one = 1; if (val) (*val)++; else bpf_map_update_elem(&cnt, &key, &one, BPF_ANY); return 0; } What we often want: List top processes calling getuid() (> 100 times) 1. Program Observation Logic a. fetch current process id b. Map Lookup for matching entry c. if found, update the counter d. if not, set as 1 and update entry 15978 15 16221 21 16320 103++ 19234 192 21245 212 23102 1 Key(PID) Value(Count)
  67. 86 But BPF Tracing Programs can! struct { __uint(type, BPF_MAP_TYPE_HASH);

    __uint(max_entries, 32768); __type(key, __u32); // process ID (TGID) __type(value, __u32); // COUNTER } cnt SEC(".maps"); SEC("kprobe/__do_sys_getuid") int BPF_KPROBE(count_getuid) { __u32 key = bpf_get_current_pid_tgid() >> 32; __u32 *val = bpf_map_lookup_elem(&cnt, &key); __u32 one = 1; if (val) (*val)++; else bpf_map_update_elem(&cnt, &key, &one, BPF_ANY); return 0; } What we often want: List top processes calling getuid() (> 100 times) 2. Annotate with SEC(“kprobe/”) CPU raise #BP Process Execution Execute INT3 kprobe framework Kernel Mode calls pre_handler() Single-Step Instruction Resume Execution calls post_handler() BPF Run BPF program as event consumer
  68. Different event source can be used too! kprobe ftrace tracepoint

    struct { __uint(type, BPF_MAP_TYPE_HASH); __uint(max_entries, 32768); __type(key, __u32); // process ID (TGID) __type(value, __u32); // COUNTER } cnt SEC(".maps"); SEC("fentry/__do_sys_getuid") int BPF_PROG(count_getuid) { __u32 key = bpf_get_current_pid_tgid() >> 32; __u32 *val = bpf_map_lookup_elem(&cnt, &key); if (val) (*val)++; return 0; } struct { __uint(type, BPF_MAP_TYPE_HASH); __uint(max_entries, 32768); __type(key, __u32); // process ID (TGID) __type(value, __u32); // COUNTER } cnt SEC(".maps"); SEC("kprobe/__do_sys_getuid") int BPF_KPROBE(count_getuid) { __u32 key = bpf_get_current_pid_tgid() >> 32; __u32 *val = bpf_map_lookup_elem(&cnt, &key); if (val) (*val)++; return 0; } struct { __uint(type, BPF_MAP_TYPE_HASH); __uint(max_entries, 32768); __type(key, __u32); // process ID (TGID) __type(value, __u32); // COUNTER } cnt SEC(".maps"); SEC("tracepoint/syscalls/sys_enter_getuid") int count_getuid(struct trace_event_raw_sys_enter *ctx) { __u32 key = bpf_get_current_pid_tgid() >> 32; __u32 *val = bpf_map_lookup_elem(&cnt, &key); if (val) (*val)++; return 0; } SYSCALL instruction entry_SYSCALL_64 (arch/x86/entry/entry_64.s) do_syscall_64() (arch/x86/entry/syscall_64.c) syscall_enter_from _user_mode() syscall_trace_enter() trace_sys_enter() do_syscall_x64() CPU raise #BP Patch 1B INT3 Execute INT3 kprobe framework Kernel Mode calls pre_handler() Single-Step Instruction Resume Execution calls post_handler() 87 ftrace_trampoline (calls registered function) On enable, patch <ftrace_trampoline> Kernel Mode etc… function_graph BPF trampoline Kernel Mode BPF BPF BPF
  69. 88 But wait, what about kprobe module? kprobe uses kernel

    module! So It can implement anything, even the same?
  70. struct { __uint(type, BPF_MAP_TYPE_HASH); __uint(max_entries, 32768); __type(key, __u32); // process

    ID (TGID) __type(value, __u32); // COUNTER } cnt SEC(".maps"); SEC("kprobe/__do_sys_getuid") int BPF_KPROBE(count_getuid) { __u32 key = bpf_get_current_pid_tgid() >> 32; __u32 *val = bpf_map_lookup_elem(&cnt, &key); __u32 one = 1; if (val) (*val)++; else bpf_map_update_elem(&cnt, &key, &one, BPF_ANY); return 0; } 89 Yes you can, but... DEFINE_HASHTABLE(cnt, 10); struct entry { u32 tgid; u32 count; struct hlist_node node; }; static int handler_pre(struct kprobe *p, struct pt_regs *regs) { u32 tgid = current->tgid; struct entry *e; hash_for_each_possible(cnt, e, node, tgid) { if (e->tgid == tgid) { e->count++; return 0; } } e = kmalloc(sizeof(*e), GFP_ATOMIC); if (!e) return 0; e->count = 1; e->tgid = tgid; hash_add(cnt, &e->node, tgid); return 0; } static struct kprobe kp = { .symbol_name = "__do_sys_getuid", .pre_handler = handler_pre, }; static int __init my_module_init(void) { hash_init(cnt); return register_kprobe(&kp); } static void __exit my_module_exit(void) { struct entry *e; int bkt; unregister_kprobe(&kp); hash_for_each(cnt, bkt, e, node) { pr_info("%u:%u\n", e->tgid, e->count); kfree(e); } } BPF Kernel module (kprobe)
  71. 90 Not effective static struct kprobe kp = { .symbol_name

    = "__do_sys_getuid", .pre_handler = handler_pre, }; static int __init my_module_init(void) { hash_init(cnt); return register_kprobe(&kp); } static void __exit my_module_exit(void) { struct entry *e; int bkt; unregister_kprobe(&kp); hash_for_each(cnt, bkt, e, node) { pr_info("%u:%u\n", e->tgid, e->count); kfree(e); } } iteration, cleanup DEFINE_HASHTABLE(cnt, 10); struct entry { u32 tgid; u32 count; struct hlist_node node; }; static int handler_pre(struct kprobe *p, struct pt_regs *regs) { u32 tgid = current->tgid; struct entry *e; hash_for_each_possible(cnt, e, node, tgid) { if (e->tgid == tgid) { e->count++; return 0; } } e = kmalloc(sizeof(*e), GFP_ATOMIC); if (!e) return 0; e->count = 1; e->tgid = tgid; hash_add(cnt, &e->node, tgid); return 0; } allocation hash table,entries lookup, update
  72. 92 Kernel Safety static int handler_pre(...) { ... while (1)

    { cpu_relax(); } ... return 0; } May not terminate Invalid memory access Corrupts kernel state static int handler_pre(...) { ... u32 *bad = (u32 *)0xdeadbeef; *bad = 1; ... return 0; } static int handler_pre(...) { ... spin_lock(&mylock); /* forgot spin_unlock() */ ... return 0; }
  73. Application Load *(u64 *0)=1; ... 93 BPF verifies this behavior

    struct { __uint(type, BPF_MAP_TYPE_HASH); __uint(max_entries, 32768); __type(key, __u32); // process ID (TGID) __type(value, __u32); // COUNTER } cnt SEC(".maps"); SEC("kprobe/__do_sys_getuid") int BPF_KPROBE(count_getuid) { __u32 key = bpf_get_current_pid_tgid() >> 32; __u32 *val = bpf_map_lookup_elem(&cnt, &key); __u32 one = 1; if (val) (*val)++; else bpf_map_update_elem(&cnt, &key, &one, BPF_ANY); return 0; } User Mode Kernel Mode ftrace tracepoint kprobe Attach Infinite Loops Load while (1) ... Invalid Mem Access Resource misuse Load ... Normal BPF program Load BPF Verifier Guaranteed Termination Safe kernel interaction Memory Safety … BPF … Does the program safe to execute?
  74. 94 So BPF Tracing Provides Programmability - write custom observation

    logic Flexibility - attach to different tracing event source Safety - run your BPF code safely inside the kernel
  75. 96 BPF in Action 1. Compile into BPF bytecode User

    written BPF program ↓ Clang / LLC struct { __uint(type, BPF_MAP_TYPE_HASH); __uint(max_entries, 32768); __type(key, __u32); // process ID (TGID) __type(value, __u32); // COUNTER } cnt SEC(".maps"); SEC("kprobe/__do_sys_getuid") int BPF_KPROBE(count_getuid) { __u32 key = bpf_get_current_pid_tgid() >> 32; __u32 *val = bpf_map_lookup_elem(&cnt, &key); __u32 one = 1; if (val) (*val)++; else bpf_map_update_elem(&cnt, &key, &one, BPF_ANY); return 0; } # clang -O2 -g -target bpf -c getuid.bpf.c -o getuid.bpf.o BPF
  76. 97 BPF in Action 1. Compile into BPF bytecode User

    written BPF program ↓ Clang / LLC struct { __uint(type, BPF_MAP_TYPE_HASH); __uint(max_entries, 32768); __type(key, __u32); // process ID (TGID) __type(value, __u32); // COUNTER } cnt SEC(".maps"); SEC("kprobe/__do_sys_getuid") int BPF_KPROBE(count_getuid) { __u32 key = bpf_get_current_pid_tgid() >> 32; __u32 *val = bpf_map_lookup_elem(&cnt, &key); __u32 one = 1; if (val) (*val)++; else bpf_map_update_elem(&cnt, &key, &one, BPF_ANY); return 0; } # clang -O2 -g -target bpf -c getuid.bpf.c -o getuid.bpf.o # llvm-objdump -d -r --arch=bpf getuid.bpf.o getuid.bpf.o: file format elf64-bpf Disassembly of section kprobe/__do_sys_getuid: 0000000000000000 <count_getuid>: 0: b7 01 00 00 00 00 00 00 r1 = 0x0 1: 63 1a fc ff 00 00 00 00 *(u32 *)(r10 - 0x4) = r1 2: bf a2 00 00 00 00 00 00 r2 = r10 3: 07 02 00 00 fc ff ff ff r2 += -0x4 4: 18 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 r1 = 0x0 ll 0000000000000020: R_BPF_64_64 cnt 6: 85 00 00 00 01 00 00 00 call 0x1 7: 15 00 03 00 00 00 00 00 if r0 == 0x0 goto +0x3 <LBB0_2> 8: 79 01 00 00 00 00 00 00 r1 = *(u64 *)(r0 + 0x0) 9: 07 01 00 00 01 00 00 00 r1 += 0x1 10: 7b 10 00 00 00 00 00 00 *(u64 *)(r0 + 0x0) = r1 11: b7 00 00 00 00 00 00 00 r0 = 0x0 12: 95 00 00 00 00 00 00 00 exit BPF
  77. 98 BPF in Action struct { __uint(type, BPF_MAP_TYPE_HASH); __uint(max_entries, 32768);

    __type(key, __u32); // process ID (TGID) __type(value, __u32); // COUNTER } cnt SEC(".maps"); SEC("kprobe/__do_sys_getuid") int BPF_KPROBE(count_getuid) { __u32 key = bpf_get_current_pid_tgid() >> 32; __u32 *val = bpf_map_lookup_elem(&cnt, &key); __u32 one = 1; if (val) (*val)++; else bpf_map_update_elem(&cnt, &key, &one, BPF_ANY); return 0; } 2. Load & Attach to kernel # clang -O2 -g -target bpf -c getuid.bpf.c -o getuid.bpf.o # bpftool prog loadall getuid.bpf.o /sys/fs/bpf/kprobe autoattach BPF Verifier Guaranteed Termination Kernel State Protection Memory Safety … Load BPF program Attach tracepoint ftrace kprobe User Mode Kernel Mode BPF
  78. CPU raise #BP Execute INT3 kprobe framework calls pre_handler() Single-Step

    Instruction Resume Execution calls post_handler() BPF in Action struct { __uint(type, BPF_MAP_TYPE_HASH); __uint(max_entries, 32768); __type(key, __u32); // process ID (TGID) __type(value, __u32); // COUNTER } cnt SEC(".maps"); SEC("kprobe/__do_sys_getuid") int BPF_KPROBE(count_getuid) { __u32 key = bpf_get_current_pid_tgid() >> 32; __u32 *val = bpf_map_lookup_elem(&cnt, &key); __u32 one = 1; if (val) (*val)++; else bpf_map_update_elem(&cnt, &key, &one, BPF_ANY); return 0; } 3. BPF Handler executed # clang -O2 -g -target bpf -c getuid.bpf.c -o getuid.bpf.o # bpftool prog loadall getuid.bpf.o /sys/fs/bpf/kprobe autoattach BPF BPF Verifier Guaranteed Termination Kernel State Protection Memory Safety … User Mode Kernel Mode Attach tracepoint ftrace kprobe Load BPF program BPF 99
  79. 100 BPF in Action struct { __uint(type, BPF_MAP_TYPE_HASH); __uint(max_entries, 32768);

    __type(key, __u32); // process ID (TGID) __type(value, __u32); // COUNTER } cnt SEC(".maps"); SEC("kprobe/__do_sys_getuid") int BPF_KPROBE(count_getuid) { __u32 key = bpf_get_current_pid_tgid() >> 32; __u32 *val = bpf_map_lookup_elem(&cnt, &key); __u32 one = 1; if (val) (*val)++; else bpf_map_update_elem(&cnt, &key, &one, BPF_ANY); return 0; } 4. Check the execution result # clang -O2 -g -target bpf -c getuid.bpf.c -o getuid.bpf.o # bpftool prog loadall getuid.bpf.o /sys/fs/bpf/kprobe autoattach # bpftool -j map dump name cnt | jq -c '.[].formatted' | sort {"key":1338,"value":12} {"key":1346,"value":90} {"key":1348,"value":107} {"key":1659,"value":4} {"key":1791,"value":197} {"key":1892,"value":561} {"key":2003,"value":14} {"key":2076,"value":51} {"key":2173,"value":159} 1346 90 1348 107 1659 4 1791 197 1829 561 2003 14 Key(PID) Value(Count)
  80. 101 BPF does not replace event sources. It reuses existing

    tracing infrastructure: kprobe, ftrace, tracepoints, or perf_event.
  81. Berkeley Packet Filter (3.18) 102 Linux Kernel Tracing Internals Dynamic

    Instrumentation Static Instrumentation Sampling Exception-based kprobe (2.6.9) Code-patching ftrace (2.6.27) Statically-defined tracepoints (2.6.32) Counter-based perf_event (2.6.31) The key is asking the right question, and choosing the right event source and consumer.