Linux Kernel Tracing Internals

Linux Kernel Tracing Internals From Event Sources to BPF Hoyeon
Lee

Hoyeon Lee Seoul, South Korea SUSE - HW Enablement Taipei
BPF System Engineer Working on Linux Kernel BPF 2 Whoami

3 Tracing? Observing what the kernel is actually doing

Exception-based kprobe (2.6.9) 4 Kernel Tracing Event Sources Code-patching ftrace
(2.6.27) Statically-deﬁned tracepoints (2.6.32) Counter-based perf_event (2.6.31) Dynamic Instrumentation Static Instrumentation Sampling

Berkeley Packet Filter (3.18) 5 Programmable Layer on Top Dynamic
Instrumentation Static Instrumentation Sampling Exception-based kprobe (2.6.9) Code-patching ftrace (2.6.27) Statically-deﬁned tracepoints (2.6.32) Counter-based perf_event (2.6.31)

1. Dynamic Instrumentation kprobe — exception-based ftrace — function-entry patching
2. Static Instrumentation tracepoints — predeﬁned observation points 3. Sampling perf_event — counter-overﬂow sampling 4. Programmable Layer BPF — programs attached to event sources Scope: x86-64, baseline paths only, no optimized paths 6 Outline

0. Instrumentation? 7

Inserting observation points, to capture exact info, at exact moment
8 What is Instrumentation? [ running ] → (observation point) → [ keep running ]

Inserting observation points, to capture exact info, at exact moment
Primitive Instrumentation: printk() 9 [ running ] → (observation point) → [ keep running ] void do_something() { · · · · · · printk("Hello World\n"); · · · · · · }

Too expensive on hot paths Perturbs timing-sensitive behavior Requires source
change and rebuild printk(): Why Not for Tracing? 10 void do_something() { · · · · · · printk("Hello World\n"); · · · · · · }

⚡ Lower overhead 🔍 Less perturbation 🧩 Enable without rebuild
11 We Need Better Instrumentation Too expensive on hot paths Perturbs timing-sensitive behavior Requires source change and rebuild

12 What Instrumentation Can We Use? Dynamic Instrumentation Insert observation
points at runtime Static Instrumentation Use Observation points inserted at build time void do_something() { · · · · · · · · · · · · · · · · · · } Compiled Result void do_something() { observation_point(); · · · · · · · · · · · · } Runtime void do_something() { · · · · · · observation_point(); · · · · · · } Compiled Result void do_something() { · · · · · · observation_point(); · · · · · · } Runtime Insert Enable

1. Dynamic Instrumentation 13

14 Exception-based kprobe Breakpoint instruction using INT3 (x86-64)

15 What is a breakpoint, really? How does GDB break
at main? Does GDB stop execution directly? $ gdb ./hello (gdb) disas /r main Dump of assembler code for function main: 0x00401136 <+0>: 55 push %rbp 0x00401137 <+1>: 48 89 e5 mov %rsp,%rbp 0x0040113a <+4>: bf 04 20 40 00 mov $0x402004,%edi 0x0040113f <+9>: e8 ec fe ff ff call 0x401030 <puts@plt> 0x00401144 <+14>: b8 00 00 00 00 mov $0x0,%eax 0x00401149 <+19>: 5d pop %rbp 0x0040114a <+20>: c3 ret End of assembler dump. (gdb) break *main Breakpoint 1 at 0x401136

16 What is a breakpoint, really? GDB doesn’t stop execution
directly. It patches INT3, and the CPU traps itself. #BP exception $ gdb ./hello (gdb) disas /r main Dump of assembler code for function main: 0x00401136 <+0>: + CC INT3 0x00401136 <+0>: - 55 push %rbp 0x00401137 <+1>: 48 89 e5 mov %rsp,%rbp 0x0040113a <+4>: bf 04 20 40 00 mov $0x402004,%edi 0x0040113f <+9>: e8 ec fe ff ff call 0x401030 <puts@plt> 0x00401144 <+14>: b8 00 00 00 00 mov $0x0,%eax 0x00401149 <+19>: 5d pop %rbp 0x0040114a <+20>: c3 ret End of assembler dump. (gdb) break *main Breakpoint 1 at 0x401136 (gdb) run Using host libthread_db library "/lib64/libthread_db.so.1". Breakpoint 1, 0x00000000004011c6 in main ()

17 What is a breakpoint, really? kprobe uses the same
breakpoint idea, but the trap is handled inside the kernel CPU raise #BP send SIGTRAP $ gdb ./hello (gdb) disas /r main Dump of assembler code for function main: 0x00401136 <+0>: + CC INT3 0x00401136 <+0>: - 55 push %rbp 0x00401137 <+1>: 48 89 e5 mov %rsp,%rbp 0x0040113a <+4>: bf 04 20 40 00 mov $0x402004,%edi 0x0040113f <+9>: e8 ec fe ff ff call 0x401030 <puts@plt> 0x00401144 <+14>: b8 00 00 00 00 mov $0x0,%eax 0x00401149 <+19>: 5d pop %rbp 0x0040114a <+20>: c3 ret End of assembler dump. (gdb) break *main Breakpoint 1 at 0x401136 (gdb) run Using host libthread_db library "/lib64/libthread_db.so.1". Breakpoint 1, 0x00000000004011c6 in main () Process Execution Execute INT3 Debugger gains control Enter Kernel Mode Trap Handler (Interrupt Descriptor Table) Leave Kernel Mode User Mode Kernel Mode ptrace check

What does the getuid system call look like? 18 A
Small Kernel Target <__do_sys_getuid> #include <unistd.h> uid_t getuid (void); # id uid=1000(user) ...

19 Disassembly of getuid syscall $ objdump -d v6.18/vmlinux --disassemble=__do_sys_getuid
00000000 <__do_sys_getuid>: 0: 55 pushq %rbp 1: 48 89 e5 movq %rsp, %rbp 4: 65 48 8b 05 ec da c8 01 movq %gs:0x1c8daec(%rip), %rax // get_current() c: 48 8b 80 68 08 00 00 movq 0x868(%rax), %rax // current->cred 13: 48 8b b8 90 00 00 00 movq 0x90(%rax), %rdi // current->cred->uid 1a: 8b 70 08 movl 0x8(%rax), %esi // current->cred->user_ns 1d: e8 6e 21 0c 00 callq 0xc2190 <from_kuid_munged> 22: 89 c0 movl %eax, %eax 24: 5d popq %rbp 25: c3 retq SYSCALL_DEFINE0(getuid) { return from_kuid_munged(current_user_ns(), current_uid()); }

20 kprobe Internals 0. Enable kprobe to __do_sys_getuid // from
kernel module struct kprobe kp = { .symbol_name = "__do_sys_getuid", .offset = 0, .pre_handler = handler_pre, // optional .post_handler = handler_post, // optional }; register_kprobe(&kp); $ objdump -d v6.18/vmlinux --disassemble=__do_sys_getuid 00000000 <__do_sys_getuid>: 0: 55 pushq %rbp 1: 48 89 e5 movq %rsp, %rbp 4: 65 48 8b 05 ec da c8 01 movq %gs:0x1c8daec(%rip), %rax c: 48 8b 80 68 08 00 00 movq 0x868(%rax), %rax 13: 48 8b b8 90 00 00 00 movq 0x90(%rax), %rdi 1a: 8b 70 08 movl 0x8(%rax), %esi 1d: e8 6e 21 0c 00 callq 0xc2190 <from_kuid_munged> 22: 89 c0 movl %eax, %eax 24: 5d popq %rbp 25: c3 retq

21 kprobe Internals // include/linux/kprobes.h struct kprobe { ... const
char *symbol_name; __do_sys_getuid unsigned int offset; 0 kprobe_pre_handler_t pre_handler; kprobe_post_handler_t post_handler; ... struct arch_specific_insn ainsn; }; 1. Save original instruction bytes $ objdump -d v6.18/vmlinux --disassemble=__do_sys_getuid 00000000 <__do_sys_getuid>: 0: 55 pushq %rbp 1: 48 89 e5 movq %rsp, %rbp 4: 65 48 8b 05 ec da c8 01 movq %gs:0x1c8daec(%rip), %rax c: 48 8b 80 68 08 00 00 movq 0x868(%rax), %rax 13: 48 8b b8 90 00 00 00 movq 0x90(%rax), %rdi 1a: 8b 70 08 movl 0x8(%rax), %esi 1d: e8 6e 21 0c 00 callq 0xc2190 <from_kuid_munged> 22: 89 c0 movl %eax, %eax 24: 5d popq %rbp 25: c3 retq pushq %rbp

char *symbol_name; __do_sys_getuid unsigned int offset; 0 kprobe_pre_handler_t pre_handler; kprobe_post_handler_t post_handler; ... struct arch_specific_insn ainsn; }; $ objdump -d v6.18/vmlinux --disassemble=__do_sys_getuid 00000000 <__do_sys_getuid>: -0: 55 - pushq %rbp +0: CC + INT3 1: 48 89 e5 movq %rsp, %rbp 4: 65 48 8b 05 ec da c8 01 movq %gs:0x1c8daec(%rip), %rax c: 48 8b 80 68 08 00 00 movq 0x868(%rax), %rax 13: 48 8b b8 90 00 00 00 movq 0x90(%rax), %rdi 1a: 8b 70 08 movl 0x8(%rax), %esi 1d: e8 6e 21 0c 00 callq 0xc2190 <from_kuid_munged> 22: 89 c0 movl %eax, %eax 24: 5d popq %rbp 25: c3 retq 2. Patch 1 Byte with INT3 pushq %rbp Patch 1B INT3 Kernel Mode

23 kprobe Internals 3. INT3 execution traps into kprobe framework
// include/linux/kprobes.h struct kprobe { ... const char *symbol_name; __do_sys_getuid unsigned int offset; 0 kprobe_pre_handler_t pre_handler; kprobe_post_handler_t post_handler; ... struct arch_specific_insn ainsn; }; $ objdump -d v6.18/vmlinux --disassemble=__do_sys_getuid 00000000 <__do_sys_getuid>: 0: CC INT3 1: 48 89 e5 movq %rsp, %rbp 4: 65 48 8b 05 ec da c8 01 movq %gs:0x1c8daec(%rip), %rax c: 48 8b 80 68 08 00 00 movq 0x868(%rax), %rax 13: 48 8b b8 90 00 00 00 movq 0x90(%rax), %rdi 1a: 8b 70 08 movl 0x8(%rax), %esi 1d: e8 6e 21 0c 00 callq 0xc2190 <from_kuid_munged> 22: 89 c0 movl %eax, %eax 24: 5d popq %rbp 25: c3 retq pushq %rbp CPU raise #BP Patch 1B INT3 Execute INT3 kprobe framework Kernel Mode

4. pre_handler() is called // include/linux/kprobes.h struct kprobe { ... const char *symbol_name; __do_sys_getuid unsigned int offset; 0 kprobe_pre_handler_t pre_handler; kprobe_post_handler_t post_handler; ... struct arch_specific_insn ainsn; }; $ objdump -d v6.18/vmlinux --disassemble=__do_sys_getuid 00000000 <__do_sys_getuid>: 0: CC INT3 1: 48 89 e5 movq %rsp, %rbp 4: 65 48 8b 05 ec da c8 01 movq %gs:0x1c8daec(%rip), %rax c: 48 8b 80 68 08 00 00 movq 0x868(%rax), %rax 13: 48 8b b8 90 00 00 00 movq 0x90(%rax), %rdi 1a: 8b 70 08 movl 0x8(%rax), %esi 1d: e8 6e 21 0c 00 callq 0xc2190 <from_kuid_munged> 22: 89 c0 movl %eax, %eax 24: 5d popq %rbp 25: c3 retq pushq %rbp CPU raise #BP Patch 1B INT3 Execute INT3 kprobe framework Kernel Mode calls pre_handler()

4. pre_handler() is called 5. Single-Step original instruction // include/linux/kprobes.h struct kprobe { ... const char *symbol_name; __do_sys_getuid unsigned int offset; 0 kprobe_pre_handler_t pre_handler; kprobe_post_handler_t post_handler; ... struct arch_specific_insn ainsn; }; pushq %rbp $ objdump -d v6.18/vmlinux --disassemble=__do_sys_getuid 00000000 <__do_sys_getuid>: 0: CC INT3 1: 48 89 e5 movq %rsp, %rbp 4: 65 48 8b 05 ec da c8 01 movq %gs:0x1c8daec(%rip), %rax c: 48 8b 80 68 08 00 00 movq 0x868(%rax), %rax 13: 48 8b b8 90 00 00 00 movq 0x90(%rax), %rdi 1a: 8b 70 08 movl 0x8(%rax), %esi 1d: e8 6e 21 0c 00 callq 0xc2190 <from_kuid_munged> 22: 89 c0 movl %eax, %eax 24: 5d popq %rbp 25: c3 retq pushq %rbp CPU raise #BP Patch 1B INT3 Execute INT3 kprobe framework Kernel Mode calls pre_handler() Single-Step Instruction

4. pre_handler() is called 5. Single-Step original instruction // include/linux/kprobes.h struct kprobe { ... const char *symbol_name; __do_sys_getuid unsigned int offset; 0 kprobe_pre_handler_t pre_handler; kprobe_post_handler_t post_handler; ... struct arch_specific_insn ainsn; }; pushq %rbp 6. post_handler() is called $ objdump -d v6.18/vmlinux --disassemble=__do_sys_getuid 00000000 <__do_sys_getuid>: 0: CC INT3 1: 48 89 e5 movq %rsp, %rbp 4: 65 48 8b 05 ec da c8 01 movq %gs:0x1c8daec(%rip), %rax c: 48 8b 80 68 08 00 00 movq 0x868(%rax), %rax 13: 48 8b b8 90 00 00 00 movq 0x90(%rax), %rdi 1a: 8b 70 08 movl 0x8(%rax), %esi 1d: e8 6e 21 0c 00 callq 0xc2190 <from_kuid_munged> 22: 89 c0 movl %eax, %eax 24: 5d popq %rbp 25: c3 retq pushq %rbp CPU raise #BP Patch 1B INT3 Execute INT3 kprobe framework Kernel Mode calls pre_handler() calls post_handler() Single-Step Instruction

char *symbol_name; __do_sys_getuid unsigned int offset; 0 kprobe_pre_handler_t pre_handler; kprobe_post_handler_t post_handler; ... struct arch_specific_insn ainsn; }; $ objdump -d v6.18/vmlinux --disassemble=__do_sys_getuid 00000000 <__do_sys_getuid>: 0: CC INT3 1: 48 89 e5 movq %rsp, %rbp 4: 65 48 8b 05 ec da c8 01 movq %gs:0x1c8daec(%rip), %rax c: 48 8b 80 68 08 00 00 movq 0x868(%rax), %rax 13: 48 8b b8 90 00 00 00 movq 0x90(%rax), %rdi 1a: 8b 70 08 movl 0x8(%rax), %esi 1d: e8 6e 21 0c 00 callq 0xc2190 <from_kuid_munged> 22: 89 c0 movl %eax, %eax 24: 5d popq %rbp 25: c3 retq pushq %rbp 7. Resume Execution CPU raise #BP Patch 1B INT3 Execute INT3 kprobe framework Kernel Mode calls pre_handler() Single-Step Instruction Resume Execution calls post_handler()

28 kprobe in Action Counting getuid() with kprobe

29 kprobe in Action DEFINE_PER_CPU(int, getuid_cnt); static int handler_pre(struct kprobe
*p, struct pt_regs *regs) { this_cpu_inc(getuid_cnt); return 0; } static struct kprobe kp = { .symbol_name = "__do_sys_getuid", .pre_handler = handler_pre, .offset = 0 }; static int __init my_module_init(void) { return register_kprobe(&kp); } static void __exit my_module_exit(void) { int cpu, total = 0; unregister_kprobe(&kp); for_each_possible_cpu(cpu) total += per_cpu(getuid_cnt, cpu); pr_info("__do_sys_getuid count = %d\n", total); } module_init(my_module_init); module_exit(my_module_exit); Looks simple. But what did we just pay?

30 kprobe in Action entire path runs for every single
syscall DEFINE_PER_CPU(int, getuid_cnt); static int handler_pre(struct kprobe *p, struct pt_regs *regs) { this_cpu_inc(getuid_cnt); return 0; } static struct kprobe kp = { .symbol_name = "__do_sys_getuid", .pre_handler = handler_pre, .offset = 0 }; static int __init my_module_init(void) { return register_kprobe(&kp); } static void __exit my_module_exit(void) { int cpu, total = 0; unregister_kprobe(&kp); for_each_possible_cpu(cpu) total += per_cpu(getuid_cnt, cpu); pr_info("__do_sys_getuid count = %d\n", total); } module_init(my_module_init); module_exit(my_module_exit); CPU raise #BP Process Execution Execute INT3 kprobe framework Kernel Mode calls pre_handler() Single-Step Instruction Resume Execution calls post_handler()

31 Revisit kprobe Why did kprobe need traps in the
ﬁrst place? <INT3>

32 Why did kprobe need traps? Because there was no
place to observe. void do_something() { · · · · · · · · · · · · · · · · · · } TRAP INT3 kprobe pre_handler kprobe post_handler Single-Step

33 Why did kprobe need traps? Because there was no
place to observe. void do_something() { · · · · · · · · · · · · · · · · · · } INT3 What if the code itself had a place to observe? (not printk or logic, does nothing by default) void do_something() { // instrumentation site · · · · · · · · · · · · } kprobe pre_handler kprobe post_handler Single-Step TRAP

Code-patching ftrace 34 Don’t trap the CPU. Patch the code.

35 How does an instrumentation site look? Five NOPs inserted
by the compiler (x86-64) Does nothing by default void do_something() { // instrumentation site · · · · · · · · · · · · } [ NOP ][ NOP ][ NOP ][ NOP ][ NOP ]

36 Why ﬁve NOPs? Five NOPs inserted by the compiler
(x86-64) Does nothing by default void do_something() { // instrumentation site · · · · · · · · · · · · } [ NOP ][ NOP ][ NOP ][ NOP ][ NOP ] CALL rel32 [ E8X ][ XXX ][ XXX ][ XXX ][ XXX ] Same size as a CALL instruction Will be replaced at runtime.

37 Revisiting getuid disassembly Compiled with CONFIG_FTRACE=N $ objdump -d
v6.18/vmlinux --disassemble=__do_sys_getuid 00000000 <__do_sys_getuid>: 0: 55 pushq %rbp 1: 48 89 e5 movq %rsp, %rbp 4: 65 48 8b 05 ec da c8 01 movq %gs:0x1c8daec(%rip), %rax c: 48 8b 80 68 08 00 00 movq 0x868(%rax), %rax 13: 48 8b b8 90 00 00 00 movq 0x90(%rax), %rdi 1a: 8b 70 08 movl 0x8(%rax), %esi 1d: e8 6e 21 0c 00 callq 0xc2190 <from_kuid_munged> 22: 89 c0 movl %eax, %eax 24: 5d popq %rbp 25: c3 retq Compiled with CONFIG_FTRACE=Y $ objdump -d v6.18/vmlinux --disassemble=__do_sys_getuid 00000000 <__do_sys_getuid>: 0: 0f 1f 44 00 00 nopl (%rax, %rax) 5: 55 pushq %rbp 6: 48 89 e5 movq %rsp, %rbp 9: 65 48 8b 05 ec da c8 01 movq %gs:0x1c8daec(%rip), %rax 11: 48 8b 80 68 08 00 00 movq 0x868(%rax), %rax 18: 48 8b b8 90 00 00 00 movq 0x90(%rax), %rdi 1f: 8b 70 08 movl 0x8(%rax), %esi 22: e8 6e 21 0c 00 callq 0xc2190 <from_kuid_munged> 27: 89 c0 movl %eax, %eax 29: 5d popq %rbp 2a: c3 retq

$ objdump -d v6.18/vmlinux --disassemble=__do_sys_getuid 00000000 <__do_sys_getuid>: 0: 0f 1f
44 00 00 nopl (%rax, %rax) 5: 55 pushq %rbp 6: 48 89 e5 movq %rsp, %rbp 9: 65 48 8b 05 ec da c8 01 movq %gs:0x1c8daec(%rip), %rax 11: 48 8b 80 68 08 00 00 movq 0x868(%rax), %rax 18: 48 8b b8 90 00 00 00 movq 0x90(%rax), %rdi 1f: 8b 70 08 movl 0x8(%rax), %esi 22: e8 6e 21 0c 00 callq 0xc2190 <from_kuid_munged> 27: 89 c0 movl %eax, %eax 29: 5d popq %rbp 2a: c3 retq 38 ftrace Internals 0. Enable ftrace function tracer with __do_sys_getuid // from terminal, tracefs # echo __do_sys_getuid > /sys/kernel/tracing/set_ftrace_filter # echo function > /sys/kernel/tracing/current_tracer # echo 1 > /sys/kernel/tracing/tracing_on

$ objdump -d v6.18/vmlinux --disassemble=__do_sys_getuid 00000000 <__do_sys_getuid>: +0: e8 fb
96 35 1c call 0xffffffffc0600000 <ftrace_trampoline> -0: 0f 1f 44 00 00 nopl (%rax, %rax) 5: 55 pushq %rbp 6: 48 89 e5 movq %rsp, %rbp 9: 65 48 8b 05 ec da c8 01 movq %gs:0x1c8daec(%rip), %rax 11: 48 8b 80 68 08 00 00 movq 0x868(%rax), %rax 18: 48 8b b8 90 00 00 00 movq 0x90(%rax), %rdi 1f: 8b 70 08 movl 0x8(%rax), %esi 22: e8 6e 21 0c 00 callq 0xc2190 <from_kuid_munged> 27: 89 c0 movl %eax, %eax 29: 5d popq %rbp 2a: c3 retq ftrace Internals 1. 5-NOP is patched to ftrace_trampoline // from terminal # echo __do_sys_getuid > /sys/kernel/tracing/set_ftrace_filter # echo function > /sys/kernel/tracing/current_tracer # echo 1 > /sys/kernel/tracing/tracing_on // from terminal # cat /proc/kallsyms | grep sys_getuid$ ffffffffa42a6900 T __do_sys_getuid # gdb -q vmlinux /proc/kcore (gdb) disas /r __do_sys_getuid, +5 Dump of assembler code from 0xffffffff812a6900 to 0xffffffff812a6905: 0xffffffff812a6900 <__do_sys_getuid+0>: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) (gdb) disas /r 0xffffffffa42a6900, +5 Dump of assembler code from 0xffffffffa42a6900 to 0xffffffffa42a6905: 0xffffffffa42a6900: e8 fb 96 35 1c call 0xffffffffc0600000 # cat /proc/kallsyms | grep ^ffffffffc0600000 ffffffffc0600000 t ftrace_trampoline [__builtin__ftrace] Kernel Mode On enable, patch <ftrace_trampoline>

96 35 1c call 0xffffffffc0600000 5: 55 pushq %rbp 6: 48 89 e5 movq %rsp, %rbp 9: 65 48 8b 05 ec da c8 01 movq %gs:0x1c8daec(%rip), %rax 11: 48 8b 80 68 08 00 00 movq 0x868(%rax), %rax 18: 48 8b b8 90 00 00 00 movq 0x90(%rax), %rdi 1f: 8b 70 08 movl 0x8(%rax), %esi 22: e8 6e 21 0c 00 callq 0xc2190 <from_kuid_munged> 27: 89 c0 movl %eax, %eax 29: 5d popq %rbp 2a: c3 retq ftrace Internals // from terminal # echo __do_sys_getuid > /sys/kernel/tracing/set_ftrace_filter # echo function > /sys/kernel/tracing/current_tracer # echo 1 > /sys/kernel/tracing/tracing_on 2. Calls <ftrace_trampoline> // built on-the-fly based on ftrace_caller ftrace_trampoline(ctx) { save_minimal_cpu_state(ctx) for (ops in ftrace_ops_list): ops.func(ctx) restore_cpu_state(ctx) return; } ftrace_trampoline (calls registered function) On enable, patch <ftrace_trampoline> Kernel Mode

96 35 1c call 0xffffffffc0600000 5: 55 pushq %rbp 6: 48 89 e5 movq %rsp, %rbp 9: 65 48 8b 05 ec da c8 01 movq %gs:0x1c8daec(%rip), %rax 11: 48 8b 80 68 08 00 00 movq 0x868(%rax), %rax 18: 48 8b b8 90 00 00 00 movq 0x90(%rax), %rdi 1f: 8b 70 08 movl 0x8(%rax), %esi 22: e8 6e 21 0c 00 callq 0xc2190 <from_kuid_munged> 27: 89 c0 movl %eax, %eax 29: 5d popq %rbp 2a: c3 retq ftrace Internals // from terminal # echo __do_sys_getuid > /sys/kernel/tracing/set_ftrace_filter # echo function > /sys/kernel/tracing/current_tracer # echo 1 > /sys/kernel/tracing/tracing_on // built on-the-fly based on ftrace_caller ftrace_trampoline(ctx) { save_minimal_cpu_state(ctx) for (ops in ftrace_ops_list): ops.func(ctx) restore_cpu_state(ctx) return; } 2. Calls <ftrace_trampoline> 3. calls handler function (here ‘function’ tracer) static void function_trace_call(ulong ip, ulong parent_ip, struct ftrace_ops *op, struct ftrace_regs *fregs) { // ... if (unlikely(!op->private->function_enabled)) return; // ... trace_function(tr, ip, parent_ip, trace_ctx, NULL); } ftrace_trampoline (calls registered function) On enable, patch <ftrace_trampoline> Kernel Mode Handler functions (here ‘function’ tracer)

$ objdump -d v6.18/vmlinux --disassemble=__do_sys_getuid 00000000 <__do_sys_getuid>: 0: e8 fb
96 35 1c call 0xffffffffc0600000 5: 55 pushq %rbp 6: 48 89 e5 movq %rsp, %rbp 9: 65 48 8b 05 ec da c8 01 movq %gs:0x1c8daec(%rip), %rax 11: 48 8b 80 68 08 00 00 movq 0x868(%rax), %rax 18: 48 8b b8 90 00 00 00 movq 0x90(%rax), %rdi 1f: 8b 70 08 movl 0x8(%rax), %esi 22: e8 6e 21 0c 00 callq 0xc2190 <from_kuid_munged> 27: 89 c0 movl %eax, %eax 29: 5d popq %rbp 2a: c3 retq ftrace Internals // from terminal # echo __do_sys_getuid > /sys/kernel/tracing/set_ftrace_filter # echo function > /sys/kernel/tracing/current_tracer # echo 1 > /sys/kernel/tracing/tracing_on // built on-the-fly based on ftrace_caller ftrace_trampoline(ctx) { save_minimal_cpu_state(ctx) for (ops in ftrace_ops_list): ops.func(ctx) restore_cpu_state(ctx) return; } 4. Resume Execution 3. calls handler function (here ’function’ tracer) ftrace_trampoline (calls registered function) On enable, patch <ftrace_trampoline> Kernel Mode Handler functions (here ‘function’ tracer) Resume Execution 2. Calls <ftrace_trampoline>

ftrace consumers 43 function tracer function_graph tracer BPF trampoline (fentry
/ fexit) # CPU DURATION FUNCTION CALLS # | | | | | | | ------------------------------------------ 8) <idle>-0 => migrati-65 ------------------------------------------ 8) | irq_exit_rcu() { 8) 0.231 us | idle_cpu(); 8) | tick_nohz_irq_exit() { 8) 0.260 us | ktime_get(); 8) 0.742 us | } 8) 1.683 us | } 8) | __sysvec_call_function() { 8) | generic_smp_call_function...() { 8) | __flush_smp_call_function...() { 8) 0.261 us | do_sync_core(); 8) 0.781 us | } # TASK-PID CPU# ||| TIMESTAMP FUNCTION # | | | ||| | | master-159 [005] ... 42.0935: __do_sys_getuid <-do_syscall_64 Lsb_release-179 [010] ... 65.2152: __do_sys_getuid <-do_syscall_64 <...>-179 [014] ... 67.7051: __do_sys_getuid <-do_syscall_64 systemd-logi-125 [002] ... 68.4213: __do_sys_getuid <-do_syscall_64 fcitx5-205 [015] ... 70.0857: __do_sys_getuid <-do_syscall_64 systemd-udev-750 [011] ... 84.1000: __do_sys_getuid <-do_syscall_64 kded6-212 [002] ... 90.3384: __do_sys_getuid <-do_syscall_64 # echo function > $TRACEFS/current_tracer # echo function_graph > $TRACEFS/current_tracer SEC("fentry/__do_sys_getuid") int BPF_PROG(trace_getuid) { bpf_printk("getuid called\n"); return 0; } 43 ftrace_trampoline (calls registered function) On enable, patch <ftrace_trampoline> Kernel Mode etc… function_graph function Who consumes this ftrace path?

44 ftrace in Action Again, counting getuid() with ftrace

45 ftrace in Action OR // from terminal, tracefs #
cd /sys/kernel/tracing # echo __do_sys_getuid > set_ftrace_filter # echo function > current_tracer # echo 1 > tracing_on Counting __do_sys_getuid calls // from terminal # cat /sys/kernel/tracing/trace # TASK-PID CPU# ||| TIMESTAMP FUNCTION # | | | ||| | | master-1590 [005] ... 42.0935: __do_sys_getuid <-do_syscall_64 Lsb_release-1796 [010] ... 65.2152: __do_sys_getuid <-do_syscall_64 <...>-1793 [014] ... 67.7051: __do_sys_getuid <-do_syscall_64 systemd-logi-1253 [002] ... 68.4213: __do_sys_getuid <-do_syscall_64 fcitx5-2052 [015] ... 70.0857: __do_sys_getuid <-do_syscall_64 systemd-udev-750 [011] ... 84.1000: __do_sys_getuid <-do_syscall_64 kded6-2126 [002] ... 90.3384: __do_sys_getuid <-do_syscall_64 # cat /sys/kernel/tracing/trace | wc -l 36

46 Dynamic Instrumentation: Flexibility vs. Cost Can attach almost anywhere
Pays the trap cost kprobe Cheaper Trampoline path Limited to function entry/return ftrace CPU raise #BP Process Execution Execute INT3 kprobe framework Kernel Mode calls pre_handler() Single-Step Instruction Resume Execution calls post_handler() ftrace_trampoline (calls registered function) On enable, patch <ftrace_trampoline> Kernel Mode Resume Execution Process Execution etc… function_graph function

47 Recap: Dynamic Instrumentation Whether it’s a trap or a
trampoline, we still intercept execution at runtime. What if the observation point was already there? void do_something() { · · · · · · · · · · · · · · · · · · } void do_something() { // instrumentation site · · · · · · · · · · · · } Runtime INT3

48 Recap: Dynamic Instrumentation Whether it’s a trap or a
trampoline, we still intercept execution at runtime. Static Instrumentation Observation points inserted at build time void do_something() { · · · · · · observation_point(); · · · · · · } Compiled Result void do_something() { · · · · · · observation_point(); · · · · · · } Runtime Enable void do_something() { · · · · · · · · · · · · · · · · · · } void do_something() { // instrumentation site · · · · · · · · · · · · } Runtime INT3

2. Static Instrumentation 49

Statically-deﬁned tracepoints 50 low-overhead observation point

51 Why printk Does Not Scale? printk() - String formatting
(parse, va_list) - printk("uid=%d pid=%d\n", uid, pid); parse format string variable argument handling convert integer to string should count total length

Why printk Does Not Scale? printk() - String formatting (parse,
va_list) - Global log buffer (aka. dmesg buffer) - printk("uid=%d pid=%d\n", uid, pid); All written into single shared buffer (kernel log buffer, aka. dmesg buffer) printk("uid=%d\n", uid); printk("pid=%d\n", pid); 52

53 Why printk Does Not Scale? printk() - String formatting
(parse, va_list) - Global log buffer (aka. dmesg buffer) - Unstructured strings (hard to ﬁlter) printk("uid=%d pid=%d\n", uid, pid); printk("user identifier=%d\n", uid); printk("user number=%d\n", uid); Format can be arbitrary and hard to ﬁlter

54 Solution? tracepoints - Fixed event schema (lower overhead) -
Dedicated tracing buffer - Structured binary data (easier ﬁltering, processing, and aggregating) printk() - String formatting (parse, va_list) - Global log buffer (aka. dmesg buffer) - Unstructured strings (hard to ﬁlter)

55 So then, where are tracepoints located? Built into the
execution path

56 Example: syscall tracepoints syscall tracepoints are ﬁred before the
actual syscall <do_syscall_x64()> <trace_sys_enter()> SYSCALL instruction entry_SYSCALL_64 (arch/x86/entry/entry_64.s) do_syscall_64() (arch/x86/entry/syscall_64.c) syscall_enter_from _user_mode() syscall_trace_enter() trace_sys_enter() do_syscall_x64()

57 Example: syscall tracepoints TRACE_EVENT_SYSCALL(sys_enter, TP_STRUCT__entry( __field(long, id) __array(unsigned long,
args, 6) ), TP_fast_assign( __entry->id = id; syscall_get_arguments(current, regs, __entry->args); ), // ... snip ... ); CPU 0 CPU 1 CPU 2 CPU 3 Just structured data written to a ring buffer. Per-CPU trace buffer SYSCALL instruction entry_SYSCALL_64 (arch/x86/entry/entry_64.s) do_syscall_64() (arch/x86/entry/syscall_64.c) syscall_enter_from _user_mode() syscall_trace_enter() trace_sys_enter() do_syscall_x64()

58 tracepoints in Action Once again, counting getuid() with tracepoints

59 tracepoint in Action // from terminal # cd /sys/kernel/tracing
# echo 1 > events/syscalls/sys_enter_getuid/enable # echo 1 > tracing_on Counting __do_sys_getuid calls with syscalls/sys_enter_getuid // from terminal # cat /sys/kernel/tracing/trace bash-601 [004] ..... 178.993751: sys_getuid() tmux: client-698 [005] ..... 180.146847: sys_getuid() tmux: server-700 [007] ..... 180.155717: sys_getuid() tput-709 [010] ..... 180.257137: sys_getuid() bash-706 [010] ..... 180.246584: sys_getuid() tput-716 [006] ..... 180.266727: sys_getuid() bash-701 [008] ..... 180.268412: sys_getuid() # cat /sys/kernel/tracing/trace | wc -l 29 OR

60 void do_something() { · · · · · ·
observation_point(); · · · · · · } void do_something() { · · · · · · · · · · · · · · · · · · } void do_something() { // instrumentation site · · · · · · · · · · · · } INT3 We examined three tracing mechanisms for the same syscall. kprobe ftrace tracepoint Event-driven instrumentation - observe events as they happen observing every event pays the tracing cost

61 Cost of Event-driven Tracing https://www.ebpf.top/en/post/bpf_rawtracepoint/ kprobe ftrace tracepoint Overhead
Cost > ≈ In Event-driven instrumentation, overhead grows with event rate $ cd tools/testing/selftests/bpf $ ./benchs/run_bench_trigger.sh

62 We need another mode of observation Not every event.
Just enough.

3. Sampling 63

64 So how Sampling Works? Sampling replaces "observe every event"
with "observe when a counter overﬂows". Events increment a counter. Sampling happens only on overﬂow. Event timeline: | | | | | | | | | | | | | | | | Counter value : 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ^ ^ [ sample1 ] [ sample2 ]

Counter-based perf_event 65 observe enough, not everything

66 What does perf_event can do? struct perf_event_attr attr =
{ .size = sizeof(struct perf_event_attr), .config = PERF_COUNT_HW_CPU_CYCLES, .type = PERF_TYPE_HARDWARE, .sample_period = 100000, .sample_type = PERF_SAMPLE_IP | PERF_SAMPLE_CALLCHAIN, }; fd = perf_event_open(&attr, -1, 0, -1, 0); Event timeline | | | | | | | | | | | | | | | | | | | | Counter value 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 (x10,000) ^ ^ [ sample1 ] [ sample2 ] perf_event deﬁnes: - What event to count - When to sample - What to observe

67 What does perf_event can do? perf_event deﬁnes: - What
event to count Event timeline | | | | | | | | | | | | | | | | | | | | Counter value 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 (x10,000) ^ ^ [ sample1 ] [ sample2 ] ➔ Hardware / PMU events (cycles, instructions, cache-misses) ➔ Software events (context-switches, page-fault, cpu-clock) ➔ Tracepoint-backed events ➔ Kprobe-backed events struct perf_event_attr attr = { .size = sizeof(struct perf_event_attr), .config = PERF_COUNT_HW_CPU_CYCLES, .type = PERF_TYPE_HARDWARE, .sample_period = 100000, .sample_type = PERF_SAMPLE_IP | PERF_SAMPLE_CALLCHAIN, }; fd = perf_event_open(&attr, -1, 0, -1, 0);

event to count - When to sample Event timeline | | | | | | | | | | | | | | | | | | | | Counter value 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 (x10,000) ^ ^ [ sample1 ] [ sample2 ] ➔ Count-based overﬂow ➔ Timer-based sampling (clock events) struct perf_event_attr attr = { .size = sizeof(struct perf_event_attr), .config = PERF_COUNT_HW_CPU_CYCLES, .type = PERF_TYPE_HARDWARE, .sample_period = 100000, // every 100,000 cycle .sample_type = PERF_SAMPLE_IP | PERF_SAMPLE_CALLCHAIN, }; fd = perf_event_open(&attr, -1, 0, -1, 0);

event to count - When to sample - What to observe Event timeline | | | | | | | | | | | | | | | | | | | | Counter value 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 (x10,000) ^ ^ [ sample1 ] [ sample2 ] ➔ Instruction Pointer ➔ User/Kernel Stack Trace ➔ PID/TID/CPU# ➔ Register Information ➔ Call chains struct perf_event_attr attr = { .size = sizeof(struct perf_event_attr), .config = PERF_COUNT_HW_CPU_CYCLES, .type = PERF_TYPE_HARDWARE, .sample_period = 100000, // every 100,000 cycle .sample_type = PERF_SAMPLE_IP | PERF_SAMPLE_CALLCHAIN, }; fd = perf_event_open(&attr, -1, 0, -1, 0);

70 What does perf_event can do? Event timeline | |
| | | | | | | | | | | | | | | | | | Counter value 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 (x10,000) ^ ^ [ sample1 ] [ sample2 ] Every 100,000 CPU Cycles, sample current IP + Call Chain => identify frequently executed code path perf_event deﬁnes: - What event to count - When to sample - What to observe struct perf_event_attr attr = { .size = sizeof(struct perf_event_attr), .config = PERF_COUNT_HW_CPU_CYCLES, .type = PERF_TYPE_HARDWARE, .sample_period = 100000, // every 100,000 cycle .sample_type = PERF_SAMPLE_IP | PERF_SAMPLE_CALLCHAIN, }; fd = perf_event_open(&attr, -1, 0, -1, 0);

71 perf_event in Action Where is the CPU spending time?

72 perf_event in Action struct perf_event_attr attr = { .size
= sizeof(struct perf_event_attr), .config = PERF_COUNT_HW_CPU_CYCLES, .type = PERF_TYPE_HARDWARE, .sample_period = 100000, .sample_type = PERF_SAMPLE_IP | PERF_SAMPLE_CALLCHAIN, }; fd = perf_event_open(&attr, -1, 0, -1, 0); ioctl(fd, PERF_EVENT_IOC_ENABLE, 0); Identifying hot code path OR # perf record -e cycles -c 100000 -C 0 -g -- sleep 1

73 perf_event in Action Identifying hot code path # perf
report --stdio --no-children –sort comm -F overhead,comm # Overhead Command # ........ ............... # 0.21% NetworkManager 0.00% avahi-daemon 0.09% awk 0.35% bash 0.06% cat 81.05% rsync | |--12.57%--arch_local_irq_restore | | | |--3.06%--obj_cgroup_uncharge | | __memcg_slab_free_hook | | kmem_cache_free | | | | | --3.03%--free_buffer_head | | try_to_free_buffers struct perf_event_attr attr = { .size = sizeof(struct perf_event_attr), .config = PERF_COUNT_HW_CPU_CYCLES, .type = PERF_TYPE_HARDWARE, .sample_period = 100000, .sample_type = PERF_SAMPLE_IP | PERF_SAMPLE_CALLCHAIN, }; fd = perf_event_open(&attr, -1, 0, -1, 0); ioctl(fd, PERF_EVENT_IOC_ENABLE, 0); Which process appeared most frequently? OR # perf record -e cycles -c 100000 -C 0 -g -- sleep 1

74 perf_event in Action OR Identifying hot code path #
perf record -e cycles -c 100000 -C 0 -g -- sleep 1 # perf report --stdio -F overhead_children,overhead,comm,symbol # Children Self Command Symbol # ........ ........ ............... .................. # 81.03% 0.00% rsync [.] _start | ---_start __libc_start_call_main main | |--60.34%--iwrite.constprop.0 | | | --60.33%--__GI___libc_write | | | ... | --60.03%--invoke_syscall | | | --60.03%--__x64__sys_write | | | --60.03%--ksys_write struct perf_event_attr attr = { .size = sizeof(struct perf_event_attr), .config = PERF_COUNT_HW_CPU_CYCLES, .type = PERF_TYPE_HARDWARE, .sample_period = 100000, .sample_type = PERF_SAMPLE_IP | PERF_SAMPLE_CALLCHAIN, }; fd = perf_event_open(&attr, -1, 0, -1, 0); ioctl(fd, PERF_EVENT_IOC_ENABLE, 0); Which call chain appeared most frequently?

75 perf_event in Action Identifying hot code path # perf
script \ | stackcollapse-perf.pl \ | flamegraph.pl > flamegraph.svg struct perf_event_attr attr = { .size = sizeof(struct perf_event_attr), .config = PERF_COUNT_HW_CPU_CYCLES, .type = PERF_TYPE_HARDWARE, .sample_period = 100000, .sample_type = PERF_SAMPLE_IP | PERF_SAMPLE_CALLCHAIN, }; fd = perf_event_open(&attr, -1, 0, -1, 0); ioctl(fd, PERF_EVENT_IOC_ENABLE, 0); FlameGraph OR # perf record -e cycles -c 100000 -C 0 -g -- sleep 1

Exception-based kprobe (2.6.9) 76 Recap: Tracing Mechanism Code-patching ftrace (2.6.27)
Statically-deﬁned tracepoints (2.6.32) Counter-based perf_event (2.6.31) Dynamic Instrumentation Static Instrumentation Sampling Event-driven Counter-driven Tracing event sources

77 But that’s not the Whole Story Event source decides
where/when. Exception-based kprobe Code-patching ftrace Statically-deﬁned tracepoints Counter-based perf_event But who consumes the event?

4. Programmable Layer 78

Safe, Flexible, Programmable Layer BPF 79 BPF is not a
new tracing event source

writes structured-data Event | | | | | | |
| | | | | | | | Counter 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 (x1,000) ^ ^ [ sample ] [ sample ] 80 Event Consumers so far: ftrace function tracer function_graph tracer SYSCALL instruction entry_SYSCALL_64 (arch/x86/entry/entry_64.s) do_syscall_64() (arch/x86/entry/syscall_64.c) syscall_enter_from _user_mode() syscall_trace_enter() trace_sys_enter() do_syscall_x64() tracepoint perf_event CPU 1 CPU 0 trace event subsystem (events/syscalls/sys_enter_getuid) perf tool (perf record, perf top) struct perf_event_attr attr = { .size = sizeof(struct perf_event_attr), .config = PERF_COUNT_HW_CPU_CYCLES, .type = PERF_TYPE_HARDWARE, .sample_period = 100000, .sample_type = PERF_SAMPLE_IP | PERF_SAMPLE_CALLCHAIN, }; fd = perf_event_open(&attr, -1, 0, -1, 0); 80 ftrace_trampoline (calls registered function) On enable, patch <ftrace_trampoline> Kernel Mode etc… function_graph function

81 Fixed Consumers: Conﬁgurable, But Not Programmable ftrace function tracer
function_graph tracer tracepoint perf_event trace event subsystem (events/syscalls/sys_enter_getuid) perf tool (perf record, perf top) # CPU DURATION FUNCTION CALLS # | | | | | | | ------------------------------------------ 8) <idle>-0 => migrati-65 ------------------------------------------ 8) | irq_exit_rcu() { 8) 0.231 us | idle_cpu(); 8) | tick_nohz_irq_exit() { 8) 0.260 us | ktime_get(); 8) 0.742 us | } 8) 1.683 us | } 8) | __sysvec_call_function() { 8) | generic_smp_call_function...() { 8) | __flush_smp_call_function...() { 8) 0.261 us | do_sync_core(); 8) 0.781 us | } 8) 1.283 us | } # echo function_graph > $TRACEFS/current_tracer // from terminal # perf record -e cycles -c 100000 -C 0 -g -- sleep 1 # perf report -f --stdio --no-hierarchy # Children Self Command Symbol # ........ ........ ............... .............. # 81.03% 0.00% dd [.] _start | ---_start __libc_start_main@@GLIBC_2.34 __libc_start_call_main main | |--60.34%--iwrite.constprop.0 | | | --60.33%--__GI___libc_write | | // from terminal # cat /sys/kernel/tracing/trace bash-601 [004] ..... 178.993751: sys_getuid() tmux: client-698 [005] ..... 180.146847: sys_getuid() tmux: server-700 [007] ..... 180.155717: sys_getuid() tput-709 [010] ..... 180.257137: sys_getuid() bash-706 [010] ..... 180.246584: sys_getuid() tput-716 [006] ..... 180.266727: sys_getuid() bash-701 [008] ..... 180.268412: sys_getuid() # echo 1 > $TRACEFS/events/syscalls/sys_enter_getuid/enable

82 Real Tracing Questions Fixed consumer can answer: How many
times getuid is called? Hard to answer with Fixed Consumers What we often want: List top processes calling getuid() (> 100 times)

83 But BPF Tracing Programs can! What we often want:
List top processes calling getuid() (> 100 times) struct { __uint(type, BPF_MAP_TYPE_HASH); __uint(max_entries, 32768); __type(key, __u32); // process ID (TGID) __type(value, __u32); // COUNTER } cnt SEC(".maps"); SEC("kprobe/__do_sys_getuid") int BPF_KPROBE(count_getuid) { __u32 key = bpf_get_current_pid_tgid() >> 32; __u32 *val = bpf_map_lookup_elem(&cnt, &key); __u32 one = 1; if (val) (*val)++; else bpf_map_update_elem(&cnt, &key, &one, BPF_ANY); return 0; }

84 But BPF Tracing Programs can! struct { __uint(type, BPF_MAP_TYPE_HASH);
__uint(max_entries, 32768); __type(key, __u32); // process ID (TGID) __type(value, __u32); // COUNTER } cnt SEC(".maps"); SEC("kprobe/__do_sys_getuid") int BPF_KPROBE(count_getuid) { __u32 key = bpf_get_current_pid_tgid() >> 32; __u32 *val = bpf_map_lookup_elem(&cnt, &key); __u32 one = 1; if (val) (*val)++; else bpf_map_update_elem(&cnt, &key, &one, BPF_ANY); return 0; } What we often want: List top processes calling getuid() (> 100 times) 0. Deﬁne BPF MAP 15978 15 16221 152 16320 30 19234 192 21245 212 23102 52 Key Value

__uint(max_entries, 32768); __type(key, __u32); // process ID (TGID) __type(value, __u32); // COUNTER } cnt SEC(".maps"); SEC("kprobe/__do_sys_getuid") int BPF_KPROBE(count_getuid) { __u32 key = bpf_get_current_pid_tgid() >> 32; __u32 *val = bpf_map_lookup_elem(&cnt, &key); __u32 one = 1; if (val) (*val)++; else bpf_map_update_elem(&cnt, &key, &one, BPF_ANY); return 0; } What we often want: List top processes calling getuid() (> 100 times) 1. Program Observation Logic a. fetch current process id b. Map Lookup for matching entry c. if found, update the counter d. if not, set as 1 and update entry 15978 15 16221 21 16320 103++ 19234 192 21245 212 23102 1 Key(PID) Value(Count)

__uint(max_entries, 32768); __type(key, __u32); // process ID (TGID) __type(value, __u32); // COUNTER } cnt SEC(".maps"); SEC("kprobe/__do_sys_getuid") int BPF_KPROBE(count_getuid) { __u32 key = bpf_get_current_pid_tgid() >> 32; __u32 *val = bpf_map_lookup_elem(&cnt, &key); __u32 one = 1; if (val) (*val)++; else bpf_map_update_elem(&cnt, &key, &one, BPF_ANY); return 0; } What we often want: List top processes calling getuid() (> 100 times) 2. Annotate with SEC(“kprobe/”) CPU raise #BP Process Execution Execute INT3 kprobe framework Kernel Mode calls pre_handler() Single-Step Instruction Resume Execution calls post_handler() BPF Run BPF program as event consumer

Different event source can be used too! kprobe ftrace tracepoint
struct { __uint(type, BPF_MAP_TYPE_HASH); __uint(max_entries, 32768); __type(key, __u32); // process ID (TGID) __type(value, __u32); // COUNTER } cnt SEC(".maps"); SEC("fentry/__do_sys_getuid") int BPF_PROG(count_getuid) { __u32 key = bpf_get_current_pid_tgid() >> 32; __u32 *val = bpf_map_lookup_elem(&cnt, &key); if (val) (*val)++; return 0; } struct { __uint(type, BPF_MAP_TYPE_HASH); __uint(max_entries, 32768); __type(key, __u32); // process ID (TGID) __type(value, __u32); // COUNTER } cnt SEC(".maps"); SEC("kprobe/__do_sys_getuid") int BPF_KPROBE(count_getuid) { __u32 key = bpf_get_current_pid_tgid() >> 32; __u32 *val = bpf_map_lookup_elem(&cnt, &key); if (val) (*val)++; return 0; } struct { __uint(type, BPF_MAP_TYPE_HASH); __uint(max_entries, 32768); __type(key, __u32); // process ID (TGID) __type(value, __u32); // COUNTER } cnt SEC(".maps"); SEC("tracepoint/syscalls/sys_enter_getuid") int count_getuid(struct trace_event_raw_sys_enter *ctx) { __u32 key = bpf_get_current_pid_tgid() >> 32; __u32 *val = bpf_map_lookup_elem(&cnt, &key); if (val) (*val)++; return 0; } SYSCALL instruction entry_SYSCALL_64 (arch/x86/entry/entry_64.s) do_syscall_64() (arch/x86/entry/syscall_64.c) syscall_enter_from _user_mode() syscall_trace_enter() trace_sys_enter() do_syscall_x64() CPU raise #BP Patch 1B INT3 Execute INT3 kprobe framework Kernel Mode calls pre_handler() Single-Step Instruction Resume Execution calls post_handler() 87 ftrace_trampoline (calls registered function) On enable, patch <ftrace_trampoline> Kernel Mode etc… function_graph BPF trampoline Kernel Mode BPF BPF BPF

88 But wait, what about kprobe module? kprobe uses kernel
module! So It can implement anything, even the same?

struct { __uint(type, BPF_MAP_TYPE_HASH); __uint(max_entries, 32768); __type(key, __u32); // process
ID (TGID) __type(value, __u32); // COUNTER } cnt SEC(".maps"); SEC("kprobe/__do_sys_getuid") int BPF_KPROBE(count_getuid) { __u32 key = bpf_get_current_pid_tgid() >> 32; __u32 *val = bpf_map_lookup_elem(&cnt, &key); __u32 one = 1; if (val) (*val)++; else bpf_map_update_elem(&cnt, &key, &one, BPF_ANY); return 0; } 89 Yes you can, but... DEFINE_HASHTABLE(cnt, 10); struct entry { u32 tgid; u32 count; struct hlist_node node; }; static int handler_pre(struct kprobe *p, struct pt_regs *regs) { u32 tgid = current->tgid; struct entry *e; hash_for_each_possible(cnt, e, node, tgid) { if (e->tgid == tgid) { e->count++; return 0; } } e = kmalloc(sizeof(*e), GFP_ATOMIC); if (!e) return 0; e->count = 1; e->tgid = tgid; hash_add(cnt, &e->node, tgid); return 0; } static struct kprobe kp = { .symbol_name = "__do_sys_getuid", .pre_handler = handler_pre, }; static int __init my_module_init(void) { hash_init(cnt); return register_kprobe(&kp); } static void __exit my_module_exit(void) { struct entry *e; int bkt; unregister_kprobe(&kp); hash_for_each(cnt, bkt, e, node) { pr_info("%u:%u\n", e->tgid, e->count); kfree(e); } } BPF Kernel module (kprobe)

90 Not effective static struct kprobe kp = { .symbol_name
= "__do_sys_getuid", .pre_handler = handler_pre, }; static int __init my_module_init(void) { hash_init(cnt); return register_kprobe(&kp); } static void __exit my_module_exit(void) { struct entry *e; int bkt; unregister_kprobe(&kp); hash_for_each(cnt, bkt, e, node) { pr_info("%u:%u\n", e->tgid, e->count); kfree(e); } } iteration, cleanup DEFINE_HASHTABLE(cnt, 10); struct entry { u32 tgid; u32 count; struct hlist_node node; }; static int handler_pre(struct kprobe *p, struct pt_regs *regs) { u32 tgid = current->tgid; struct entry *e; hash_for_each_possible(cnt, e, node, tgid) { if (e->tgid == tgid) { e->count++; return 0; } } e = kmalloc(sizeof(*e), GFP_ATOMIC); if (!e) return 0; e->count = 1; e->tgid = tgid; hash_add(cnt, &e->node, tgid); return 0; } allocation hash table,entries lookup, update

91 Memory allocation, lifetime, locking, cleanup, etc… More code, more
responsibility!

92 Kernel Safety static int handler_pre(...) { ... while (1)
{ cpu_relax(); } ... return 0; } May not terminate Invalid memory access Corrupts kernel state static int handler_pre(...) { ... u32 *bad = (u32 *)0xdeadbeef; *bad = 1; ... return 0; } static int handler_pre(...) { ... spin_lock(&mylock); /* forgot spin_unlock() */ ... return 0; }

Application Load *(u64 *0)=1; ... 93 BPF verifies this behavior
struct { __uint(type, BPF_MAP_TYPE_HASH); __uint(max_entries, 32768); __type(key, __u32); // process ID (TGID) __type(value, __u32); // COUNTER } cnt SEC(".maps"); SEC("kprobe/__do_sys_getuid") int BPF_KPROBE(count_getuid) { __u32 key = bpf_get_current_pid_tgid() >> 32; __u32 *val = bpf_map_lookup_elem(&cnt, &key); __u32 one = 1; if (val) (*val)++; else bpf_map_update_elem(&cnt, &key, &one, BPF_ANY); return 0; } User Mode Kernel Mode ftrace tracepoint kprobe Attach Infinite Loops Load while (1) ... Invalid Mem Access Resource misuse Load ... Normal BPF program Load BPF Verifier Guaranteed Termination Safe kernel interaction Memory Safety … BPF … Does the program safe to execute?

94 So BPF Tracing Provides Programmability - write custom observation
logic Flexibility - attach to different tracing event source Safety - run your BPF code safely inside the kernel

95 BPF in Action counting getuid() with BPF Tracing

96 BPF in Action 1. Compile into BPF bytecode User
written BPF program ↓ Clang / LLC struct { __uint(type, BPF_MAP_TYPE_HASH); __uint(max_entries, 32768); __type(key, __u32); // process ID (TGID) __type(value, __u32); // COUNTER } cnt SEC(".maps"); SEC("kprobe/__do_sys_getuid") int BPF_KPROBE(count_getuid) { __u32 key = bpf_get_current_pid_tgid() >> 32; __u32 *val = bpf_map_lookup_elem(&cnt, &key); __u32 one = 1; if (val) (*val)++; else bpf_map_update_elem(&cnt, &key, &one, BPF_ANY); return 0; } # clang -O2 -g -target bpf -c getuid.bpf.c -o getuid.bpf.o BPF

97 BPF in Action 1. Compile into BPF bytecode User
written BPF program ↓ Clang / LLC struct { __uint(type, BPF_MAP_TYPE_HASH); __uint(max_entries, 32768); __type(key, __u32); // process ID (TGID) __type(value, __u32); // COUNTER } cnt SEC(".maps"); SEC("kprobe/__do_sys_getuid") int BPF_KPROBE(count_getuid) { __u32 key = bpf_get_current_pid_tgid() >> 32; __u32 *val = bpf_map_lookup_elem(&cnt, &key); __u32 one = 1; if (val) (*val)++; else bpf_map_update_elem(&cnt, &key, &one, BPF_ANY); return 0; } # clang -O2 -g -target bpf -c getuid.bpf.c -o getuid.bpf.o # llvm-objdump -d -r --arch=bpf getuid.bpf.o getuid.bpf.o: file format elf64-bpf Disassembly of section kprobe/__do_sys_getuid: 0000000000000000 <count_getuid>: 0: b7 01 00 00 00 00 00 00 r1 = 0x0 1: 63 1a fc ff 00 00 00 00 *(u32 *)(r10 - 0x4) = r1 2: bf a2 00 00 00 00 00 00 r2 = r10 3: 07 02 00 00 fc ff ff ff r2 += -0x4 4: 18 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 r1 = 0x0 ll 0000000000000020: R_BPF_64_64 cnt 6: 85 00 00 00 01 00 00 00 call 0x1 7: 15 00 03 00 00 00 00 00 if r0 == 0x0 goto +0x3 <LBB0_2> 8: 79 01 00 00 00 00 00 00 r1 = *(u64 *)(r0 + 0x0) 9: 07 01 00 00 01 00 00 00 r1 += 0x1 10: 7b 10 00 00 00 00 00 00 *(u64 *)(r0 + 0x0) = r1 11: b7 00 00 00 00 00 00 00 r0 = 0x0 12: 95 00 00 00 00 00 00 00 exit BPF

98 BPF in Action struct { __uint(type, BPF_MAP_TYPE_HASH); __uint(max_entries, 32768);
__type(key, __u32); // process ID (TGID) __type(value, __u32); // COUNTER } cnt SEC(".maps"); SEC("kprobe/__do_sys_getuid") int BPF_KPROBE(count_getuid) { __u32 key = bpf_get_current_pid_tgid() >> 32; __u32 *val = bpf_map_lookup_elem(&cnt, &key); __u32 one = 1; if (val) (*val)++; else bpf_map_update_elem(&cnt, &key, &one, BPF_ANY); return 0; } 2. Load & Attach to kernel # clang -O2 -g -target bpf -c getuid.bpf.c -o getuid.bpf.o # bpftool prog loadall getuid.bpf.o /sys/fs/bpf/kprobe autoattach BPF Veriﬁer Guaranteed Termination Kernel State Protection Memory Safety … Load BPF program Attach tracepoint ftrace kprobe User Mode Kernel Mode BPF

CPU raise #BP Execute INT3 kprobe framework calls pre_handler() Single-Step
Instruction Resume Execution calls post_handler() BPF in Action struct { __uint(type, BPF_MAP_TYPE_HASH); __uint(max_entries, 32768); __type(key, __u32); // process ID (TGID) __type(value, __u32); // COUNTER } cnt SEC(".maps"); SEC("kprobe/__do_sys_getuid") int BPF_KPROBE(count_getuid) { __u32 key = bpf_get_current_pid_tgid() >> 32; __u32 *val = bpf_map_lookup_elem(&cnt, &key); __u32 one = 1; if (val) (*val)++; else bpf_map_update_elem(&cnt, &key, &one, BPF_ANY); return 0; } 3. BPF Handler executed # clang -O2 -g -target bpf -c getuid.bpf.c -o getuid.bpf.o # bpftool prog loadall getuid.bpf.o /sys/fs/bpf/kprobe autoattach BPF BPF Veriﬁer Guaranteed Termination Kernel State Protection Memory Safety … User Mode Kernel Mode Attach tracepoint ftrace kprobe Load BPF program BPF 99

100 BPF in Action struct { __uint(type, BPF_MAP_TYPE_HASH); __uint(max_entries, 32768);
__type(key, __u32); // process ID (TGID) __type(value, __u32); // COUNTER } cnt SEC(".maps"); SEC("kprobe/__do_sys_getuid") int BPF_KPROBE(count_getuid) { __u32 key = bpf_get_current_pid_tgid() >> 32; __u32 *val = bpf_map_lookup_elem(&cnt, &key); __u32 one = 1; if (val) (*val)++; else bpf_map_update_elem(&cnt, &key, &one, BPF_ANY); return 0; } 4. Check the execution result # clang -O2 -g -target bpf -c getuid.bpf.c -o getuid.bpf.o # bpftool prog loadall getuid.bpf.o /sys/fs/bpf/kprobe autoattach # bpftool -j map dump name cnt | jq -c '.[].formatted' | sort {"key":1338,"value":12} {"key":1346,"value":90} {"key":1348,"value":107} {"key":1659,"value":4} {"key":1791,"value":197} {"key":1892,"value":561} {"key":2003,"value":14} {"key":2076,"value":51} {"key":2173,"value":159} 1346 90 1348 107 1659 4 1791 197 1829 561 2003 14 Key(PID) Value(Count)

101 BPF does not replace event sources. It reuses existing
tracing infrastructure: kprobe, ftrace, tracepoints, or perf_event.

Berkeley Packet Filter (3.18) 102 Linux Kernel Tracing Internals Dynamic
Instrumentation Static Instrumentation Sampling Exception-based kprobe (2.6.9) Code-patching ftrace (2.6.27) Statically-deﬁned tracepoints (2.6.32) Counter-based perf_event (2.6.31) The key is asking the right question, and choosing the right event source and consumer.

Q&A 103

Linux Kernel Tracing Internals

Linux Kernel Tracing Internals

More Decks by Hoyeon Lee

Other Decks in Technology

Featured

Transcript