eBPF is the behind-the-scenes subsystem of the Linux kernel that enables new and simpler methods of profiling, networking, and security for Kubernetes without compromising speed and safety.
• Instructions more closely match machine • Workflow: • Write eBPF program and compile into byte code • Attach to kernel kprobe or tracepoint using bpf() system call • Kernel uses JIT compiler to convert to machine code • Pros: • No kernel module required • You can construct much more sophisticated analysis within the kernel • Cons: • No support on very old kernels, some backports (RHEL) Kernel Version Feature 3.18 bpf() system call added 4.1 Attach to kprobes 4.7 Attach to tracepoints
contain loops • Execution simulation, with pruning to check that register and state stack are valid • Pointer arithmetic can be enabled under certain conditions • Restricts which kernel functions can be called • Checks for reads from uninitialized variables.
bpf system call only accepts eBPF bytecode • Options • Write in eBPF assembly by hand, use kernel tool bpf_asm to generate bytecode • Write in C using <bpf/bpf.h> provided by kernel for eBPF data structures and functions
PID that called sync to a map at index 0 */ SEC("kprobe/sys_sync") int bpf_prog1(struct pt_regs *ctx) { u64 pid = bpf_get_current_pid_tgid(); int idx = 0; if (!bpf_current_task_under_cgroup(&cgroup_map, 0)) return 0; bpf_map_update_elem(&perf_map, &idx, &pid, BPF_ANY); return 0; }
and wrappers for writing eBPF scripts • Still need to write eBPF in C • Bpftrace • Scripting language built on top of BCC • Makes it easy to write one-liners # Read bytes by process: bpftrace -e 'tracepoint:syscalls:sys_exit_read /args->ret/ { @[comm] = sum(args->ret); }' # Read size distribution by process: bpftrace -e 'tracepoint:syscalls:sys_exit_read { @[comm] = hist(args->ret); }' # Count LLC cache misses by process name and PID (uses PMCs): bpftrace -e 'hardware:cache-misses:1000000 { @[comm, pid] = count(); }' # Profile user-level stacks at 99 Hertz, for PID 189: bpftrace -e 'profile:hz:99 /pid == 189/ { @[ustack] = count(); }'
to access low-level information about another process • read and write the attached process memory • debugger breakpoints • read and write the attached process CPU registers • be notified of system events • recognize the exec syscall, clone, exit, etc • control its execution • CPU single-stepping • alter signal handling • Ptrace is slow -- context switches for every event long ptrace(enum __ptrace_request request, pid_t pid, void * addr, void * data)
which uses ptrace() • Attach strace to running process you wish to monitor • Every system call in kernel will add an extra context switch long ptrace(enum __ptrace_request request, pid_t pid, void * addr, void * data) rc@robby-dev:~$ strace ls /tmp execve("/bin/ls", ["ls", "/tmp"], [/* 77 vars */]) = 0 brk(NULL) = 0x887000 access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory) access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory) open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3 fstat(3, {st_mode=S_IFREG|0644, st_size=56320, ...}) = 0 mmap(NULL, 56320, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f3504e99000 close(3) = 0 access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory) ...
at boot-time or runtime • Kernel outputs audit logs • Userland can configure audit rules for alerting and monitoring • All communication happens over a netlink socket (AF_NETLINK) • Netlink sockets allow for kernel < - - >userland communication • Common for legacy security tools to use AuditD • Pros • Built-in • Cons • Not well maintained, performance hit, unwieldy log output, container support is lacking
information to be available to userspace • Unlike procfs, any information, not just process information • Used by reading and writing to files to extract data or configure options /sys/kernel/debug
be used directly from userspace. • Command-line interaction also possible with perf int perf_event_open(struct perf_event_attr *attr, pid_t pid, int cpu, int group_fd, long flags) # Count system calls by type for the specified PID, until Ctrl-C: perf stat -e 'syscalls:sys_enter_*' -p PID # Sample on-CPU user instructions, for 5 seconds: perf record -e cycles:u -a -- sleep 5 # Count syscalls per-second system-wide: perf stat -e raw_syscalls:sys_enter -I 1000 -a
(probe) that you can provide at runtime. • Statically pre-compiled events in the kernel • DEFINE_TRACE(subsys_eventname) • cat /sys/kernel/debug/tracing/available_events • 1ooo+ event types • Syscalls, networking, block etc • Each tracepoint has an output format defined.
in /sys/kernel/debug/ • cat /sys/kernel/debug/tracing/trace_pipe • Directly from userspace with perf • perf record -e syscalls:sys_enter_execve • Within a kernel module... • • Within an eBPF program...
any location • Dynamically modify your kernel at runtime • No pre-existing format, barebones • Pro • Attach the exact kernel function you want to monitor • Con • Stability is not guaranteed, internal kernel functions change over time
or unloaded into the kernel on-demand • Often used to extend hardware/filesystem/networking/graphics support • Take advantage of any or all of the existing kernel data collection systems • Roll your own system call tracing… • Written in C • Mark kernel page RW • Find the system call table • Overwrite system calls with shim function • ...Or utilize perf subsystem, tracepoints, etc • Tracepoints for example in kernel module • <trace/syscall.h> • tracepoint_probe_register()