Slide 1

Slide 1 text

Office Hours: eBPF 101 August 19, 2021 Robby Cochran Senior Software Engineer @ Red Hat

Slide 2

Slide 2 text

2

Slide 3

Slide 3 text

3 System Calls API

Slide 4

Slide 4 text

4 Image source: https://ebpf.io/what-is-ebpf/

Slide 5

Slide 5 text

5 ©2019 StackRox. All rights reserved. DEMO - kubectl-trace https://github.com/iovisor/kubectl-trace

Slide 6

Slide 6 text

6 Berkeley Packet Filtering (BPF) Virtual machine (VM) model, for packet filtering, introduced in 1993 setsockopt(sock, SOL_SOCKET, SO_ATTACH_FILTER, &bpf, sizeof(bpf)); rc@robby-dev:~$ sudo tcpdump -d "tcp dst port 22" (000) ldh [12] (001) jeq #0x86dd jt 2 jf 6 (002) ldb [20] (003) jeq #0x6 jt 4 jf 15 (004) ldh [56] (005) jeq #0x16 jt 14 jf 15 (006) jeq #0x800 jt 7 jf 15 (007) ldb [23] (008) jeq #0x6 jt 9 jf 15 (009) ldh [20] (010) jset #0x1fff jt 15 jf 11 (011) ldxb 4*([14]&0xf) (012) ldh [x + 16] (013) jeq #0x16 jt 14 jf 15 (014) ret #262144 (015) ret #0

Slide 7

Slide 7 text

7 extended Berkeley Packet Filters (eBPF) • Redesigned VM model • Instructions more closely match machine • Workflow: • Write eBPF program and compile into byte code • Attach to kernel kprobe or tracepoint using bpf() system call • Kernel uses JIT compiler to convert to machine code • Pros: • No kernel module required • You can construct much more sophisticated analysis within the kernel • Cons: • No support on very old kernels, some backports (RHEL) Kernel Version Feature 3.18 bpf() system call added 4.1 Attach to kprobes 4.7 Attach to tracepoints

Slide 8

Slide 8 text

8 eBPF Verifier • Ensure eBPF code terminates and doesn’t contain loops • Execution simulation, with pruning to check that register and state stack are valid • Pointer arithmetic can be enabled under certain conditions • Restricts which kernel functions can be called • Checks for reads from uninitialized variables.

Slide 9

Slide 9 text

9 How do I write an eBPF program? • The bpf system call only accepts eBPF bytecode • Options • Write in eBPF assembly by hand, use kernel tool bpf_asm to generate bytecode • Write in C using provided by kernel for eBPF data structures and functions

Slide 10

Slide 10 text

10 int bpf(int cmd, union bpf_attr *attr, unsigned int size); eBPF System Call • Commands • BPF_PROG_LOAD • BPF_MAP_CREATE, BPF_MAP_LOOKUP_ELEM, BPF_MAP_UPDATE • Program Types • BPF_PROG_TYPE_SOCKET_FILTER • BPF_PROG_TYPE_KPROBE • BPF_PROG_TYPE_TRACEPOINT • BPF_PROG_TYPE_XDP • BPF_PROG_TYPE_PERF_EVENT

Slide 11

Slide 11 text

11 Example eBPF program (in C) /* Writes the last PID that called sync to a map at index 0 */ SEC("kprobe/sys_sync") int bpf_prog1(struct pt_regs *ctx) { u64 pid = bpf_get_current_pid_tgid(); int idx = 0; if (!bpf_current_task_under_cgroup(&cgroup_map, 0)) return 0; bpf_map_update_elem(&perf_map, &idx, &pid, BPF_ANY); return 0; }

Slide 12

Slide 12 text

12 eBPF Frameworks • BPF Compiler Collection (BCC) • Toolkit and wrappers for writing eBPF scripts • Still need to write eBPF in C • Bpftrace • Scripting language built on top of BCC • Makes it easy to write one-liners # Read bytes by process: bpftrace -e 'tracepoint:syscalls:sys_exit_read /args->ret/ { @[comm] = sum(args->ret); }' # Read size distribution by process: bpftrace -e 'tracepoint:syscalls:sys_exit_read { @[comm] = hist(args->ret); }' # Count LLC cache misses by process name and PID (uses PMCs): bpftrace -e 'hardware:cache-misses:1000000 { @[comm, pid] = count(); }' # Profile user-level stacks at 99 Hertz, for PID 189: bpftrace -e 'profile:hz:99 /pid == 189/ { @[ustack] = count(); }'

Slide 13

Slide 13 text

13 Image source: https://ebpf.io/what-is-ebpf/

Slide 14

Slide 14 text

14 Resources eBPF Development ● https://github.com/libbpf/libbpf ● https://github.com/iovisor/kubectl-trace ● https://github.com/iovisor/bcc ● https://github.com/iovisor/gobpf ● https://nakryiko.com/posts/bpf-portability-and-co-re/ Guides and Documentation ● https://ebpf.io/ ● https://www.stackrox.io/blog/what-is-ebpf/ ● https://facebookmicrosites.github.io/bpf/ ● https://docs.cilium.io/en/stable/bpf/ ● http://www.brendangregg.com/index.html ● https://www.kernel.org/doc/Documentation/kprobes.txt ● https://www.kernel.org/doc/Documentation/trace/tracepoints. txt ● https://suchakra.wordpress.com/2015/05/18/bpf-internals-i/ ● https://jvns.ca/blog/2017/07/05/linux-tracing-systems/ Tools and Platforms using eBPF ● [security] Red Hat Advance Cluster Security for Kubernetes ○ https://cloud.redhat.com/products/kubernetes-security ○ stackrox.io ● [security] https://falco.org/ and https://github.com/falcosecurity/libs ● [security] https://github.com/aquasecurity/tracee ● [networking] https://cilium.io/ ● [visibility] https://osquery.io/ ● [debugging] https://github.com/iovisor/kubectl-trace ● [access control] Kernel Runtime Security Instrumentation - https://lwn.net/Articles/808048/

Slide 15

Slide 15 text

15 System Calls • Linux system calls are the base API for infrastructure • process control, file manipulation, device manipulation, information maintenance, communication and networking

Slide 16

Slide 16 text

16 Procfs Kernel information associated with processes

Slide 17

Slide 17 text

17 Ptrace • The Linux ptrace API allows a process to access low-level information about another process • read and write the attached process memory • debugger breakpoints • read and write the attached process CPU registers • be notified of system events • recognize the exec syscall, clone, exit, etc • control its execution • CPU single-stepping • alter signal handling • Ptrace is slow -- context switches for every event long ptrace(enum __ptrace_request request, pid_t pid, void * addr, void * data)

Slide 18

Slide 18 text

18 Ptrace • strace is a a user mode program which uses ptrace() • Attach strace to running process you wish to monitor • Every system call in kernel will add an extra context switch long ptrace(enum __ptrace_request request, pid_t pid, void * addr, void * data) rc@robby-dev:~$ strace ls /tmp execve("/bin/ls", ["ls", "/tmp"], [/* 77 vars */]) = 0 brk(NULL) = 0x887000 access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory) access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory) open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3 fstat(3, {st_mode=S_IFREG|0644, st_size=56320, ...}) = 0 mmap(NULL, 56320, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f3504e99000 close(3) = 0 access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory) ...

Slide 19

Slide 19 text

19 AuditD • Auditing subsystem in the kernel • Enabled at boot-time or runtime • Kernel outputs audit logs • Userland can configure audit rules for alerting and monitoring • All communication happens over a netlink socket (AF_NETLINK) • Netlink sockets allow for kernel < - - >userland communication • Common for legacy security tools to use AuditD • Pros • Built-in • Cons • Not well maintained, performance hit, unwieldy log output, container support is lacking

Slide 20

Slide 20 text

20 AuditD • Suspicious activity • Privilege escalation • Unauthorized file access -a always,exit -F arch=b32 -S creat -S open -S openat -S open_by_handle_at -S truncate -S ftruncate -F exit=-EACCES -F auid>=500 -F auid!=4294967295 -k file_access -a always,exit -F arch=b32 -S creat -S open -S openat -S open_by_handle_at -S truncate -S ftruncate -F exit=-EPERM -F auid>=500 -F auid!=4294967295 -k file_access -a always,exit -F arch=b64 -S creat -S open -S openat -S open_by_handle_at -S truncate -S ftruncate -F exit=-EACCES -F auid>=500 -F auid!=4294967295 -k file_access -a always,exit -F arch=b64 -S creat -S open -S openat -S open_by_handle_at -S truncate -S ftruncate -F exit=-EPERM -F auid>=500 -F auid!=4294967295 -k file_access Example AuditD rules

Slide 21

Slide 21 text

21 DebugFS • Special kernel file system that allows kernel information to be available to userspace • Unlike procfs, any information, not just process information • Used by reading and writing to files to extract data or configure options /sys/kernel/debug

Slide 22

Slide 22 text

22 Linux Perf Subsystem • Hardware Events (CPU counters) • Software Events (Kernel counters) • Kernel Tracepoint Events • User Statically-Defined Tracing (USDT) • Dynamic Tracing • Timed Profiling

Slide 23

Slide 23 text

23 Linux Perf Subsystem • The perf_event_open system call can be used directly from userspace. • Command-line interaction also possible with perf int perf_event_open(struct perf_event_attr *attr, pid_t pid, int cpu, int group_fd, long flags) # Count system calls by type for the specified PID, until Ctrl-C: perf stat -e 'syscalls:sys_enter_*' -p PID # Sample on-CPU user instructions, for 5 seconds: perf record -e cycles:u -a -- sleep 5 # Count syscalls per-second system-wide: perf stat -e raw_syscalls:sys_enter -I 1000 -a

Slide 24

Slide 24 text

24 Tracepoints • Provide a hook to call a function (probe) that you can provide at runtime. • Statically pre-compiled events in the kernel • DEFINE_TRACE(subsys_eventname) • cat /sys/kernel/debug/tracing/available_events • 1ooo+ event types • Syscalls, networking, block etc • Each tracepoint has an output format defined.

Slide 25

Slide 25 text

25 Tracepoints rc@robby-dev:~$ sudo cat /sys/kernel/debug/tracing/events/syscalls/sys_enter_execve/format name: sys_enter_execve ID: 680 format: field:unsigned short common_type; offset:0; size:2; signed:0; field:unsigned char common_flags; offset:2; size:1; signed:0; field:unsigned char common_preempt_count; offset:3; size:1; signed:0; field:int common_pid; offset:4; size:4; signed:1; field:int __syscall_nr; offset:8; size:4; signed:1; field:const char * filename; offset:16; size:8; signed:0; field:const char *const * argv; offset:24; size:8; signed:0; field:const char *const * envp; offset:32; size:8; signed:0; print fmt: "filename: 0x%08lx, argv: 0x%08lx, envp: 0x%08lx", ((unsigned long)(REC->filename)), ((unsigned long)(REC->argv)), ((unsigned long)(REC->envp)) Tracepoints provide a format when configured

Slide 26

Slide 26 text

26 Using tracepoints • Directly from userspace with debugfs interaction in /sys/kernel/debug/ • cat /sys/kernel/debug/tracing/trace_pipe • Directly from userspace with perf • perf record -e syscalls:sys_enter_execve • Within a kernel module... • • Within an eBPF program...

Slide 27

Slide 27 text

27 Kprobes • Kernel feature to attach event to almost any location • Dynamically modify your kernel at runtime • No pre-existing format, barebones • Pro • Attach the exact kernel function you want to monitor • Con • Stability is not guaranteed, internal kernel functions change over time

Slide 28

Slide 28 text

28 Uprobes • Use the kernel tooling to monitor data about userland • Userspace version of kprobes • Attach to memory location, events are triggered when execution reaches location • Overhead similar to ptrace echo 'r:zfree_exit /bin/zsh:0x46420 %ip %ax' >> /sys/kernel/debug/tracing/uprobe_events cat /sys/kernel/debug/tracing/trace zsh-24842 [006] 258544.995456: zfree_entry: (0x446420) arg1=446420 arg2=79 zsh-24842 [007] 258545.000270: zfree_exit: (0x446540 <- 0x446420) arg1=446540 arg2=0 zsh-24842 [002] 258545.043929: zfree_entry: (0x446420) arg1=446420 arg2=79 zsh-24842 [004] 258547.046129: zfree_exit: (0x446540 <- 0x446420) arg1=446540 arg2=0

Slide 29

Slide 29 text

29 Kernel Modules • Compiled code that can be loaded or unloaded into the kernel on-demand • Often used to extend hardware/filesystem/networking/graphics support • Take advantage of any or all of the existing kernel data collection systems • Roll your own system call tracing… • Written in C • Mark kernel page RW • Find the system call table • Overwrite system calls with shim function • ...Or utilize perf subsystem, tracepoints, etc • Tracepoints for example in kernel module • • tracepoint_probe_register()

Slide 30

Slide 30 text

30 ©2019 StackRox. All rights reserved. Questions?