Upgrade to Pro — share decks privately, control downloads, hide ads and more …

StackRox Community Office Hours (E2): eBPF 101 ...

StackRox Community Office Hours (E2): eBPF 101 — Implementing Security & Monitoring Kubernetes

eBPF is the behind-the-scenes subsystem of the Linux kernel that enables new and simpler methods of profiling, networking, and security for Kubernetes without compromising speed and safety.

Red Hat Livestreaming

August 19, 2021
Tweet

More Decks by Red Hat Livestreaming

Other Decks in Technology

Transcript

  1. 2

  2. 6 Berkeley Packet Filtering (BPF) Virtual machine (VM) model, for

    packet filtering, introduced in 1993 setsockopt(sock, SOL_SOCKET, SO_ATTACH_FILTER, &bpf, sizeof(bpf)); rc@robby-dev:~$ sudo tcpdump -d "tcp dst port 22" (000) ldh [12] (001) jeq #0x86dd jt 2 jf 6 (002) ldb [20] (003) jeq #0x6 jt 4 jf 15 (004) ldh [56] (005) jeq #0x16 jt 14 jf 15 (006) jeq #0x800 jt 7 jf 15 (007) ldb [23] (008) jeq #0x6 jt 9 jf 15 (009) ldh [20] (010) jset #0x1fff jt 15 jf 11 (011) ldxb 4*([14]&0xf) (012) ldh [x + 16] (013) jeq #0x16 jt 14 jf 15 (014) ret #262144 (015) ret #0
  3. 7 extended Berkeley Packet Filters (eBPF) • Redesigned VM model

    • Instructions more closely match machine • Workflow: • Write eBPF program and compile into byte code • Attach to kernel kprobe or tracepoint using bpf() system call • Kernel uses JIT compiler to convert to machine code • Pros: • No kernel module required • You can construct much more sophisticated analysis within the kernel • Cons: • No support on very old kernels, some backports (RHEL) Kernel Version Feature 3.18 bpf() system call added 4.1 Attach to kprobes 4.7 Attach to tracepoints
  4. 8 eBPF Verifier • Ensure eBPF code terminates and doesn’t

    contain loops • Execution simulation, with pruning to check that register and state stack are valid • Pointer arithmetic can be enabled under certain conditions • Restricts which kernel functions can be called • Checks for reads from uninitialized variables.
  5. 9 How do I write an eBPF program? • The

    bpf system call only accepts eBPF bytecode • Options • Write in eBPF assembly by hand, use kernel tool bpf_asm to generate bytecode • Write in C using <bpf/bpf.h> provided by kernel for eBPF data structures and functions
  6. 10 int bpf(int cmd, union bpf_attr *attr, unsigned int size);

    eBPF System Call • Commands • BPF_PROG_LOAD • BPF_MAP_CREATE, BPF_MAP_LOOKUP_ELEM, BPF_MAP_UPDATE • Program Types • BPF_PROG_TYPE_SOCKET_FILTER • BPF_PROG_TYPE_KPROBE • BPF_PROG_TYPE_TRACEPOINT • BPF_PROG_TYPE_XDP • BPF_PROG_TYPE_PERF_EVENT
  7. 11 Example eBPF program (in C) /* Writes the last

    PID that called sync to a map at index 0 */ SEC("kprobe/sys_sync") int bpf_prog1(struct pt_regs *ctx) { u64 pid = bpf_get_current_pid_tgid(); int idx = 0; if (!bpf_current_task_under_cgroup(&cgroup_map, 0)) return 0; bpf_map_update_elem(&perf_map, &idx, &pid, BPF_ANY); return 0; }
  8. 12 eBPF Frameworks • BPF Compiler Collection (BCC) • Toolkit

    and wrappers for writing eBPF scripts • Still need to write eBPF in C • Bpftrace • Scripting language built on top of BCC • Makes it easy to write one-liners # Read bytes by process: bpftrace -e 'tracepoint:syscalls:sys_exit_read /args->ret/ { @[comm] = sum(args->ret); }' # Read size distribution by process: bpftrace -e 'tracepoint:syscalls:sys_exit_read { @[comm] = hist(args->ret); }' # Count LLC cache misses by process name and PID (uses PMCs): bpftrace -e 'hardware:cache-misses:1000000 { @[comm, pid] = count(); }' # Profile user-level stacks at 99 Hertz, for PID 189: bpftrace -e 'profile:hz:99 /pid == 189/ { @[ustack] = count(); }'
  9. 14 Resources eBPF Development • https://github.com/libbpf/libbpf • https://github.com/iovisor/kubectl-trace • https://github.com/iovisor/bcc

    • https://github.com/iovisor/gobpf • https://nakryiko.com/posts/bpf-portability-and-co-re/ Guides and Documentation • https://ebpf.io/ • https://www.stackrox.io/blog/what-is-ebpf/ • https://facebookmicrosites.github.io/bpf/ • https://docs.cilium.io/en/stable/bpf/ • http://www.brendangregg.com/index.html • https://www.kernel.org/doc/Documentation/kprobes.txt • https://www.kernel.org/doc/Documentation/trace/tracepoints. txt • https://suchakra.wordpress.com/2015/05/18/bpf-internals-i/ • https://jvns.ca/blog/2017/07/05/linux-tracing-systems/ Tools and Platforms using eBPF • [security] Red Hat Advance Cluster Security for Kubernetes ◦ https://cloud.redhat.com/products/kubernetes-security ◦ stackrox.io • [security] https://falco.org/ and https://github.com/falcosecurity/libs • [security] https://github.com/aquasecurity/tracee • [networking] https://cilium.io/ • [visibility] https://osquery.io/ • [debugging] https://github.com/iovisor/kubectl-trace • [access control] Kernel Runtime Security Instrumentation - https://lwn.net/Articles/808048/
  10. 15 System Calls • Linux system calls are the base

    API for infrastructure • process control, file manipulation, device manipulation, information maintenance, communication and networking
  11. 17 Ptrace • The Linux ptrace API allows a process

    to access low-level information about another process • read and write the attached process memory • debugger breakpoints • read and write the attached process CPU registers • be notified of system events • recognize the exec syscall, clone, exit, etc • control its execution • CPU single-stepping • alter signal handling • Ptrace is slow -- context switches for every event long ptrace(enum __ptrace_request request, pid_t pid, void * addr, void * data)
  12. 18 Ptrace • strace is a a user mode program

    which uses ptrace() • Attach strace to running process you wish to monitor • Every system call in kernel will add an extra context switch long ptrace(enum __ptrace_request request, pid_t pid, void * addr, void * data) rc@robby-dev:~$ strace ls /tmp execve("/bin/ls", ["ls", "/tmp"], [/* 77 vars */]) = 0 brk(NULL) = 0x887000 access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory) access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory) open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3 fstat(3, {st_mode=S_IFREG|0644, st_size=56320, ...}) = 0 mmap(NULL, 56320, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f3504e99000 close(3) = 0 access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory) ...
  13. 19 AuditD • Auditing subsystem in the kernel • Enabled

    at boot-time or runtime • Kernel outputs audit logs • Userland can configure audit rules for alerting and monitoring • All communication happens over a netlink socket (AF_NETLINK) • Netlink sockets allow for kernel < - - >userland communication • Common for legacy security tools to use AuditD • Pros • Built-in • Cons • Not well maintained, performance hit, unwieldy log output, container support is lacking
  14. 20 AuditD • Suspicious activity • Privilege escalation • Unauthorized

    file access -a always,exit -F arch=b32 -S creat -S open -S openat -S open_by_handle_at -S truncate -S ftruncate -F exit=-EACCES -F auid>=500 -F auid!=4294967295 -k file_access -a always,exit -F arch=b32 -S creat -S open -S openat -S open_by_handle_at -S truncate -S ftruncate -F exit=-EPERM -F auid>=500 -F auid!=4294967295 -k file_access -a always,exit -F arch=b64 -S creat -S open -S openat -S open_by_handle_at -S truncate -S ftruncate -F exit=-EACCES -F auid>=500 -F auid!=4294967295 -k file_access -a always,exit -F arch=b64 -S creat -S open -S openat -S open_by_handle_at -S truncate -S ftruncate -F exit=-EPERM -F auid>=500 -F auid!=4294967295 -k file_access Example AuditD rules
  15. 21 DebugFS • Special kernel file system that allows kernel

    information to be available to userspace • Unlike procfs, any information, not just process information • Used by reading and writing to files to extract data or configure options /sys/kernel/debug
  16. 22 Linux Perf Subsystem • Hardware Events (CPU counters) •

    Software Events (Kernel counters) • Kernel Tracepoint Events • User Statically-Defined Tracing (USDT) • Dynamic Tracing • Timed Profiling
  17. 23 Linux Perf Subsystem • The perf_event_open system call can

    be used directly from userspace. • Command-line interaction also possible with perf int perf_event_open(struct perf_event_attr *attr, pid_t pid, int cpu, int group_fd, long flags) # Count system calls by type for the specified PID, until Ctrl-C: perf stat -e 'syscalls:sys_enter_*' -p PID # Sample on-CPU user instructions, for 5 seconds: perf record -e cycles:u -a -- sleep 5 # Count syscalls per-second system-wide: perf stat -e raw_syscalls:sys_enter -I 1000 -a
  18. 24 Tracepoints • Provide a hook to call a function

    (probe) that you can provide at runtime. • Statically pre-compiled events in the kernel • DEFINE_TRACE(subsys_eventname) • cat /sys/kernel/debug/tracing/available_events • 1ooo+ event types • Syscalls, networking, block etc • Each tracepoint has an output format defined.
  19. 25 Tracepoints rc@robby-dev:~$ sudo cat /sys/kernel/debug/tracing/events/syscalls/sys_enter_execve/format name: sys_enter_execve ID: 680

    format: field:unsigned short common_type; offset:0; size:2; signed:0; field:unsigned char common_flags; offset:2; size:1; signed:0; field:unsigned char common_preempt_count; offset:3; size:1; signed:0; field:int common_pid; offset:4; size:4; signed:1; field:int __syscall_nr; offset:8; size:4; signed:1; field:const char * filename; offset:16; size:8; signed:0; field:const char *const * argv; offset:24; size:8; signed:0; field:const char *const * envp; offset:32; size:8; signed:0; print fmt: "filename: 0x%08lx, argv: 0x%08lx, envp: 0x%08lx", ((unsigned long)(REC->filename)), ((unsigned long)(REC->argv)), ((unsigned long)(REC->envp)) Tracepoints provide a format when configured
  20. 26 Using tracepoints • Directly from userspace with debugfs interaction

    in /sys/kernel/debug/ • cat /sys/kernel/debug/tracing/trace_pipe • Directly from userspace with perf • perf record -e syscalls:sys_enter_execve • Within a kernel module... • • Within an eBPF program...
  21. 27 Kprobes • Kernel feature to attach event to almost

    any location • Dynamically modify your kernel at runtime • No pre-existing format, barebones • Pro • Attach the exact kernel function you want to monitor • Con • Stability is not guaranteed, internal kernel functions change over time
  22. 28 Uprobes • Use the kernel tooling to monitor data

    about userland • Userspace version of kprobes • Attach to memory location, events are triggered when execution reaches location • Overhead similar to ptrace echo 'r:zfree_exit /bin/zsh:0x46420 %ip %ax' >> /sys/kernel/debug/tracing/uprobe_events cat /sys/kernel/debug/tracing/trace zsh-24842 [006] 258544.995456: zfree_entry: (0x446420) arg1=446420 arg2=79 zsh-24842 [007] 258545.000270: zfree_exit: (0x446540 <- 0x446420) arg1=446540 arg2=0 zsh-24842 [002] 258545.043929: zfree_entry: (0x446420) arg1=446420 arg2=79 zsh-24842 [004] 258547.046129: zfree_exit: (0x446540 <- 0x446420) arg1=446540 arg2=0
  23. 29 Kernel Modules • Compiled code that can be loaded

    or unloaded into the kernel on-demand • Often used to extend hardware/filesystem/networking/graphics support • Take advantage of any or all of the existing kernel data collection systems • Roll your own system call tracing… • Written in C • Mark kernel page RW • Find the system call table • Overwrite system calls with shim function • ...Or utilize perf subsystem, tracepoints, etc • Tracepoints for example in kernel module • <trace/syscall.h> • tracepoint_probe_register()