Upgrade to Pro — share decks privately, control downloads, hide ads and more …

StackRox Community Office Hours (E2): eBPF 101 — Implementing Security & Monitoring Kubernetes

StackRox Community Office Hours (E2): eBPF 101 — Implementing Security & Monitoring Kubernetes

eBPF is the behind-the-scenes subsystem of the Linux kernel that enables new and simpler methods of profiling, networking, and security for Kubernetes without compromising speed and safety.

A61fc58218907d6778a6cbf0fe7611da?s=128

Red Hat Livestreaming

August 19, 2021
Tweet

Transcript

  1. Office Hours: eBPF 101 August 19, 2021 Robby Cochran Senior

    Software Engineer @ Red Hat
  2. 2

  3. 3 System Calls API

  4. 4 Image source: https://ebpf.io/what-is-ebpf/

  5. 5 ©2019 StackRox. All rights reserved. DEMO - kubectl-trace https://github.com/iovisor/kubectl-trace

  6. 6 Berkeley Packet Filtering (BPF) Virtual machine (VM) model, for

    packet filtering, introduced in 1993 setsockopt(sock, SOL_SOCKET, SO_ATTACH_FILTER, &bpf, sizeof(bpf)); rc@robby-dev:~$ sudo tcpdump -d "tcp dst port 22" (000) ldh [12] (001) jeq #0x86dd jt 2 jf 6 (002) ldb [20] (003) jeq #0x6 jt 4 jf 15 (004) ldh [56] (005) jeq #0x16 jt 14 jf 15 (006) jeq #0x800 jt 7 jf 15 (007) ldb [23] (008) jeq #0x6 jt 9 jf 15 (009) ldh [20] (010) jset #0x1fff jt 15 jf 11 (011) ldxb 4*([14]&0xf) (012) ldh [x + 16] (013) jeq #0x16 jt 14 jf 15 (014) ret #262144 (015) ret #0
  7. 7 extended Berkeley Packet Filters (eBPF) • Redesigned VM model

    • Instructions more closely match machine • Workflow: • Write eBPF program and compile into byte code • Attach to kernel kprobe or tracepoint using bpf() system call • Kernel uses JIT compiler to convert to machine code • Pros: • No kernel module required • You can construct much more sophisticated analysis within the kernel • Cons: • No support on very old kernels, some backports (RHEL) Kernel Version Feature 3.18 bpf() system call added 4.1 Attach to kprobes 4.7 Attach to tracepoints
  8. 8 eBPF Verifier • Ensure eBPF code terminates and doesn’t

    contain loops • Execution simulation, with pruning to check that register and state stack are valid • Pointer arithmetic can be enabled under certain conditions • Restricts which kernel functions can be called • Checks for reads from uninitialized variables.
  9. 9 How do I write an eBPF program? • The

    bpf system call only accepts eBPF bytecode • Options • Write in eBPF assembly by hand, use kernel tool bpf_asm to generate bytecode • Write in C using <bpf/bpf.h> provided by kernel for eBPF data structures and functions
  10. 10 int bpf(int cmd, union bpf_attr *attr, unsigned int size);

    eBPF System Call • Commands • BPF_PROG_LOAD • BPF_MAP_CREATE, BPF_MAP_LOOKUP_ELEM, BPF_MAP_UPDATE • Program Types • BPF_PROG_TYPE_SOCKET_FILTER • BPF_PROG_TYPE_KPROBE • BPF_PROG_TYPE_TRACEPOINT • BPF_PROG_TYPE_XDP • BPF_PROG_TYPE_PERF_EVENT
  11. 11 Example eBPF program (in C) /* Writes the last

    PID that called sync to a map at index 0 */ SEC("kprobe/sys_sync") int bpf_prog1(struct pt_regs *ctx) { u64 pid = bpf_get_current_pid_tgid(); int idx = 0; if (!bpf_current_task_under_cgroup(&cgroup_map, 0)) return 0; bpf_map_update_elem(&perf_map, &idx, &pid, BPF_ANY); return 0; }
  12. 12 eBPF Frameworks • BPF Compiler Collection (BCC) • Toolkit

    and wrappers for writing eBPF scripts • Still need to write eBPF in C • Bpftrace • Scripting language built on top of BCC • Makes it easy to write one-liners # Read bytes by process: bpftrace -e 'tracepoint:syscalls:sys_exit_read /args->ret/ { @[comm] = sum(args->ret); }' # Read size distribution by process: bpftrace -e 'tracepoint:syscalls:sys_exit_read { @[comm] = hist(args->ret); }' # Count LLC cache misses by process name and PID (uses PMCs): bpftrace -e 'hardware:cache-misses:1000000 { @[comm, pid] = count(); }' # Profile user-level stacks at 99 Hertz, for PID 189: bpftrace -e 'profile:hz:99 /pid == 189/ { @[ustack] = count(); }'
  13. 13 Image source: https://ebpf.io/what-is-ebpf/

  14. 14 Resources eBPF Development • https://github.com/libbpf/libbpf • https://github.com/iovisor/kubectl-trace • https://github.com/iovisor/bcc

    • https://github.com/iovisor/gobpf • https://nakryiko.com/posts/bpf-portability-and-co-re/ Guides and Documentation • https://ebpf.io/ • https://www.stackrox.io/blog/what-is-ebpf/ • https://facebookmicrosites.github.io/bpf/ • https://docs.cilium.io/en/stable/bpf/ • http://www.brendangregg.com/index.html • https://www.kernel.org/doc/Documentation/kprobes.txt • https://www.kernel.org/doc/Documentation/trace/tracepoints. txt • https://suchakra.wordpress.com/2015/05/18/bpf-internals-i/ • https://jvns.ca/blog/2017/07/05/linux-tracing-systems/ Tools and Platforms using eBPF • [security] Red Hat Advance Cluster Security for Kubernetes ◦ https://cloud.redhat.com/products/kubernetes-security ◦ stackrox.io • [security] https://falco.org/ and https://github.com/falcosecurity/libs • [security] https://github.com/aquasecurity/tracee • [networking] https://cilium.io/ • [visibility] https://osquery.io/ • [debugging] https://github.com/iovisor/kubectl-trace • [access control] Kernel Runtime Security Instrumentation - https://lwn.net/Articles/808048/
  15. 15 System Calls • Linux system calls are the base

    API for infrastructure • process control, file manipulation, device manipulation, information maintenance, communication and networking
  16. 16 Procfs Kernel information associated with processes

  17. 17 Ptrace • The Linux ptrace API allows a process

    to access low-level information about another process • read and write the attached process memory • debugger breakpoints • read and write the attached process CPU registers • be notified of system events • recognize the exec syscall, clone, exit, etc • control its execution • CPU single-stepping • alter signal handling • Ptrace is slow -- context switches for every event long ptrace(enum __ptrace_request request, pid_t pid, void * addr, void * data)
  18. 18 Ptrace • strace is a a user mode program

    which uses ptrace() • Attach strace to running process you wish to monitor • Every system call in kernel will add an extra context switch long ptrace(enum __ptrace_request request, pid_t pid, void * addr, void * data) rc@robby-dev:~$ strace ls /tmp execve("/bin/ls", ["ls", "/tmp"], [/* 77 vars */]) = 0 brk(NULL) = 0x887000 access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory) access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory) open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3 fstat(3, {st_mode=S_IFREG|0644, st_size=56320, ...}) = 0 mmap(NULL, 56320, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f3504e99000 close(3) = 0 access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory) ...
  19. 19 AuditD • Auditing subsystem in the kernel • Enabled

    at boot-time or runtime • Kernel outputs audit logs • Userland can configure audit rules for alerting and monitoring • All communication happens over a netlink socket (AF_NETLINK) • Netlink sockets allow for kernel < - - >userland communication • Common for legacy security tools to use AuditD • Pros • Built-in • Cons • Not well maintained, performance hit, unwieldy log output, container support is lacking
  20. 20 AuditD • Suspicious activity • Privilege escalation • Unauthorized

    file access -a always,exit -F arch=b32 -S creat -S open -S openat -S open_by_handle_at -S truncate -S ftruncate -F exit=-EACCES -F auid>=500 -F auid!=4294967295 -k file_access -a always,exit -F arch=b32 -S creat -S open -S openat -S open_by_handle_at -S truncate -S ftruncate -F exit=-EPERM -F auid>=500 -F auid!=4294967295 -k file_access -a always,exit -F arch=b64 -S creat -S open -S openat -S open_by_handle_at -S truncate -S ftruncate -F exit=-EACCES -F auid>=500 -F auid!=4294967295 -k file_access -a always,exit -F arch=b64 -S creat -S open -S openat -S open_by_handle_at -S truncate -S ftruncate -F exit=-EPERM -F auid>=500 -F auid!=4294967295 -k file_access Example AuditD rules
  21. 21 DebugFS • Special kernel file system that allows kernel

    information to be available to userspace • Unlike procfs, any information, not just process information • Used by reading and writing to files to extract data or configure options /sys/kernel/debug
  22. 22 Linux Perf Subsystem • Hardware Events (CPU counters) •

    Software Events (Kernel counters) • Kernel Tracepoint Events • User Statically-Defined Tracing (USDT) • Dynamic Tracing • Timed Profiling
  23. 23 Linux Perf Subsystem • The perf_event_open system call can

    be used directly from userspace. • Command-line interaction also possible with perf int perf_event_open(struct perf_event_attr *attr, pid_t pid, int cpu, int group_fd, long flags) # Count system calls by type for the specified PID, until Ctrl-C: perf stat -e 'syscalls:sys_enter_*' -p PID # Sample on-CPU user instructions, for 5 seconds: perf record -e cycles:u -a -- sleep 5 # Count syscalls per-second system-wide: perf stat -e raw_syscalls:sys_enter -I 1000 -a
  24. 24 Tracepoints • Provide a hook to call a function

    (probe) that you can provide at runtime. • Statically pre-compiled events in the kernel • DEFINE_TRACE(subsys_eventname) • cat /sys/kernel/debug/tracing/available_events • 1ooo+ event types • Syscalls, networking, block etc • Each tracepoint has an output format defined.
  25. 25 Tracepoints rc@robby-dev:~$ sudo cat /sys/kernel/debug/tracing/events/syscalls/sys_enter_execve/format name: sys_enter_execve ID: 680

    format: field:unsigned short common_type; offset:0; size:2; signed:0; field:unsigned char common_flags; offset:2; size:1; signed:0; field:unsigned char common_preempt_count; offset:3; size:1; signed:0; field:int common_pid; offset:4; size:4; signed:1; field:int __syscall_nr; offset:8; size:4; signed:1; field:const char * filename; offset:16; size:8; signed:0; field:const char *const * argv; offset:24; size:8; signed:0; field:const char *const * envp; offset:32; size:8; signed:0; print fmt: "filename: 0x%08lx, argv: 0x%08lx, envp: 0x%08lx", ((unsigned long)(REC->filename)), ((unsigned long)(REC->argv)), ((unsigned long)(REC->envp)) Tracepoints provide a format when configured
  26. 26 Using tracepoints • Directly from userspace with debugfs interaction

    in /sys/kernel/debug/ • cat /sys/kernel/debug/tracing/trace_pipe • Directly from userspace with perf • perf record -e syscalls:sys_enter_execve • Within a kernel module... • • Within an eBPF program...
  27. 27 Kprobes • Kernel feature to attach event to almost

    any location • Dynamically modify your kernel at runtime • No pre-existing format, barebones • Pro • Attach the exact kernel function you want to monitor • Con • Stability is not guaranteed, internal kernel functions change over time
  28. 28 Uprobes • Use the kernel tooling to monitor data

    about userland • Userspace version of kprobes • Attach to memory location, events are triggered when execution reaches location • Overhead similar to ptrace echo 'r:zfree_exit /bin/zsh:0x46420 %ip %ax' >> /sys/kernel/debug/tracing/uprobe_events cat /sys/kernel/debug/tracing/trace zsh-24842 [006] 258544.995456: zfree_entry: (0x446420) arg1=446420 arg2=79 zsh-24842 [007] 258545.000270: zfree_exit: (0x446540 <- 0x446420) arg1=446540 arg2=0 zsh-24842 [002] 258545.043929: zfree_entry: (0x446420) arg1=446420 arg2=79 zsh-24842 [004] 258547.046129: zfree_exit: (0x446540 <- 0x446420) arg1=446540 arg2=0
  29. 29 Kernel Modules • Compiled code that can be loaded

    or unloaded into the kernel on-demand • Often used to extend hardware/filesystem/networking/graphics support • Take advantage of any or all of the existing kernel data collection systems • Roll your own system call tracing… • Written in C • Mark kernel page RW • Find the system call table • Overwrite system calls with shim function • ...Or utilize perf subsystem, tracepoints, etc • Tracepoints for example in kernel module • <trace/syscall.h> • tracepoint_probe_register()
  30. 30 ©2019 StackRox. All rights reserved. Questions?