Upgrade to Pro — share decks privately, control downloads, hide ads and more …

StackRox Community Office Hours (E2): eBPF 101 — Implementing Security & Monitoring Kubernetes

StackRox Community Office Hours (E2): eBPF 101 — Implementing Security & Monitoring Kubernetes

eBPF is the behind-the-scenes subsystem of the Linux kernel that enables new and simpler methods of profiling, networking, and security for Kubernetes without compromising speed and safety.

Red Hat Livestreaming

August 19, 2021
Tweet

More Decks by Red Hat Livestreaming

Other Decks in Technology

Transcript

  1. Office Hours: eBPF 101
    August 19, 2021
    Robby Cochran
    Senior Software Engineer @ Red Hat

    View Slide

  2. 2

    View Slide

  3. 3
    System Calls API

    View Slide

  4. 4
    Image source: https://ebpf.io/what-is-ebpf/

    View Slide

  5. 5
    ©2019 StackRox. All rights reserved.
    DEMO - kubectl-trace
    https://github.com/iovisor/kubectl-trace

    View Slide

  6. 6
    Berkeley Packet Filtering (BPF)
    Virtual machine (VM) model, for
    packet filtering, introduced in
    1993
    setsockopt(sock, SOL_SOCKET, SO_ATTACH_FILTER, &bpf, sizeof(bpf));
    [email protected]:~$ sudo tcpdump -d "tcp dst port 22"
    (000) ldh [12]
    (001) jeq #0x86dd jt 2 jf 6
    (002) ldb [20]
    (003) jeq #0x6 jt 4 jf 15
    (004) ldh [56]
    (005) jeq #0x16 jt 14 jf 15
    (006) jeq #0x800 jt 7 jf 15
    (007) ldb [23]
    (008) jeq #0x6 jt 9 jf 15
    (009) ldh [20]
    (010) jset #0x1fff jt 15 jf 11
    (011) ldxb 4*([14]&0xf)
    (012) ldh [x + 16]
    (013) jeq #0x16 jt 14 jf 15
    (014) ret #262144
    (015) ret #0

    View Slide

  7. 7
    extended Berkeley Packet Filters (eBPF)
    • Redesigned VM model
    • Instructions more closely match machine
    • Workflow:
    • Write eBPF program and compile into byte code
    • Attach to kernel kprobe or tracepoint using bpf()
    system call
    • Kernel uses JIT compiler to convert to machine code
    • Pros:
    • No kernel module required
    • You can construct much more sophisticated analysis
    within the kernel
    • Cons:
    • No support on very old kernels, some backports
    (RHEL)
    Kernel Version Feature
    3.18 bpf() system call
    added
    4.1 Attach to kprobes
    4.7 Attach to
    tracepoints

    View Slide

  8. 8
    eBPF Verifier
    • Ensure eBPF code terminates and doesn’t contain loops
    • Execution simulation, with pruning to check that register and state
    stack are valid
    • Pointer arithmetic can be enabled under certain conditions
    • Restricts which kernel functions can be called
    • Checks for reads from uninitialized variables.

    View Slide

  9. 9
    How do I write an eBPF program?
    • The bpf system call only accepts eBPF bytecode
    • Options
    • Write in eBPF assembly by hand, use kernel tool bpf_asm to generate
    bytecode
    • Write in C using provided by kernel for eBPF data
    structures and functions

    View Slide

  10. 10
    int bpf(int cmd, union bpf_attr *attr, unsigned int size);
    eBPF System Call
    • Commands
    • BPF_PROG_LOAD
    • BPF_MAP_CREATE, BPF_MAP_LOOKUP_ELEM, BPF_MAP_UPDATE
    • Program Types
    • BPF_PROG_TYPE_SOCKET_FILTER
    • BPF_PROG_TYPE_KPROBE
    • BPF_PROG_TYPE_TRACEPOINT
    • BPF_PROG_TYPE_XDP
    • BPF_PROG_TYPE_PERF_EVENT

    View Slide

  11. 11
    Example eBPF program (in C)
    /* Writes the last PID that called sync to a map at index 0 */
    SEC("kprobe/sys_sync")
    int bpf_prog1(struct pt_regs *ctx)
    {
    u64 pid = bpf_get_current_pid_tgid();
    int idx = 0;
    if (!bpf_current_task_under_cgroup(&cgroup_map, 0))
    return 0;
    bpf_map_update_elem(&perf_map, &idx, &pid, BPF_ANY);
    return 0;
    }

    View Slide

  12. 12
    eBPF Frameworks
    • BPF Compiler Collection (BCC)
    • Toolkit and wrappers for writing eBPF scripts
    • Still need to write eBPF in C
    • Bpftrace
    • Scripting language built on top of BCC
    • Makes it easy to write one-liners
    # Read bytes by process:
    bpftrace -e 'tracepoint:syscalls:sys_exit_read /args->ret/ { @[comm] = sum(args->ret); }'
    # Read size distribution by process:
    bpftrace -e 'tracepoint:syscalls:sys_exit_read { @[comm] = hist(args->ret); }'
    # Count LLC cache misses by process name and PID (uses PMCs):
    bpftrace -e 'hardware:cache-misses:1000000 { @[comm, pid] = count(); }'
    # Profile user-level stacks at 99 Hertz, for PID 189:
    bpftrace -e 'profile:hz:99 /pid == 189/ { @[ustack] = count(); }'

    View Slide

  13. 13
    Image source: https://ebpf.io/what-is-ebpf/

    View Slide

  14. 14
    Resources
    eBPF Development
    ● https://github.com/libbpf/libbpf
    ● https://github.com/iovisor/kubectl-trace
    ● https://github.com/iovisor/bcc
    ● https://github.com/iovisor/gobpf
    ● https://nakryiko.com/posts/bpf-portability-and-co-re/
    Guides and Documentation
    ● https://ebpf.io/
    ● https://www.stackrox.io/blog/what-is-ebpf/
    ● https://facebookmicrosites.github.io/bpf/
    ● https://docs.cilium.io/en/stable/bpf/
    ● http://www.brendangregg.com/index.html
    ● https://www.kernel.org/doc/Documentation/kprobes.txt
    ● https://www.kernel.org/doc/Documentation/trace/tracepoints.
    txt
    ● https://suchakra.wordpress.com/2015/05/18/bpf-internals-i/
    ● https://jvns.ca/blog/2017/07/05/linux-tracing-systems/
    Tools and Platforms using eBPF
    ● [security] Red Hat Advance Cluster Security for Kubernetes
    ○ https://cloud.redhat.com/products/kubernetes-security
    ○ stackrox.io
    ● [security] https://falco.org/ and
    https://github.com/falcosecurity/libs
    ● [security] https://github.com/aquasecurity/tracee
    ● [networking] https://cilium.io/
    ● [visibility] https://osquery.io/
    ● [debugging] https://github.com/iovisor/kubectl-trace
    ● [access control] Kernel Runtime Security Instrumentation -
    https://lwn.net/Articles/808048/

    View Slide

  15. 15
    System Calls
    • Linux system calls are the base API for infrastructure
    • process control, file manipulation, device manipulation, information
    maintenance, communication and networking

    View Slide

  16. 16
    Procfs
    Kernel information associated with processes

    View Slide

  17. 17
    Ptrace
    • The Linux ptrace API allows a process to access low-level information about
    another process
    • read and write the attached process memory
    • debugger breakpoints
    • read and write the attached process CPU registers
    • be notified of system events
    • recognize the exec syscall, clone, exit, etc
    • control its execution
    • CPU single-stepping
    • alter signal handling
    • Ptrace is slow -- context switches for every event
    long ptrace(enum __ptrace_request request, pid_t pid, void * addr, void * data)

    View Slide

  18. 18
    Ptrace
    • strace is a a user mode program which uses ptrace()
    • Attach strace to running process you wish to monitor
    • Every system call in kernel will add an extra context switch
    long ptrace(enum __ptrace_request request, pid_t pid, void * addr, void * data)
    [email protected]:~$ strace ls /tmp
    execve("/bin/ls", ["ls", "/tmp"], [/* 77 vars */]) = 0
    brk(NULL) = 0x887000
    access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory)
    access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
    open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
    fstat(3, {st_mode=S_IFREG|0644, st_size=56320, ...}) = 0
    mmap(NULL, 56320, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f3504e99000
    close(3) = 0
    access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory)
    ...

    View Slide

  19. 19
    AuditD
    • Auditing subsystem in the kernel
    • Enabled at boot-time or runtime
    • Kernel outputs audit logs
    • Userland can configure audit rules for alerting and monitoring
    • All communication happens over a netlink socket (AF_NETLINK)
    • Netlink sockets allow for kernel < - - >userland communication
    • Common for legacy security tools to use AuditD
    • Pros
    • Built-in
    • Cons
    • Not well maintained, performance hit, unwieldy log output, container
    support is lacking

    View Slide

  20. 20
    AuditD
    • Suspicious activity
    • Privilege escalation
    • Unauthorized file access
    -a always,exit -F arch=b32 -S creat -S open -S openat -S open_by_handle_at -S truncate -S ftruncate -F
    exit=-EACCES -F auid>=500 -F auid!=4294967295 -k file_access
    -a always,exit -F arch=b32 -S creat -S open -S openat -S open_by_handle_at -S truncate -S ftruncate -F
    exit=-EPERM -F auid>=500 -F auid!=4294967295 -k file_access
    -a always,exit -F arch=b64 -S creat -S open -S openat -S open_by_handle_at -S truncate -S ftruncate -F
    exit=-EACCES -F auid>=500 -F auid!=4294967295 -k file_access
    -a always,exit -F arch=b64 -S creat -S open -S openat -S open_by_handle_at -S truncate -S ftruncate -F
    exit=-EPERM -F auid>=500 -F auid!=4294967295 -k file_access
    Example AuditD rules

    View Slide

  21. 21
    DebugFS
    • Special kernel file system that allows kernel information to be available to
    userspace
    • Unlike procfs, any information, not just process information
    • Used by reading and writing to files to extract data or configure options
    /sys/kernel/debug

    View Slide

  22. 22
    Linux Perf Subsystem
    • Hardware Events (CPU counters)
    • Software Events (Kernel counters)
    • Kernel Tracepoint Events
    • User Statically-Defined Tracing (USDT)
    • Dynamic Tracing
    • Timed Profiling

    View Slide

  23. 23
    Linux Perf Subsystem
    • The perf_event_open system call can be used directly from userspace.
    • Command-line interaction also possible with perf
    int perf_event_open(struct perf_event_attr *attr, pid_t pid, int cpu, int group_fd, long flags)
    # Count system calls by type for the specified PID, until Ctrl-C:
    perf stat -e 'syscalls:sys_enter_*' -p PID
    # Sample on-CPU user instructions, for 5 seconds:
    perf record -e cycles:u -a -- sleep 5
    # Count syscalls per-second system-wide:
    perf stat -e raw_syscalls:sys_enter -I 1000 -a

    View Slide

  24. 24
    Tracepoints
    • Provide a hook to call a function (probe) that you can provide at runtime.
    • Statically pre-compiled events in the kernel
    • DEFINE_TRACE(subsys_eventname)
    • cat /sys/kernel/debug/tracing/available_events
    • 1ooo+ event types
    • Syscalls, networking, block etc
    • Each tracepoint has an output format defined.

    View Slide

  25. 25
    Tracepoints
    [email protected]:~$ sudo cat /sys/kernel/debug/tracing/events/syscalls/sys_enter_execve/format
    name: sys_enter_execve
    ID: 680
    format:
    field:unsigned short common_type; offset:0; size:2; signed:0;
    field:unsigned char common_flags; offset:2; size:1; signed:0;
    field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
    field:int common_pid; offset:4; size:4; signed:1;
    field:int __syscall_nr; offset:8; size:4; signed:1;
    field:const char * filename; offset:16; size:8; signed:0;
    field:const char *const * argv; offset:24; size:8; signed:0;
    field:const char *const * envp; offset:32; size:8; signed:0;
    print fmt: "filename: 0x%08lx, argv: 0x%08lx, envp: 0x%08lx", ((unsigned long)(REC->filename)), ((unsigned
    long)(REC->argv)), ((unsigned long)(REC->envp))
    Tracepoints provide a format when configured

    View Slide

  26. 26
    Using tracepoints
    • Directly from userspace with debugfs interaction in /sys/kernel/debug/
    • cat /sys/kernel/debug/tracing/trace_pipe
    • Directly from userspace with perf
    • perf record -e syscalls:sys_enter_execve
    • Within a kernel module...

    • Within an eBPF program...

    View Slide

  27. 27
    Kprobes
    • Kernel feature to attach event to almost any location
    • Dynamically modify your kernel at runtime
    • No pre-existing format, barebones
    • Pro
    • Attach the exact kernel function you want to monitor
    • Con
    • Stability is not guaranteed, internal kernel functions change over time

    View Slide

  28. 28
    Uprobes
    • Use the kernel tooling to monitor data about userland
    • Userspace version of kprobes
    • Attach to memory location, events are triggered when execution reaches location
    • Overhead similar to ptrace
    echo 'r:zfree_exit /bin/zsh:0x46420 %ip %ax' >> /sys/kernel/debug/tracing/uprobe_events
    cat /sys/kernel/debug/tracing/trace
    zsh-24842 [006] 258544.995456: zfree_entry: (0x446420) arg1=446420 arg2=79
    zsh-24842 [007] 258545.000270: zfree_exit: (0x446540 zsh-24842 [002] 258545.043929: zfree_entry: (0x446420) arg1=446420 arg2=79
    zsh-24842 [004] 258547.046129: zfree_exit: (0x446540

    View Slide

  29. 29
    Kernel Modules
    • Compiled code that can be loaded or unloaded into the kernel on-demand
    • Often used to extend hardware/filesystem/networking/graphics support
    • Take advantage of any or all of the existing kernel data collection systems
    • Roll your own system call tracing…
    • Written in C
    • Mark kernel page RW
    • Find the system call table
    • Overwrite system calls with shim function
    • ...Or utilize perf subsystem, tracepoints, etc
    • Tracepoints for example in kernel module

    • tracepoint_probe_register()

    View Slide

  30. 30
    ©2019 StackRox. All rights reserved.
    Questions?

    View Slide