Slide 1

Slide 1 text

TraceLeft A Configuration Driven eBPF Tracing Framework Suchakra Sharma & Alban Crequy All Systems Go, 29th September 2018, Berlin

Slide 2

Slide 2 text

Suchakra Sharma Staff Scientist, ShiftLeft Inc. Github: tuxology Twitter: @tuxology Email: [email protected] PhD, DORSAL Lab (Ecole Polytechnique de Montreal). Loves systems engineering, performance analysis, hardware tracing and runtime security Alban Crequy CTO, Kinvolk GmbH. Github: alban Twitter: @albcr Email: [email protected] Loves Kubernetes, networking, security, systemd and containers at the lower-levels of the system.

Slide 3

Slide 3 text

The Deep-stack Kubernetes Experts Engineering services and products for Kubernetes, containers, process management and Linux user-space + kernel Blog: kinvolk.io/blog Github: kinvolk Twitter: kinvolkio Email: [email protected] Kinvolk

Slide 4

Slide 4 text

Continuous Security for Cloud Native Apps Secure applications by analyzing applications pre-emptively at build-time and carrying forward its security in production... seamlessly! Blog: shiftleft.io/blog Github: ShiftLeftSecurity Twitter: ShiftLeftInc Contact: shiftleft.io/contact

Slide 5

Slide 5 text

- Traceleft - Background - Tracing 101 - Architecture - Trace Configuration - JSON/Protobuf - Process/Store Trace Events - eBPF - What is eBPF? - The eBPF programs & maps - Use cases - Syscall monitoring example - demo ncurses demo - Event auditing - traceleft demo Agenda

Slide 6

Slide 6 text

- Challenges - Recompilation - File operations - Network - Future Work - Changes in recent kernel versions - Get rid of proc connector Agenda

Slide 7

Slide 7 text

Background DISTRIBUTED TRACING APPLICATION TRACING SYSTEM TRACING S1 S2 S3 Mic r i s Ap ic o Ap ic o OS - Tracing 101 - Low-impact recording on high frequency events such as syscalls, network events, scheduling, interrupts or process/container specific functions - Used for performance analysis and security

Slide 8

Slide 8 text

System Tracing - Tracing 101 Think of your program as a bike with paint on tires, going down the street

Slide 9

Slide 9 text

System Tracing - Tracing 101

Slide 10

Slide 10 text

System Tracing - Tracing 101

Slide 11

Slide 11 text

System Tracing - Tracing 101

Slide 12

Slide 12 text

System Tracing - Examples - Static Tracing: Kernel Tracepoints (Perf/Ftrace/eBPF), compile-time instrumentation (GCC/Clang), LTTng, USDT (Java, Python, Ruby) - Dynamic Tracing: Kprobes/Kretprobes (Ftrace/eBPF), Custom (Pin-tools, Dyninst) Uprobes (eBPF), Dtrace (BSD/MacOS)

Slide 13

Slide 13 text

System Tracing - Code Instrumentation

Slide 14

Slide 14 text

System Tracing - Kprobes - Dynamic Instrumentation in Kernel

Slide 15

Slide 15 text

eBPF Stateful, programmable, in-kernel decisions for networking, tracing and security

Slide 16

Slide 16 text

Berkeley Packet Filter - Classical BPF (cBPF) - Network packet filtering [McCanne et al. 1993], Seccomp - Small, in-kernel VM. Register based, switch dispatch interpreter, few instructions - Extended BPF (eBPF) - More registers, better verifier - Attach on Tracepoint/Kprobe/Uprobe/USDT - In-kernel trace aggregation & filtering - Control via bpf(), trace collection via BPF Maps/trace pipe - Upstream in Linux Kernel (bpf() syscall, kernel v3.18+) - Bytecode compilation upstream in LLVM/Clang

Slide 17

Slide 17 text

Berkeley Packet Filter - eBPF Program

Slide 18

Slide 18 text

Berkeley Packet Filter - eBPF + Kprobes

Slide 19

Slide 19 text

Berkeley Packet Filter - eBPF + Kprobes Example (1/2): code SEC("kprobe/tcp_set_state") int kprobe__handle_tcp_set_state(struct pt_regs *ctx) { u32 cpu = bpf_get_smp_processor_id(); u64 pid_tgid = bpf_get_current_pid_tgid(); u32 tgid = pid_tgid >> 32; int state = (int) PT_REGS_PARM2(ctx); tcp_event_t ev = { .timestamp = bpf_ktime_get_ns(), .tgid = tgid, .state = state, ... }; bpf_perf_event_output(ctx, &events, cpu, &ev, sizeof(ev)); }

Slide 20

Slide 20 text

Berkeley Packet Filter - eBPF + Kprobes Example (2/2): perf map /* This is a key/value store with the keys being the cpu number * and the values being a perf file descriptor. */ struct bpf_map_def SEC("maps/events") events = { .type = BPF_MAP_TYPE_PERF_EVENT_ARRAY, .key_size = sizeof(int), .value_size = sizeof(__u32), .max_entries = 1024, .map_flags = 0, .pinning = PIN_GLOBAL_NS, .namespace = "traceleft", }; typedef struct { uint64_t timestamp; int64_t tgid; int64_t state; ... } tcp_event_t;

Slide 21

Slide 21 text

TraceLeft https://github.com/ShiftLeftSecurity/traceleft

Slide 22

Slide 22 text

- What’s TraceLeft? - Framework to build syscall, network & file auditing or monitoring tools - eBPF+Kprobes based, supported from kernel v4.4+ - Also a binary, traceleft that is a reference implementation - Can generate a single binary - with a modular trace battery. - Everything is compiled based on detailed event configuration and platform information - Why? - Configurable event tracing that Just Works™ ...*coughs* - Programmable tracing, supported on older kernels TraceLeft Overview

Slide 23

Slide 23 text

Architecture

Slide 24

Slide 24 text

Components

Slide 25

Slide 25 text

- Metagenerator - Generated C and Go structures for each event to be received - Goes through /sys/kernel/debug/tracing/events/syscalls/* and generates structures - Generator - Generates the eBPF handler program sources in C - Battery - Compiled eBPF programs battery (a kernel v4.4 pre-compiled battery has been tested to work till kernel v4.16) Components

Slide 26

Slide 26 text

- Probe - Responsible for registering and unregistering eBPF handlers. - Tracer - Loads a the probe, starts polling the events perf map and calls the callback for each received event - Metrics Aggregator - Experimental event aggregation code that allows processing of raw trace events generated by TraceLeft Components

Slide 27

Slide 27 text

- Configuration - A fine-grained per-event configuration that defines each BPF handler’s event structure - What all to collect from each probe along with type info, variable names - Can be eventually simplified to avoid duplication Components "event": [ { "name": "open", "args": [{ "position": 1, "type": "char", "name": "filename", "hashFunc": "string", "suffix": "[256]" }, { "position": 2, "type": "s64", "name": "flags" }, { "position": 3, "type": "u64", "name": "mode" }]

Slide 28

Slide 28 text

- Aggregation Spec - Defines how each event collected should be aggregated, filtered and transmitted or stored - Channels: Where to store/send events, - Function: How to process input event stream), - Rule: Filter applied to event aggregation Components "channels": [ { "id": "1", "type": "file", "path": "/tmp/traceleft.log" }, { "id": "2", "type": "grpc", "path": "localhost:50051" } ], "events": [ { "name": "open", "channel": "1", "stream": "filesystem", "group": "system_metrics", "rule": "arg1 == '/tmp/a.txt'", "function": { "id": "sigma", "parameters": "frequency=100;threshold=0" }, "output": { "metrics": "alerts_per_sec", "format": "collector_spec_pb" }}]

Slide 29

Slide 29 text

Build Process

Slide 30

Slide 30 text

Use Cases - traceleft CLI - Simple syscall logging and auditing system name open pid 5518 program id 0 return value 8 hash 3355305515321265881 Filename "/etc/passwd" Flags 524288 Mode 438 name open pid 5518 program id 0 return value 8 hash 3355305515321265881 Filename "/etc/passwd" Flags 524288 Mode 438 name open pid 5522 program id 0 return value 11 hash 10268694621493151422 Filename "/proc/sys/kernel/ngroups_max" Flags 0 Mode 0 name open pid 5522 program id 0 return value 11 hash 5259532013223916043 Filename "/etc/group" Flags 524288 Mode 438

Slide 31

Slide 31 text

Use Cases - Syscall Monitoring Agent - Sample implementation for a ncurses based live syscall monitoring example using TraceLeft aggregation API

Slide 32

Slide 32 text

Challenges Matching pids and applications

Slide 33

Slide 33 text

- What’s an application? - One or more processes. Might be short-lived (shell scripts) - Application running as a systemd unit - In a different cgroup - Maybe in different namespaces - Application running in a container - In a different cgroup - In different namespaces Matching pids and applications

Slide 34

Slide 34 text

Matching pids and applications BPF helper function bpf_get_current_pid_tgid() 4.2 bpf_get_cgroup_classid() 4.3 (network) bpf_current_task_under_cgroup() 4.9 bpf_get_current_cgroup_id() 4.18 + cgroup-v2 bpf_get_current_pidns_info() Future (4.20+?) https://github.com/iovisor/bcc/blob/master/docs/kernel-versions.md

Slide 35

Slide 35 text

- Register handlers by PID - Matching the app and the pid externally - Using Linux’ proc connector Using the Traceleft API func (probe *Probe) RegisterHandlerById (programID uint64, pid int, hash string) error

Slide 36

Slide 36 text

- Connector: sub-family of Netlink - Subscribe to proc events - Receive notifications for fork, exec, exit - Since Linux v2.6.15 (January 2006) Proc connector socket(AF_NETLINK, SOCK_RAW, NETLINK_CONNECTOR); sendmsg(sockfd, ...PROC_CN_MCAST_LISTEN...);

Slide 37

Slide 37 text

- Only works in init userns, pidns, with net privileges - Can’t keep track of namespaces or cgroups - Need to check in /proc, asynchronously - /proc/$PID/{exe,comm,cgroup,ns} - Races - Short-lived processes: can’t read procfs fast enough - Missing early events that happened before the BPF handler was installed Proc connector limitations

Slide 38

Slide 38 text

- Avoid - Procfs - Proc connector - Using new BPF helpers - Add new BPF helpers upstream if needed Solutions

Slide 39

Slide 39 text

Challenges Strings in eBPF

Slide 40

Slide 40 text

Reporting strings - Example with open() syscall In userspace: int open(const char *pathname, int flags); In kernel: len = strncpy_from_user(kname, filename, EMBEDDED_NAME_MAX); In the eBPF kprobe: ret = bpf_probe_read(&evt.filename, sizeof(evt.filename), (void *) PT_REGS_PARM2(args));

Slide 41

Slide 41 text

- Time of check to time of use (TOCTOU) - Buffer copied twice from userspace - Multithreaded programs could alter the buffer in the middle - Same issue as seccomp Problems with strings

Slide 42

Slide 42 text

- Cannot find the size of the string - probe_read_str() only in Linux 4.11 - TraceLeft copies 256 bytes - Might be too little - Danger of reading too much - A page border might cause EFAULT - open() use NULL-terminated strings Problems with strings virtual memory of a process mmap’ed region 256 bytes fd = open(ptr, flags);

Slide 43

Slide 43 text

Challenges Identifying files

Slide 44

Slide 44 text

File descriptors fd = open(“/data/foo.txt”, O_RDWR); fd2 = dup(fd); ret = write(fd2, buf, sz); Keeping track of file descriptors per process

Slide 45

Slide 45 text

- open(), openat()... - SCM_RIGHTS - dup(), dup2(), dup3() How processes receive a file descriptor

Slide 46

Slide 46 text

- All the string problems from before - Path lookups depends on: - mntns - root, cwd, or dirfd with openat() - at every components, possible symlink - Cannot be evaluated atomically from eBPF Path lookups fd = open(“/data/foo.txt”, O_RDWR);

Slide 47

Slide 47 text

- Landlock-LSM? - eBPF programs acting on kernel objects instead of strings - More programmable actions (resource control) Solutions?

Slide 48

Slide 48 text

Challenges Networking

Slide 49

Slide 49 text

- Destination IP visible at the syscall level - But not the full connection tuple - We add kprobes on inet_csk_accept(), tcp_set_state(), tcp_close(), tcp_v4_connect() Correlating IPs with services ret = connect(sockfd, { IP: 192.168.0.40 } );

Slide 50

Slide 50 text

Challenges Lost events: perf ring buffer and kretprobes

Slide 51

Slide 51 text

- Events sent asynchronously - BPF programs cannot sleep or wait - Ring buffer has limited size - Default in traceleft: 8 pages (32KiB) per cpu - bpf_perf_event_output() just overwrites previous entries - Counter of lost events Losing events in the perf ring buffer

Slide 52

Slide 52 text

- How kprobes work - Place break exception (or jump) on function entry - How kretprobes work - Place break exception on function entry - Save the return address of function and replace it by a trampoline - The trampoline does its job and then return to the original address Missing kretprobes

Slide 53

Slide 53 text

- Multiple CPUs, preemptible kernels - There could be several function calls in parallel - Need to save several return addresses - Example: a synchronous accept() syscall - maxactive - Default value: - Since Linux 4.12 (commit 696ced4fb1d7), configurable - In TraceLeft, we chose maxactive=16 Missing kretprobes rp->maxactive = max_t(unsigned int, 10, 2*num_possible_cpus());

Slide 54

Slide 54 text

Future work

Slide 55

Slide 55 text

- Use tracepoints - Benefit from more stable API - Use new BPF helper functions - bpf_get_current_cgroup_id - bpf_probe_read_str - Use LLVM API directly - Avoid using clang, generation of sources etc. Future Work

Slide 56

Slide 56 text

References

Slide 57

Slide 57 text

- IOVisor/BPF - BCC (https://github.com/iovisor/bcc) - bpfd (https://github.com/genuinetools/bpfd) - BPFd (https://github.com/joelagnel/bpfd)[Deprecated] - BpfTrace(https://github.com/ajor/bpftrace) - Ply (https://github.com/iovisor/ply) - Landlock LSM (https://landlock.io/) - Auditd - Architecture (https://goo.gl/zXdfsJ) Related Work

Slide 58

Slide 58 text

- BPF Docs/Tutorials - https://github.com/zoidbergwill/awesome-ebpf (William Martin Stewart) - http://docs.cilium.io/en/latest/bpf/ (Cilium) - http://www.brendangregg.com/ebpf.html (Brendan Gregg) - https://qmonnet.github.io/whirl-offload/2016/09/01/dive-into-bp f/ (Quentin Monnet) - https://blog.yadutaf.fr/2017/07/28/tracing-a-packet-journey-usin g-linux-tracepoints-perf-ebpf/ (Jean-Tiare Le Bigot) - https://kinvolk.io/blog/2017/09/an-update-on-gobpf---elf-loadin g-uprobes-more-program-types/ (Kinvolk) Documentation and Links

Slide 59

Slide 59 text

- [McCanne et al. 1993] The BSD Packet Filter: A New Architecture for User-level Packet Capture, Winter USENIX Conference (1993) San Diego - [Tu et al 2017] Joe Stringer, and Justin Pettit. 2017. Building an Extensible Open vSwitch Datapath. SIGOPS Operating Systems Review - [Borkmann 2016-1] Advanced programmability and recent updates with tc’s cls_bpf, NetDev 1.2 (2016) Tokyo Research Papers

Slide 60

Slide 60 text

- [Borkmann 2016-1] On getting tc classifier fully programmable with cls bpf, NetDev 1.1 (2016), Seville - [Clément 2016] Linux Kernel packet transmission performance in high-speed networks, Masters Thesis (2016), KTH, Stockholm - [Sharma et al. 2016] Enhanced Userspace and In-Kernel Trace Filtering for Production Systems, J. Comput. Sci. Technol. (2016), Springer US Research Papers