Slide 1

Slide 1 text

The BSD Packet Filter A New Architecture for User-level Packet Capture Steven McCanne and Van Jacobson Presented by : Suchakrapani Sharma 28th June 2017 Papers We Love - Montreal (1993 Winter USENIX – San Diego, CA)

Slide 2

Slide 2 text

Back in the olden days.. Suchakrapani Datt Sharma

Slide 3

Slide 3 text

Suchakrapani Datt Sharma

Slide 4

Slide 4 text

Problem Scope Suchakrapani Datt Sharma Network Packet Tap - Traditional network “tap” required copying packets in kernel buffers across kernel-userspace boundaries - Eg. SunOS’s STREAMS NIT [10] Network Packet Filtering - Raw packets were accessed and filtered upstream - Filters represented as predicate trees and processed - Eg. CMU/Stanford Packet Filter (CSPF) in Unix [8] - Tree evaluation required - Stack simulation* - Redundant operations* *Elaborated later

Slide 5

Slide 5 text

Network Tap Suchakrapani Datt Sharma In-Kernel Filters - Filters described in userspace but evaluated early - If “passed”, copy buffer and pass upstream

Slide 6

Slide 6 text

Network Tap Suchakrapani Datt Sharma BPF vs NIT - Measure bpf_tap() vs snit_intr() + mbuf copy - 5.7us (BPF) vs 89.2s (NIT) per packet (15x overhead)

Slide 7

Slide 7 text

Network Packet Filtering Suchakrapani Datt Sharma Filter Model - Boolean Expression Tree vs directed acyclic CFG

Slide 8

Slide 8 text

Network Packet Filtering Suchakrapani Datt Sharma Boolean Expression Tree (CSPF) - Easier to model with a stack based machine - Implement load, stores to memory & simulate stack - Redundant parses of tree needed 7 comparison predicates 6 boolean operators

Slide 9

Slide 9 text

Network Packet Filtering Suchakrapani Datt Sharma CFG (NNStat and BPF) - Node are comparison predicates, with two final targets (TRUE/FALSE) (easier to model on registers) - No redundant paths – but requires reordering of graph nodes Max 5 comparisons

Slide 10

Slide 10 text

Network Packet Filtering Suchakrapani Datt Sharma BPF Virtual Machine - Not tied to any protocol. Packets are byte arrays - A generic machine, easily programmable - Variable length packets support* - Simple switch-case dispatch mechanism - Simple instruction set; A, X and scratch memory registers { 0x28, 0, 0, 0x0000000c }, /* 0x28 is opcode for ldh */ { 0x15, 1, 0, 0x00000800 }, /* jump next to next instr if A = 0x800 */ { 0x15, 0, 5, 0x00000805 }, /* jump to FALSE (offset 5) if A != 0x805 */ Instruction Format Sample Instructions { OP, JT, JF, K } l0: ldh [12] l1: jeq #0x800, l3, l2 l2: jeq #0x805, l3, l8 L3: ... l7: ret #0xffff l8: ret #0 Instr Representation

Slide 11

Slide 11 text

Network Packet Filtering Suchakrapani Datt Sharma BPF Virtual Machine Find length (IHL) Then 16 bytes from that (TCP destination port) ldx 4*([14]&0xf) ldh [x+16] jeq #N, L1, L2 ret #TRUE ret #0 Variable Length Packets Example (TCP) (Special addressing mode) 14 Type

Slide 12

Slide 12 text

Network Packet Filtering Suchakrapani Datt Sharma Sample BPF Interpreter (Linux Kernel v3.14) 127 u32 A = 0; /* Accumulator */ 128 u32 X = 0; /* Index Register */ 129 u32 mem[BPF_MEMWORDS]; /* Scratch Memory Store */ 130 u32 tmp; 131 int k; 132 133 /* 134 * Process array of filter instructions. 135 */ 136 for (;; fentry++) { 137 #if defined(CONFIG_X86_32) 138 #define K (fentry->k) 139 #else 140 const u32 K = fentry->k; 141 #endif 142 143 switch (fentry->code) { 144 case BPF_S_ALU_ADD_X: 145 A += X; 146 continue; 147 case BPF_S_ALU_ADD_K: 148 A += K; 149 continue; 150 ..

Slide 13

Slide 13 text

Network Packet Filtering Suchakrapani Datt Sharma BPF vs CSPF

Slide 14

Slide 14 text

▶▶ Fast forward to present Suchakrapani Datt Sharma

Slide 15

Slide 15 text

BPF in Linux Kernel Suchakrapani Datt Sharma Classical BPF (cBPF) - Network packet filtering, eventually seccomp - Filter Expressions → Bytecode → Interpret* - Small, in-kernel VM. Register based, switch dispatch interpreter, few instructions Extended BPF (eBPF) [Alexei Starovoitov, Borkmann et al.] - More registers, JIT compiler (flexible/faster), verifier - Attach on Tracepoint/Kprobe/Uprobe/USDT - In-kernel trace aggregation & filtering - Control via bpf(), trace collection via BPF Maps - Upstream in Linux Kernel (bpf() syscall, v3.18+) - Bytecode compilation upstream in LLVM/Clang *JIT support eventually landed in kernel

Slide 16

Slide 16 text

BPF in Linux Kernel Suchakrapani Datt Sharma eBPF prog.bpf LLVM/Clang BPF Bytecode Native Code bpf() Bytecode bpf() Modern eBPF Programs

Slide 17

Slide 17 text

eBPF for Networking Suchakrapani Datt Sharma Traffic Control/XDP - TC with cls_bpf [Borkmann, 2016] act_bpf and XDP BPF Code BPF Code BPF Code BPF Code TC Ingress TC Egress eth0 tc.bpf eth0 LLVM/Clang bpf() bpf() Adapted from Thomas Graf’s presentation “Cilium - BPF & XDP for containers”

Slide 18

Slide 18 text

eBPF for Security Suchakrapani Datt Sharma BPF Code LSM Hooks LSM Hook Syscalls policy.bpf LLVM/Clang bpf() EACCESS

Slide 19

Slide 19 text

eBPF for Tracing Suchakrapani Datt Sharma BPF Code Kprobes/Kretprobes Kprobe Kernel Function trace.bpf LLVM/Clang Perf Buffer bpf() bpf()

Slide 20

Slide 20 text

eBPF Features & Support Suchakrapani Datt Sharma Major BPF Milestones by Kernel Version* - 3.18 : bpf() syscall - 3.19 : Sockets support, BPF Maps - 4.1 : Kprobe support - 4.4 : Perf events - 4.6 : Stack traces, per-CPU Maps - 4.7 : Attach on Tracepoints - 4.8 : XDP core and act - 4.9 : Profiling, attach to Perf events - 4.10 : cgroups support (socket filters) - 4.11 : Tracerception – tracepoints for eBPF debugging *Adapted from “BPF: Tracing and More” by Brendan Gregg (Linux.Conf.au 2017)

Slide 21

Slide 21 text

eBPF Features & Support Suchakrapani Datt Sharma Program Types - BPF_PROG_TYPE_UNSPEC - BPF_PROG_TYPE_SOCKET_FILTER - BPF_PROG_TYPE_KPROBE - BPF_PROG_TYPE_SCHED_CLS - BPF_PROG_TYPE_SCHED_ACT - BPF_PROG_TYPE_TRACEPOINT - BPF_PROG_TYPE_XDP - BPF_PROG_TYPE_PERF_EVENT - BPF_PROG_TYPE_CGROUP_SKB - BPF_PROG_TYPE_CGROUP_SOCK - BPF_PROG_TYPE_LWT_IN - BPF_PROG_TYPE_LWT_OUT - BPF_PROG_TYPE_LWT_XMIT - BPF_PROG_TYPE_LANDLOCK http://lxr.free-electrons.com/source/include/uapi/linux/bpf.h Tracing Security Cgroups

Slide 22

Slide 22 text

eBPF Features & Support Suchakrapani Datt Sharma Map Types - BPF_MAP_TYPE_UNSPEC - BPF_MAP_TYPE_HASH - BPF_MAP_TYPE_ARRAY - BPF_MAP_TYPE_PROG_ARRAY - BPF_MAP_TYPE_PERF_EVENT_ARRAY - BPF_MAP_TYPE_PERCPU_HASH - BPF_MAP_TYPE_PERCPU_ARRAY - BPF_MAP_TYPE_STACK_TRACE - BPF_MAP_TYPE_CGROUP_ARRAY - BPF_MAP_TYPE_LRU_HASH - BPF_MAP_TYPE_LRU_PERCPU_HASH http://lxr.free-electrons.com/source/include/uapi/linux/bpf.h

Slide 23

Slide 23 text

eBPF for Tracing Suchakrapani Datt Sharma Frontends - IOVisor BCC – Python, C++, Lua, Go (gobpf) APIs - Compile BPF programs directly via LLVM interface - Helper functions to manage maps, buffers, probes Kprobes Example from bcc import BPF prog = """ int hello(void *ctx) { bpf_trace_printk("Hello, World!\\n"); return 0; } """ b = BPF(text=prog) b.attach_kprobe(event="sys_clone", fn_name="hello") print "PID MESSAGE" b.trace_print(fmt="{1} {5}") Attach to Kprobe event prog compiled to BPF bytecode Print trace pipe Complete Program trace_fields.py

Slide 24

Slide 24 text

eBPF for Tracing Suchakrapani Datt Sharma Tracepoint Example (v4.7+) # define EXIT_REASON 18 prog = """ TRACEPOINT_PROBE(kvm, kvm_exit) { if (args->exit_reason == EXIT_REASON) { bpf_trace_printk("KVM_EXIT exit_reason : %d\\n", args->exit_reason); } return 0; } TRACEPOINT_PROBE(kvm, kvm_entry) { if (args->vcpu_id = 0) { bpf_trace_printk("KVM_ENTRY vcpu_id : %u\\n", args->vcpu_id); } } """ Attach to tracepoint Filter on args # ./kvm-test.py 2445.577129000 CPU 0/KVM 8896 KVM_ENTRY vcpu_id : 0 2445.577136000 CPU 0/KVM 8896 KVM_EXIT exit_reason : 18 Output Program Excerpt

Slide 25

Slide 25 text

eBPF for Tracing Suchakrapani Datt Sharma Uprobes Example bpf_text = """ #include #include int get_fname(struct pt_regs *ctx) { if (!ctx->si) return 0; char str[NAME_MAX] = {}; bpf_probe_read(&str, sizeof(str), (void *)ctx->si); bpf_trace_printk("%s\\n", &str); return 0; }; """ b = BPF(text=bpf_text) b.attach_uprobe(name="/usr/bin/vim", sym="readfile", fn_name="get_fname") Get 2nd argument Program Excerpt Process Symbol # ./vim-test.py TASK PID FILENAME vim 23707 /tmp/wololo Output

Slide 26

Slide 26 text

eBPF for Tracing Suchakrapani Datt Sharma USDT Example from bcc import BPF, USDT . . bpf_text = """ #include int do_trace(struct pt_regs *ctx) { uint64_t addr; char path[128]={0}; bpf_usdt_readarg(6, ctx, &addr); bpf_probe_read(&path, sizeof(path), (void *)addr); bpf_trace_printk("path:%s\\n", path); return 0; }; """ u = USDT(pid=int(pid)) u.enable_probe(probe="http__server__request", fn_name="do_trace") b = BPF(text=bpf_text, usdt_contexts=[u]) Read to local variable Program Excerpt nodejs_http_server.py Get 6th Argument Probe in Node Target PID

Slide 27

Slide 27 text

eBPF for Tracing Suchakrapani Datt Sharma USDT Example Supported Frameworks - MySQL : --enable-dtrace (Build) - JVM : -XX:+ExtendedDTraceProbes (Runtime) - Node : --with-dtrace (Build) - Python : --with-dtrace (Build) - Ruby : --enable-dtrace (Build) # ./nodejs_http_server.py 24728 TIME(s) COMM PID ARGS 24653324.561322998 node 24728 path:/index.html 24653335.343401998 node 24728 path:/images/welcome.png 24653340.510164998 node 24728 path:/images/favicon.png Output

Slide 28

Slide 28 text

eBPF for Tracing Suchakrapani Datt Sharma BPF Maps – Filters, States, Counters bpf_text = """ #include #include #include BPF_HASH(currsock, u32, struct sock *); int kprobe__tcp_v4_connect(struct pt_regs *ctx, struct sock *sk) { u32 pid = bpf_get_current_pid_tgid(); // stash the sock ptr for lookup on return currsock.update(&pid, &sk); return 0; }; . . . Update hash map Program Excerpt tcpv4connect.py Key Value type

Slide 29

Slide 29 text

eBPF for Tracing Suchakrapani Datt Sharma BPF Maps – Filters, States, Counters int kretprobe__tcp_v4_connect(struct pt_regs *ctx) { int ret = PT_REGS_RC(ctx); u32 pid = bpf_get_current_pid_tgid(); struct sock **skpp; skpp = currsock.lookup(&pid); if (skpp == 0) { return 0; // missed entry } if (ret != 0) { // failed to send SYNC packet, may not have populated currsock.delete(&pid); return 0; } struct sock *skp = *skpp; u32 saddr = 0, daddr = 0; u16 dport = 0; bpf_probe_read(&saddr, sizeof(saddr), &skp->__sk_common.skc_rcv_saddr); bpf_probe_read(&daddr, sizeof(daddr), &skp->__sk_common.skc_daddr); bpf_probe_read(&dport, sizeof(dport), &skp->__sk_common.skc_dport); bpf_trace_printk("trace_tcp4connect %x %x %d\\n", saddr, daddr, ntohs(dport)); currsock.delete(&pid); return 0; } """ Read stuff from sock ptr Program Excerpt tcpv4connect.py Get Key Lookup ax reg Delete Delete

Slide 30

Slide 30 text

eBPF for Tracing Suchakrapani Datt Sharma BPF Maps – Filters, States, Counters More Uses - Record latency (Δt) - biosnoop.py - Flags for keeping track of events - kvm_hypercall.py - Counting events, histograms - cachestat.py - cpudist.py # ./tcpv4connect.py PID COMM SADDR DADDR DPORT 1479 telnet 127.0.0.1 127.0.0.1 23 1469 curl 10.201.219.236 54.245.105.25 80 1469 curl 10.201.219.236 54.67.101.145 80 Output

Slide 31

Slide 31 text

eBPF for Tracing Suchakrapani Datt Sharma BPF Perf Event Output - Build perf events and save to per-cpu perf buffers prog = """ #include #include #include struct data_t { u32 pid; u64 ts; char comm[TASK_COMM_LEN]; char fname[NAME_MAX]; }; BPF_PERF_OUTPUT(events); int handler(struct pt_regs *ctx) { struct data_t data = {}; data.pid = bpf_get_current_pid_tgid(); data.ts = bpf_ktime_get_ns(); bpf_get_current_comm(&data.comm, sizeof(data.comm)); bpf_probe_read(&data.fname, sizeof(data.fname), (void *)PT_REGS_PARM1(ctx)); events.perf_submit(ctx, &data, sizeof(data)); return 0; } """ Send to buffer Program Excerpt Event Struct Init Event Build Event

Slide 32

Slide 32 text

eBPF Trace Visualization Suchakrapani Datt Sharma Current State - Using ASCII histograms, ASCII escape codes - eBPF trace driven Flamegraphs # ./argdist -H 'p:c:write(int fd, void *buf, size_t len):size_t:len:fd==1' [01:47:19] p:c:write(int fd, void *buf, size_t len):size_t:len:fd==1 len : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 0 | | 8 -> 15 : 3 |********* | 16 -> 31 : 0 | | 32 -> 63 : 5 |*************** | 64 -> 127 : 13 |****************************************| Output argdist.py

Slide 33

Slide 33 text

eBPF Trace Visualization Suchakrapani Datt Sharma Current State - Using ASCII histograms, ASCII escape codes - eBPF Flamegraphs, some web-based views

Slide 34

Slide 34 text

Further Reading Suchakrapani Datt Sharma Papers [Begel et al. 1999] BPF+: exploiting global data-flow optimization in a generalized packet filter architecture, ACM SIGCOMM ‘99 [Wu et al. 2008] Swift: A Fast Dynamic Packet Filter, USENIX NSDI (2008) [Sharma et al. 2016] Enhanced Userspace and In-Kernel Trace Filtering for Production Systems, J. Comput. Sci. Technol. (2016), Springer US [Clément 2016] Linux Kernel packet transmission performance in high-speed networks, Masters Thesis (2016), KTH, Stockholm [Borkmann 2016] Advanced programmability and recent updates with tc’s cls_bpf, NetDev 1.2 (2016) Tokyo

Slide 35

Slide 35 text

References Suchakrapani Datt Sharma Links - IOVisor BPF Docs - bcc Reference Guide - bcc Python Developer Tutorial - bcc/BPF Blog Posts - Dive into BPF: a list of reading material (Quentin Monnet) - Cilium - Network and Application Security with BPF and XDP (Thomas Graf) - Landlock LSM Docs (Mickaël Salaün et al.) - XDP for the Rest of Us (Jesper Brouer & Andy Gospodarek, Netdev 2.1) - USDT/BPF Tracing Tools (Sasha Goldshtein) - Linux 4.x Tracing : Performance Analysis with bcc/BPF (Brendan Gregg, SCALE 15X) - BPF/bcc for Oracle Tracing - Weaveworks Scope HTTP Statistics Plugin

Slide 36

Slide 36 text

Ack Suchakrapani Datt Sharma DORSAL Lab, Polytechnique Montréal IOVisor Project Contributors Hopper.com Papers We Love

Slide 37

Slide 37 text

Fin! Suchakrapani Datt Sharma [email protected] @tuxology All the text and images in this presentation drawn by the authors are released under CC-BY-SA. Images not drawn by authors have been attributed either on slides or in references.