Overcoming Observer Effects in Memory Management with DAMON

Slide 1

Slide 1 text

BPFize Your Kernel Subsystem The fanotify Experience Song Liu Kernel Recipes 2025

Slide 2

Slide 2 text

Who am I? ● Song Liu ● Software engineer @ Meta ● Linux kernel contributor, reviewer, and maintainer ● Work with BPF for 7+ years ● Support BPF LSM users @ Meta 2

Slide 3

Slide 3 text

● BPF originally stood for Berkeley Packet Filter ● eBPF means extended BPF ● BPF and eBPF are generally used interchangeably What is BPF/eBPF? 3

Slide 4

Slide 4 text

● BPF originally stood for Berkeley Packet Filter ● eBPF means extended BPF ● BPF and eBPF are generally used interchangeably ● How would you describe BPF for someone who has never heard of it? What is BPF/eBPF? 4

Slide 5

Slide 5 text

Well-known proverb in China “There are a thousand Hamlets in a thousand people’s eyes.” “一千个人眼中有一千个哈姆雷特。” 5

Slide 6

Slide 6 text

“There are a thousand BPFs on a thousand kernel hackers’ keyboards.” 6

Slide 7

Slide 7 text

One of those 1000 kernel hackers. “What is BPF in my eyes? How do I use BPF on my keyboard?” 7

Slide 8

Slide 8 text

Agenda 01 What is BPF? 02 BPFize your kernel subsystem 03 BPF LSM 04 fanotify-bpf 8

Slide 9

Slide 9 text

https://ebpf.io/what-is-ebpf/ “eBPF is a revolutionary technology with origins in the Linux kernel that can run sandboxed programs in a privileged context such as the operating system kernel. It is used to safely and efficiently extend the capabilities of the kernel without requiring to change kernel source code or load kernel modules.” 9

Slide 10

Slide 10 text

Something is a tasty food from France that goes well with pesto source 10

Slide 11

Slide 11 text

Something is a tasty food from France that goes well with pesto source and is often served in its own shell 11

Slide 12

Slide 12 text

Something is a tasty food from France that goes well with pesto source and is often served in its own shell 12

Slide 13

Slide 13 text

Slide 14

Slide 14 text

Slide 15

Slide 15 text

● Traditionally, we can only tune kernel behavior with knobs: sysfs, syscalls, sysctl, etc. ● BPF Instruction Set Architecture (ISA) BPF brings programmability 15

Slide 16

Slide 16 text

● Running in the privileged contexts, BPF have visibility of events happened in the whole stack ● XDP (eXpress Data Path) runs in the device driver to achieve extreme performance BPF is powerful and performant 16

Slide 17

Slide 17 text

● BPF will not crash the system ● BPF are guaranteed to terminate ● BPF will not read uninitialized memory ● BPF will not leak memory ● BPF will not leak reference to kernel objects ● and the list is growing BPF is safe 17

Slide 18

Slide 18 text

● BPF verifier checks the safety for each BPF program ● Verifier works with BPF instructions ○ This is one of reason why we need BPF ISA ● What about performance? JIT! BPF is safe, but how? verifier verifier 18

Slide 19

Slide 19 text

BPF program from source to execution helloworld.bpf.c: SEC(“kprobe/copy_process”) void BPF_PROG(my_prog) { bpf_printk(“Hello world!”); } helloworld.bpf.o (BPF ISA) clang/gcc my_prog (BPF ISA) verifier my_prog (BPF ISA) my_prog (native ISA) JIT my_prog (native ISA) BPF_PROG_LOAD BPF_PROG_ATTACH copy_process() cat /sys/kernel/tracing/trace [...] Hello World! [...] Hello World! 19

Slide 20

Slide 20 text

● This “sandbox” runs in the privileged context ● To fully leverage the privileges, BPF programs need reusable building blocks ○ BPF maps, BPF helpers/kfuncs, etc. ● The uses of these building blocks are verified to be safe Sandboxed vs. privileged context 20

Slide 21

Slide 21 text

● BPF maps => memory management, input/output ● BPF helpers and kfuncs => interact with the system ● BPF iterator, BPF trampoline => trigger BPF programs for certain use cases ● BTF (BPF Type Format) => understand the kernel/BPF data structures Reusable building blocks 21

Slide 22

Slide 22 text

● Writing verifiable BPF program is not always straightforward ● Creating reusable building blocks is harder than writing code for a single use case BPF challenges 22

Slide 23

Slide 23 text

Agenda 01 What is BPF? 02 BPFize your kernel subsystem 03 BPF LSM 04 fanotify-bpf 23

Slide 24

Slide 24 text

BPFize your kernel subsystem Use Case Hook Point Community Support Building Blocks Pie! 24

Slide 25

Slide 25 text

● Why do we need BPF? ○ Programmability? Speed? ● What are the alternatives? ● How do BPF fit in the whole solution? Know your use case Use Case Hook Point Community Support Building Blocks 25

Slide 26

Slide 26 text

● Add a pointer to struct bpf_prog, and some code to call the program ● For most new use cases, struct_ops is the recommended solution ● Try to reuse existing mechanism Design your hook point Use Case Hook Point Community Support Building Blocks 26

Slide 27

Slide 27 text

● New BPF helpers New BPF kfuncs ● New map type ● Try to reuse existing building blocks ○ e.g. Use hashmap/array instead of new map type Add building blocks to the kernel Use Case Hook Point Community Support Building Blocks 27

Slide 28

Slide 28 text

● Help the community understand the use case ● Be prepared to redesign your solution ● Write self tests for kernel changes ● Support your code in the upstream kernel Work with the community Use Case Hook Point Community Support Building Blocks 28

Slide 29

Slide 29 text

Agenda 01 What is BPF? 02 BPFize your kernel subsystem 03 BPF LSM 04 fanotify-bpf 29

Slide 30

Slide 30 text

● Discretionary Access Control (DAC): The owner have full control of the resources ls -l Makefile -rw-r--r-- 1 song myGroup 7764 Sep 12 16:56 Makefile ● Mandatory Access Control (MAC): Detailed rules/policies ALLOW firefox_process firefox_log:FILE WRITE; Access control: DAC vs. MAC 30

Slide 31

Slide 31 text

The rest of the kernel SELinux 200+ LSM hooks: security_*() AppArmor BPF LSM LandLock ... ● Linux’s MAC solution LSM: Linux Security Module 31

Slide 32

Slide 32 text

Notify the LSMs that something happened in the system static inline void file_free(struct file *f) { security_file_free(f); /* ... */ } VOID hooks 32

Slide 33

Slide 33 text

Also allow the LSMs to deny an operation static int do_dentry_open(struct file *f, ...) { /* ... */ error = security_file_open(f); if (error) goto cleanup_all; /* deny open() */ /* ... */ } INT hooks 33

Slide 34

Slide 34 text

static int apparmor_file_open(struct file *file) { /* Logic that decides whether this open() is allowed */ } static struct security_hook_list apparmor_hooks[] = { /* ... */ LSM_HOOK_INIT(file_open, apparmor_file_open), /* ... */ }; LSM hook functions 34

Slide 35

Slide 35 text

int security_file_open(struct file *file) { return call_int_hook(file_open, file); } Inside a LSM hook for_each_lsm(some_lsm) { ret = some_lsm_file_open(file); if (ret != 0) return ret; } return 0; 35

Slide 36

Slide 36 text

The rest of the kernel LSM in kernel logic LSM hooks: security_*() ● In kernel logic to support a policy language ● User space policy sets crafted for given use cases In tree LSM Policies and tools in user space 36

Slide 37

Slide 37 text

The rest of the kernel LSM hooks: security_*() ● Kernel space logic written in BPF programs ● User space daemon controls BPF programs and policies BPF LSM User space daemon and policy LSM logic in BPF 37

Slide 38

Slide 38 text

BPF LSM Use Case Hook Point Community Support Building Blocks 38

Slide 39

Slide 39 text

The rest of the kernel LSM hooks: security_*() In kernel LSM In kernel LSM vs. BPF LSM User space daemon, policy, tools LSM logic in BPF LSM in kernel logic Policies and tools in user space BPF LSM 39

Slide 40

Slide 40 text

● Flexibility ● Programmability ● Alternative: in tree LSMs BPF LSM use case Use Case Hook Point Community Support Building Blocks 40

Slide 41

Slide 41 text

BPF LSM hooks static int bpf_lsm_file_open(struct file *file) { return 0; } static struct security_hook_list bpf_lsm_hooks[] = { /* ... */ LSM_HOOK_INIT(file_open, bpf_lsm_file_open), /* ... */ }; Use Case Hook Point Community Support Building Blocks 41

Slide 42

Slide 42 text

Slide 43

Slide 43 text

● A small function generated at run time to call BPF program(s) ○ Use direct calls for fast execution ○ Can reliably access function arguments at function return time ● BPF_TRAMP_MODIFY_RETURN: Allows a BPF program to modify the return value of the target kernel function ○ Used in error injection and BPF LSM BPF trampoline 43

Slide 44

Slide 44 text

BPF trampoline with BPF_TRAMP_MODIFY_RETURN bpf_tramp_xxxx(struct file *file) { ret = call(bpf_lsm_file_open+sizeof(nop)); modified_ret = bpf_prog_xxxx(file, ret); return modified_ret; } int bpf_lsm_file_open(struct file *file) { nop(); /* __fentry__ */ return 0; } 44

Slide 45

Slide 45 text

int bpf_lsm_file_open(struct file *file) { nop(); call(bpf_tramp_xxxx, file); return 0; } BPF trampoline with BPF_TRAMP_MODIFY_RETURN bpf_tramp_xxxx(struct file *file) { ret = call(bpf_lsm_file_open+sizeof(nop)); modified_ret = bpf_prog_xxxx(file, ret); return modified_ret; } 45

Slide 46

Slide 46 text

int bpf_lsm_file_open(struct file *file) { nop(); call(bpf_tramp_xxxx, file); return 0; } bpf_tramp_xxxx(struct file *file) { ret = call(bpf_lsm_file_open+sizeof(nop)); modified_ret = bpf_prog_xxxx(file, ret); return modified_ret; } BPF trampoline with BPF_TRAMP_MODIFY_RETURN 46

Slide 47

Slide 47 text

● BPF local storage ● New BPF kfuncs ○ Multiple kfuncs to read/write xattrs ○ bpf_get_fsverity_digest ○ bpf_verify_pkcs7_signature BPF LSM building blocks Use Case Hook Point Community Support Building Blocks 47

Slide 48

Slide 48 text

● Many LSMs use fix sized data for kernel object, e.g., inode, task_struct ● A data blob is allocated to serve all enabled LSMs ● The data blob has same life time as the kernel object it attaches to Security data attached to kernel objects struct inode { void *i_security; }; rcu_head SELinux ... ... 48

Slide 49

Slide 49 text

● BPF LSM cannot use fixed sized data blob, because the size of per object data is not known ahead of time. ● BPF local storage was introduced to handle multiple maps ● Limitation: 3 pointer deref to access the value BPF local storage struct inode { void *i_security; }; rcu_head SELinux BPF ... bpf_local_storage ... value 49

Slide 50

Slide 50 text

Slide 51

Slide 51 text

● Use case: Protect key resources against untrusted binary ○ e.g., only signed sshd binary can bind port 22 ● Trusted build server signs the binary’s fsverity digest (Merkle tree root hash) with private key ● The signature is stored in xattr ● User space security daemon loads the public key to keying Binary authorization with BPF LSM 51

Slide 52

Slide 52 text

Binary authorization with BPF LSM, cont’d SEC("lsm.s/bprm_committed_creds") int BPF_PROG(bprm_committed_creds, const struct linux_binprm *bprm) { /* simplified to fit in one page */ bpf_get_fsverity_digest(bprm->file, &digest); bpf_get_file_xattr(bprm->file, "user.sig", &sig); ret = bpf_verify_pkcs7_signature(&digest, &sig, keyring); /* If ret == 0, this binary has been signed by private key. * Set a flag in a BPF map. */ } 52

Slide 53

Slide 53 text

● restrict_filesystems in systemd ● Used by big companies ● Open source solution: Tetragon, more coming ● Improvements being shipped ● Use cases evolving quickly BPF LSM in the community Use Case Hook Point Community Support Building Blocks 53

Slide 54

Slide 54 text

Agenda 01 What is BPF? 02 BPFize your kernel subsystem 03 BPF LSM 04 fanotify-bpf 54

Slide 55

Slide 55 text

● Different access rules for different files ○ e.g., LSM only protects some key files ● LSM hooks are global ○ e.g., file_open hook is triggered on every file open Limitation of LSM: local rules vs. global hooks 55

Slide 56

Slide 56 text

To Be Determined Use Case Hook Point Community Support Building Blocks 56

Slide 57

Slide 57 text

● Apply security rules for a set of files. ● Specifically, for a subtree in VFS tree ● Alternative: BPF LSM Use case Use Case Hook Point Community Support Building Blocks 57

Slide 58

Slide 58 text

● Filesystem notification system ● Also provides access control via permission events, used by Antivirus scanners ● fanotify can specify rules locally to a file/directory/superblock fanotify 58

Slide 59

Slide 59 text

/* user process */ fd = open(file, ...); /* fd == -EPERM */ fanotify permissions events /* monitor daemon */ read(fan_fd, buf, size); metadata = (struct fanotify_event_metadata *)buf; if (metadata->masks & FAN_OPEN_PERM){ /* scan the file here, with metadata->fd */ response.fd = metadata->fd; response.response = FAN_DENY; write(fan_fd, &response, sizeof(response)); } 59

Slide 60

Slide 60 text

struct inode { __u32 i_fsnotify_mask; /* all events this inode cares about */ struct fsnotify_mark_connector __rcu *i_fsnotify_marks; }; struct super_block { u32 s_fsnotify_mask; struct fsnotify_sb_info *s_fsnotify_info; }; fanotify: per inode/superblock rules 60

Slide 61

Slide 61 text

Slide 62

Slide 62 text

The rest of the kernel LSM hooks: security_*() BPF LSM + fanotify User space daemon, policy, and tools BPF LSM programs fanotify hooks BPF fanotify programs 62

Slide 63

Slide 63 text

● Apply security rules for a set of files. ● Specifically, for a subtree in VFS tree ● Alternative: BPF LSM ● Can also help Antivirus scanner fanotify-bpf use cases Use Case Hook Point Community Support Building Blocks 63

Slide 64

Slide 64 text

/* user process */ fd = open(file, ...); /* fd == -EPERM */ fanotify permission events /* monitor daemon */ read(fan_fd, buf, size); metadata = (struct fanotify_event_metadata *)buf; if (metadata->masks & FAN_OPEN_PERM){ /* scan the file here, with metadata->fd */ response.fd = metadata->fd; response.response = FAN_DENY; write(fan_fd, &response, sizeof(response)); } 64

Slide 65

Slide 65 text

/* user process */ fd = open(file, ...); /* fd == valid or -EPERM */ fanotify with in-kernel fast path /* monitor daemon */ read(fan_fd, buf, size); metadata = (struct fanotify_event_metadata *)buf; if (metadata->masks & FAN_OPEN_PERM){ /* scan the file here, with metadata->fd */ response.fd = metadata->fd; response.response = FAN_DENY; write(fan_fd, &response, sizeof(response)); } fp_handler 65

Slide 66

Slide 66 text

struct fanotify_fastpath_ops { int (*fp_handler)(...); int (*fp_init)(...); void (*fp_free)(...); char name[FAN_FP_NAME_MAX]; /* ... */ } fanotify fastpath handler 66

Slide 67

Slide 67 text

fanotify fastpath handler Use Case Hook Point Community Support Building Blocks struct fanotify_fastpath_ops { int (*fp_handler)(...); int (*fp_init)(...); void (*fp_free)(...); char name[FAN_FP_NAME_MAX]; /* ... */ } 67

Slide 68

Slide 68 text

struct_ops ● Define an interface ○ tcp-bpf: congestion control ○ sched_ext: scheduler ● Multiple implementations of the interface ● Can use both kernel module or BPF ○ Kernel module: can use all exported symbols ○ BPF: safe 68

Slide 69

Slide 69 text

● New kfunc bpf_is_subdir(dentry, subroot) to enable subtree rules ● Hold reference to subroot in BPF map fanotify-bpf building blocks Use Case Hook Point Community Support Building Blocks 69

Slide 70

Slide 70 text

● Multiple rounds of patch reviews with VFS and BPF maintainers fanotify-bpf: engage the community Use Case Hook Point Community Support Building Blocks 70

Slide 71

Slide 71 text

fanotify-bpf Use Case Hook Point Community Support Building Blocks 71

Slide 72

Slide 72 text

● Issues with subtree monitoring ○ Requires per superblock rules, not really local ○ bpf_is_subdir() is not much faster than path walking in BPF ○ Holding subroot in BPF map could be risky, e.g., it may block umount ● Antivirus scanner may not need an in-kernel fastpath ● The alternative, BPF LSM, is good enough in most use cases fanotify-bpf: not landed yet 72

Slide 73

Slide 73 text

Slide 74

Slide 74 text

Slide 75

Slide 75 text

● Present fanotify-bpf at Kernel Recipes [DONE] ● Benchmark of BPF LSM vs. fanotify-bpf ● Better solution for subtree monitoring ● Let the use cases evolve What’s next? 75