Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Overcoming Observer Effects in Memory Managemen...

Overcoming Observer Effects in Memory Management with DAMON

Knowing the data access pattern of a system and its workloads is crucial for efficient memory management. However, accurate measurement can be challenging due to observer effects.

This talk will introduce how the Linux kernel overcomes this problem using the DAMON subsystem and share practical use cases of it for improving memory efficiencies on real world product Linux systems.

SJ Park

Avatar for Kernel Recipes

Kernel Recipes PRO

October 02, 2025
Tweet

More Decks by Kernel Recipes

Other Decks in Technology

Transcript

  1. Who am I? • Song Liu • Software engineer @

    Meta • Linux kernel contributor, reviewer, and maintainer • Work with BPF for 7+ years • Support BPF LSM users @ Meta 2
  2. • BPF originally stood for Berkeley Packet Filter • eBPF

    means extended BPF • BPF and eBPF are generally used interchangeably What is BPF/eBPF? 3
  3. • BPF originally stood for Berkeley Packet Filter • eBPF

    means extended BPF • BPF and eBPF are generally used interchangeably • How would you describe BPF for someone who has never heard of it? What is BPF/eBPF? 4
  4. Well-known proverb in China “There are a thousand Hamlets in

    a thousand people’s eyes.” “一千个人眼中有一千个哈姆雷特。” 5
  5. One of those 1000 kernel hackers. “What is BPF in

    my eyes? How do I use BPF on my keyboard?” 7
  6. https://ebpf.io/what-is-ebpf/ “eBPF is a revolutionary technology with origins in the

    Linux kernel that can run sandboxed programs in a privileged context such as the operating system kernel. It is used to safely and efficiently extend the capabilities of the kernel without requiring to change kernel source code or load kernel modules.” 9
  7. Something is a tasty food from France that goes well

    with pesto source and is often served in its own shell 11
  8. Something is a tasty food from France that goes well

    with pesto source and is often served in its own shell 12
  9. https://ebpf.io/what-is-ebpf/ “eBPF is a revolutionary technology with origins in the

    Linux kernel that can run sandboxed programs in a privileged context such as the operating system kernel. It is used to safely and efficiently extend the capabilities of the kernel without requiring to change kernel source code or load kernel modules.” 13
  10. https://ebpf.io/what-is-ebpf/ “eBPF is a revolutionary technology with origins in the

    Linux kernel that can run sandboxed programs in a privileged context such as the operating system kernel. It is used to safely and efficiently extend the capabilities of the kernel without requiring to change kernel source code or load kernel modules.” 14
  11. • Traditionally, we can only tune kernel behavior with knobs:

    sysfs, syscalls, sysctl, etc. • BPF Instruction Set Architecture (ISA) BPF brings programmability 15
  12. • Running in the privileged contexts, BPF have visibility of

    events happened in the whole stack • XDP (eXpress Data Path) runs in the device driver to achieve extreme performance BPF is powerful and performant 16
  13. • BPF will not crash the system • BPF are

    guaranteed to terminate • BPF will not read uninitialized memory • BPF will not leak memory • BPF will not leak reference to kernel objects • and the list is growing BPF is safe 17
  14. • BPF verifier checks the safety for each BPF program

    • Verifier works with BPF instructions ◦ This is one of reason why we need BPF ISA • What about performance? JIT! BPF is safe, but how? verifier verifier 18
  15. BPF program from source to execution helloworld.bpf.c: SEC(“kprobe/copy_process”) void BPF_PROG(my_prog)

    { bpf_printk(“Hello world!”); } helloworld.bpf.o (BPF ISA) clang/gcc my_prog (BPF ISA) verifier my_prog (BPF ISA) my_prog (native ISA) JIT my_prog (native ISA) BPF_PROG_LOAD BPF_PROG_ATTACH copy_process() cat /sys/kernel/tracing/trace [...] Hello World! [...] Hello World! 19
  16. • This “sandbox” runs in the privileged context • To

    fully leverage the privileges, BPF programs need reusable building blocks ◦ BPF maps, BPF helpers/kfuncs, etc. • The uses of these building blocks are verified to be safe Sandboxed vs. privileged context 20
  17. • BPF maps => memory management, input/output • BPF helpers

    and kfuncs => interact with the system • BPF iterator, BPF trampoline => trigger BPF programs for certain use cases • BTF (BPF Type Format) => understand the kernel/BPF data structures Reusable building blocks 21
  18. • Writing verifiable BPF program is not always straightforward •

    Creating reusable building blocks is harder than writing code for a single use case BPF challenges 22
  19. • Why do we need BPF? ◦ Programmability? Speed? •

    What are the alternatives? • How do BPF fit in the whole solution? Know your use case Use Case Hook Point Community Support Building Blocks 25
  20. • Add a pointer to struct bpf_prog, and some code

    to call the program • For most new use cases, struct_ops is the recommended solution • Try to reuse existing mechanism Design your hook point Use Case Hook Point Community Support Building Blocks 26
  21. • New BPF helpers New BPF kfuncs • New map

    type • Try to reuse existing building blocks ◦ e.g. Use hashmap/array instead of new map type Add building blocks to the kernel Use Case Hook Point Community Support Building Blocks 27
  22. • Help the community understand the use case • Be

    prepared to redesign your solution • Write self tests for kernel changes • Support your code in the upstream kernel Work with the community Use Case Hook Point Community Support Building Blocks 28
  23. • Discretionary Access Control (DAC): The owner have full control

    of the resources ls -l Makefile -rw-r--r-- 1 song myGroup 7764 Sep 12 16:56 Makefile • Mandatory Access Control (MAC): Detailed rules/policies ALLOW firefox_process firefox_log:FILE WRITE; Access control: DAC vs. MAC 30
  24. The rest of the kernel SELinux 200+ LSM hooks: security_*()

    AppArmor BPF LSM LandLock ... • Linux’s MAC solution LSM: Linux Security Module 31
  25. Notify the LSMs that something happened in the system static

    inline void file_free(struct file *f) { security_file_free(f); /* ... */ } VOID hooks 32
  26. Also allow the LSMs to deny an operation static int

    do_dentry_open(struct file *f, ...) { /* ... */ error = security_file_open(f); if (error) goto cleanup_all; /* deny open() */ /* ... */ } INT hooks 33
  27. static int apparmor_file_open(struct file *file) { /* Logic that decides

    whether this open() is allowed */ } static struct security_hook_list apparmor_hooks[] = { /* ... */ LSM_HOOK_INIT(file_open, apparmor_file_open), /* ... */ }; LSM hook functions 34
  28. int security_file_open(struct file *file) { return call_int_hook(file_open, file); } Inside

    a LSM hook for_each_lsm(some_lsm) { ret = some_lsm_file_open(file); if (ret != 0) return ret; } return 0; 35
  29. The rest of the kernel LSM in kernel logic LSM

    hooks: security_*() • In kernel logic to support a policy language • User space policy sets crafted for given use cases In tree LSM Policies and tools in user space 36
  30. The rest of the kernel LSM hooks: security_*() • Kernel

    space logic written in BPF programs • User space daemon controls BPF programs and policies BPF LSM User space daemon and policy LSM logic in BPF 37
  31. The rest of the kernel LSM hooks: security_*() In kernel

    LSM In kernel LSM vs. BPF LSM User space daemon, policy, tools LSM logic in BPF LSM in kernel logic Policies and tools in user space BPF LSM 39
  32. • Flexibility • Programmability • Alternative: in tree LSMs BPF

    LSM use case Use Case Hook Point Community Support Building Blocks 40
  33. BPF LSM hooks static int bpf_lsm_file_open(struct file *file) { return

    0; } static struct security_hook_list bpf_lsm_hooks[] = { /* ... */ LSM_HOOK_INIT(file_open, bpf_lsm_file_open), /* ... */ }; Use Case Hook Point Community Support Building Blocks 41
  34. BPF LSM hooks static int bpf_lsm_file_open(struct file *file) { return

    0; } static struct security_hook_list bpf_lsm_hooks[] = { /* ... */ LSM_HOOK_INIT(file_open, bpf_lsm_file_open), /* ... */ }; Use Case Hook Point Community Support Building Blocks 42
  35. • A small function generated at run time to call

    BPF program(s) ◦ Use direct calls for fast execution ◦ Can reliably access function arguments at function return time • BPF_TRAMP_MODIFY_RETURN: Allows a BPF program to modify the return value of the target kernel function ◦ Used in error injection and BPF LSM BPF trampoline 43
  36. BPF trampoline with BPF_TRAMP_MODIFY_RETURN bpf_tramp_xxxx(struct file *file) { ret =

    call(bpf_lsm_file_open+sizeof(nop)); modified_ret = bpf_prog_xxxx(file, ret); return modified_ret; } int bpf_lsm_file_open(struct file *file) { nop(); /* __fentry__ */ return 0; } 44
  37. int bpf_lsm_file_open(struct file *file) { nop(); call(bpf_tramp_xxxx, file); return 0;

    } BPF trampoline with BPF_TRAMP_MODIFY_RETURN bpf_tramp_xxxx(struct file *file) { ret = call(bpf_lsm_file_open+sizeof(nop)); modified_ret = bpf_prog_xxxx(file, ret); return modified_ret; } 45
  38. int bpf_lsm_file_open(struct file *file) { nop(); call(bpf_tramp_xxxx, file); return 0;

    } bpf_tramp_xxxx(struct file *file) { ret = call(bpf_lsm_file_open+sizeof(nop)); modified_ret = bpf_prog_xxxx(file, ret); return modified_ret; } BPF trampoline with BPF_TRAMP_MODIFY_RETURN 46
  39. • BPF local storage • New BPF kfuncs ◦ Multiple

    kfuncs to read/write xattrs ◦ bpf_get_fsverity_digest ◦ bpf_verify_pkcs7_signature BPF LSM building blocks Use Case Hook Point Community Support Building Blocks 47
  40. • Many LSMs use fix sized data for kernel object,

    e.g., inode, task_struct • A data blob is allocated to serve all enabled LSMs • The data blob has same life time as the kernel object it attaches to Security data attached to kernel objects struct inode { void *i_security; }; rcu_head SELinux ... ... 48
  41. • BPF LSM cannot use fixed sized data blob, because

    the size of per object data is not known ahead of time. • BPF local storage was introduced to handle multiple maps • Limitation: 3 pointer deref to access the value BPF local storage struct inode { void *i_security; }; rcu_head SELinux BPF ... bpf_local_storage ... value 49
  42. • BPF local storage • New BPF kfuncs ◦ Multiple

    kfuncs to read/write xattrs ◦ bpf_get_fsverity_digest ◦ bpf_verify_pkcs7_signature BPF LSM building blocks Use Case Hook Point Community Support Building Blocks 50
  43. • Use case: Protect key resources against untrusted binary ◦

    e.g., only signed sshd binary can bind port 22 • Trusted build server signs the binary’s fsverity digest (Merkle tree root hash) with private key • The signature is stored in xattr • User space security daemon loads the public key to keying Binary authorization with BPF LSM 51
  44. Binary authorization with BPF LSM, cont’d SEC("lsm.s/bprm_committed_creds") int BPF_PROG(bprm_committed_creds, const

    struct linux_binprm *bprm) { /* simplified to fit in one page */ bpf_get_fsverity_digest(bprm->file, &digest); bpf_get_file_xattr(bprm->file, "user.sig", &sig); ret = bpf_verify_pkcs7_signature(&digest, &sig, keyring); /* If ret == 0, this binary has been signed by private key. * Set a flag in a BPF map. */ } 52
  45. • restrict_filesystems in systemd • Used by big companies •

    Open source solution: Tetragon, more coming • Improvements being shipped • Use cases evolving quickly BPF LSM in the community Use Case Hook Point Community Support Building Blocks 53
  46. • Different access rules for different files ◦ e.g., LSM

    only protects some key files • LSM hooks are global ◦ e.g., file_open hook is triggered on every file open Limitation of LSM: local rules vs. global hooks 55
  47. • Apply security rules for a set of files. •

    Specifically, for a subtree in VFS tree • Alternative: BPF LSM Use case Use Case Hook Point Community Support Building Blocks 57
  48. • Filesystem notification system • Also provides access control via

    permission events, used by Antivirus scanners • fanotify can specify rules locally to a file/directory/superblock fanotify 58
  49. /* user process */ fd = open(file, ...); /* fd

    == -EPERM */ fanotify permissions events /* monitor daemon */ read(fan_fd, buf, size); metadata = (struct fanotify_event_metadata *)buf; if (metadata->masks & FAN_OPEN_PERM){ /* scan the file here, with metadata->fd */ response.fd = metadata->fd; response.response = FAN_DENY; write(fan_fd, &response, sizeof(response)); } 59
  50. struct inode { __u32 i_fsnotify_mask; /* all events this inode

    cares about */ struct fsnotify_mark_connector __rcu *i_fsnotify_marks; }; struct super_block { u32 s_fsnotify_mask; struct fsnotify_sb_info *s_fsnotify_info; }; fanotify: per inode/superblock rules 60
  51. struct inode { __u32 i_fsnotify_mask; /* all events this inode

    cares about */ struct fsnotify_mark_connector __rcu *i_fsnotify_marks; }; struct super_block { u32 s_fsnotify_mask; struct fsnotify_sb_info *s_fsnotify_info; }; fanotify: fast embedded filter 61
  52. The rest of the kernel LSM hooks: security_*() BPF LSM

    + fanotify User space daemon, policy, and tools BPF LSM programs fanotify hooks BPF fanotify programs 62
  53. • Apply security rules for a set of files. •

    Specifically, for a subtree in VFS tree • Alternative: BPF LSM • Can also help Antivirus scanner fanotify-bpf use cases Use Case Hook Point Community Support Building Blocks 63
  54. /* user process */ fd = open(file, ...); /* fd

    == -EPERM */ fanotify permission events /* monitor daemon */ read(fan_fd, buf, size); metadata = (struct fanotify_event_metadata *)buf; if (metadata->masks & FAN_OPEN_PERM){ /* scan the file here, with metadata->fd */ response.fd = metadata->fd; response.response = FAN_DENY; write(fan_fd, &response, sizeof(response)); } 64
  55. /* user process */ fd = open(file, ...); /* fd

    == valid or -EPERM */ fanotify with in-kernel fast path /* monitor daemon */ read(fan_fd, buf, size); metadata = (struct fanotify_event_metadata *)buf; if (metadata->masks & FAN_OPEN_PERM){ /* scan the file here, with metadata->fd */ response.fd = metadata->fd; response.response = FAN_DENY; write(fan_fd, &response, sizeof(response)); } fp_handler 65
  56. fanotify fastpath handler Use Case Hook Point Community Support Building

    Blocks struct fanotify_fastpath_ops { int (*fp_handler)(...); int (*fp_init)(...); void (*fp_free)(...); char name[FAN_FP_NAME_MAX]; /* ... */ } 67
  57. struct_ops • Define an interface ◦ tcp-bpf: congestion control ◦

    sched_ext: scheduler • Multiple implementations of the interface • Can use both kernel module or BPF ◦ Kernel module: can use all exported symbols ◦ BPF: safe 68
  58. • New kfunc bpf_is_subdir(dentry, subroot) to enable subtree rules •

    Hold reference to subroot in BPF map fanotify-bpf building blocks Use Case Hook Point Community Support Building Blocks 69
  59. • Multiple rounds of patch reviews with VFS and BPF

    maintainers fanotify-bpf: engage the community Use Case Hook Point Community Support Building Blocks 70
  60. • Issues with subtree monitoring ◦ Requires per superblock rules,

    not really local ◦ bpf_is_subdir() is not much faster than path walking in BPF ◦ Holding subroot in BPF map could be risky, e.g., it may block umount • Antivirus scanner may not need an in-kernel fastpath • The alternative, BPF LSM, is good enough in most use cases fanotify-bpf: not landed yet 72
  61. struct inode { __u32 i_fsnotify_mask; /* all events this inode

    cares about */ struct fsnotify_mark_connector __rcu *i_fsnotify_marks; }; struct super_block { u32 s_fsnotify_mask; struct fsnotify_sb_info *s_fsnotify_info; }; fanotify: fast embedded filter 73
  62. • Issues with subtree monitoring ◦ Requires per superblock rules,

    not really local ◦ bpf_is_subdir() is not much faster than path walking in BPF ◦ Holding subroot in BPF map could be risky, e.g., it may block umount • Antivirus scanner may not need an in-kernel fastpath • The alternative, BPF LSM, is good enough in most use cases fanotify-bpf: not landed yet 74
  63. • Present fanotify-bpf at Kernel Recipes [DONE] • Benchmark of

    BPF LSM vs. fanotify-bpf • Better solution for subtree monitoring • Let the use cases evolve What’s next? 75
  64. 77