Save 37% off PRO during our Black Friday Sale! »

seccomp: Whereabouts of Light and Darkness

seccomp: Whereabouts of Light and Darkness

44df43ecb9d8ae00cafee6b804db3fcd?s=128

Kenta Tada

April 17, 2021
Tweet

Transcript

  1. R&D Center, Sony Group Corporation Copyright 2021 Sony Group Corporation

    seccomp 光と闇の行方 第14回 コンテナ型仮想化の情報交換会 Kenta Tada
  2. About me ⚫ Kenta Tada ⚫ Software Engineer, Sony Group

    Corporation 2
  3. Agenda ⚫ Internal of Seccomp Notifier ⚫ Use Case of

    Addfd ⚫ Seccomp Notification Preemption 3
  4. Environment ⚫ Linux Kernel v5.12-rc4 ⚫ LXD 4.12 4

  5. Internal of Seccomp Notifier 5

  6. Overview 6 Userspace Kernel Container Manager 1. Issue a system

    call e.g., socket(), mount() Container 4. The container wants to run the system call ioctl(fd, SECCOMP_IOCTL_NOTIF_RECV, req) 5. Read the system call arguments from /proc/$pid/mem 6. Validate the system call if OK, go to 7a-1. If NG, go to 7b 7a. Perform the system call on behalf of the process (Optional) Prepare for addfd to return fd to the process ioctl(fd, SECCOMP_IOCTL_NOTIF_ADDFD, addfd) 7b. Reject the system call 8a. Set the return value to 0 (success) (Optional) Return fd from the manager 8b. Set the return value to error code (failure) ioctl(fd, SECCOMP_IOCTL_NOTIF_SEND, req) Process 2. Execute filter 3. Return “notify” cBPF Program Seccomp 9a. Return 0 (success) (Optional) Get fd created by the manager 9b. Return error code (failure) ⚫ Install seccomp filter when the process is started. ⚫ After the process is started, handle syscalls like below Note. From “Rust-based, Secure and Lightweight Container Runtime for Embedded Systems" by Manabu Sugimoto, 2021, Cloud Native Rust Day, p. 25 (Presentation Slide), https://sched.co/iLkx
  7. Install seccomp filter when seccomp notifier is used 7 do_seccomp()

    seccomp_get_notif_sizes() 1. Copy the size of seccomp data structures to user space. seccomp_set_mode_filter() 1. Check the exclusivity of NEW_LISTENER and TSYNC 2. Prepare for the file struct for the listener 3. Check the duplicated listener in ancestors 4. Bind the listener fd if seccomp filter is installed 5. Error handing for seccomp notifier case SECCOMP_GET_NOTIF_SIZES: case SECCOMP_SET_MODE_FILTER:
  8. Check the exclusivity of NEW_LISTENER and TSYNC ⚫ TSYNC synchronizes

    thread group seccomp filters at startup. • If it is failed, the PID of one of failing threads will be returned. ⚫ On the other hand, when NEW_LISTENER is set, it returns the new listener fd. ⚫ TSYNC_ESRCH which returns ESRCH is introduced to use both. • Ex. Chrome sandbox wanted to use TSYNC for video drivers and NEW_ LISTENER to proxy syscalls. 8 if ((flags & SECCOMP_FILTER_FLAG_TSYNC) && (flags & SECCOMP_FILTER_FLAG_NEW_LISTENER) && ((flags & SECCOMP_FILTER_FLAG_TSYNC_ESRCH) == 0)) return -EINVAL; https://lkml.org/lkml/2020/3/4/813 From seccomp_set_mode_filter() https://lkml.org/lkml/2020/3/4/813
  9. Prepare for the file struct for the listener ⚫ listener

    is the fd and kernel returns it to user space. • The target process passes this fd to the tracer process via a UNIX domain socket. ⚫ listener_f is the file struct for listener. • This file instance is hooked up to anonymous inode. 9 if (flags & SECCOMP_FILTER_FLAG_NEW_LISTENER) { listener = get_unused_fd_flags(O_CLOEXEC); if (listener < 0) { ret = listener; goto out_free; } listener_f = init_listener(prepared); if (flags & SECCOMP_FILTER_FLAG_NEW_LISTENER) { listener = get_unused_fd_flags(O_CLOEXEC); if (listener < 0) { ret = listener; goto out_free; } listener_f = init_listener(prepared); From seccomp_set_mode_filter()
  10. Issue system calls from user space 10 1. Issue a

    system call e.g., socket(), mount() Container Process 2. Execute filter 3. Return “notify” cBPF Program Seccomp
  11. Issue system calls from the process 11 __secure_computing() __seccomp_filter() seccomp_do_user_notification()

    1. Initialize related structs 2. Set up a state of seccomp notifier as SECCOMP_NOTIFY_INIT 3. Wait for a reply from user space 4a. Continue the syscall when SECCOMP_USER_NOTIF_FLAG_CONTINUE is set 4b. Modify the return value when SECCOMP_USER_NOTIF_FLAG_CONTINUE is not set case SECCOMP_RET_USER_NOTIF:
  12. Initialize related structs 12 Key Explanation struct task_struct *task The

    task whose filter triggered the notification u64 id The "cookie" for this request. This is unique for this filter. const struct seccomp_data *data The seccomp data enum notify_state state Notification states. • SECCOMP_NOTIFY_INIT • SECCOMP_NOTIFY_SENT • SECCOMP_NOTIFY_REPLIED long val The return value from Container Manager struct completion ready The completion to manage the status of Seccomp Notifier struct list_head addfd The list of addfd requests Main Components of struct seccomp_knotif
  13. Wait for a reply from user space ⚫ The process

    waits for a reply(SECCOMP_IOCTL_NOTIF_SEND) from Container Manager as TASK_INTERRUPTIBLE. 13 wait: err = wait_for_completion_interruptible(&n.ready); mutex_lock(&match->notify_lock); if (err == 0) { /* Check if we were woken up by a addfd message */ addfd = list_first_entry_or_null(&n.addfd, struct seccomp_kaddfd, list); if (addfd && n.state != SECCOMP_NOTIFY_REPLIED) { seccomp_handle_addfd(addfd); mutex_unlock(&match->notify_lock); goto wait; } ret = n.val; err = n.error; flags = n.flags; } From seccomp_do_user_notification()
  14. Container Manager handles syscalls from the process 14 Container Manager

    4. The container wants to run the system call ioctl(fd, SECCOMP_IOCTL_NOTIF_RECV, req) 5. Read the system call arguments from /proc/$pid/mem 6. Validate the system call if OK, go to 7a-1. If NG, go to 7b 7a. Perform the system call on behalf of the process (Optional) Prepare for addfd to return fd to the process ioctl(fd, SECCOMP_IOCTL_NOTIF_ADDFD, addfd) 7b. Reject the system call 8a. Set the return value to 0 (success) (Optional) Return fd from the manager 8b. Set the return value to error code (failure) ioctl(fd, SECCOMP_IOCTL_NOTIF_SEND, req) Seccomp 9a. Return 0 (success) (Optional) Get fd created by the manager 9b. Return error code (failure)
  15. Container Manager receives the request from kernel space 15 seccomp_notify_ioctl()

    seccomp_notify_recv() 1. Check whether the buffer from user space is cleared 2. Search the notification which has the state of SECCOMP_NOTIFY_INIT 3. Change the status from SECOMP_NOTIFY_INIT to SECCOMP_NOTIFY_SENT if it is found 4. Copy the data from the waiting process to Container Manager case SECCOMP_IOCTL_NOTIF_RECV:
  16. seccomp_notify_recv() ⚫ Checks whether the buffer from user space is

    cleared ⚫ Change the status from SECOMP_NOTIFY_INIT to SECCOMP_NOTIFY_SENT 16 /* Verify that we're not given garbage to keep struct extensible. */ ret = check_zeroed_user(buf, sizeof(unotif)); if (ret < 0) return ret; if (!ret) return -EINVAL; … unotif.id = knotif->id; unotif.pid = task_pid_vnr(knotif->task); unotif.data = *(knotif->data); knotif->state = SECCOMP_NOTIFY_SENT; wake_up_poll(&filter->wqh, EPOLLOUT | EPOLLWRNORM); From seccomp_notify_recv()
  17. Container Manager reads actual data from memory ⚫ Container Manager

    needs to read actual data of the process’s memory address via /proc/[PID]/mem or process_vm_readv(2). 17 /* Read the memory of the client to get source and dest of mount */ snprintf(mem_path, sizeof(mem_path), "/proc/%d/mem", req->pid); mem_fd = open(mem_path, O_RDONLY); … /* Read the path of the source directory */ if (lseek(mem_fd, req->data.args[0], SEEK_SET) < 0) { fprintf(stderr, "lseek¥n"); exit(EXIT_FAILURE); } result = read(mem_fd, source, sizeof(source)); if (result < 0) { fprintf(stderr, "failed to read the source¥n"); exit(EXIT_FAILURE); } User space sample code: How to read the first argument of mount(2)
  18. Addfd since kernel 5.9 18 ⚫ Prior Seccomp Notifier cannot

    return file descriptors. ⚫ Addfd adds the capability for Seccomp to add file descriptors. ⚫ Use case • LXD : Enable bpf in unprivileged containers • IPv4-IPv6 transition(Netflix) : Get a socket from a namespace with global IPv4 reachability. https://lwn.net/Articles/821351/
  19. Container Manager prepares for the fd instead of the process

    19 seccomp_notify_ioctl() seccomp_notify_addfd() 1. Copy the data of seccomp_notif_addfd from Container Manager 2. Check the flags 3. Prepare for seccomp_kaddfd 3-1. Get the file instance from srcfd of seccomp_notif_addfd 3-2. Copy newfd_flags to flags of seccomp_kaddfd 3-3. Set the fd of seccomp_kaddfd from newfd of seccomp_notif_addfd 4. Notify the process of the new addfd and wait to be processed 5. Remove the handled request case EA_IOCTL(SECCOMP_IOCTL_NOTIF_ADDFD):
  20. Prepare for seccomp_kaddfd struct seccomp_notif_addfd { __u64 id; __u32 flags;

    __u32 srcfd; __u32 newfd; __u32 newfd_flags; }; 20 struct seccomp_kaddfd { struct file *file; int fd; unsigned int flags; /* To only be set on reply */ int ret; struct completion completion; struct list_head list; }; fget(srcfd) if newfd ⚫ Container Manager needs to set up id and srcfd which is the fd created by Container Manager at least. ⚫ If you want to specify newfd which is used by the process, you need to set up flags as SECCOMP_ADDFD_FLAG_SETFD.
  21. Notify the process of the new addfd and wait to

    be processed ⚫ In seccomp_notify_addfd(), notify the process of the new addfd using the member of completion in struct seccomp_kaddfd ⚫ Kernel handles addfds while it is waiting for a reply(SECCOMP_IOCTL_NOTIF_SEND) from Container Manager in seccomp_do_user_notification(). ⚫ After addfd is processed correctly in seccomp_notify_addfd(), it returns the number of the fd to Container Manager. 21
  22. Return to the process in container ⚫ Wait until SECCOMP_NOTIFY_REPLIED

    • Handle the addfd message until SECCOMP_NOTIFY_REPLIED 22 wait: err = wait_for_completion_interruptible(&n.ready); mutex_lock(&match->notify_lock); if (err == 0) { /* Check if we were woken up by a addfd message */ addfd = list_first_entry_or_null(&n.addfd, struct seccomp_kaddfd, list); if (addfd && n.state != SECCOMP_NOTIFY_REPLIED) { seccomp_handle_addfd(addfd); mutex_unlock(&match->notify_lock); goto wait; } ret = n.val; err = n.error; flags = n.flags; } From seccomp_do_user_notification()
  23. seccomp_handle_addfd() ⚫ seccomp_handle_addfd() replaces the fd to use the file

    instance which is created by Container Manager. • addfd->ret is the number of fd in the container’s process and the returned value of ioctl(SECCOMP_IOCTL_NOTIF_ADDFD, …) ⚫ Notify Container Manager of completion 23 static void seccomp_handle_addfd(struct seccomp_kaddfd *addfd) { /* * Remove the notification, and reset the list pointers, indicating * that it has been handled. */ list_del_init(&addfd->list); addfd->ret = receive_fd_replace(addfd->fd, addfd->file, addfd->flags); complete(&addfd->completion); } seccomp_handle_addfd()
  24. Container Manager sends the response to kernel space 24 seccomp_notify_ioctl()

    seccomp_notify_send() 1. Check the flags 2. Search the notification which has the id specified by Container Manage 3. Change the status from SECOMP_NOTIFY_SENT to SECCOMP_NOTIFY_REPLIED 4. Notifies the waiting process of completion case SECCOMP_IOCTL_NOTIF_SEND:
  25. seccomp_notify_send() ⚫ Permit SECCOMP_USER_NOTIF_FLAG_CONTINUE without response ⚫ Change the state

    to SECCOMP_NOTIFY_REPLIED ⚫ Notify the waiting process of completion 25 if (copy_from_user(&resp, buf, sizeof(resp))) return -EFAULT; if (resp.flags & ~SECCOMP_USER_NOTIF_FLAG_CONTINUE) return -EINVAL; if ((resp.flags & SECCOMP_USER_NOTIF_FLAG_CONTINUE) && (resp.error || resp.val)) return -EINVAL; knotif->state = SECCOMP_NOTIFY_REPLIED; knotif->error = resp.error; knotif->val = resp.val; knotif->flags = resp.flags; complete(&knotif->ready);
  26. Use Case of Addfd 26

  27. Use eBPF in unprivileged containers in LXD 27 ⚫ LXD

    uses Addfd to handle eBPF in unprivileged containers. ⚫ When the container uses cgroup v2, eBPF is used for the device controller. • https://speakerdeck.com/kentatada/cgroup-v2-internals ⚫ But eBPF needs some capabilities or the root privilege even if rootless container.
  28. When the command is BPF_PROG_LOAD ⚫ Get the fd from

    the bpf syscall ⚫ Prepare for seccomp_notif_addfd ⚫ Send addfd to kernel space ⚫ After this snippet, resp->val is sent to kernel space and it is used as the return value of the process in container finally. 28 bpf_prog_fd = bpf(cmd, &new_attr, sizeof(new_attr)); … addfd.srcfd = bpf_prog_fd; addfd.id = req->id; addfd.flags = 0; ret = ioctl(notify_fd, SECCOMP_IOCTL_NOTIF_ADDFD, &addfd); if (ret < 0) return -errno; resp->val = ret; Managing BPF code from https://github.com/lxc/lxd/blob/lxd-4.12/lxd/seccomp/seccomp.go
  29. Seccomp Notification Preemption 29

  30. Race condition when syscalls are interrupted. 30 ⚫ As we

    read the kernel code, the process in container waits for a reply as TASK_INTERRUPTIBLE ⚫ If syscalls are interrupted while it is waiting, the problem will happen. ⚫ Recently, Golang adapts async preemption to avoid more than 10ms during GC. ⚫ So, Golang runtime sends the SIGURG signal and it exacerbates this situation.
  31. How to fix this issue in the community 31 ⚫

    The community is trying to add wait_killable semantic to Seccomp Notifier. ⚫ It allows the supervisor to set a flag that moves the process into a state where it is only killable by terminating signals as opposed to all signals. https://lkml.kernel.org/lkml/20210318051733.2544-6-sargun@sargun.me/T/
  32. Key takeaways ⚫ Seccomp Notifier realizes fine-grained access control. ⚫

    Addfd helps your unprivileged container to get the fd. ⚫ The current Seccomp Notifier has the problem. • Especially, Golang exacerbates this issue. • It will be fixed in future. 32
  33. SONY is a registered trademark of Sony Group Corporation. Names

    of Sony products and services are the registered trademarks and/or trademarks of Sony Group Corporation or its Group companies. Other company names and product names are registered trademarks and/or trademarks of the respective companies.