Upgrade to Pro — share decks privately, control downloads, hide ads and more …

seccomp: Whereabouts of Light and Darkness

seccomp: Whereabouts of Light and Darkness

Kenta Tada

April 17, 2021
Tweet

More Decks by Kenta Tada

Other Decks in Programming

Transcript

  1. R&D Center, Sony Group Corporation
    Copyright 2021 Sony Group Corporation
    seccomp 光と闇の行方
    第14回 コンテナ型仮想化の情報交換会
    Kenta Tada

    View Slide

  2. About me
    ⚫ Kenta Tada
    ⚫ Software Engineer, Sony Group Corporation
    2

    View Slide

  3. Agenda
    ⚫ Internal of Seccomp Notifier
    ⚫ Use Case of Addfd
    ⚫ Seccomp Notification Preemption
    3

    View Slide

  4. Environment
    ⚫ Linux Kernel v5.12-rc4
    ⚫ LXD 4.12
    4

    View Slide

  5. Internal of Seccomp Notifier
    5

    View Slide

  6. Overview
    6
    Userspace
    Kernel
    Container Manager
    1. Issue a system call
    e.g., socket(), mount()
    Container
    4. The container wants to run the system call
    ioctl(fd, SECCOMP_IOCTL_NOTIF_RECV, req)
    5. Read the system call arguments from
    /proc/$pid/mem
    6. Validate the system call
    if OK, go to 7a-1. If NG, go to 7b
    7a. Perform the system call on behalf of the process
    (Optional) Prepare for addfd to return fd to the process
    ioctl(fd, SECCOMP_IOCTL_NOTIF_ADDFD, addfd)
    7b. Reject the system call
    8a. Set the return value to 0 (success)
    (Optional) Return fd from the manager
    8b. Set the return value to error code (failure)
    ioctl(fd, SECCOMP_IOCTL_NOTIF_SEND, req)
    Process
    2. Execute filter
    3. Return “notify”
    cBPF Program
    Seccomp
    9a. Return 0 (success)
    (Optional) Get fd created by the manager
    9b. Return error code (failure)
    ⚫ Install seccomp filter when the process is started.
    ⚫ After the process is started, handle syscalls like below
    Note. From “Rust-based, Secure and Lightweight Container Runtime
    for Embedded Systems" by Manabu Sugimoto, 2021, Cloud Native
    Rust Day, p. 25 (Presentation Slide), https://sched.co/iLkx

    View Slide

  7. Install seccomp filter when seccomp notifier is used
    7
    do_seccomp()
    seccomp_get_notif_sizes()
    1. Copy the size of seccomp data structures to user space.
    seccomp_set_mode_filter()
    1. Check the exclusivity of NEW_LISTENER and TSYNC
    2. Prepare for the file struct for the listener
    3. Check the duplicated listener in ancestors
    4. Bind the listener fd if seccomp filter is installed
    5. Error handing for seccomp notifier
    case SECCOMP_GET_NOTIF_SIZES:
    case SECCOMP_SET_MODE_FILTER:

    View Slide

  8. Check the exclusivity of NEW_LISTENER and TSYNC
    ⚫ TSYNC synchronizes thread group seccomp filters at startup.
    • If it is failed, the PID of one of failing threads will be returned.
    ⚫ On the other hand, when NEW_LISTENER is set, it returns the new
    listener fd.
    ⚫ TSYNC_ESRCH which returns ESRCH is introduced to use both.
    • Ex. Chrome sandbox wanted to use TSYNC for video drivers and NEW_
    LISTENER to proxy syscalls.
    8
    if ((flags & SECCOMP_FILTER_FLAG_TSYNC) &&
    (flags & SECCOMP_FILTER_FLAG_NEW_LISTENER) &&
    ((flags & SECCOMP_FILTER_FLAG_TSYNC_ESRCH) == 0))
    return -EINVAL;
    https://lkml.org/lkml/2020/3/4/813
    From seccomp_set_mode_filter()
    https://lkml.org/lkml/2020/3/4/813

    View Slide

  9. Prepare for the file struct for the listener
    ⚫ listener is the fd and kernel returns it to user space.
    • The target process passes this fd to the tracer process via a UNIX domain socket.
    ⚫ listener_f is the file struct for listener.
    • This file instance is hooked up to anonymous inode.
    9
    if (flags & SECCOMP_FILTER_FLAG_NEW_LISTENER) {
    listener = get_unused_fd_flags(O_CLOEXEC);
    if (listener < 0) {
    ret = listener;
    goto out_free;
    }
    listener_f = init_listener(prepared);
    if (flags & SECCOMP_FILTER_FLAG_NEW_LISTENER) {
    listener = get_unused_fd_flags(O_CLOEXEC);
    if (listener < 0) {
    ret = listener;
    goto out_free;
    }
    listener_f = init_listener(prepared);
    From seccomp_set_mode_filter()

    View Slide

  10. Issue system calls from user space
    10
    1. Issue a system call
    e.g., socket(), mount()
    Container
    Process
    2. Execute filter
    3. Return “notify”
    cBPF Program
    Seccomp

    View Slide

  11. Issue system calls from the process
    11
    __secure_computing()
    __seccomp_filter()
    seccomp_do_user_notification()
    1. Initialize related structs
    2. Set up a state of seccomp notifier as SECCOMP_NOTIFY_INIT
    3. Wait for a reply from user space
    4a. Continue the syscall when
    SECCOMP_USER_NOTIF_FLAG_CONTINUE is set
    4b. Modify the return value when
    SECCOMP_USER_NOTIF_FLAG_CONTINUE is not set
    case SECCOMP_RET_USER_NOTIF:

    View Slide

  12. Initialize related structs
    12
    Key Explanation
    struct task_struct *task The task whose filter triggered the notification
    u64 id The "cookie" for this request.
    This is unique for this filter.
    const struct seccomp_data *data The seccomp data
    enum notify_state state Notification states.
    • SECCOMP_NOTIFY_INIT
    • SECCOMP_NOTIFY_SENT
    • SECCOMP_NOTIFY_REPLIED
    long val The return value from Container Manager
    struct completion ready The completion to manage the status of Seccomp
    Notifier
    struct list_head addfd The list of addfd requests
    Main Components of struct seccomp_knotif

    View Slide

  13. Wait for a reply from user space
    ⚫ The process waits for a reply(SECCOMP_IOCTL_NOTIF_SEND)
    from Container Manager as TASK_INTERRUPTIBLE. 13
    wait:
    err = wait_for_completion_interruptible(&n.ready);
    mutex_lock(&match->notify_lock);
    if (err == 0) {
    /* Check if we were woken up by a addfd message */
    addfd = list_first_entry_or_null(&n.addfd, struct seccomp_kaddfd, list);
    if (addfd && n.state != SECCOMP_NOTIFY_REPLIED) {
    seccomp_handle_addfd(addfd);
    mutex_unlock(&match->notify_lock);
    goto wait;
    }
    ret = n.val;
    err = n.error;
    flags = n.flags;
    }
    From seccomp_do_user_notification()

    View Slide

  14. Container Manager handles syscalls from the process
    14
    Container Manager
    4. The container wants to run the system call
    ioctl(fd, SECCOMP_IOCTL_NOTIF_RECV, req)
    5. Read the system call arguments from
    /proc/$pid/mem
    6. Validate the system call
    if OK, go to 7a-1. If NG, go to 7b
    7a. Perform the system call on behalf of the process
    (Optional) Prepare for addfd to return fd to the process
    ioctl(fd, SECCOMP_IOCTL_NOTIF_ADDFD, addfd)
    7b. Reject the system call
    8a. Set the return value to 0 (success)
    (Optional) Return fd from the manager
    8b. Set the return value to error code (failure)
    ioctl(fd, SECCOMP_IOCTL_NOTIF_SEND, req)
    Seccomp
    9a. Return 0 (success)
    (Optional) Get fd created by the manager
    9b. Return error code (failure)

    View Slide

  15. Container Manager receives the request from kernel space
    15
    seccomp_notify_ioctl()
    seccomp_notify_recv()
    1. Check whether the buffer from user space is cleared
    2. Search the notification which has the state of SECCOMP_NOTIFY_INIT
    3. Change the status from SECOMP_NOTIFY_INIT to
    SECCOMP_NOTIFY_SENT if it is found
    4. Copy the data from the waiting process to Container Manager
    case SECCOMP_IOCTL_NOTIF_RECV:

    View Slide

  16. seccomp_notify_recv()
    ⚫ Checks whether the buffer from user space is cleared
    ⚫ Change the status from SECOMP_NOTIFY_INIT to
    SECCOMP_NOTIFY_SENT
    16
    /* Verify that we're not given garbage to keep struct extensible. */
    ret = check_zeroed_user(buf, sizeof(unotif));
    if (ret < 0)
    return ret;
    if (!ret)
    return -EINVAL;

    unotif.id = knotif->id;
    unotif.pid = task_pid_vnr(knotif->task);
    unotif.data = *(knotif->data);
    knotif->state = SECCOMP_NOTIFY_SENT;
    wake_up_poll(&filter->wqh, EPOLLOUT | EPOLLWRNORM);
    From seccomp_notify_recv()

    View Slide

  17. Container Manager reads actual data from memory
    ⚫ Container Manager needs to read actual data of the
    process’s memory address via /proc/[PID]/mem or
    process_vm_readv(2).
    17
    /* Read the memory of the client to get source and dest of mount */
    snprintf(mem_path, sizeof(mem_path), "/proc/%d/mem", req->pid);
    mem_fd = open(mem_path, O_RDONLY);

    /* Read the path of the source directory */
    if (lseek(mem_fd, req->data.args[0], SEEK_SET) < 0) {
    fprintf(stderr, "lseek¥n");
    exit(EXIT_FAILURE);
    }
    result = read(mem_fd, source, sizeof(source));
    if (result < 0) {
    fprintf(stderr, "failed to read the source¥n");
    exit(EXIT_FAILURE);
    }
    User space sample code: How to read the first argument of mount(2)

    View Slide

  18. Addfd since kernel 5.9
    18
    ⚫ Prior Seccomp Notifier cannot return file descriptors.
    ⚫ Addfd adds the capability for Seccomp to add file
    descriptors.
    ⚫ Use case
    • LXD : Enable bpf in unprivileged containers
    • IPv4-IPv6 transition(Netflix) : Get a socket from a namespace with
    global IPv4 reachability.
    https://lwn.net/Articles/821351/

    View Slide

  19. Container Manager prepares for the fd instead of the process
    19
    seccomp_notify_ioctl()
    seccomp_notify_addfd()
    1. Copy the data of seccomp_notif_addfd from Container Manager
    2. Check the flags
    3. Prepare for seccomp_kaddfd
    3-1. Get the file instance from srcfd of seccomp_notif_addfd
    3-2. Copy newfd_flags to flags of seccomp_kaddfd
    3-3. Set the fd of seccomp_kaddfd from newfd of seccomp_notif_addfd
    4. Notify the process of the new addfd and wait to be processed
    5. Remove the handled request
    case EA_IOCTL(SECCOMP_IOCTL_NOTIF_ADDFD):

    View Slide

  20. Prepare for seccomp_kaddfd
    struct seccomp_notif_addfd {
    __u64 id;
    __u32 flags;
    __u32 srcfd;
    __u32 newfd;
    __u32 newfd_flags;
    };
    20
    struct seccomp_kaddfd {
    struct file *file;
    int fd;
    unsigned int flags;
    /* To only be set on reply */
    int ret;
    struct completion completion;
    struct list_head list;
    };
    fget(srcfd)
    if newfd
    ⚫ Container Manager needs to set up id and srcfd which is the fd
    created by Container Manager at least.
    ⚫ If you want to specify newfd which is used by the process, you
    need to set up flags as SECCOMP_ADDFD_FLAG_SETFD.

    View Slide

  21. Notify the process of the new addfd and wait to be processed
    ⚫ In seccomp_notify_addfd(), notify the process of the new
    addfd using the member of completion in struct
    seccomp_kaddfd
    ⚫ Kernel handles addfds while it is waiting for a
    reply(SECCOMP_IOCTL_NOTIF_SEND) from Container
    Manager in seccomp_do_user_notification().
    ⚫ After addfd is processed correctly in
    seccomp_notify_addfd(), it returns the number of the fd to
    Container Manager.
    21

    View Slide

  22. Return to the process in container
    ⚫ Wait until SECCOMP_NOTIFY_REPLIED
    • Handle the addfd message until SECCOMP_NOTIFY_REPLIED
    22
    wait:
    err = wait_for_completion_interruptible(&n.ready);
    mutex_lock(&match->notify_lock);
    if (err == 0) {
    /* Check if we were woken up by a addfd message */
    addfd = list_first_entry_or_null(&n.addfd, struct seccomp_kaddfd, list);
    if (addfd && n.state != SECCOMP_NOTIFY_REPLIED) {
    seccomp_handle_addfd(addfd);
    mutex_unlock(&match->notify_lock);
    goto wait;
    }
    ret = n.val;
    err = n.error;
    flags = n.flags;
    }
    From seccomp_do_user_notification()

    View Slide

  23. seccomp_handle_addfd()
    ⚫ seccomp_handle_addfd() replaces the fd to use the file instance
    which is created by Container Manager.
    • addfd->ret is the number of fd in the container’s process and the returned
    value of ioctl(SECCOMP_IOCTL_NOTIF_ADDFD, …)
    ⚫ Notify Container Manager of completion 23
    static void seccomp_handle_addfd(struct seccomp_kaddfd *addfd)
    {
    /*
    * Remove the notification, and reset the list pointers, indicating
    * that it has been handled.
    */
    list_del_init(&addfd->list);
    addfd->ret = receive_fd_replace(addfd->fd, addfd->file, addfd->flags);
    complete(&addfd->completion);
    }
    seccomp_handle_addfd()

    View Slide

  24. Container Manager sends the response to kernel space
    24
    seccomp_notify_ioctl()
    seccomp_notify_send()
    1. Check the flags
    2. Search the notification which has the id specified by Container
    Manage
    3. Change the status from SECOMP_NOTIFY_SENT to
    SECCOMP_NOTIFY_REPLIED
    4. Notifies the waiting process of completion
    case SECCOMP_IOCTL_NOTIF_SEND:

    View Slide

  25. seccomp_notify_send()
    ⚫ Permit SECCOMP_USER_NOTIF_FLAG_CONTINUE without
    response
    ⚫ Change the state to SECCOMP_NOTIFY_REPLIED
    ⚫ Notify the waiting process of completion 25
    if (copy_from_user(&resp, buf, sizeof(resp)))
    return -EFAULT;
    if (resp.flags & ~SECCOMP_USER_NOTIF_FLAG_CONTINUE)
    return -EINVAL;
    if ((resp.flags & SECCOMP_USER_NOTIF_FLAG_CONTINUE) &&
    (resp.error || resp.val))
    return -EINVAL;
    knotif->state = SECCOMP_NOTIFY_REPLIED;
    knotif->error = resp.error;
    knotif->val = resp.val;
    knotif->flags = resp.flags;
    complete(&knotif->ready);

    View Slide

  26. Use Case of Addfd
    26

    View Slide

  27. Use eBPF in unprivileged containers in LXD
    27
    ⚫ LXD uses Addfd to handle eBPF in unprivileged containers.
    ⚫ When the container uses cgroup v2, eBPF is used for the
    device controller.
    • https://speakerdeck.com/kentatada/cgroup-v2-internals
    ⚫ But eBPF needs some capabilities or the root privilege even
    if rootless container.

    View Slide

  28. When the command is BPF_PROG_LOAD
    ⚫ Get the fd from the bpf syscall
    ⚫ Prepare for seccomp_notif_addfd
    ⚫ Send addfd to kernel space
    ⚫ After this snippet, resp->val is sent to kernel space and it is
    used as the return value of the process in container finally. 28
    bpf_prog_fd = bpf(cmd, &new_attr, sizeof(new_attr));

    addfd.srcfd = bpf_prog_fd;
    addfd.id = req->id;
    addfd.flags = 0;
    ret = ioctl(notify_fd, SECCOMP_IOCTL_NOTIF_ADDFD, &addfd);
    if (ret < 0)
    return -errno;
    resp->val = ret;
    Managing BPF code from https://github.com/lxc/lxd/blob/lxd-4.12/lxd/seccomp/seccomp.go

    View Slide

  29. Seccomp Notification Preemption
    29

    View Slide

  30. Race condition when syscalls are interrupted.
    30
    ⚫ As we read the kernel code, the process in container waits
    for a reply as TASK_INTERRUPTIBLE
    ⚫ If syscalls are interrupted while it is waiting, the problem
    will happen.
    ⚫ Recently, Golang adapts async preemption to avoid more
    than 10ms during GC.
    ⚫ So, Golang runtime sends the SIGURG signal and it
    exacerbates this situation.

    View Slide

  31. How to fix this issue in the community
    31
    ⚫ The community is trying to add wait_killable semantic to
    Seccomp Notifier.
    ⚫ It allows the supervisor to set a flag that moves the process
    into a state where it is only killable by terminating signals
    as opposed to all signals.
    https://lkml.kernel.org/lkml/[email protected]/T/

    View Slide

  32. Key takeaways
    ⚫ Seccomp Notifier realizes fine-grained access control.
    ⚫ Addfd helps your unprivileged container to get the fd.
    ⚫ The current Seccomp Notifier has the problem.
    • Especially, Golang exacerbates this issue.
    • It will be fixed in future.
    32

    View Slide

  33. SONY is a registered trademark of Sony Group Corporation.
    Names of Sony products and services are the registered trademarks and/or trademarks of Sony Group Corporation or its Group companies.
    Other company names and product names are registered trademarks and/or trademarks of the respective companies.

    View Slide