$30 off During Our Annual Pro Sale. View Details »

Embedded Container Runtime using Linux capabilities, seccomp, cgroups

Kenta Tada
September 09, 2020

Embedded Container Runtime using Linux capabilities, seccomp, cgroups

Kenta Tada

September 09, 2020
Tweet

More Decks by Kenta Tada

Other Decks in Programming

Transcript

  1. Copyright 2020 Sony Corporation
    Embedded Container Runtime
    using Linux capabilities, seccomp, cgroups
    CloudNative Days Tokyo 2020
    Kenta Tada
    R&D Center
    Sony Corporation

    View Slide

  2. About me
    ⚫Kenta Tada
    ⚫Software Engineer, Sony
    2

    View Slide

  3. Agenda
    ⚫Overview of Container Runtime
    ⚫Introduction to Embedded Container Runtime
    ⚫Case Study of OSS activities
    3

    View Slide

  4. Overview of Container Runtime
    4

    View Slide

  5. What is Container Runtime
    ⚫Container Runtime is spawning and running containers
    according to the OCI specification.
    ⚫OCI specification
    • https://github.com/opencontainers/runtime-spec
    ⚫Container Runtime needs
    • config.json which is a configuration file for container guest
    –environment variables, Linux namespace, seccomp, etc.
    • rootfs for a container guest
    5

    View Slide

  6. Node
    Container Runtime in Kubernetes
    kublet
    High Level Runtime
    (Ex. containerd)
    Low Level Runtime
    (Ex. runc)
    Pod
    (Sample Application)
    Control Plane
    etcd
    kube-apiserver
    kube-
    scheduler
    kube-
    controller-
    manager
    Today’s Topic
    6

    View Slide

  7. Introduction to Embedded Container Runtime
    7

    View Slide

  8. Why we need a container
    ⚫For edge, we use a container technology
    • as a light-weight SandBox
    • to package an application along with its dependencies
    Wifi/Bluetooth/
    LTE/Ethernet
    Isolation
    Hardware
    Host OS
    dockerd
    Apps1 Apps2
    Server
    Host OS
    Apps1 Apps2
    Hardware
    Edge
    8

    View Slide

  9. Software Stack of Embedded Container Runtime
    Hardware
    Hypervisor
    Guest OS
    Apps1
    Hardware
    Host OS
    dockerd
    Apps1 Apps2
    Hypervisor
    Guest OS
    Apps2
    Host OS
    Isolation
    Docker
    9

    View Slide

  10. Software Stack of Embedded Container Runtime Isolation
    Host OS
    Apps1 Apps2
    Embedded Container
    Runtime
    Hardware
    Hardware
    Hypervisor
    Guest OS
    Apps1
    Hardware
    Host OS
    dockerd
    Apps1 Apps2
    Hypervisor
    Guest OS
    Apps2
    Host OS
    Docker
    10

    View Slide

  11. Embedded Container Runtime with Kubernetes
    kublet
    High Level Runtime
    (containerd)
    Low Level Runtime
    (runC)
    Pod
    (Application)
    Node
    ⚫RuntimeClass == runC
    11

    View Slide

  12. Embedded Container Runtime with Kubernetes
    kublet
    High Level Runtime
    (containerd)
    Node
    (WIP) Rust based Low
    Level Runtime
    Low Level Runtime
    (runC)
    Embedded Container Runtime
    Pod
    (Application)
    ⚫RuntimeClass == Embedded Container Runtime
    Low Level Runtime
    (runC)
    Pod
    (Application)
    Container
    Launcher
    12

    View Slide

  13. Main features of Embedded Container Runtime
    ⚫Config Generator
    ⚫Light-weight execution environment
    ⚫Resource Control
    ⚫Tracing
    ⚫Security
    ⚫Debug
    ⚫Fast container boot
    ⚫Flash-friendly environment
    ⚫Realtime support
    ⚫Monitoring
    13

    View Slide

  14. Main features of Embedded Container Runtime
    ⚫Config Generator
    ⚫Light-weight execution environment
    ⚫Resource Control
    ⚫Tracing
    ⚫Security
    ⚫Debug
    ⚫Fast container boot
    ⚫Flash-friendly environment
    ⚫Realtime support
    ⚫Monitoring
    14

    View Slide

  15. Linux Capabilities
    Linux namespace
    seccomp
    Config Generator
    ⚫Application developers want to concentrate on their
    application although “config.json” has many items.
    User
    Kernel
    process1
    CAP_NET_ADMIN
    allows various network-
    related operations
    CAP_SYS_TIME
    allows to set up
    system clock
    write()
    write()
    /box1
    /bin /usr /sbin
    /
    /bin /usr /sbin /
    Mount
    Namespace
    PID
    Namespace
    5
    6 7
    1
    2 3 4 1
    2 3
    ???
    process2
    process1
    process2
    15

    View Slide

  16. Separation of concerns
    ⚫Config Generator helps them to set up containers without
    any OS knowledge.
    Embedded Container Runtime
    config.json
    Container
    Guest
    Container Runtime
    Container’s
    rootfs
    Application
    Manifest
    Config Generator
    System
    Manifest
    16

    View Slide

  17. Light-weight execution environment
    ⚫Resource constraints in the embedded system
    • CPU load
    • storage size
    • memory size
    ⚫Especially, the size of container image is large.
    17

    View Slide

  18. What technologies help optimization
    ⚫CPU
    • Config Generator checks the configuration to improve the
    performance.
    –Some configuration degrades the performance of system for
    processes inside the container.
    ⚫Storage
    • Bind mount
    –Reduce the size of storage to use same files
    • (WIP) Deduplicate storage for Embedded Container Platform
    18

    View Slide

  19. What technologies help optimization
    ⚫Memory
    • Bind mount
    –Reduce the size of memory to share page caches
    • KSM (KERNEL SAME-PAGE MERGING)
    –Reduce the size of memory to share anonymous pages
    –(WIP) Control KSM for fine grained scan.
    –(WIP) How to set up madvise(2)
    19

    View Slide

  20. Resource Control
    ⚫Use cgroupv1
    • We use cgroup as is.
    • But we encountered some issues sometimes when we used it
    for our use case.
    –See https://speakerdeck.com/kentatada/container-tracer-using-oci-hooks-on-
    kubernetes?slide=17
    ⚫(WIP) cgroupv2 support for Embedded Container Runtime
    20

    View Slide

  21. Tracing
    ⚫We want to dynamically know the behavior of the
    application for security and safety.
    • Needed Linux Capabilities
    • Correct file permissions
    • Executed syscall list to set up seccomp
    • Page fault occurrence in the critical code
    ⚫We need a light-weight and secure tool for embedded.
    21

    View Slide

  22. Light-weight secure tracer for container
    ⚫ftrace-based rootless tracer using OCI hook
    • This tracer sets up ftrace at the prestart of Container Runtime.
    –See https://speakerdeck.com/kentatada/debug-application-inside-kubernetes-using-linux-kernel-tools
    • We could trace others as same as the syscall tracer.
    ⚫Support operations per container
    • We had no way to specify OCI hook per container on Kubernetes.
    • We merged the patch for operations per container to mainline.
    –https://github.com/containerd/cri/pull/1436
    • We can control our tracer per container since containerd 1.4.0
    release.
    22

    View Slide

  23. Security
    ⚫Embedded system also needs a root privilege.
    ⚫On the other hand, we don’t want to provide all
    applications with a root privilege.
    ⚫Especially, embedded applications directly access devices
    sometimes.
    –mount(2)
    –mknod(2)
    –Access GPIO
    23

    View Slide

  24. Transition of our secure settings
    Host OS
    Apps
    Hardware
    Isolation
    Hardware
    Host OS
    Apps
    ⚫root privilege + Linux Capabilities + Prior seccomp
    • Linux Capabilities cannot realize fine grained access control.
    –Ex. Both ping and ARP spoofing need CAP_NET_RAW
    • seccomp just allows or denies syscall and does not provide privileges.
    ⚫User namespace
    • And what is needed to provide correct access control???
    Linux Capabilities user namespace
    24

    View Slide

  25. Fine Grained Access Control
    ⚫Implement Fine Grain Access Control using seccomp notifier
    ⚫FGAC server decides whether to allow syscalls instead of
    applications.
    App1 App2
    Kernel
    user
    kernel
    FGAC server
    25

    View Slide

  26. Mechanism of seccomp notifier
    ⚫seccomp notifier provides a way to handle a particular
    syscall in user space.
    ⚫Advantages over ptrace
    • Performance
    • To be able to run it on the program that uses seccomp
    • Protection against PID recycling
    ⚫But process_vm_readv(2) is needed to fetch the data from
    the tracee’s address space.
    26

    View Slide

  27. How to pass the notify file descriptor to another process
    1. App1 initializes the seccomp context using seccomp_init().
    2. App1 sets up the seccomp context using seccomp_rule_add().
    3. App1 loads the seccomp context using seccomp_load().
    4. App1 gets notify fd for notification using seccomp_notify_fd().
    App1
    Kernel
    FGAC server
    notification
    fd
    fd
    https://github.com/seccomp/libseccomp/pull/232#issuecomment-627731454
    27

    View Slide

  28. How to pass the notify file descriptor to another process
    1. App1 initializes the seccomp context using seccomp_init().
    2. App1 sets up the seccomp context using seccomp_rule_add().
    3. App1 loads the seccomp context using seccomp_load().
    4. App1 gets notify fd for notification using seccomp_notify_fd().
    5. App1 sends notify fd to FGAC server via UNIX Domain Socket.
    App1
    Kernel
    FGAC server
    notification
    fd
    fd fd
    28
    https://github.com/seccomp/libseccomp/pull/232#issuecomment-627731454

    View Slide

  29. How to pass the notify file descriptor to another process
    1. App1 initializes the seccomp context using seccomp_init().
    2. App1 sets up the seccomp context using seccomp_rule_add().
    3. App1 loads the seccomp context using seccomp_load().
    4. App1 gets notify fd for notification using seccomp_notify_fd().
    5. App1 sends notify fd to FGAC server via UNIX Domain Socket.
    6. FGAC server notify fd from App1 via UNIX Domain Socket.
    App1
    Kernel
    FGAC server
    notification
    fd
    fd fd
    29
    https://github.com/seccomp/libseccomp/pull/232#issuecomment-627731454

    View Slide

  30. How to pass the notify file descriptor to another process
    1. App1 initializes the seccomp context using seccomp_init().
    2. App1 sets up the seccomp context using seccomp_rule_add().
    3. App1 loads the seccomp context using seccomp_load().
    4. App1 gets notify fd for notification using seccomp_notify_fd().
    5. App1 sends notify fd to FGAC server via UNIX Domain Socket.
    6. FGAC server notify fd from App1 via UNIX Domain Socket.
    7. FGAC server receives a notification from notify fd using
    seccomp_notify_receive().
    App1
    Kernel
    FGAC server
    notification
    fd
    fd fd
    Userspace
    handler
    30
    https://github.com/seccomp/libseccomp/pull/232#issuecomment-627731454

    View Slide

  31. Case Study of OSS activities
    31

    View Slide

  32. seccomp with speculation mitigations
    ⚫Speculation mitigations have a significant impact on CPU-
    intensive programs when use default-configured Docker.
    • See http://mamememo.blogspot.com/2020/05/cpu-intensive-rubypython-code-runs.html
    • We have CPU-intensive software in the robot field.
    ⚫All speculation mitigations are automatically enabled when
    seccomp is enabled.
    ⚫But we can change the setting of seccomp with
    SECCOMP_FILTER_FLAG_SPEC_ALLOW.
    32

    View Slide

  33. Apps
    libsecomp
    -golang
    How to improve the performance without speculation mitigations
    Linux Kernel
    Docker
    Library
    2. Initialize seccomp
    3. Disable
    speculation
    feature
    4. Set up each mitigation
    ⚫This feature needs to
    change the behavior of
    Docker and runc and
    Linux Kernel.
    ⚫In addition to that, we
    must modify related
    libraries if we need.
    runc
    33

    View Slide

  34. What is needed to improve the performance in OSS
    Linux Kernel
    Docker Apps
    Library
    runc
    libsecomp
    -golang
    To be determined
    Implement the new option to control
    speculation mitigation
    runtime-spec :
    https://github.com/opencontainers/runtime-spec/pull/1047
    runc :
    https://github.com/opencontainers/runc/pull/2433
    Support
    SECCOMP_FILTER_FLAG_SPEC_ALLOW
    https://github.com/seccomp/libseccomp-golang/pull/51
    Fix PR_SPEC_FORCE_DISABLE
    https://lore.kernel.org/patchwork/patch/1251849
    34

    View Slide

  35. Key takeaways
    ⚫Embedded systems have the different constraints.
    ⚫But similar technologies are used.
    ⚫Diversity is important for OSS.
    • We need the knowledge of various software layers.
    • The perspectives from different industries make OSS great.
    • Let's boost the container community up together.
    35

    View Slide

  36. SONYはソニー株式会社の登録商標または商標です。
    各ソニー製品の商品名・サービス名はソニー株式会社またはグループ各社の登録商標または商標です。その他の製品および会社名は、各社の商号、登録商標または商標です。

    View Slide