Pro Yearly is on sale from $80 to $50! »

Embedded Container Runtime using Linux capabilities, seccomp, cgroups

44df43ecb9d8ae00cafee6b804db3fcd?s=47 Kenta Tada
September 09, 2020

Embedded Container Runtime using Linux capabilities, seccomp, cgroups

44df43ecb9d8ae00cafee6b804db3fcd?s=128

Kenta Tada

September 09, 2020
Tweet

Transcript

  1. Copyright 2020 Sony Corporation Embedded Container Runtime using Linux capabilities,

    seccomp, cgroups CloudNative Days Tokyo 2020 Kenta Tada R&D Center Sony Corporation
  2. About me ⚫Kenta Tada ⚫Software Engineer, Sony 2

  3. Agenda ⚫Overview of Container Runtime ⚫Introduction to Embedded Container Runtime

    ⚫Case Study of OSS activities 3
  4. Overview of Container Runtime 4

  5. What is Container Runtime ⚫Container Runtime is spawning and running

    containers according to the OCI specification. ⚫OCI specification • https://github.com/opencontainers/runtime-spec ⚫Container Runtime needs • config.json which is a configuration file for container guest –environment variables, Linux namespace, seccomp, etc. • rootfs for a container guest 5
  6. Node Container Runtime in Kubernetes kublet High Level Runtime (Ex.

    containerd) Low Level Runtime (Ex. runc) Pod (Sample Application) Control Plane etcd kube-apiserver kube- scheduler kube- controller- manager Today’s Topic 6
  7. Introduction to Embedded Container Runtime 7

  8. Why we need a container ⚫For edge, we use a

    container technology • as a light-weight SandBox • to package an application along with its dependencies Wifi/Bluetooth/ LTE/Ethernet Isolation Hardware Host OS dockerd Apps1 Apps2 Server Host OS Apps1 Apps2 Hardware Edge 8
  9. Software Stack of Embedded Container Runtime Hardware Hypervisor Guest OS

    Apps1 Hardware Host OS dockerd Apps1 Apps2 Hypervisor Guest OS Apps2 Host OS Isolation Docker 9
  10. Software Stack of Embedded Container Runtime Isolation Host OS Apps1

    Apps2 Embedded Container Runtime Hardware Hardware Hypervisor Guest OS Apps1 Hardware Host OS dockerd Apps1 Apps2 Hypervisor Guest OS Apps2 Host OS Docker 10
  11. Embedded Container Runtime with Kubernetes kublet High Level Runtime (containerd)

    Low Level Runtime (runC) Pod (Application) Node ⚫RuntimeClass == runC 11
  12. Embedded Container Runtime with Kubernetes kublet High Level Runtime (containerd)

    Node (WIP) Rust based Low Level Runtime Low Level Runtime (runC) Embedded Container Runtime Pod (Application) ⚫RuntimeClass == Embedded Container Runtime Low Level Runtime (runC) Pod (Application) Container Launcher 12
  13. Main features of Embedded Container Runtime ⚫Config Generator ⚫Light-weight execution

    environment ⚫Resource Control ⚫Tracing ⚫Security ⚫Debug ⚫Fast container boot ⚫Flash-friendly environment ⚫Realtime support ⚫Monitoring 13
  14. Main features of Embedded Container Runtime ⚫Config Generator ⚫Light-weight execution

    environment ⚫Resource Control ⚫Tracing ⚫Security ⚫Debug ⚫Fast container boot ⚫Flash-friendly environment ⚫Realtime support ⚫Monitoring 14
  15. Linux Capabilities Linux namespace seccomp Config Generator ⚫Application developers want

    to concentrate on their application although “config.json” has many items. User Kernel process1 CAP_NET_ADMIN allows various network- related operations CAP_SYS_TIME allows to set up system clock write() write() /box1 /bin /usr /sbin / /bin /usr /sbin / Mount Namespace PID Namespace 5 6 7 1 2 3 4 1 2 3 ??? process2 process1 process2 15
  16. Separation of concerns ⚫Config Generator helps them to set up

    containers without any OS knowledge. Embedded Container Runtime config.json Container Guest Container Runtime Container’s rootfs Application Manifest Config Generator System Manifest 16
  17. Light-weight execution environment ⚫Resource constraints in the embedded system •

    CPU load • storage size • memory size ⚫Especially, the size of container image is large. 17
  18. What technologies help optimization ⚫CPU • Config Generator checks the

    configuration to improve the performance. –Some configuration degrades the performance of system for processes inside the container. ⚫Storage • Bind mount –Reduce the size of storage to use same files • (WIP) Deduplicate storage for Embedded Container Platform 18
  19. What technologies help optimization ⚫Memory • Bind mount –Reduce the

    size of memory to share page caches • KSM (KERNEL SAME-PAGE MERGING) –Reduce the size of memory to share anonymous pages –(WIP) Control KSM for fine grained scan. –(WIP) How to set up madvise(2) 19
  20. Resource Control ⚫Use cgroupv1 • We use cgroup as is.

    • But we encountered some issues sometimes when we used it for our use case. –See https://speakerdeck.com/kentatada/container-tracer-using-oci-hooks-on- kubernetes?slide=17 ⚫(WIP) cgroupv2 support for Embedded Container Runtime 20
  21. Tracing ⚫We want to dynamically know the behavior of the

    application for security and safety. • Needed Linux Capabilities • Correct file permissions • Executed syscall list to set up seccomp • Page fault occurrence in the critical code ⚫We need a light-weight and secure tool for embedded. 21
  22. Light-weight secure tracer for container ⚫ftrace-based rootless tracer using OCI

    hook • This tracer sets up ftrace at the prestart of Container Runtime. –See https://speakerdeck.com/kentatada/debug-application-inside-kubernetes-using-linux-kernel-tools • We could trace others as same as the syscall tracer. ⚫Support operations per container • We had no way to specify OCI hook per container on Kubernetes. • We merged the patch for operations per container to mainline. –https://github.com/containerd/cri/pull/1436 • We can control our tracer per container since containerd 1.4.0 release. 22
  23. Security ⚫Embedded system also needs a root privilege. ⚫On the

    other hand, we don’t want to provide all applications with a root privilege. ⚫Especially, embedded applications directly access devices sometimes. –mount(2) –mknod(2) –Access GPIO 23
  24. Transition of our secure settings Host OS Apps Hardware Isolation

    Hardware Host OS Apps ⚫root privilege + Linux Capabilities + Prior seccomp • Linux Capabilities cannot realize fine grained access control. –Ex. Both ping and ARP spoofing need CAP_NET_RAW • seccomp just allows or denies syscall and does not provide privileges. ⚫User namespace • And what is needed to provide correct access control??? Linux Capabilities user namespace 24
  25. Fine Grained Access Control ⚫Implement Fine Grain Access Control using

    seccomp notifier ⚫FGAC server decides whether to allow syscalls instead of applications. App1 App2 Kernel user kernel FGAC server 25
  26. Mechanism of seccomp notifier ⚫seccomp notifier provides a way to

    handle a particular syscall in user space. ⚫Advantages over ptrace • Performance • To be able to run it on the program that uses seccomp • Protection against PID recycling ⚫But process_vm_readv(2) is needed to fetch the data from the tracee’s address space. 26
  27. How to pass the notify file descriptor to another process

    1. App1 initializes the seccomp context using seccomp_init(). 2. App1 sets up the seccomp context using seccomp_rule_add(). 3. App1 loads the seccomp context using seccomp_load(). 4. App1 gets notify fd for notification using seccomp_notify_fd(). App1 Kernel FGAC server notification fd fd https://github.com/seccomp/libseccomp/pull/232#issuecomment-627731454 27
  28. How to pass the notify file descriptor to another process

    1. App1 initializes the seccomp context using seccomp_init(). 2. App1 sets up the seccomp context using seccomp_rule_add(). 3. App1 loads the seccomp context using seccomp_load(). 4. App1 gets notify fd for notification using seccomp_notify_fd(). 5. App1 sends notify fd to FGAC server via UNIX Domain Socket. App1 Kernel FGAC server notification fd fd fd 28 https://github.com/seccomp/libseccomp/pull/232#issuecomment-627731454
  29. How to pass the notify file descriptor to another process

    1. App1 initializes the seccomp context using seccomp_init(). 2. App1 sets up the seccomp context using seccomp_rule_add(). 3. App1 loads the seccomp context using seccomp_load(). 4. App1 gets notify fd for notification using seccomp_notify_fd(). 5. App1 sends notify fd to FGAC server via UNIX Domain Socket. 6. FGAC server notify fd from App1 via UNIX Domain Socket. App1 Kernel FGAC server notification fd fd fd 29 https://github.com/seccomp/libseccomp/pull/232#issuecomment-627731454
  30. How to pass the notify file descriptor to another process

    1. App1 initializes the seccomp context using seccomp_init(). 2. App1 sets up the seccomp context using seccomp_rule_add(). 3. App1 loads the seccomp context using seccomp_load(). 4. App1 gets notify fd for notification using seccomp_notify_fd(). 5. App1 sends notify fd to FGAC server via UNIX Domain Socket. 6. FGAC server notify fd from App1 via UNIX Domain Socket. 7. FGAC server receives a notification from notify fd using seccomp_notify_receive(). App1 Kernel FGAC server notification fd fd fd Userspace handler 30 https://github.com/seccomp/libseccomp/pull/232#issuecomment-627731454
  31. Case Study of OSS activities 31

  32. seccomp with speculation mitigations ⚫Speculation mitigations have a significant impact

    on CPU- intensive programs when use default-configured Docker. • See http://mamememo.blogspot.com/2020/05/cpu-intensive-rubypython-code-runs.html • We have CPU-intensive software in the robot field. ⚫All speculation mitigations are automatically enabled when seccomp is enabled. ⚫But we can change the setting of seccomp with SECCOMP_FILTER_FLAG_SPEC_ALLOW. 32
  33. Apps libsecomp -golang How to improve the performance without speculation

    mitigations Linux Kernel Docker Library 2. Initialize seccomp 3. Disable speculation feature 4. Set up each mitigation ⚫This feature needs to change the behavior of Docker and runc and Linux Kernel. ⚫In addition to that, we must modify related libraries if we need. runc 33
  34. What is needed to improve the performance in OSS Linux

    Kernel Docker Apps Library runc libsecomp -golang To be determined Implement the new option to control speculation mitigation runtime-spec : https://github.com/opencontainers/runtime-spec/pull/1047 runc : https://github.com/opencontainers/runc/pull/2433 Support SECCOMP_FILTER_FLAG_SPEC_ALLOW https://github.com/seccomp/libseccomp-golang/pull/51 Fix PR_SPEC_FORCE_DISABLE https://lore.kernel.org/patchwork/patch/1251849 34
  35. Key takeaways ⚫Embedded systems have the different constraints. ⚫But similar

    technologies are used. ⚫Diversity is important for OSS. • We need the knowledge of various software layers. • The perspectives from different industries make OSS great. • Let's boost the container community up together. 35
  36. SONYはソニー株式会社の登録商標または商標です。 各ソニー製品の商品名・サービス名はソニー株式会社またはグループ各社の登録商標または商標です。その他の製品および会社名は、各社の商号、登録商標または商標です。