Slide 1

Slide 1 text

Copyright 2020 Sony Corporation Embedded Container Runtime using Linux capabilities, seccomp, cgroups CloudNative Days Tokyo 2020 Kenta Tada R&D Center Sony Corporation

Slide 2

Slide 2 text

About me ⚫Kenta Tada ⚫Software Engineer, Sony 2

Slide 3

Slide 3 text

Agenda ⚫Overview of Container Runtime ⚫Introduction to Embedded Container Runtime ⚫Case Study of OSS activities 3

Slide 4

Slide 4 text

Overview of Container Runtime 4

Slide 5

Slide 5 text

What is Container Runtime ⚫Container Runtime is spawning and running containers according to the OCI specification. ⚫OCI specification • https://github.com/opencontainers/runtime-spec ⚫Container Runtime needs • config.json which is a configuration file for container guest –environment variables, Linux namespace, seccomp, etc. • rootfs for a container guest 5

Slide 6

Slide 6 text

Node Container Runtime in Kubernetes kublet High Level Runtime (Ex. containerd) Low Level Runtime (Ex. runc) Pod (Sample Application) Control Plane etcd kube-apiserver kube- scheduler kube- controller- manager Today’s Topic 6

Slide 7

Slide 7 text

Introduction to Embedded Container Runtime 7

Slide 8

Slide 8 text

Why we need a container ⚫For edge, we use a container technology • as a light-weight SandBox • to package an application along with its dependencies Wifi/Bluetooth/ LTE/Ethernet Isolation Hardware Host OS dockerd Apps1 Apps2 Server Host OS Apps1 Apps2 Hardware Edge 8

Slide 9

Slide 9 text

Software Stack of Embedded Container Runtime Hardware Hypervisor Guest OS Apps1 Hardware Host OS dockerd Apps1 Apps2 Hypervisor Guest OS Apps2 Host OS Isolation Docker 9

Slide 10

Slide 10 text

Software Stack of Embedded Container Runtime Isolation Host OS Apps1 Apps2 Embedded Container Runtime Hardware Hardware Hypervisor Guest OS Apps1 Hardware Host OS dockerd Apps1 Apps2 Hypervisor Guest OS Apps2 Host OS Docker 10

Slide 11

Slide 11 text

Embedded Container Runtime with Kubernetes kublet High Level Runtime (containerd) Low Level Runtime (runC) Pod (Application) Node ⚫RuntimeClass == runC 11

Slide 12

Slide 12 text

Embedded Container Runtime with Kubernetes kublet High Level Runtime (containerd) Node (WIP) Rust based Low Level Runtime Low Level Runtime (runC) Embedded Container Runtime Pod (Application) ⚫RuntimeClass == Embedded Container Runtime Low Level Runtime (runC) Pod (Application) Container Launcher 12

Slide 13

Slide 13 text

Main features of Embedded Container Runtime ⚫Config Generator ⚫Light-weight execution environment ⚫Resource Control ⚫Tracing ⚫Security ⚫Debug ⚫Fast container boot ⚫Flash-friendly environment ⚫Realtime support ⚫Monitoring 13

Slide 14

Slide 14 text

Main features of Embedded Container Runtime ⚫Config Generator ⚫Light-weight execution environment ⚫Resource Control ⚫Tracing ⚫Security ⚫Debug ⚫Fast container boot ⚫Flash-friendly environment ⚫Realtime support ⚫Monitoring 14

Slide 15

Slide 15 text

Linux Capabilities Linux namespace seccomp Config Generator ⚫Application developers want to concentrate on their application although “config.json” has many items. User Kernel process1 CAP_NET_ADMIN allows various network- related operations CAP_SYS_TIME allows to set up system clock write() write() /box1 /bin /usr /sbin / /bin /usr /sbin / Mount Namespace PID Namespace 5 6 7 1 2 3 4 1 2 3 ??? process2 process1 process2 15

Slide 16

Slide 16 text

Separation of concerns ⚫Config Generator helps them to set up containers without any OS knowledge. Embedded Container Runtime config.json Container Guest Container Runtime Container’s rootfs Application Manifest Config Generator System Manifest 16

Slide 17

Slide 17 text

Light-weight execution environment ⚫Resource constraints in the embedded system • CPU load • storage size • memory size ⚫Especially, the size of container image is large. 17

Slide 18

Slide 18 text

What technologies help optimization ⚫CPU • Config Generator checks the configuration to improve the performance. –Some configuration degrades the performance of system for processes inside the container. ⚫Storage • Bind mount –Reduce the size of storage to use same files • (WIP) Deduplicate storage for Embedded Container Platform 18

Slide 19

Slide 19 text

What technologies help optimization ⚫Memory • Bind mount –Reduce the size of memory to share page caches • KSM (KERNEL SAME-PAGE MERGING) –Reduce the size of memory to share anonymous pages –(WIP) Control KSM for fine grained scan. –(WIP) How to set up madvise(2) 19

Slide 20

Slide 20 text

Resource Control ⚫Use cgroupv1 • We use cgroup as is. • But we encountered some issues sometimes when we used it for our use case. –See https://speakerdeck.com/kentatada/container-tracer-using-oci-hooks-on- kubernetes?slide=17 ⚫(WIP) cgroupv2 support for Embedded Container Runtime 20

Slide 21

Slide 21 text

Tracing ⚫We want to dynamically know the behavior of the application for security and safety. • Needed Linux Capabilities • Correct file permissions • Executed syscall list to set up seccomp • Page fault occurrence in the critical code ⚫We need a light-weight and secure tool for embedded. 21

Slide 22

Slide 22 text

Light-weight secure tracer for container ⚫ftrace-based rootless tracer using OCI hook • This tracer sets up ftrace at the prestart of Container Runtime. –See https://speakerdeck.com/kentatada/debug-application-inside-kubernetes-using-linux-kernel-tools • We could trace others as same as the syscall tracer. ⚫Support operations per container • We had no way to specify OCI hook per container on Kubernetes. • We merged the patch for operations per container to mainline. –https://github.com/containerd/cri/pull/1436 • We can control our tracer per container since containerd 1.4.0 release. 22

Slide 23

Slide 23 text

Security ⚫Embedded system also needs a root privilege. ⚫On the other hand, we don’t want to provide all applications with a root privilege. ⚫Especially, embedded applications directly access devices sometimes. –mount(2) –mknod(2) –Access GPIO 23

Slide 24

Slide 24 text

Transition of our secure settings Host OS Apps Hardware Isolation Hardware Host OS Apps ⚫root privilege + Linux Capabilities + Prior seccomp • Linux Capabilities cannot realize fine grained access control. –Ex. Both ping and ARP spoofing need CAP_NET_RAW • seccomp just allows or denies syscall and does not provide privileges. ⚫User namespace • And what is needed to provide correct access control??? Linux Capabilities user namespace 24

Slide 25

Slide 25 text

Fine Grained Access Control ⚫Implement Fine Grain Access Control using seccomp notifier ⚫FGAC server decides whether to allow syscalls instead of applications. App1 App2 Kernel user kernel FGAC server 25

Slide 26

Slide 26 text

Mechanism of seccomp notifier ⚫seccomp notifier provides a way to handle a particular syscall in user space. ⚫Advantages over ptrace • Performance • To be able to run it on the program that uses seccomp • Protection against PID recycling ⚫But process_vm_readv(2) is needed to fetch the data from the tracee’s address space. 26

Slide 27

Slide 27 text

How to pass the notify file descriptor to another process 1. App1 initializes the seccomp context using seccomp_init(). 2. App1 sets up the seccomp context using seccomp_rule_add(). 3. App1 loads the seccomp context using seccomp_load(). 4. App1 gets notify fd for notification using seccomp_notify_fd(). App1 Kernel FGAC server notification fd fd https://github.com/seccomp/libseccomp/pull/232#issuecomment-627731454 27

Slide 28

Slide 28 text

How to pass the notify file descriptor to another process 1. App1 initializes the seccomp context using seccomp_init(). 2. App1 sets up the seccomp context using seccomp_rule_add(). 3. App1 loads the seccomp context using seccomp_load(). 4. App1 gets notify fd for notification using seccomp_notify_fd(). 5. App1 sends notify fd to FGAC server via UNIX Domain Socket. App1 Kernel FGAC server notification fd fd fd 28 https://github.com/seccomp/libseccomp/pull/232#issuecomment-627731454

Slide 29

Slide 29 text

How to pass the notify file descriptor to another process 1. App1 initializes the seccomp context using seccomp_init(). 2. App1 sets up the seccomp context using seccomp_rule_add(). 3. App1 loads the seccomp context using seccomp_load(). 4. App1 gets notify fd for notification using seccomp_notify_fd(). 5. App1 sends notify fd to FGAC server via UNIX Domain Socket. 6. FGAC server notify fd from App1 via UNIX Domain Socket. App1 Kernel FGAC server notification fd fd fd 29 https://github.com/seccomp/libseccomp/pull/232#issuecomment-627731454

Slide 30

Slide 30 text

How to pass the notify file descriptor to another process 1. App1 initializes the seccomp context using seccomp_init(). 2. App1 sets up the seccomp context using seccomp_rule_add(). 3. App1 loads the seccomp context using seccomp_load(). 4. App1 gets notify fd for notification using seccomp_notify_fd(). 5. App1 sends notify fd to FGAC server via UNIX Domain Socket. 6. FGAC server notify fd from App1 via UNIX Domain Socket. 7. FGAC server receives a notification from notify fd using seccomp_notify_receive(). App1 Kernel FGAC server notification fd fd fd Userspace handler 30 https://github.com/seccomp/libseccomp/pull/232#issuecomment-627731454

Slide 31

Slide 31 text

Case Study of OSS activities 31

Slide 32

Slide 32 text

seccomp with speculation mitigations ⚫Speculation mitigations have a significant impact on CPU- intensive programs when use default-configured Docker. • See http://mamememo.blogspot.com/2020/05/cpu-intensive-rubypython-code-runs.html • We have CPU-intensive software in the robot field. ⚫All speculation mitigations are automatically enabled when seccomp is enabled. ⚫But we can change the setting of seccomp with SECCOMP_FILTER_FLAG_SPEC_ALLOW. 32

Slide 33

Slide 33 text

Apps libsecomp -golang How to improve the performance without speculation mitigations Linux Kernel Docker Library 2. Initialize seccomp 3. Disable speculation feature 4. Set up each mitigation ⚫This feature needs to change the behavior of Docker and runc and Linux Kernel. ⚫In addition to that, we must modify related libraries if we need. runc 33

Slide 34

Slide 34 text

What is needed to improve the performance in OSS Linux Kernel Docker Apps Library runc libsecomp -golang To be determined Implement the new option to control speculation mitigation runtime-spec : https://github.com/opencontainers/runtime-spec/pull/1047 runc : https://github.com/opencontainers/runc/pull/2433 Support SECCOMP_FILTER_FLAG_SPEC_ALLOW https://github.com/seccomp/libseccomp-golang/pull/51 Fix PR_SPEC_FORCE_DISABLE https://lore.kernel.org/patchwork/patch/1251849 34

Slide 35

Slide 35 text

Key takeaways ⚫Embedded systems have the different constraints. ⚫But similar technologies are used. ⚫Diversity is important for OSS. • We need the knowledge of various software layers. • The perspectives from different industries make OSS great. • Let's boost the container community up together. 35

Slide 36

Slide 36 text

SONYはソニー株式会社の登録商標または商標です。 各ソニー製品の商品名・サービス名はソニー株式会社またはグループ各社の登録商標または商標です。その他の製品および会社名は、各社の商号、登録商標または商標です。