Slide 1

Slide 1 text

Copyright 2021 Sony Group Corporation Secure Container Runtime Development Information Exchange Meeting for Container Technologies Part 15 Manabu Sugimoto R&D Center, Sony Group Corporation

Slide 2

Slide 2 text

R&D Center Oct/9th/2021 2 About Me  Manabu Sugimoto  System Software Engineer, R&D Center, Sony Group Corporation  Interests  Container Virtualization, Linux Kernel, and Software Analysis

Slide 3

Slide 3 text

R&D Center Oct/9th/2021 3 Outline of the Talk  Our Rust-based secure and lightweight container runtime  Introduction of a presentation at Cloud Native Rust Day 2021  libseccomp-rs: Native Rust crate for libseccomp library  Development of a container runtime based on Kata Containers agent  Activities of Kata Containers  Introduction of runk, a container runtime based on the Kata Agent  Secure containers by static-based system call analysis technique  Activities of container security enhancement using Confine

Slide 4

Slide 4 text

R&D Center Oct/9th/2021 4 Rust-based Secure and Lightweight Container Runtime for Embedded Systems Introduction of the presentation at Cloud Native Rust Day 2021 https://sched.co/iLkx

Slide 5

Slide 5 text

R&D Center Oct/9th/2021 5 Requirements of Embedded Systems  Embedded systems have more restrictions than general-purpose systems  Resource-constrained systems  Small memory size  Low-capacity storage  Low-spec CPU  Mission-critical systems  Real-time application  Critical functionality  Longer life cycle

Slide 6

Slide 6 text

R&D Center Oct/9th/2021 6 Containers on Embedded Systems  Using Kubernetes or Docker on embedded systems is difficult  Include performance overhead and high resource usage  Write operations by the daemon process shorten the lifespan of eMMC*  We run a low-level container runtime alone on the systems *Embedded Multimedia Card

Slide 7

Slide 7 text

R&D Center Oct/9th/2021 7 Problems of the Existing Container Runtimes  Security  Linux Capabilities are not fine-grained access control  e.g., Both ping and ARP spoofing need CAP_NET_RAW  Rootless containers by user namespace are very strict for the systems  The rootless containers cannot emulate all system calls  However, some embedded applications need to access devices via mount(2), mknod(2), etc.  Lightweight  Container startup time is not fast enough for real-time systems  The Go-based runtimes are not suitable for resource-constrained systems  The application binary size is big  Garbage Collection (GC) includes high CPU utilization

Slide 8

Slide 8 text

R&D Center Oct/9th/2021 8 Rust-based Container Runtime  Secure and Lightweight container runtime (SL runtime)  SL runtime is implemented fully in Rust with modern crates  Minimal container runtime for embedded systems  OCI-compatible runtime

Slide 9

Slide 9 text

R&D Center Oct/9th/2021 9 Comparison with the Existing Runtimes *All binary files are stripped Comparison Table from the Perspective of Embedded Systems

Slide 10

Slide 10 text

R&D Center Oct/9th/2021 10 Why Rust?  Rust is a great fit for embedded systems  Performance is equivalent to C/C++  Memory safety without GC  Small application binary size  Awesome crates for developing the container runtime  FFI (Foreign Function Interface) to bind Linux API  Go is also a great programming language but has some limitations  Problem interacting with namespaces by go-runtime  The application binary size is big compared to Rust  Overhead by GC

Slide 11

Slide 11 text

R&D Center Oct/9th/2021 11 Crates for the Container Runtime  Many awesome crates for developing the container runtime  capability : https://crates.io/crates/caps  rlimit : https://crates.io/crates/rlimit  cgroups : https://crates.io/crates/cgroups-rs  libseccomp-rs : https://crates.io/crates/libseccomp  passfd : https://crates.io/crates/passfd  This crate is used for the fine-grained access control  core_affinity : https://crates.io/crates/core_affinity  This crate is used for the real-time support  etc.  clap : https://crates.io/crates/clap  serde_json : https://crates.io/crates/serde_json  anyhow : https://crates.io/crates/anyhow  etc. Developing Runtime Creating Container

Slide 12

Slide 12 text

R&D Center Oct/9th/2021 12 Fast Startup Mechanism  Launch a container speedily by leveraging a pre-created container  Remove time for initializing the runtime and creating the container  Replace only the execution process inside the container at startup  Reuses the other configuration except for the execution process

Slide 13

Slide 13 text

R&D Center Oct/9th/2021 13 Fine-Grained Access Control (FGAC)  FGAC enables the rootless containers to execute system calls safely  FGAC server emulates the system calls in userspace on behalf of the container  The rootless containers can access devices safely via mount(2), mknod(2), etc.  FGAC mechanism is achieved using the new seccomp notify feature

Slide 14

Slide 14 text

R&D Center Oct/9th/2021 14 Seccomp Notify Feature  Provide a way to handle a particular system call in userspace  Introduced in Linux 5.0

Slide 15

Slide 15 text

R&D Center Oct/9th/2021 15 Design of FGAC  Launch a FGAC server before starting a container  The server is launched as root by only a system administrator  Run the container using config.json that describes the seccomp notify  OCI runtime specifications already support seccomp notify [1] [1] https://github.com/opencontainers/runtime-spec/pull/1074

Slide 16

Slide 16 text

R&D Center Oct/9th/2021 16 Evaluation  Goals  Measure startup time of the containers: Normal run and Fast startup  Measure memory consumption of the container runtimes  Environment  Host: AMD Ryzen 9 3900X 12-Core (Ubuntu 20.04)  Evaluated the runtimes*: SL runtime, runsc, singularity, runc, crun, and railcar  Experimental Setup  All the runtimes use a same config.json  Execute /bin/true inside the container runtimes without any client tools *[Version of the evaluated runtimes] runsc:v20201208.0, singularity: v3.1.0, runc: v1.0.0-rc93, crun: v0.18, railcar: v1.0.4

Slide 17

Slide 17 text

R&D Center Oct/9th/2021 17 Results: Start Time  SL runtime is the fastest among the existing runtimes  Normal run achieves a 7.4x speed-up compared to runc  Fast startup achieves a 1.5x speed-up compared to the Normal run

Slide 18

Slide 18 text

R&D Center Oct/9th/2021 18 Results: Memory Usage  SL runtime memory usage is equivalent to crun written in C  Rust is a great fit for resource-constrained systems

Slide 19

Slide 19 text

R&D Center Oct/9th/2021 19 libseccomp-rs: Native Rust Crate for Libseccomp Library

Slide 20

Slide 20 text

R&D Center Oct/9th/2021 20 Libseccomp-rs  Native Rust crate for libseccomp library  Repository: https://github.com/ManaSugi/libseccomp-rs  Anyone is welcome to join and contribute code, documentation, and use cases!  A set of projects that enables developers to use the libseccomp API in Rust  libseccomp: High-level safe API  libseccomp-sys: Low-level unsafe API (automatically generated)  tool: Tool for generating low-level bindings using bindgen  Make libseccomp C library safe as mush as possible  e.g., Use NonNull to ensure the return value of seccomp_init is non-null

Slide 21

Slide 21 text

R&D Center Oct/9th/2021 21 Example of Libseccomp-rs  Create and load a single seccomp rule as follows

Slide 22

Slide 22 text

R&D Center Oct/9th/2021 22 Discussion with Libseccomp Community  We have plan to put the libseccomp-rs into seccomp/libseccomp project  https://github.com/seccomp/libseccomp/issues/323  Prepare for aggregating tests all in on tests directory of the libseccomp  Get better functional code coverage for Golang and Rust  There are 58 tests

Slide 23

Slide 23 text

R&D Center Oct/9th/2021 23 Development of a Container Runtime Based on Kata Containers Agent

Slide 24

Slide 24 text

R&D Center Oct/9th/2021 24 Kata Containers  Hypervisor-based containers  Additional isolation with a lightweight VM and individual kernels  The project is led by Intel, Apple, Red Hat and Ant Group Host Kernel Hardware Virtualization Hardware Virtualization Hardware Virtualization Virtual Machine A Guest Kernel Container Container Virtual Machine B Guest Kernel Container Container Virtual Machine C Guest Kernel Container Container

Slide 25

Slide 25 text

R&D Center Oct/9th/2021 25  Kata Containers v2.0 Virtual Machine Pod Sandbox Namespaces Kata Containers Overview kubelet containerd-shim-kata-v2 / runtime v2 (Shim API) Guest Kernel kata-agent Hypervisor Container Container ttRPC Launch VM CRI

Slide 26

Slide 26 text

R&D Center Oct/9th/2021 26 Kata Agent  Manage container processes inside the VM  Support many OCI behaviors  The execution unit is the sandbox  Rewritten in Rust for Kata Containers 2.0 release  Reduces the memory footprint while keeping the memory safety  Release kata-agent-ctl, a useful developer tool to aid in the agent API debugging  Communicate with the other kata components over ttRPC

Slide 27

Slide 27 text

R&D Center Oct/9th/2021 27 Features of Kata Agent OCI Behaviors create/start containers signal/wait process exec/list process IO/ stream cgroups capabilities, rlimit, readonly path, masked path, users container stats (stats_container) hooks Agent Features & APIs run agent as init (mount fs, udev, setup lo) block device as root device health API network, interface/routes (update_container) file transfer API (copy_file) device APIs (reseed_random_device, online_cpu_memory, mem_hotplug_probe, set_guet_data_time) VSOCK support OCI spec validator Infrastructures debug console Command line

Slide 28

Slide 28 text

R&D Center Oct/9th/2021 28 Our Activities to Enhance Security of Kata Agent  Support Seccomp (WIP)  https://github.com/kata-containers/kata-containers/pull/1788  Make the Kata Containers more secure inside the Pod sandbox  Use libseccomp-rs  Support AppArmor (WIP)  https://github.com/kata-containers/kata-containers/issues/2227  Use prestart OCI hooks for loading the AppArmor profile in the VM  runk: Standard OCI container runtime based on Kata-agent (WIP)

Slide 29

Slide 29 text

R&D Center Oct/9th/2021 29 Kata Agent is a Process for Kata Containers  The kata-agent is not a container runtime  Receive requests from containerd-shim-kata-v2 using ttRPC  Does not provide an OCI Command-Line Interface (CLI)  The kata-agent has most of the features needed for the container runtime  OCI compatibility  Can we develop a standard container runtime based on the kata-agent?  The kata-agent can be adopted in various systems  Make the kata-agent easier to follow the latest OCI runtime specifications  Make the kata-agent easier to test and debug

Slide 30

Slide 30 text

R&D Center Oct/9th/2021 30 Our Proposal: runk  runk: a Rust-based standard container runtime based on the kata-agent  The runk spawns and runs container on the host machine directly like runc  Conform to the OCI Container Runtime specifications  https://github.com/kata-containers/kata-containers/pull/2785

Slide 31

Slide 31 text

R&D Center Oct/9th/2021 31 Performance of runk  The runk is faster than runc and has a lower memory footprint  1.4x speed up compared to runc runk runc crun time [msec] 39.01 53.09 35.35 memory footprint [MB] 4.223 16.54 3.157 This table shows the average of the elapsed time and the memory footprint (maximum resident set size) for running sequentially 500 containers, the containers run /bin/true with detaching mode; runk always runs containers with the mode currently.

Slide 32

Slide 32 text

R&D Center Oct/9th/2021 32 Secure Containers by Static-based System Call Analysis Technique

Slide 33

Slide 33 text

R&D Center Oct/9th/2021 33 Container Security  Container is weaker isolation than VM  Attackers can exploit kernel vulnerabilities to compromise the host  Trusted computing base in container comprises the entire kernel  The code base of the kernel has been expanding with the number of system calls Container Trusted Application Container Container Platform Operating System Hardware Container Runtime Untrusted Application × Container Trusted Application Exploit Compromise Compromise

Slide 34

Slide 34 text

R&D Center Oct/9th/2021 34 Countermeasure to Large Code Base of Linux Kernel  Reduce attack surface of the kernel  Remove code that is inaccessible, or not needed for a given workload of configuration  NIST container security guidelines [2] suggestion is to reduce the attack surface by limiting the functionality available to containers  Seccomp limits system calls for containers  Deny potentially dangerous system calls  Reduce the kernel code available to each container few syscalls seccomp Application Linux Kernel [2] Karen Scarfone Murugiah Souppaya, John Morello. Application Container Security Guide, 2017. https://nvlpubs.nist.gov/nistpubs/Spec ialPublications/NIST.SP.800-190.pdf.

Slide 35

Slide 35 text

R&D Center Oct/9th/2021 35 Challenge of Seccomp  Need to know all system calls issued by containers  Seccomp filter requires system calls number to insert the deny or allow list  Docker, containerd, and CRI-O drops potentially dangerous system calls by default Seccomp profiles  e.g., pivot_root, ptrace, unshare, etc.  How do we get all system calls inside an application binary?  Static analysis  Dynamic analysis

Slide 36

Slide 36 text

R&D Center Oct/9th/2021 36 Problems of Existing Approaches  Static code analysis takes a lot of time  Dynamic analysis does not capture exhaustively all the code  Not capture parts of code that are executed rarely, such as error handling routines  Combing static and dynamic analysis is important to analyze system calls in the application binary effectively

Slide 37

Slide 37 text

R&D Center Oct/9th/2021 37 Confine [Ghavamnia et al. RAID’20]  Automated system call policy generation for attack surface reduction [3]  Confine takes a container image as input and generates a system call policy  Reduce attack surface of the kernel by limiting system calls in containers  Extract the system calls automatically using static and dynamic analysis  Limit the extracted system calls by adopting Seccomp  Results of the evaluation by authors  Confine can disable 145 system calls (out of 326) using 150 Docker Images  Confine can neutralize 51 previously disclosed kernel vulnerabilities [3] Seyedhamed Ghavamnia, Tapti Palit, Azzedine Benameur, and Michalis Polychronakis. Confine: Automated System Call Policy Generation for Container Attack Surface Reduction. In International Symposium on Research in Attacks, Intrusions and Defenses (RAID), 2020.

Slide 38

Slide 38 text

R&D Center Oct/9th/2021 38 Design of Confine  Identify all applications that may run on the container  Identify all library functions imported by each application  Map library functions to system calls  Extract direct system call invocations from applications and libraries

Slide 39

Slide 39 text

R&D Center Oct/9th/2021 39 Our Motivation  Default profiles of high-level container runtimes are not accurate  Include many system calls that are not used in the containers  Recently, new static-based system call analysis techniques have been proposed in research papers  We can analyze system calls accurately If container image developers can use Confine or other new static analysis tools to extract system calls that are used by container images, they can generate more accurate default profiles for the container images than the runtime default profiles

Slide 40

Slide 40 text

R&D Center Oct/9th/2021 40 Our Proposal: SecurityContext in OCI Image Spec  Define a new SecurityContext media type in OCI Image Spec (WIP)  Include Seccomp and Linux Capability configurations inside the image spec  Issue: https://github.com/opencontainers/image-spec/issues/867  Goal  Allow users to choose the image default SecurityContext including the default seccomp profiles from the container orchestration software such as Kubernetes OCI Image Image Default SecurityContext New! Runtime Default Profile Custom Profile Choose User

Slide 41

Slide 41 text

R&D Center Oct/9th/2021 41 Expected Use Cases of SecurityContext  User side:  Set default seccomp profiles for a container image in Kubernetes configuration  Image developer side:  Analyze a container image using system call analysis tools such as Confine  Add information about Seccomp to SecurityContext in the image spec  Push container image public or private registries spec: securityContext: seccompProfile: type: ImageDefault

Slide 42

Slide 42 text

R&D Center Oct/9th/2021 42 System Overview (WIP) General-purpose Systems Embedded Systems System Call Analysis Tool based on Confine Container Image Developer Run Confine with Docker and K8s Generate limited system calls name list Put the system calls list in SecurityContext We need to add the system call list to the OCI image spec https://github.com/opencontainers/image-spec/issues/867 . OCI Image containerd Pull the image Container Runtime OCI Secure Container Launch a container Image Registry Store the mage kubelet CRI User Generate Seccomp profiles in config.json “seccomp”: { .... } We need to develop and modify the systems and the specifications Container Runtime Secure Container Run Confine with “low-level” container runtime “seccomp”: { .... } We need to modify the K8s specification to be able to choose Seccomp profiles generated by Confine in the image spec Image Default SecurityContext User

Slide 43

Slide 43 text

R&D Center Oct/9th/2021 43 Key Takeaways  Rust is a great fit for embedded systems  Small memory footprint and binary size for resource-constrained systems  Memory safety without any overhead for mission-critical systems  Awesome crates for developing container runtimes  runk is a standard container runtime based on a modified version of the Kata Containers agent  System call analysis tools make containers more secure  Various state-of-the-art system call analysis techniques have been proposed in research papers

Slide 44

Slide 44 text

SONY is a registered trademark of Sony Group Corporation. Names of Sony products and services are the registered trademarks and/or trademarks of Sony Group Corporation or its Group companies. Other company names and product names are registered trademarks and/or trademarks of the respective companies.