Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Secure Container Runtime Development

Secure Container Runtime Development


Manabu Sugimoto

October 09, 2021


  1. Copyright 2021 Sony Group Corporation Secure Container Runtime Development Information

    Exchange Meeting for Container Technologies Part 15 Manabu Sugimoto R&D Center, Sony Group Corporation
  2. R&D Center Oct/9th/2021 2 About Me  Manabu Sugimoto 

    System Software Engineer, R&D Center, Sony Group Corporation  Interests  Container Virtualization, Linux Kernel, and Software Analysis
  3. R&D Center Oct/9th/2021 3 Outline of the Talk  Our

    Rust-based secure and lightweight container runtime  Introduction of a presentation at Cloud Native Rust Day 2021  libseccomp-rs: Native Rust crate for libseccomp library  Development of a container runtime based on Kata Containers agent  Activities of Kata Containers  Introduction of runk, a container runtime based on the Kata Agent  Secure containers by static-based system call analysis technique  Activities of container security enhancement using Confine
  4. R&D Center Oct/9th/2021 4 Rust-based Secure and Lightweight Container Runtime

    for Embedded Systems Introduction of the presentation at Cloud Native Rust Day 2021 https://sched.co/iLkx
  5. R&D Center Oct/9th/2021 5 Requirements of Embedded Systems  Embedded

    systems have more restrictions than general-purpose systems  Resource-constrained systems  Small memory size  Low-capacity storage  Low-spec CPU  Mission-critical systems  Real-time application  Critical functionality  Longer life cycle
  6. R&D Center Oct/9th/2021 6 Containers on Embedded Systems  Using

    Kubernetes or Docker on embedded systems is difficult  Include performance overhead and high resource usage  Write operations by the daemon process shorten the lifespan of eMMC*  We run a low-level container runtime alone on the systems *Embedded Multimedia Card
  7. R&D Center Oct/9th/2021 7 Problems of the Existing Container Runtimes

     Security  Linux Capabilities are not fine-grained access control  e.g., Both ping and ARP spoofing need CAP_NET_RAW  Rootless containers by user namespace are very strict for the systems  The rootless containers cannot emulate all system calls  However, some embedded applications need to access devices via mount(2), mknod(2), etc.  Lightweight  Container startup time is not fast enough for real-time systems  The Go-based runtimes are not suitable for resource-constrained systems  The application binary size is big  Garbage Collection (GC) includes high CPU utilization
  8. R&D Center Oct/9th/2021 8 Rust-based Container Runtime  Secure and

    Lightweight container runtime (SL runtime)  SL runtime is implemented fully in Rust with modern crates  Minimal container runtime for embedded systems  OCI-compatible runtime
  9. R&D Center Oct/9th/2021 9 Comparison with the Existing Runtimes *All

    binary files are stripped Comparison Table from the Perspective of Embedded Systems
  10. R&D Center Oct/9th/2021 10 Why Rust?  Rust is a

    great fit for embedded systems  Performance is equivalent to C/C++  Memory safety without GC  Small application binary size  Awesome crates for developing the container runtime  FFI (Foreign Function Interface) to bind Linux API  Go is also a great programming language but has some limitations  Problem interacting with namespaces by go-runtime  The application binary size is big compared to Rust  Overhead by GC
  11. R&D Center Oct/9th/2021 11 Crates for the Container Runtime 

    Many awesome crates for developing the container runtime  capability : https://crates.io/crates/caps  rlimit : https://crates.io/crates/rlimit  cgroups : https://crates.io/crates/cgroups-rs  libseccomp-rs : https://crates.io/crates/libseccomp  passfd : https://crates.io/crates/passfd  This crate is used for the fine-grained access control  core_affinity : https://crates.io/crates/core_affinity  This crate is used for the real-time support  etc.  clap : https://crates.io/crates/clap  serde_json : https://crates.io/crates/serde_json  anyhow : https://crates.io/crates/anyhow  etc. Developing Runtime Creating Container
  12. R&D Center Oct/9th/2021 12 Fast Startup Mechanism  Launch a

    container speedily by leveraging a pre-created container  Remove time for initializing the runtime and creating the container  Replace only the execution process inside the container at startup  Reuses the other configuration except for the execution process
  13. R&D Center Oct/9th/2021 13 Fine-Grained Access Control (FGAC)  FGAC

    enables the rootless containers to execute system calls safely  FGAC server emulates the system calls in userspace on behalf of the container  The rootless containers can access devices safely via mount(2), mknod(2), etc.  FGAC mechanism is achieved using the new seccomp notify feature
  14. R&D Center Oct/9th/2021 14 Seccomp Notify Feature  Provide a

    way to handle a particular system call in userspace  Introduced in Linux 5.0
  15. R&D Center Oct/9th/2021 15 Design of FGAC  Launch a

    FGAC server before starting a container  The server is launched as root by only a system administrator  Run the container using config.json that describes the seccomp notify  OCI runtime specifications already support seccomp notify [1] [1] https://github.com/opencontainers/runtime-spec/pull/1074
  16. R&D Center Oct/9th/2021 16 Evaluation  Goals  Measure startup

    time of the containers: Normal run and Fast startup  Measure memory consumption of the container runtimes  Environment  Host: AMD Ryzen 9 3900X 12-Core (Ubuntu 20.04)  Evaluated the runtimes*: SL runtime, runsc, singularity, runc, crun, and railcar  Experimental Setup  All the runtimes use a same config.json  Execute /bin/true inside the container runtimes without any client tools *[Version of the evaluated runtimes] runsc:v20201208.0, singularity: v3.1.0, runc: v1.0.0-rc93, crun: v0.18, railcar: v1.0.4
  17. R&D Center Oct/9th/2021 17 Results: Start Time  SL runtime

    is the fastest among the existing runtimes  Normal run achieves a 7.4x speed-up compared to runc  Fast startup achieves a 1.5x speed-up compared to the Normal run
  18. R&D Center Oct/9th/2021 18 Results: Memory Usage  SL runtime

    memory usage is equivalent to crun written in C  Rust is a great fit for resource-constrained systems
  19. R&D Center Oct/9th/2021 19 libseccomp-rs: Native Rust Crate for Libseccomp

  20. R&D Center Oct/9th/2021 20 Libseccomp-rs  Native Rust crate for

    libseccomp library  Repository: https://github.com/ManaSugi/libseccomp-rs  Anyone is welcome to join and contribute code, documentation, and use cases!  A set of projects that enables developers to use the libseccomp API in Rust  libseccomp: High-level safe API  libseccomp-sys: Low-level unsafe API (automatically generated)  tool: Tool for generating low-level bindings using bindgen  Make libseccomp C library safe as mush as possible  e.g., Use NonNull to ensure the return value of seccomp_init is non-null
  21. R&D Center Oct/9th/2021 21 Example of Libseccomp-rs  Create and

    load a single seccomp rule as follows
  22. R&D Center Oct/9th/2021 22 Discussion with Libseccomp Community  We

    have plan to put the libseccomp-rs into seccomp/libseccomp project  https://github.com/seccomp/libseccomp/issues/323  Prepare for aggregating tests all in on tests directory of the libseccomp  Get better functional code coverage for Golang and Rust  There are 58 tests
  23. R&D Center Oct/9th/2021 23 Development of a Container Runtime Based

    on Kata Containers Agent
  24. R&D Center Oct/9th/2021 24 Kata Containers  Hypervisor-based containers 

    Additional isolation with a lightweight VM and individual kernels  The project is led by Intel, Apple, Red Hat and Ant Group Host Kernel Hardware Virtualization Hardware Virtualization Hardware Virtualization Virtual Machine A Guest Kernel Container Container Virtual Machine B Guest Kernel Container Container Virtual Machine C Guest Kernel Container Container
  25. R&D Center Oct/9th/2021 25  Kata Containers v2.0 Virtual Machine

    Pod Sandbox Namespaces Kata Containers Overview kubelet containerd-shim-kata-v2 / runtime v2 (Shim API) Guest Kernel kata-agent Hypervisor Container Container ttRPC Launch VM CRI
  26. R&D Center Oct/9th/2021 26 Kata Agent  Manage container processes

    inside the VM  Support many OCI behaviors  The execution unit is the sandbox  Rewritten in Rust for Kata Containers 2.0 release  Reduces the memory footprint while keeping the memory safety  Release kata-agent-ctl, a useful developer tool to aid in the agent API debugging  Communicate with the other kata components over ttRPC
  27. R&D Center Oct/9th/2021 27 Features of Kata Agent OCI Behaviors

    create/start containers signal/wait process exec/list process IO/ stream cgroups capabilities, rlimit, readonly path, masked path, users container stats (stats_container) hooks Agent Features & APIs run agent as init (mount fs, udev, setup lo) block device as root device health API network, interface/routes (update_container) file transfer API (copy_file) device APIs (reseed_random_device, online_cpu_memory, mem_hotplug_probe, set_guet_data_time) VSOCK support OCI spec validator Infrastructures debug console Command line
  28. R&D Center Oct/9th/2021 28 Our Activities to Enhance Security of

    Kata Agent  Support Seccomp (WIP)  https://github.com/kata-containers/kata-containers/pull/1788  Make the Kata Containers more secure inside the Pod sandbox  Use libseccomp-rs  Support AppArmor (WIP)  https://github.com/kata-containers/kata-containers/issues/2227  Use prestart OCI hooks for loading the AppArmor profile in the VM  runk: Standard OCI container runtime based on Kata-agent (WIP)
  29. R&D Center Oct/9th/2021 29 Kata Agent is a Process for

    Kata Containers  The kata-agent is not a container runtime  Receive requests from containerd-shim-kata-v2 using ttRPC  Does not provide an OCI Command-Line Interface (CLI)  The kata-agent has most of the features needed for the container runtime  OCI compatibility  Can we develop a standard container runtime based on the kata-agent?  The kata-agent can be adopted in various systems  Make the kata-agent easier to follow the latest OCI runtime specifications  Make the kata-agent easier to test and debug
  30. R&D Center Oct/9th/2021 30 Our Proposal: runk  runk: a

    Rust-based standard container runtime based on the kata-agent  The runk spawns and runs container on the host machine directly like runc  Conform to the OCI Container Runtime specifications  https://github.com/kata-containers/kata-containers/pull/2785
  31. R&D Center Oct/9th/2021 31 Performance of runk  The runk

    is faster than runc and has a lower memory footprint  1.4x speed up compared to runc runk runc crun time [msec] 39.01 53.09 35.35 memory footprint [MB] 4.223 16.54 3.157 This table shows the average of the elapsed time and the memory footprint (maximum resident set size) for running sequentially 500 containers, the containers run /bin/true with detaching mode; runk always runs containers with the mode currently.
  32. R&D Center Oct/9th/2021 32 Secure Containers by Static-based System Call

    Analysis Technique
  33. R&D Center Oct/9th/2021 33 Container Security  Container is weaker

    isolation than VM  Attackers can exploit kernel vulnerabilities to compromise the host  Trusted computing base in container comprises the entire kernel  The code base of the kernel has been expanding with the number of system calls Container Trusted Application Container Container Platform Operating System Hardware Container Runtime Untrusted Application × Container Trusted Application Exploit Compromise Compromise
  34. R&D Center Oct/9th/2021 34 Countermeasure to Large Code Base of

    Linux Kernel  Reduce attack surface of the kernel  Remove code that is inaccessible, or not needed for a given workload of configuration  NIST container security guidelines [2] suggestion is to reduce the attack surface by limiting the functionality available to containers  Seccomp limits system calls for containers  Deny potentially dangerous system calls  Reduce the kernel code available to each container few syscalls seccomp Application Linux Kernel [2] Karen Scarfone Murugiah Souppaya, John Morello. Application Container Security Guide, 2017. https://nvlpubs.nist.gov/nistpubs/Spec ialPublications/NIST.SP.800-190.pdf.
  35. R&D Center Oct/9th/2021 35 Challenge of Seccomp  Need to

    know all system calls issued by containers  Seccomp filter requires system calls number to insert the deny or allow list  Docker, containerd, and CRI-O drops potentially dangerous system calls by default Seccomp profiles  e.g., pivot_root, ptrace, unshare, etc.  How do we get all system calls inside an application binary?  Static analysis  Dynamic analysis
  36. R&D Center Oct/9th/2021 36 Problems of Existing Approaches  Static

    code analysis takes a lot of time  Dynamic analysis does not capture exhaustively all the code  Not capture parts of code that are executed rarely, such as error handling routines  Combing static and dynamic analysis is important to analyze system calls in the application binary effectively
  37. R&D Center Oct/9th/2021 37 Confine [Ghavamnia et al. RAID’20] 

    Automated system call policy generation for attack surface reduction [3]  Confine takes a container image as input and generates a system call policy  Reduce attack surface of the kernel by limiting system calls in containers  Extract the system calls automatically using static and dynamic analysis  Limit the extracted system calls by adopting Seccomp  Results of the evaluation by authors  Confine can disable 145 system calls (out of 326) using 150 Docker Images  Confine can neutralize 51 previously disclosed kernel vulnerabilities [3] Seyedhamed Ghavamnia, Tapti Palit, Azzedine Benameur, and Michalis Polychronakis. Confine: Automated System Call Policy Generation for Container Attack Surface Reduction. In International Symposium on Research in Attacks, Intrusions and Defenses (RAID), 2020.
  38. R&D Center Oct/9th/2021 38 Design of Confine  Identify all

    applications that may run on the container  Identify all library functions imported by each application  Map library functions to system calls  Extract direct system call invocations from applications and libraries
  39. R&D Center Oct/9th/2021 39 Our Motivation  Default profiles of

    high-level container runtimes are not accurate  Include many system calls that are not used in the containers  Recently, new static-based system call analysis techniques have been proposed in research papers  We can analyze system calls accurately If container image developers can use Confine or other new static analysis tools to extract system calls that are used by container images, they can generate more accurate default profiles for the container images than the runtime default profiles
  40. R&D Center Oct/9th/2021 40 Our Proposal: SecurityContext in OCI Image

    Spec  Define a new SecurityContext media type in OCI Image Spec (WIP)  Include Seccomp and Linux Capability configurations inside the image spec  Issue: https://github.com/opencontainers/image-spec/issues/867  Goal  Allow users to choose the image default SecurityContext including the default seccomp profiles from the container orchestration software such as Kubernetes OCI Image Image Default SecurityContext New! Runtime Default Profile Custom Profile Choose User
  41. R&D Center Oct/9th/2021 41 Expected Use Cases of SecurityContext 

    User side:  Set default seccomp profiles for a container image in Kubernetes configuration  Image developer side:  Analyze a container image using system call analysis tools such as Confine  Add information about Seccomp to SecurityContext in the image spec  Push container image public or private registries spec: securityContext: seccompProfile: type: ImageDefault
  42. R&D Center Oct/9th/2021 42 System Overview (WIP) General-purpose Systems Embedded

    Systems System Call Analysis Tool based on Confine Container Image Developer Run Confine with Docker and K8s Generate limited system calls name list Put the system calls list in SecurityContext We need to add the system call list to the OCI image spec https://github.com/opencontainers/image-spec/issues/867 . OCI Image containerd Pull the image Container Runtime OCI Secure Container Launch a container Image Registry Store the mage kubelet CRI User Generate Seccomp profiles in config.json “seccomp”: { .... } We need to develop and modify the systems and the specifications Container Runtime Secure Container Run Confine with “low-level” container runtime “seccomp”: { .... } We need to modify the K8s specification to be able to choose Seccomp profiles generated by Confine in the image spec Image Default SecurityContext User
  43. R&D Center Oct/9th/2021 43 Key Takeaways  Rust is a

    great fit for embedded systems  Small memory footprint and binary size for resource-constrained systems  Memory safety without any overhead for mission-critical systems  Awesome crates for developing container runtimes  runk is a standard container runtime based on a modified version of the Kata Containers agent  System call analysis tools make containers more secure  Various state-of-the-art system call analysis techniques have been proposed in research papers
  44. SONY is a registered trademark of Sony Group Corporation. Names

    of Sony products and services are the registered trademarks and/or trademarks of Sony Group Corporation or its Group companies. Other company names and product names are registered trademarks and/or trademarks of the respective companies.