Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Send in the chown()s - systemd containers in user namespaces

Send in the chown()s - systemd containers in user namespaces

Linux container escapes continue to affect Kubernetes and derived products. User namespaces are one technology that can mitigate the risk. In this presentation I will explain the past, present and future of user namespace support in Kubernetes, and discuss how to run systemd-based containers in user namespaces. And why you would even want to try. There will be demos! Attendees will learn about what containers are, the technologies that underpin Linux containers, and how Kubernetes actually runs containers.

"systemd in a container? What! Why!?" We had our Reasons, and I'll even explain them. But more interesting than the "why" is the "how", and that's what this talk is about. Come and learn about the upcoming and in-development Kernel and Kubernetes security features that will enable better container isolation and secure deployment of systemd-based workloads.

This is a talk about what happened when a handful of complete container newbies tried to port their massive, complex, legacy application to Kubernetes. As a monolithic container. Based on systemd.

The runtime shunned our container and refused to execute it. Cloud engineers recoiled in horror at our architecture. With astounding hubris we ignored their admonitions and doubled down. If the container runtime won't run our application, well, we'll just modify the container runtime!

And so we did. Our journey took us into the darkest corners of container runtimes, Kubernetes and systemd. And we have emerged to tell you the tale. There will be demos.

Attendees will learn about the security technologies that underpin Linux containers, including namespaces and cgroups, as well as the behaviour of systemd in containers. I will also discuss the recent and planned changes in Kubernetes to provide official support for running containers in user namespaces.

Fraser Tweedale

February 04, 2023
Tweet

More Decks by Fraser Tweedale

Other Decks in Technology

Transcript

  1. Agenda Containers and container standards Kubernetes and OpenShift User namespaces

    and cgroups systemd-based workloads on Kubernetes/OpenShift
  2. What is a container? An process isolation and confinement abstraction

    Most commonly: OS-level virtualisation (shared kernel) e.g. FreeBSD jails, Solaris zones Container image defines filesystem contents
  3. Containers on linux Some combination of the following security mechanisms:

    namespaces(7): pid, mount, network, cgroup, . . . restricted capabilities(7) and/or seccomp(2) profile SELinux or AppArmor MAC policy cgroups(7) for resource limits Not necessarily all of these at the same time. . .
  4. Container standards Open Container Initiative (OCI)1 Runtime Specification2 - low

    level runtime interface Linux, Solaris, Windows, VMs, . . . Implementations3: runc4 (reference implementation), crun5, Kata Containers6 1https://opencontainers.org 2https://github.com/opencontainers/runtime-spec 3https://github.com/opencontainers/runtime-spec/blob/main/implementations.md 4https://github.com/opencontainers/runc 5https://github.com/containers/crun 6https://katacontainers.io/
  5. OCI Runtime Specification JSON configuration (example7) mounts, process and environment,

    lifecycle hooks, . . . Linux-specific: capabilities, namespaces, cgroup, sysctls, seccomp profile 7https://github.com/opencontainers/runtime-spec/blob/main/config.md#configuration-schema-example
  6. OCI Runtime Specification { "process ": { "user" : {

    "uid": 0, "gid": 0 }, "args ": [ "/ sbin/init" ], "env": [ ... ], ... }, "root ": {" path ": "/ home/ftweedal/scratch/fs", "readonly ": false}, "hostname ": "runc", "mounts ": [ ... ], "linux ": { "namespaces" : [ { "type ": "pid" }, { "type ": "ipc" }, { "type ": "uts" }, { "type ": "mount" }, { "type ": "cgroup" } ], "cgroupsPath ": "user.slice:runc:sandbox" } }
  7. Kubernetes - container orchestration A container orchestration system Declarative configuration

    of container-based applications Integration with many cloud providers https://kubernetes.io/ https://github.com/kubernetes/
  8. Kubernetes - terminology Container: isolated/confined process [tree] Pod: group (1+)

    of related Containers (e.g. HTTP app + database) Namespace: object and auth[nz] scope, such as for a team/project Node: a machine in the cluster; where Pods are executed
  9. Kubernetes - more terminology Kubelet8: agent that executes Pods on

    Nodes Sandbox: isolation/confinement mechanism(s); one per Pod Container Runtime Interface (CRI)9: interface used by Kubelet to create/start/stop/destroy Sandboxes and Containers CRI-O10 containerd11 8https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/ 9https://kubernetes.io/docs/concepts/architecture/cri/ 10https://cri-o.io/ 11https://containerd.io/
  10. Kubernetes - Pod definition apiVersion: v1 kind: Pod metadata: name:

    fedora labels: app: fedora spec: containers: - name: fedora image : registry.fedoraproject.org/fedora :35- x86_64 command : [" sleep", "3600"] env : - name: DEBUG value: "1"
  11. OpenShift13 a.k.a. OpenShift Container Platform (OCP) An enterprise-ready Kubernetes container

    platform Commercially supported by Red Hat Community “upstream” distribution: OKD12 Uses CRI-O and runc Latest stable release: 4.12 12https://www.okd.io/ 13https://openshift.com/
  12. OpenShift runtime environment (default) Confinement: SELinux, namespaces (cgroup, pid, mount,

    uts, network) Each namespace gets assigned a unique uid range Containers run as a uid from that range Circumvent via RunAsUser and Security Context Constraints (SCCs) . . . which is a bad idea
  13. User namespaces - why Increase workload isolation and confinement Run

    applications that require/assume specific uid(s)
  14. User namespaces - why CVE description CVSS14 CVE-2019-5736 host runc

    binary overwritten from container 8.6 CVE-2021-25741 host fs access via symlink exchange attack 8.1 / 8.6 CVE-2021-30465 host fs access via symlink exchange attack 8.5 CVE-2017-1002101 host fs access via subpath volume mount 9.6 / 8.8 CVE-2016-8867 privesc due to excessive ambient capabilities 7.5 CVE-2018-15664 host fs access via subpath volume mount 7.5 14NIST or NIST / CNA Kubernetes
  15. User namespaces - how (Linux) user_namespaces(7) unshare(2) unshare(1) % id

    -u 1000 % unshare --user --map -root -user id -u 0
  16. User namespaces - how (OCI runtime) { ... "linux ":

    { ... "namespaces ": [ { "type ": "user" }, ... ], "uidMappings ": [ { "containerID ": 0, "hostID ": 1000000 , "size ": 65536 } ], "gidMappings ": [ { "containerID ": 0, "hostID ": 1000000 , "size ": 65536 } ] } }
  17. User namespaces - how (OpenShift) Implemented in CRI-O OpenShift 4.7

    (with non-default cluster config) OpenShift 4.10 (out of the box) Opt-in via Pod annotations Requires anyuid SCC (or equivalent) for admission Some workloads may still require non-default cluster configuration
  18. User namespaces - how (OpenShift) apiVersion: v1 kind: Pod metadata:

    name: nginx labels: app: nginx annotations: io.openshift.builder: "true" io.kubernetes.cri-o.userns-mode: "auto:size=65536" spec: containers: - name: nginx image: quay.io/ftweedal/test -nginx:latest tty: true
  19. User namespaces - how (Kubernetes) KEP15-12716 Initial support delivered in

    k8s v1.25 (OpenShift 4.12) Alpha; feature gate: UserNamespacesStatelessPodsSupport Supported volume types: emptyDir, configmap, secret Post must opt in by setting spec.hostUsers: false Fixed mapping size (65536); unique to Pod Support for more volume types deferred to later phase 15Kubernetes Enhancement Proposal 16https://github.com/kubernetes/enhancements/pull/3065
  20. User namespaces - how (Kubernetes) - challenges shared / persistent

    volumes require ID-mapped mounts simple heuristics for ID range assignment → lower number of pods with unique user namespaces other mount point and file ownership issues (e.g. cgroupfs)
  21. cgroups OpenShift creates a unique cgroup17 for each container cgroup

    namespace18 makes it the “root” namespace inside the container cgroupfs mounts it at /sys/fs/cgroup systemd needs write access. . . but doesn’t have it 17cgroups(7) 18cgroup_namespaces(7)
  22. cgroups - send in the chown(2)s Solution: modify runtime to

    chown the cgroup to the container process UID But first: extend OCI Runtime Spec with semantics for cgroup ownership19 runc pull request20 Merged; released in OpenShift 4.11 19https://github.com/opencontainers/runtime-spec/blob/main/config-linux.md#cgroup-ownership 20https://github.com/opencontainers/runc/pull/3057
  23. cgroups - OCI runtime semantics chown container’s cgroup to host

    UID matching the process UID in container’s user namespace, if and only if. . . cgroups v2 in use; and container has its own cgroup namespace; and cgroupfs is mounted read/write
  24. cgroups - OCI runtime semantics Only the cgroup directory itself,

    and the files mentioned in /sys/kernel/cgroup/delegate, should be chown’d: cgroup.procs cgroup.threads cgroup.subtree_control memory.oom.group21 memory.reclaim21 21depends on kernel version
  25. cgroups - OpenShift cgroups v2 not the default (yet) but

    it works and is supported annotation required to activate the ownership semantics: io.kubernetes.cri-o.cgroup2-mount-hierarchy-rw: "true"
  26. © 2023 Red Hat, Inc. Except where otherwise noted this

    work is licensed under http://creativecommons.org/licenses/by/4.0/ Slides speakerdeck.com/frasertweedale Blog frasertweedale.github.io/blog-redhat Email [email protected] Fediverse @[email protected]