Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Send in the chown()s - systemd containers in user namespaces

Send in the chown()s - systemd containers in user namespaces

Linux container escapes continue to affect Kubernetes and derived products. User namespaces are one technology that can mitigate the risk. In this presentation I will explain the past, present and future of user namespace support in Kubernetes, and discuss how to run systemd-based containers in user namespaces. And why you would even want to try. There will be demos! Attendees will learn about what containers are, the technologies that underpin Linux containers, and how Kubernetes actually runs containers.

"systemd in a container? What! Why!?" We had our Reasons, and I'll even explain them. But more interesting than the "why" is the "how", and that's what this talk is about. Come and learn about the upcoming and in-development Kernel and Kubernetes security features that will enable better container isolation and secure deployment of systemd-based workloads.

This is a talk about what happened when a handful of complete container newbies tried to port their massive, complex, legacy application to Kubernetes. As a monolithic container. Based on systemd.

The runtime shunned our container and refused to execute it. Cloud engineers recoiled in horror at our architecture. With astounding hubris we ignored their admonitions and doubled down. If the container runtime won't run our application, well, we'll just modify the container runtime!

And so we did. Our journey took us into the darkest corners of container runtimes, Kubernetes and systemd. And we have emerged to tell you the tale. There will be demos.

Attendees will learn about the security technologies that underpin Linux containers, including namespaces and cgroups, as well as the behaviour of systemd in containers. I will also discuss the recent and planned changes in Kubernetes to provide official support for running containers in user namespaces.

Fraser Tweedale

February 04, 2023
Tweet

More Decks by Fraser Tweedale

Other Decks in Technology

Transcript

  1. send in the chown(2)s
    systemd containers in user namespaces
    Fraser Tweedale
    @[email protected]
    February 4, 2023

    View Slide

  2. Agenda
    Containers and container standards
    Kubernetes and OpenShift
    User namespaces and cgroups
    systemd-based workloads on Kubernetes/OpenShift

    View Slide

  3. What is a container?
    An process isolation and confinement abstraction
    Most commonly: OS-level virtualisation (shared kernel)
    e.g. FreeBSD jails, Solaris zones
    Container image defines filesystem contents

    View Slide

  4. Containers on linux
    Some combination of the following security mechanisms:
    namespaces(7): pid, mount, network, cgroup, . . .
    restricted capabilities(7) and/or seccomp(2) profile
    SELinux or AppArmor MAC policy
    cgroups(7) for resource limits
    Not necessarily all of these at the same time. . .

    View Slide

  5. Container standards
    Open Container Initiative (OCI)1
    Runtime Specification2 - low level runtime interface
    Linux, Solaris, Windows, VMs, . . .
    Implementations3: runc4 (reference implementation), crun5,
    Kata Containers6
    1https://opencontainers.org
    2https://github.com/opencontainers/runtime-spec
    3https://github.com/opencontainers/runtime-spec/blob/main/implementations.md
    4https://github.com/opencontainers/runc
    5https://github.com/containers/crun
    6https://katacontainers.io/

    View Slide

  6. OCI Runtime Specification
    JSON configuration (example7)
    mounts, process and environment, lifecycle hooks, . . .
    Linux-specific: capabilities, namespaces, cgroup, sysctls, seccomp profile
    7https://github.com/opencontainers/runtime-spec/blob/main/config.md#configuration-schema-example

    View Slide

  7. OCI Runtime Specification
    {
    "process ": {
    "user" : { "uid": 0, "gid": 0 },
    "args ": [ "/ sbin/init" ],
    "env": [ ... ],
    ...
    },
    "root ": {" path ": "/ home/ftweedal/scratch/fs", "readonly ": false},
    "hostname ": "runc",
    "mounts ": [ ... ],
    "linux ": {
    "namespaces" : [
    { "type ": "pid" }, { "type ": "ipc" }, { "type ": "uts" },
    { "type ": "mount" }, { "type ": "cgroup" }
    ],
    "cgroupsPath ": "user.slice:runc:sandbox"
    }
    }

    View Slide

  8. Kubernetes and OpenShift

    View Slide

  9. Kubernetes - container orchestration
    A container orchestration system
    Declarative configuration of container-based applications
    Integration with many cloud providers
    https://kubernetes.io/
    https://github.com/kubernetes/

    View Slide

  10. Kubernetes - terminology
    Container: isolated/confined process [tree]
    Pod: group (1+) of related Containers (e.g. HTTP app + database)
    Namespace: object and auth[nz] scope, such as for a team/project
    Node: a machine in the cluster; where Pods are executed

    View Slide

  11. Kubernetes - more terminology
    Kubelet8: agent that executes Pods on Nodes
    Sandbox: isolation/confinement mechanism(s); one per Pod
    Container Runtime Interface (CRI)9: interface used by Kubelet to
    create/start/stop/destroy Sandboxes and Containers
    CRI-O10
    containerd11
    8https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/
    9https://kubernetes.io/docs/concepts/architecture/cri/
    10https://cri-o.io/
    11https://containerd.io/

    View Slide

  12. Kubernetes - Container Runtime Interface

    View Slide

  13. Kubernetes - Container Runtime Interface - CRI-O

    View Slide

  14. Kubernetes - Container Runtime Interface - CRI-O + runc

    View Slide

  15. Kubernetes - Pod definition
    apiVersion: v1
    kind: Pod
    metadata:
    name: fedora
    labels:
    app: fedora
    spec:
    containers:
    - name: fedora
    image : registry.fedoraproject.org/fedora :35- x86_64
    command : [" sleep", "3600"]
    env :
    - name: DEBUG
    value: "1"

    View Slide

  16. OpenShift13
    a.k.a. OpenShift Container Platform (OCP)
    An enterprise-ready Kubernetes container platform
    Commercially supported by Red Hat
    Community “upstream” distribution: OKD12
    Uses CRI-O and runc
    Latest stable release: 4.12
    12https://www.okd.io/
    13https://openshift.com/

    View Slide

  17. OpenShift runtime environment (default)
    Confinement: SELinux, namespaces (cgroup, pid, mount, uts, network)
    Each namespace gets assigned a unique uid range
    Containers run as a uid from that range
    Circumvent via RunAsUser and Security Context Constraints (SCCs)
    . . . which is a bad idea

    View Slide

  18. User namespaces

    View Slide

  19. User namespaces - why
    Increase workload isolation and confinement
    Run applications that require/assume specific uid(s)

    View Slide

  20. User namespaces - why
    CVE description CVSS14
    CVE-2019-5736 host runc binary overwritten from container 8.6
    CVE-2021-25741 host fs access via symlink exchange attack 8.1 / 8.6
    CVE-2021-30465 host fs access via symlink exchange attack 8.5
    CVE-2017-1002101 host fs access via subpath volume mount 9.6 / 8.8
    CVE-2016-8867 privesc due to excessive ambient capabilities 7.5
    CVE-2018-15664 host fs access via subpath volume mount 7.5
    14NIST or NIST / CNA Kubernetes

    View Slide

  21. User namespaces - what

    View Slide

  22. User namespaces - how (Linux)
    user_namespaces(7)
    unshare(2)
    unshare(1)
    % id -u
    1000
    % unshare --user --map -root -user id -u
    0

    View Slide

  23. User namespaces - how (OCI runtime)
    {
    ...
    "linux ": {
    ...
    "namespaces ": [
    { "type ": "user" },
    ...
    ],
    "uidMappings ": [
    { "containerID ": 0, "hostID ": 1000000 , "size ": 65536 }
    ],
    "gidMappings ": [
    { "containerID ": 0, "hostID ": 1000000 , "size ": 65536 }
    ]
    }
    }

    View Slide

  24. User namespaces - how (OpenShift)
    Implemented in CRI-O
    OpenShift 4.7 (with non-default cluster config)
    OpenShift 4.10 (out of the box)
    Opt-in via Pod annotations
    Requires anyuid SCC (or equivalent) for admission
    Some workloads may still require non-default cluster configuration

    View Slide

  25. User namespaces - how (OpenShift)
    apiVersion: v1
    kind: Pod
    metadata:
    name: nginx
    labels:
    app: nginx
    annotations:
    io.openshift.builder: "true"
    io.kubernetes.cri-o.userns-mode: "auto:size=65536"
    spec:
    containers:
    - name: nginx
    image: quay.io/ftweedal/test -nginx:latest
    tty: true

    View Slide

  26. User namespaces - how (Kubernetes)
    KEP15-12716
    Initial support delivered in k8s v1.25 (OpenShift 4.12)
    Alpha; feature gate: UserNamespacesStatelessPodsSupport
    Supported volume types: emptyDir, configmap, secret
    Post must opt in by setting spec.hostUsers: false
    Fixed mapping size (65536); unique to Pod
    Support for more volume types deferred to later phase
    15Kubernetes Enhancement Proposal
    16https://github.com/kubernetes/enhancements/pull/3065

    View Slide

  27. User namespaces - how (Kubernetes) - challenges
    shared / persistent volumes require ID-mapped mounts
    simple heuristics for ID range assignment → lower number of pods with
    unique user namespaces
    other mount point and file ownership issues (e.g. cgroupfs)

    View Slide

  28. cgroups
    OpenShift creates a unique cgroup17 for each container
    cgroup namespace18 makes it the “root” namespace inside the container
    cgroupfs mounts it at /sys/fs/cgroup
    systemd needs write access. . . but doesn’t have it
    17cgroups(7)
    18cgroup_namespaces(7)

    View Slide

  29. cgroups - send in the chown(2)s
    Solution: modify runtime to chown the cgroup to the container process UID
    But first: extend OCI Runtime Spec with semantics for cgroup ownership19
    runc pull request20
    Merged; released in OpenShift 4.11
    19https://github.com/opencontainers/runtime-spec/blob/main/config-linux.md#cgroup-ownership
    20https://github.com/opencontainers/runc/pull/3057

    View Slide

  30. cgroups - OCI runtime semantics
    chown container’s cgroup to host UID matching the process UID in container’s
    user namespace, if and only if. . .
    cgroups v2 in use; and
    container has its own cgroup namespace; and
    cgroupfs is mounted read/write

    View Slide

  31. cgroups - OCI runtime semantics
    Only the cgroup directory itself, and the files mentioned in
    /sys/kernel/cgroup/delegate, should be chown’d:
    cgroup.procs
    cgroup.threads
    cgroup.subtree_control
    memory.oom.group21
    memory.reclaim21
    21depends on kernel version

    View Slide

  32. cgroups - OpenShift
    cgroups v2 not the default (yet)
    but it works and is supported
    annotation required to activate the ownership semantics:
    io.kubernetes.cri-o.cgroup2-mount-hierarchy-rw: "true"

    View Slide

  33. Demo

    View Slide

  34. Links / resources
    My blog: https://frasertweedale.github.io/blog-redhat/tags/containers.html
    Demo: https://www.youtube.com/watch?v=OGAVvIJwmd0
    KEP-127:
    https://github.com/kubernetes/enhancements/tree/master/keps/sig-
    node/127-user-namespaces
    OCI Runtime Specification:
    https://github.com/opencontainers/runtime-spec

    View Slide

  35. © 2023 Red Hat, Inc.
    Except where otherwise noted this work is licensed under
    http://creativecommons.org/licenses/by/4.0/
    Slides speakerdeck.com/frasertweedale
    Blog frasertweedale.github.io/blog-redhat
    Email [email protected]
    Fediverse @[email protected]

    View Slide