Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Linux Container Primitives: cgroups, namespaces, and more! (LinuxFest Northwest 2020)

Linux Container Primitives: cgroups, namespaces, and more! (LinuxFest Northwest 2020)

In this session, we’ll explore the different Linux primitives that are commonly used in implementing container runtimes. We’ll learn about the Linux primitives that underlie container runtimes like Docker, including cgroups, namespaces, and union filesystems. We’ll see how Docker uses these primitives, and how the OCI standard makes it possible to customize how your containers run. We’ll also discuss alternative container runtimes like CRI-O, rkt, and systemd-nspawn and what makes them different. This will be an interactive session with a live demo and open questions.

This session is a repeat of the session from last year.

Samuel Karp

May 08, 2020
Tweet

More Decks by Samuel Karp

Other Decks in Programming

Transcript

  1. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights
    reserved.
    Samuel Karp, Amazon Web Services – @samuelkarp
    LinuxFest Nortwest 2020 – Online
    Linux Container Primitives
    cgroups, namespaces, and more!

    View Slide

  2. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    LinuxFest Northwest – Online!
    Q&A at https://discuss.lfnw.org
    Also available on Twitter – @samuelkarp

    View Slide

  3. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    Agenda
    Container primitives overview
    Control groups (cgroups)
    Namespaces
    Union filesystems
    Runtimes

    View Slide

  4. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    Agenda
    Container primitives overview
    Control groups (cgroups)
    Namespaces
    Union filesystems
    Capabilities
    Runtimes

    View Slide

  5. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    Linux container primitives

    View Slide

  6. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights
    reserved.
    © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    Containers are an abstraction over
    several different Linux
    technologies

    View Slide

  7. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    Linux Kernel
    Container runtime
    Container 1 Container 2 Container 3 Container 4 Container 5 Container 6
    Namespaces Control groups Union filesystem

    View Slide

  8. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    Control groups

    View Slide

  9. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    What do control groups (cgroups) do?
    • Organize all processes in the system
    • Account for resource usage and gather utilization data
    • Limit or prioritize resource utilization

    View Slide

  10. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    Subsystems
    • Control group system is an
    abstract framework
    • Subsystems are concrete
    implementations
    • Different subsystems can
    organize processes separately
    • Most subsystems are resource
    controllers
    Examples of subsystems:
    • Memory
    • CPU time
    • Block I/O
    • Number of discrete processes
    (pids)
    • CPU & memory pinning
    • Freezer (used by docker pause)
    • Devices
    • Network priority

    View Slide

  11. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    Hierarchical representation
    • Independent subsystem
    hierarchies
    • Every pid is represented
    exactly once in each
    subsystem
    • New processes inherit
    cgroups from their parents
    ├── blkio
    │ └── docker
    │ └── b211c37
    ├── cpu,cpuacct
    │ └── docker
    │ └── b211c37
    ├── cpuset
    │ └── docker
    │ └── b211c37
    ├── devices
    │ └── docker
    │ └── b211c37
    ├── freezer
    │ └── docker
    │ └── b211c37
    ├── hugetlb
    │ └── docker
    │ └── b211c37
    ├── memory
    │ └── docker
    │ └── b211c37

    View Slide

  12. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    cgroup virtual filesystem
    • Typically mounted at
    /sys/fs/cgroup
    • tasks virtual file holds all
    pids in the cgroup
    • Other files have settings and
    utilization data
    ├── cgroup.clone_children
    ├── cgroup.procs
    ├── cgroup.sane_behavior
    ├── cpuacct.stat
    ├── cpuacct.usage
    ├── cpuacct.usage_all
    ├── cpuacct.usage_percpu
    ├── cpuacct.usage_percpu_sys
    ├── cpuacct.usage_percpu_user
    ├── cpuacct.usage_sys
    ├── cpuacct.usage_user
    ├── cpu.cfs_period_us
    ├── cpu.cfs_quota_us
    ├── cpu.rt_period_us
    ├── cpu.rt_runtime_us
    ├── cpu.shares
    ├── cpu.stat
    ├── notify_on_release
    ├── release_agent
    └── tasks

    View Slide

  13. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights
    reserved.
    © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    Demo

    View Slide

  14. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    What can you use cgroups for?
    • cgroups can be used
    independently of containers
    • cgroups control resource limits
    for processes
    • Monitor processes and
    organize them
    • Be careful not to break any
    assumptions your container
    runtime or orchestrator might
    have

    View Slide

  15. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    Further reading
    • Linux: Documentation/cgroup-v1

    View Slide

  16. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    Namespaces

    View Slide

  17. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    What do namespaces do?
    • Isolation mechanism for resources
    • Changes to resources within namespace can be invisible
    outside the namespace
    • Resource mapping with permission changes

    View Slide

  18. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    What namespaces are available?
    • Network
    • Filesystem (mounts)
    • Processes (pid)
    • Inter-process communication (ipc)
    • Hostname and domain name (uts)
    • User and group IDs
    • cgroup

    View Slide

  19. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    Namespace sharing
    Process A Process B Process C Process D
    pid:[2]
    pid:[1] pid:[3]
    net:[4] net:[5] net:[6]
    mount:[7] mount:[8]

    View Slide

  20. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    Network namespace
    • Frequently used in containers
    • veth devices can connect different namespaces
    • docker run uses a separate network namespace per
    container
    • Multiple containers can share a network namespace
    • Kubernetes pods
    • Amazon ECS tasks with the awsvpc networking mode

    View Slide

  21. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    Mount namespace
    • Used for giving containers
    their own filesystem
    • Container image is mounted
    as the root filesystem
    • “Volumes” to share data
    between containers or the
    host
    • More about filesystems to
    come!
    bash-4.2# mount
    overlay on / type overlay
    (rw,relatime,lowerdir=/var/lib/docker/overlay2
    /l/Q5EBZ7CIJYELLG2MBKZIRRFWW6:/var/lib/docker/
    overlay2/l/
    PKATP76T57BQZ5D44JXYFIB26E,upperdir=/var/lib/
    docker/
    overlay2/88816f9510a9ff38b31eaaceccbef6ffc9cc3
    c06bcc451f9684850db5ee1b152/diff,workdir=/var/
    lib/docker/
    overlay2/88816f9510a9ff38b31eaaceccbef6ffc9cc3
    c06bcc451f9684850db5ee1b152/work)
    proc on /proc type proc
    (rw,nosuid,nodev,noexec,relatime)
    tmpfs on /dev type tmpfs
    (rw,nosuid,size=65536k,mode=755)
    devpts on /dev/pts type devpts
    (rw,nosuid,noexec,relatime,gid=5,mode=620,ptmx
    mode=666)
    sysfs on /sys type sysfs
    (ro,nosuid,nodev,noexec,relatime)
    tmpfs on /sys/fs/cgroup type tmpfs
    (ro,nosuid,nodev,noexec,relatime,mode=755)

    View Slide

  22. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    procfs virtual filesystem
    • Namespaces are visible in
    /proc
    • Files are symbolic links to
    the namespace
    • The link contains the
    namespace type and inode
    number to identify the
    namespace
    $ readlink /proc/$$/ns/*
    cgroup:[4026531835]
    ipc:[4026531839]
    mnt:[4026531840]
    net:[4026531993]
    pid:[4026531836]
    user:[4026531837]
    uts:[4026531838]

    View Slide

  23. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    Creating namespaces
    • clone(2) and
    unshare(2)
    • CLONE_NEW* flags to
    specify which namespaces
    • clone(2) is for new
    processes to create new
    namespaces
    • unshare(2) is for existing
    processes to create new
    namespaces

    View Slide

  24. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    Persisting namespaces
    • The kernel automatically
    garbage-collects namespaces
    by reference-counting
    • New namespace remains open
    as long as
    • a process runs or
    • a mount is open
    • Bind-mount a file in
    /proc/$$/ns to another place
    on the filesystem
    $ mount \
    --bind /proc/$$/ns/net \
    /var/run/netns/lfnw

    View Slide

  25. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    Entering namespaces
    • Open a file from
    /proc/$$/ns (or a
    bind-mount)
    • Pass to setns(2) to enter the
    existing namespace
    • Namespace remains open as
    long as the process is running,
    even if the original file goes
    away
    • nsenter(1) is a command for
    doing this interactively
    • ip-netns(8) works
    specifically for network
    namespaces

    View Slide

  26. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights
    reserved.
    © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    Demo

    View Slide

  27. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    How can you leverage this?
    • Use nsenter or ip netns to troubleshoot container
    networking
    • Monitor containers by entering the pid namespace
    • Access binaries in your containers with the mount
    namespace

    View Slide

  28. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    Further reading
    • man 7 namespaces
    • man 7 pid_namespaces
    • man 7 user_namespaces

    View Slide

  29. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    Images, layers, and union
    filesystems

    View Slide

  30. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    Filesystem images
    • Images are representations of a filesystem
    • Images are popular for virtualization and container
    systems
    • Docker helped popularize the concept of layers

    View Slide

  31. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    Top layer
    (read-write)
    Intermediate
    layer
    (read-only)
    Base layer
    (read-only)
    How Docker layers work
    • A copy-on-write view of your
    files
    • New files exist only in the top
    layer
    • When a file is modified, it is
    copied up to the top layer
    • Unmodified files exist in
    whatever layer they were
    added/modified
    • Deleted files are hidden, but still
    exist

    View Slide

  32. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    Union filesystems
    • Unified view of two (or more) filesystems
    • Popular in container runtimes (like Docker) to
    implement layers
    • Efficient use of storage when making minor
    modifications to images
    • Efficient use of storage when starting multiple
    containers with identical images

    View Slide

  33. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    Overlay filesystem
    • Joins two directories (upper and lower) to form a union
    • Uses file name to describe the files
    • When writing to the overlay
    • lowerdir is not modified, all changes go to upperdir
    • Existing files are copied-up to the upperdir for modificiation
    • Whole file is copied, not just blocks
    • “Deleting” a file in the upperdir creates a whiteout
    • Files: character devices with 0/0 device number
    • Directories: xattr “trusted.overlay.opaque” set to “y”

    View Slide

  34. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    Overlay filesystem (continued)
    • An upperdir can have multiple lowerdirs
    • Overlay filesystems can be created with mount(2)
    • You can examine the mounts with
    • mount(8)
    • /proc/mounts
    • /proc/$$/mountinfo

    View Slide

  35. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    Docker’s overlay driver
    • Docker’s default layer storage uses the overlay
    filesystem
    • upperdir, lowerdir, and diff directories are in
    /var/lib/docker/overlay2

    View Slide

  36. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights
    reserved.
    © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    Demo

    View Slide

  37. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    How can you leverage this?
    • Locate files in your layers
    • Examine which files and
    layers contribute to your
    disk usage
    • Understand the impact of
    writable files in your
    containers
    # du -h . | sort -hr
    753M .
    211M ./e33f37/diff
    211M ./e33f37
    204M ./e33f37/diff/usr
    169M ./f87973/diff

    # ls ./f87973
    diff link
    # ls ./e33f37
    diff link lower work
    Base layer!
    Intermediate layer!

    View Slide

  38. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    Further reading
    • Linux:
    Documentation/filesystems/overlay.txt

    View Slide

  39. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    Capabilities

    View Slide

  40. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    Traditional UNIX permissions
    • Privileged operations restricted to UID 0 (root)
    • Non-privileged operations available to all users
    • Privileged processes bypass all permission checks
    • Unprivileged processes permission checks (UID/GID)

    View Slide

  41. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    Increased granularity
    • Grant some permissions without root
    • Deny permissions even to root processes
    • 38 distinct capabilities
    • Varying degrees of granularity
    • CAP_SYS_ADMIN is very broad
    • CAP_SYS_TIME is comparatively narrow
    • Capabilities set on threads and files

    View Slide

  42. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    Thread capabilities
    • Capabilities are set on threads; different threads of the
    same process can have different capabilities
    • Threads can raise or lower privileges at runtime
    • Effective – used by the kernel for permission checks
    • Permitted – limiting superset of effective capabilities
    • Inheritable – persist across execve(2) for root
    • Ambient – persist across execve(2) for non-root
    • Bounding – limits permissions across execve(2)

    View Slide

  43. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    File capabilities
    • File capabilities + thread capabilities determine
    capabilities after execve(2)
    • Permitted – automatically permitted, regardless of
    inheritable
    • Inheritable – ANDed with thread inheritable set to
    determine which capabilities are enabled after
    execve(2)
    • Effective – whether permitted capabilities are
    automatically enabled

    View Slide

  44. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    Transforming capabilities with execve
    P’(ambient) = (file is privileged) ? 0 : P(ambient)
    P’(permitted) = (P(inheritable) & F(inheritable)) |
    (F(permitted) & P(bounding)) | P’(ambient)
    P’(effective) = F(effective) ? P’(permitted) : P’(ambient)
    P’(inheritable) = P(inheritable) [i.e., unchanged]
    P’(bounding) = P(bounding) [i.e., unchanged]

    View Slide

  45. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    Transforming capabilities with execve
    P’(ambient) = (file is privileged) ? 0 : P(ambient)
    P’(permitted) = (P(inheritable) & F(inheritable)) |
    (F(permitted) & P(bounding)) | P’(ambient)
    P’(effective) = F(effective) ? P’(permitted) : P’(ambient)
    P’(inheritable) = P(inheritable) [i.e., unchanged]
    P’(bounding) = P(bounding) [i.e., unchanged]

    View Slide

  46. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    Transforming capabilities with execve
    P’(ambient) = (file is privileged) ? 0 : P(ambient)
    P’(permitted) = (P(inheritable) & F(inheritable)) |
    (F(permitted) & P(bounding)) | P’(ambient)
    P’(effective) = F(effective) ? P’(permitted) : P’(ambient)
    P’(inheritable) = P(inheritable) [i.e., unchanged]
    P’(bounding) = P(bounding) [i.e., unchanged]

    View Slide

  47. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    Transforming capabilities with execve
    P’(ambient) = (file is privileged) ? 0 : P(ambient)
    P’(permitted) = (P(inheritable) & F(inheritable)) |
    (F(permitted) & P(bounding)) | P’(ambient)
    P’(effective) = F(effective) ? P’(permitted) : P’(ambient)
    P’(inheritable) = P(inheritable) [i.e., unchanged]
    P’(bounding) = P(bounding) [i.e., unchanged]

    View Slide

  48. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    Transforming capabilities with execve
    P’(ambient) = (file is privileged) ? 0 : P(ambient)
    P’(permitted) = (P(inheritable) & F(inheritable)) |
    (F(permitted) & P(bounding)) | P’(ambient)
    P’(effective) = F(effective) ? P’(permitted) : P’(ambient)
    P’(inheritable) = P(inheritable) [i.e., unchanged]
    P’(bounding) = P(bounding) [i.e., unchanged]

    View Slide

  49. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    Transforming capabilities with execve
    P’(ambient) = (file is privileged) ? 0 : P(ambient)
    P’(permitted) = (P(inheritable) & F(inheritable)) |
    (F(permitted) & P(bounding)) | P’(ambient)
    P’(effective) = F(effective) ? P’(permitted) : P’(ambient)
    P’(inheritable) = P(inheritable) [i.e., unchanged]
    P’(bounding) = P(bounding) [i.e., unchanged]

    View Slide

  50. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    Transforming capabilities with execve
    P’(ambient) = (file is privileged) ? 0 : P(ambient)
    P’(permitted) = (P(inheritable) & F(inheritable)) |
    (F(permitted) & P(bounding)) | P’(ambient)
    P’(effective) = F(effective) ? P’(permitted) : P’(ambient)
    P’(inheritable) = P(inheritable) [i.e., unchanged]
    P’(bounding) = P(bounding) [i.e., unchanged]

    View Slide

  51. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    Transforming capabilities with execve
    P’(ambient) = (file is privileged) ? 0 : P(ambient)
    P’(permitted) = (P(inheritable) & F(inheritable)) |
    (F(permitted) & P(bounding)) | P’(ambient)
    P’(effective) = F(effective) ? P’(permitted) : P’(ambient)
    P’(inheritable) = P(inheritable) [i.e., unchanged]
    P’(bounding) = P(bounding) [i.e., unchanged]

    View Slide

  52. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    Transforming capabilities with execve
    P’(ambient) = (file is privileged) ? 0 : P(ambient)
    P’(permitted) = (P(inheritable) & F(inheritable)) |
    (F(permitted) & P(bounding)) | P’(ambient)
    P’(effective) = F(effective) ? P’(permitted) : P’(ambient)
    P’(inheritable) = P(inheritable) [i.e., unchanged]
    P’(bounding) = P(bounding) [i.e., unchanged]

    View Slide

  53. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    Transforming capabilities with execve
    P’(ambient) = (file is privileged) ? 0 : P(ambient)
    P’(permitted) = (P(inheritable) & F(inheritable)) |
    (F(permitted) & P(bounding)) | P’(ambient)
    P’(effective) = F(effective) ? P’(permitted) : P’(ambient)
    P’(inheritable) = P(inheritable) [i.e., unchanged]
    P’(bounding) = P(bounding) [i.e., unchanged]

    View Slide

  54. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    Transforming capabilities with execve
    P’(ambient) = (file is privileged) ? 0 : P(ambient)
    P’(permitted) = (P(inheritable) & F(inheritable)) |
    (F(permitted) & P(bounding)) | P’(ambient)
    P’(effective) = F(effective) ? P’(permitted) : P’(ambient)
    P’(inheritable) = P(inheritable) [i.e., unchanged]
    P’(bounding) = P(bounding) [i.e., unchanged]

    View Slide

  55. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    Transforming capabilities with execve
    P’(ambient) = (file is privileged) ? 0 : P(ambient)
    P’(permitted) = (P(inheritable) & F(inheritable)) |
    (F(permitted) & P(bounding)) | P’(ambient)
    P’(effective) = F(effective) ? P’(permitted) : P’(ambient)
    P’(inheritable) = P(inheritable) [i.e., unchanged]
    P’(bounding) = P(bounding) [i.e., unchanged]

    View Slide

  56. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    Transforming capabilities with execve
    P’(ambient) = (file is privileged) ? 0 : P(ambient)
    P’(permitted) = (P(inheritable) & F(inheritable)) |
    (F(permitted) & P(bounding)) | P’(ambient)
    P’(effective) = F(effective) ? P’(permitted) : P’(ambient)
    P’(inheritable) = P(inheritable) [i.e., unchanged]
    P’(bounding) = P(bounding) [i.e., unchanged]

    View Slide

  57. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    Transforming capabilities with execve
    P’(ambient) = (file is privileged) ? 0 : P(ambient)
    P’(permitted) = (P(inheritable) & F(inheritable)) |
    (F(permitted) & P(bounding)) | P’(ambient)
    P’(effective) = F(effective) ? P’(permitted) : P’(ambient)
    P’(inheritable) = P(inheritable) [i.e., unchanged]
    P’(bounding) = P(bounding) [i.e., unchanged]

    View Slide

  58. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    Transforming capabilities with execve
    P’(ambient) = (file is privileged) ? 0 : P(ambient)
    P’(permitted) = (P(inheritable) & F(inheritable)) |
    (F(permitted) & P(bounding)) | P’(ambient)
    P’(effective) = F(effective) ? P’(permitted) : P’(ambient)
    P’(inheritable) = P(inheritable) [i.e., unchanged]
    P’(bounding) = P(bounding) [i.e., unchanged]

    View Slide

  59. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    Transforming capabilities with execve
    P’(ambient) = (file is privileged) ? 0 : P(ambient)
    P’(permitted) = (P(inheritable) & F(inheritable)) |
    (F(permitted) & P(bounding)) | P’(ambient)
    P’(effective) = F(effective) ? P’(permitted) : P’(ambient)
    P’(inheritable) = P(inheritable) [i.e., unchanged]
    P’(bounding) = P(bounding) [i.e., unchanged]

    View Slide

  60. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    Transforming capabilities with execve
    P’(ambient) = (file is privileged) ? 0 : P(ambient)
    P’(permitted) = (P(inheritable) & F(inheritable)) |
    (F(permitted) & P(bounding)) | P’(ambient)
    P’(effective) = F(effective) ? P’(permitted) : P’(ambient)
    P’(inheritable) = P(inheritable) [i.e., unchanged]
    P’(bounding) = P(bounding) [i.e., unchanged]

    View Slide

  61. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    Special treatment for root
    • Preserve traditional UNIX semantics
    • setuid root with file capabilities
    • Still bound by the Bounding set

    View Slide

  62. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    Syscalls
    • prctl(2): control ambient and bounding capabilities,
    “no new privileges”, “keep capabilities”, etc.
    • capget(2)/cap_get_proc(3) &
    capset(2)/cap_set_proc(3): Control effective,
    permitted, and inheritable capability sets

    View Slide

  63. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    Tools
    • capsh(1): run processes with specified capabilities
    • getcap(8)/setcap(8): get/set file capabilities

    View Slide

  64. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights
    reserved.
    © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    Demo

    View Slide

  65. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    Challenges with the capability system
    • Capabilities were added to Linux much later, and are not
    as widely used
    • Very complex, interactions between thread and file
    capabilities are hard to reason about
    • Broad capabilities make it hard to effectively restrict
    • Some capabilities can be used to escalate arbitrarily

    View Slide

  66. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    How can you leverage this?
    • Reduce the need for
    setuid/setgid binaries
    • Understand how Docker
    uses bounded capabilities to
    restrict permissions
    "capabilities": {
    "bounding": [
    "CAP_CHOWN",
    "CAP_DAC_OVERRIDE",
    "CAP_FSETID",
    "CAP_FOWNER",
    "CAP_MKNOD",
    "CAP_NET_RAW",
    "CAP_SETGID",
    "CAP_SETUID",
    "CAP_SETFCAP",
    "CAP_SETPCAP",
    "CAP_NET_BIND_SERVICE",
    "CAP_SYS_CHROOT",
    "CAP_KILL",
    "CAP_AUDIT_WRITE"
    ],
    "effective": [
    "CAP_CHOWN",

    View Slide

  67. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    Further reading
    • capabilities(7)
    • capsh(1)
    • setcap(8)

    View Slide

  68. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    Runtimes

    View Slide

  69. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    What is a container runtime?
    • A software tool that configures Linux primitives to create and run containers
    on a host
    • Examples include:
    • Docker
    • containerd
    • runc
    • CRI-O
    • systemd-nspawn
    • Open Containers Initiative (OCI) aims to standardize container runtimes,
    image format, and distribution
    • The OCI reference implementation (runc) powers Docker, containerd, and
    CRI‑O

    View Slide

  70. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    OCI runtime spec
    • Containers are “bundles”
    • Filesystem
    • JSON document
    • Filesystem can be a union
    • JSON document describes
    • cgroups
    • Namespaces
    • Additional mounts
    • Linux capabilities
    • Linux security modules
    • And more
    • Hooks can modify the bundle
    {
    "ociVersion": "1.0.1",

    "root": {
    "path": "/var/lib/docker/overlay2/03004c/merged"
    },

    "hooks": {
    "prestart": [{"path": "/proc/9306/exe"}]
    },
    "linux": {
    "resources": {
    "cpu": {"shares": 0},
    "pids": {“limit": 0},

    },
    "cgroupsPath": "/docker/bd5cebc8950c",
    "namespaces": [
    {"type": "mount"},
    {"type": "network"},

    ],

    }

    View Slide

  71. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    OCI runtime hooks
    • Hooks run
    • Before a container starts
    • After a container starts
    • After a container stops
    • Hooks can modify the
    filesystem, modify the JSON
    file, or take other actions
    • Hooks run sequentially, in
    an order defined in the
    JSON file
    • Docker generates a bundle
    without hooks
    • Docker does let you specify
    your own runtime
    • Your runtime could inject
    hooks, then execute the real
    runtime

    View Slide

  72. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights
    reserved.
    © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    Demo

    View Slide

  73. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    A brief note before we finish —
    Feedback provides valuable information to speakers!
    Feedback that is very helpful:
    • Topics you were excited to learn about
    • Suggestions for improving understanding and clarity
    Feedback that is extremely unhelpful:
    • Comments unrelated to talk content (please refer to the
    LinuxFest Northwest Code of Conduct)
    Reach out for Q&A (https://discuss.lfnw.org, @samuelkarp)
    For support, use the AWS Forums or contact AWS Support

    View Slide

  74. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights
    reserved.
    © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    Questions?
    Q&A at https://discuss.lfnw.org
    Also available on Twitter – @samuelkarp

    View Slide

  75. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    Thank you!
    Samuel Karp
    @samuelkarp

    View Slide