Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Linux Container Primitives and Runtimes (re:Invent 2018, CON407)

Samuel Karp
November 26, 2018

Linux Container Primitives and Runtimes (re:Invent 2018, CON407)

In this session, we'll explore the different Linux primitives that are commonly used in implementing container runtimes. Starting with Docker containers and moving down through the stack, we'll cover the underlying Linux primitives like cgroups, namespaces, and union filesystems, as well as how OCI runtimes like runc use them. We'll also discuss alternative container runtimes like CRI-O, rkt, and systemd-nspawn and what makes them different. This will be an interactive session with a live demo and open questions.

Samuel Karp

November 26, 2018
Tweet

More Decks by Samuel Karp

Other Decks in Technology

Transcript

  1. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Linux Container Primitives
    and Runtimes
    Samuel Karp
    Senior Software Development Engineer
    Amazon Web Services – Container Services
    C O N 4 0 7

    View full-size slide

  2. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Agenda
    Container primitives overview
    Control groups (cgroups)
    Namespaces
    Union filesystems
    Runtimes

    View full-size slide

  3. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Related breakouts
    Monday, November 26
    CON355 Best Practices for Container Image Management
    10:45 a.m. – 11:45 a.m. | Venetian, Level 3, Murano 3302
    Tuesday, November 27
    CON410-R Deep Dive into Container Networking
    9:15 a.m. – 10:15 a.m. | MGM, Level 3, South Concourse 301
    Wednesday, November 28
    CON410-R1 Deep Dive into Container Networking
    6:15 p.m. – 7:15 p.m. | Aria West, Level 3, Starvine 2

    View full-size slide

  4. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

    View full-size slide

  5. Containers are an abstraction over
    several different Linux technologies

    View full-size slide

  6. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Linux Kernel
    Container runtime
    Container 1 Container 2 Container 3 Container 4 Container 5 Container 6
    Namespaces Control groups Union filesystem

    View full-size slide

  7. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

    View full-size slide

  8. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    What do control groups (cgroups) do?
    • Organize all processes in the system
    • Limit or prioritize resource utilization
    • Account for resource usage and gather utilization data

    View full-size slide

  9. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Subsystems
    • Control groups is an abstract
    framework
    • Subsystems are concrete
    implementations
    • Different subsystems can
    organize processes separately
    • Most subsystems are
    resource controllers
    Examples of subsystems:
    • Memory
    • CPU time
    • Block I/O
    • Number of discrete processes
    (pids)
    • CPU & memory pinning
    • Freezer (used by docker pause)
    • Devices
    • Network priority

    View full-size slide

  10. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Hierarchical representation
    • Independent subsystem
    hierarchies
    • Every pid is represented
    exactly once in each
    subsystem
    • New processes inherit
    cgroups from their parents
    ├── blkio
    │ └── docker
    │ └── b211c37
    ├── cpu,cpuacct
    │ └── docker
    │ └── b211c37
    ├── cpuset
    │ └── docker
    │ └── b211c37
    ├── devices
    │ └── docker
    │ └── b211c37
    ├── freezer
    │ └── docker
    │ └── b211c37
    ├── hugetlb
    │ └── docker
    │ └── b211c37
    ├── memory
    │ └── docker
    │ └── b211c37

    View full-size slide

  11. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    cgroup virtual filesystem
    • Typically mounted at
    /sys/fs/cgroup
    • tasks virtual file holds all
    pids in the cgroup
    • Other files have settings and
    utilization data
    ├── cgroup.clone_children
    ├── cgroup.procs
    ├── cgroup.sane_behavior
    ├── cpuacct.stat
    ├── cpuacct.usage
    ├── cpuacct.usage_all
    ├── cpuacct.usage_percpu
    ├── cpuacct.usage_percpu_sys
    ├── cpuacct.usage_percpu_user
    ├── cpuacct.usage_sys
    ├── cpuacct.usage_user
    ├── cpu.cfs_period_us
    ├── cpu.cfs_quota_us
    ├── cpu.rt_period_us
    ├── cpu.rt_runtime_us
    ├── cpu.shares
    ├── cpu.stat
    ├── notify_on_release
    ├── release_agent
    └── tasks

    View full-size slide

  12. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

    View full-size slide

  13. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    What can you use cgroups for?
    • cgroups can be used
    independently of containers
    • cgroups control resource
    limits for processes
    • Monitor processes and
    organize them
    • Be careful not to break any
    assumptions your container
    runtime or orchestrator
    might have

    View full-size slide

  14. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Further reading
    • Linux: Documentation/cgroup-v1

    View full-size slide

  15. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

    View full-size slide

  16. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    What do namespaces do?
    • Isolation mechanism for resources
    • Changes to resources within namespace are invisible outside the
    namespace*

    View full-size slide

  17. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    What namespaces are available?
    • Network
    • Filesystem (mounts)
    • Processes (pid)
    • Inter-process communication (ipc)
    • Hostname and domain name (uts)
    • User and group IDs
    • cgroup

    View full-size slide

  18. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Namespace sharing
    Process A Process B Process C Process D
    pid:[2]
    pid:[1] pid:[3]
    net:[4] net:[5] net:[6]
    mount:[7] mount:[8]

    View full-size slide

  19. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Network namespace
    • Frequently used in containers
    • docker run uses a separate network namespace per container
    • Multiple containers can share a network namespace
    • Kubernetes pods
    • Amazon Elastic Container Service (Amazon ECS) tasks with the awsvpc networking mode

    View full-size slide

  20. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Mount namespace
    • Used for giving containers
    their own filesystem
    • Container image is mounted
    as the root filesystem
    • More about filesystems to
    come!
    bash-4.2# mount
    overlay on / type overlay
    (rw,relatime,lowerdir=/var/lib/docker/overl
    ay2/l/Q5EBZ7CIJYELLG2MBKZIRRFWW6:/var/lib/d
    ocker/overlay2/l/PKATP76T57BQZ5D44JXYFIB26E
    ,upperdir=/var/lib/docker/overlay2/88816f95
    10a9ff38b31eaaceccbef6ffc9cc3c06bcc451f9684
    850db5ee1b152/diff,workdir=/var/lib/docker/
    overlay2/88816f9510a9ff38b31eaaceccbef6ffc9
    cc3c06bcc451f9684850db5ee1b152/work)
    proc on /proc type proc
    (rw,nosuid,nodev,noexec,relatime)
    tmpfs on /dev type tmpfs
    (rw,nosuid,size=65536k,mode=755)
    devpts on /dev/pts type devpts
    (rw,nosuid,noexec,relatime,gid=5,mode=620,p
    tmxmode=666)
    sysfs on /sys type sysfs
    (ro,nosuid,nodev,noexec,relatime)
    tmpfs on /sys/fs/cgroup type tmpfs
    (ro,nosuid,nodev,noexec,relatime,mode=755)

    View full-size slide

  21. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    procfs virtual filesystem
    • Namespaces are visible in
    /proc
    • Files are symbolic links to the
    namespace
    • The link contains the
    namespace type and inode
    number to identify the
    namespace
    $ readlink /proc/$$/ns/*
    cgroup:[4026531835]
    ipc:[4026531839]
    mnt:[4026531840]
    net:[4026531993]
    pid:[4026531836]
    user:[4026531837]
    uts:[4026531838]

    View full-size slide

  22. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Creating namespaces
    • clone(2) and unshare(2)
    • CLONE_NEW* flags to specify
    which namespaces
    • clone(2) is for new
    processes to create new
    namespaces
    • unshare(2) is for existing
    processes to create new
    namespaces

    View full-size slide

  23. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Persisting namespaces
    • The kernel automatically
    garbage-collects namespaces
    by reference-counting
    • New namespace remains
    open as long as
    • a process runs or
    • a mount is open
    • Bind-mount a file in
    /proc/$$/ns to another
    place on the filesystem
    $ mount \
    --bind /proc/$$/ns/net \
    /var/run/netns/con407

    View full-size slide

  24. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Entering namespaces
    • Open a file from
    /proc/$$/ns (or a bind-
    mount)
    • Pass to setns(2) to enter
    the existing namespace
    • Namespace remains open as
    long as the process is
    running, even if the original
    file goes away
    • nsenter(1) is a command
    for doing this interactively
    • ip-netns(8) works
    specifically for network
    namespaces

    View full-size slide

  25. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

    View full-size slide

  26. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    How can you leverage this?
    • Use nsenter or ip netns to troubleshoot container networking
    • Monitor containers by entering the pid namespace
    • Access binaries in your containers with the mount namespace

    View full-size slide

  27. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Further reading
    • man 7 namespaces
    • man 7 pid_namespaces
    • man 7 user_namespaces

    View full-size slide

  28. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

    View full-size slide

  29. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Filesystem images
    • Images are representations of a filesystem
    • Images are popular for virtualization and container systems
    • Docker helped popularize the concept of layers
    • A union filesystem is one where two or more filesystems are joined
    together in a unified view

    View full-size slide

  30. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    How Docker layers work
    • A copy-on-write view of your
    files
    • New files exist only in the top
    layer
    • When a file is modified, it is
    copied up to the top layer
    • Unmodified files exist in
    whatever layer they were
    added/modified
    • Deleted files are hidden, but
    still exist
    Top layer
    (read-write)
    Intermediate
    layer (read-
    only)
    Base layer
    (read-only)

    View full-size slide

  31. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Union filesystems
    • Popular in container runtimes (like Docker) to implement layers
    • Efficient use of storage when starting multiple containers with
    identical images
    • Efficient use of storage when making minor modifications to images

    View full-size slide

  32. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Overlay filesystem
    • Joins two directories (upper and lower) to form a union
    • Uses file name to describe the files
    • When writing to the overlay
    • lowerdir is not modified, all changes go to upperdir
    • Existing files are copied-up to the upperdir for modificiation
    • Whole file is copied, not just blocks
    • “Deleting” a file in the upperdir creates a whiteout
    • Files: character devices with 0/0 device number
    • Directories: xattr “trusted.overlay.opaque” set to “y”

    View full-size slide

  33. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Overlay filesystem (continued)
    • An upperdir can have multiple lowerdirs
    • Overlay filesystems can be created with mount(2)
    • You can examine the mounts with
    • mount(8)
    • /proc/mounts
    • /proc/$$/mountinfo

    View full-size slide

  34. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Docker’s overlay driver
    • Docker’s default layer storage uses the overlay filesystem
    • upperdir, lowerdir, and diff directories are in
    /var/lib/docker/overlay2

    View full-size slide

  35. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

    View full-size slide

  36. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    How can you leverage this?
    • Locate files in your layers
    • Examine which files and
    layers contribute to your disk
    usage
    • Understand the impact of
    writable files in your
    containers and how to reduce
    # du -h . | sort -hr
    753M .
    211M ./e33f37/diff
    211M ./e33f37
    204M ./e33f37/diff/usr
    169M ./f87973/diff

    # ls ./f87973
    diff link
    # ls ./e33f37
    diff link lower work

    View full-size slide

  37. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Further reading
    • Linux: Documentation/filesystems/overlay.txt

    View full-size slide

  38. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

    View full-size slide

  39. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    What is a container runtime?
    • A software tool that configures Linux primitives to create and run
    containers on a host
    • Examples include:
    • Docker
    • containerd
    • CRI-O
    • rkt
    • systemd-nspawn
    • Open Containers Initiative (OCI) aims to standardize container
    runtimes, image format, and distribution
    • The OCI reference implementation (runc) powers Docker, containerd,
    and CRI-O

    View full-size slide

  40. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    OCI runtime spec
    • Containers are “bundles”
    • Filesystem
    • JSON document
    • Filesystem can be a union
    • JSON document describes
    • cgroups
    • Namespaces
    • Additional mounts
    • Linux capabilities
    • Linux security modules
    • And more
    • Hooks can modify the bundle
    {
    "ociVersion": "1.0.1",

    "root": {
    "path": "/var/lib/docker/overlay2/03004c/merged"
    },

    "hooks": {
    "prestart": [{"path": "/proc/9306/exe"}]
    },
    "linux": {
    "resources": {
    "cpu": {"shares": 0},
    "pids": {“limit": 0},

    },
    "cgroupsPath": "/docker/bd5cebc8950c",
    "namespaces": [
    {"type": "mount"},
    {"type": "network"},

    ],

    }

    View full-size slide

  41. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    OCI runtime hooks
    • Hooks run
    • Before a container starts
    • After a container starts
    • After a container stops
    • Hooks can modify the
    filesystem, modify the JSON
    file, or take other actions
    • Hooks run sequentially, in an
    order defined in the JSON file
    • Docker generates a bundle
    without hooks
    • Docker does let you specify
    your own runtime
    • Your runtime could inject
    hooks, then execute the real
    runtime
    • This is how Nvidia’s container
    runtime works

    View full-size slide

  42. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

    View full-size slide

  43. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

    View full-size slide

  44. Thank you!
    © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Samuel Karp
    @samuelkarp

    View full-size slide