Slide 1

Slide 1 text

No content

Slide 2

Slide 2 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Linux Container Primitives and Runtimes Samuel Karp Senior Software Development Engineer Amazon Web Services – Container Services C O N 4 0 7

Slide 3

Slide 3 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Agenda Container primitives overview Control groups (cgroups) Namespaces Union filesystems Runtimes

Slide 4

Slide 4 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Related breakouts Monday, November 26 CON355 Best Practices for Container Image Management 10:45 a.m. – 11:45 a.m. | Venetian, Level 3, Murano 3302 Tuesday, November 27 CON410-R Deep Dive into Container Networking 9:15 a.m. – 10:15 a.m. | MGM, Level 3, South Concourse 301 Wednesday, November 28 CON410-R1 Deep Dive into Container Networking 6:15 p.m. – 7:15 p.m. | Aria West, Level 3, Starvine 2

Slide 5

Slide 5 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Slide 6

Slide 6 text

Containers are an abstraction over several different Linux technologies

Slide 7

Slide 7 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Linux Kernel Container runtime Container 1 Container 2 Container 3 Container 4 Container 5 Container 6 Namespaces Control groups Union filesystem

Slide 8

Slide 8 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Slide 9

Slide 9 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. What do control groups (cgroups) do? • Organize all processes in the system • Limit or prioritize resource utilization • Account for resource usage and gather utilization data

Slide 10

Slide 10 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Subsystems • Control groups is an abstract framework • Subsystems are concrete implementations • Different subsystems can organize processes separately • Most subsystems are resource controllers Examples of subsystems: • Memory • CPU time • Block I/O • Number of discrete processes (pids) • CPU & memory pinning • Freezer (used by docker pause) • Devices • Network priority

Slide 11

Slide 11 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Hierarchical representation • Independent subsystem hierarchies • Every pid is represented exactly once in each subsystem • New processes inherit cgroups from their parents ├── blkio │ └── docker │ └── b211c37 ├── cpu,cpuacct │ └── docker │ └── b211c37 ├── cpuset │ └── docker │ └── b211c37 ├── devices │ └── docker │ └── b211c37 ├── freezer │ └── docker │ └── b211c37 ├── hugetlb │ └── docker │ └── b211c37 ├── memory │ └── docker │ └── b211c37

Slide 12

Slide 12 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. cgroup virtual filesystem • Typically mounted at /sys/fs/cgroup • tasks virtual file holds all pids in the cgroup • Other files have settings and utilization data ├── cgroup.clone_children ├── cgroup.procs ├── cgroup.sane_behavior ├── cpuacct.stat ├── cpuacct.usage ├── cpuacct.usage_all ├── cpuacct.usage_percpu ├── cpuacct.usage_percpu_sys ├── cpuacct.usage_percpu_user ├── cpuacct.usage_sys ├── cpuacct.usage_user ├── cpu.cfs_period_us ├── cpu.cfs_quota_us ├── cpu.rt_period_us ├── cpu.rt_runtime_us ├── cpu.shares ├── cpu.stat ├── notify_on_release ├── release_agent └── tasks

Slide 13

Slide 13 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Slide 14

Slide 14 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. What can you use cgroups for? • cgroups can be used independently of containers • cgroups control resource limits for processes • Monitor processes and organize them • Be careful not to break any assumptions your container runtime or orchestrator might have

Slide 15

Slide 15 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Further reading • Linux: Documentation/cgroup-v1

Slide 16

Slide 16 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Slide 17

Slide 17 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. What do namespaces do? • Isolation mechanism for resources • Changes to resources within namespace are invisible outside the namespace*

Slide 18

Slide 18 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. What namespaces are available? • Network • Filesystem (mounts) • Processes (pid) • Inter-process communication (ipc) • Hostname and domain name (uts) • User and group IDs • cgroup

Slide 19

Slide 19 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Namespace sharing Process A Process B Process C Process D pid:[2] pid:[1] pid:[3] net:[4] net:[5] net:[6] mount:[7] mount:[8]

Slide 20

Slide 20 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Network namespace • Frequently used in containers • docker run uses a separate network namespace per container • Multiple containers can share a network namespace • Kubernetes pods • Amazon Elastic Container Service (Amazon ECS) tasks with the awsvpc networking mode

Slide 21

Slide 21 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Mount namespace • Used for giving containers their own filesystem • Container image is mounted as the root filesystem • More about filesystems to come! bash-4.2# mount overlay on / type overlay (rw,relatime,lowerdir=/var/lib/docker/overl ay2/l/Q5EBZ7CIJYELLG2MBKZIRRFWW6:/var/lib/d ocker/overlay2/l/PKATP76T57BQZ5D44JXYFIB26E ,upperdir=/var/lib/docker/overlay2/88816f95 10a9ff38b31eaaceccbef6ffc9cc3c06bcc451f9684 850db5ee1b152/diff,workdir=/var/lib/docker/ overlay2/88816f9510a9ff38b31eaaceccbef6ffc9 cc3c06bcc451f9684850db5ee1b152/work) proc on /proc type proc (rw,nosuid,nodev,noexec,relatime) tmpfs on /dev type tmpfs (rw,nosuid,size=65536k,mode=755) devpts on /dev/pts type devpts (rw,nosuid,noexec,relatime,gid=5,mode=620,p tmxmode=666) sysfs on /sys type sysfs (ro,nosuid,nodev,noexec,relatime) tmpfs on /sys/fs/cgroup type tmpfs (ro,nosuid,nodev,noexec,relatime,mode=755)

Slide 22

Slide 22 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. procfs virtual filesystem • Namespaces are visible in /proc • Files are symbolic links to the namespace • The link contains the namespace type and inode number to identify the namespace $ readlink /proc/$$/ns/* cgroup:[4026531835] ipc:[4026531839] mnt:[4026531840] net:[4026531993] pid:[4026531836] user:[4026531837] uts:[4026531838]

Slide 23

Slide 23 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Creating namespaces • clone(2) and unshare(2) • CLONE_NEW* flags to specify which namespaces • clone(2) is for new processes to create new namespaces • unshare(2) is for existing processes to create new namespaces

Slide 24

Slide 24 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Persisting namespaces • The kernel automatically garbage-collects namespaces by reference-counting • New namespace remains open as long as • a process runs or • a mount is open • Bind-mount a file in /proc/$$/ns to another place on the filesystem $ mount \ --bind /proc/$$/ns/net \ /var/run/netns/con407

Slide 25

Slide 25 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Entering namespaces • Open a file from /proc/$$/ns (or a bind- mount) • Pass to setns(2) to enter the existing namespace • Namespace remains open as long as the process is running, even if the original file goes away • nsenter(1) is a command for doing this interactively • ip-netns(8) works specifically for network namespaces

Slide 26

Slide 26 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Slide 27

Slide 27 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. How can you leverage this? • Use nsenter or ip netns to troubleshoot container networking • Monitor containers by entering the pid namespace • Access binaries in your containers with the mount namespace

Slide 28

Slide 28 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Further reading • man 7 namespaces • man 7 pid_namespaces • man 7 user_namespaces

Slide 29

Slide 29 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Slide 30

Slide 30 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Filesystem images • Images are representations of a filesystem • Images are popular for virtualization and container systems • Docker helped popularize the concept of layers • A union filesystem is one where two or more filesystems are joined together in a unified view

Slide 31

Slide 31 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. How Docker layers work • A copy-on-write view of your files • New files exist only in the top layer • When a file is modified, it is copied up to the top layer • Unmodified files exist in whatever layer they were added/modified • Deleted files are hidden, but still exist Top layer (read-write) Intermediate layer (read- only) Base layer (read-only)

Slide 32

Slide 32 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Union filesystems • Popular in container runtimes (like Docker) to implement layers • Efficient use of storage when starting multiple containers with identical images • Efficient use of storage when making minor modifications to images

Slide 33

Slide 33 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Overlay filesystem • Joins two directories (upper and lower) to form a union • Uses file name to describe the files • When writing to the overlay • lowerdir is not modified, all changes go to upperdir • Existing files are copied-up to the upperdir for modificiation • Whole file is copied, not just blocks • “Deleting” a file in the upperdir creates a whiteout • Files: character devices with 0/0 device number • Directories: xattr “trusted.overlay.opaque” set to “y”

Slide 34

Slide 34 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Overlay filesystem (continued) • An upperdir can have multiple lowerdirs • Overlay filesystems can be created with mount(2) • You can examine the mounts with • mount(8) • /proc/mounts • /proc/$$/mountinfo

Slide 35

Slide 35 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Docker’s overlay driver • Docker’s default layer storage uses the overlay filesystem • upperdir, lowerdir, and diff directories are in /var/lib/docker/overlay2

Slide 36

Slide 36 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Slide 37

Slide 37 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. How can you leverage this? • Locate files in your layers • Examine which files and layers contribute to your disk usage • Understand the impact of writable files in your containers and how to reduce # du -h . | sort -hr 753M . 211M ./e33f37/diff 211M ./e33f37 204M ./e33f37/diff/usr 169M ./f87973/diff … # ls ./f87973 diff link # ls ./e33f37 diff link lower work

Slide 38

Slide 38 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Further reading • Linux: Documentation/filesystems/overlay.txt

Slide 39

Slide 39 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Slide 40

Slide 40 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. What is a container runtime? • A software tool that configures Linux primitives to create and run containers on a host • Examples include: • Docker • containerd • CRI-O • rkt • systemd-nspawn • Open Containers Initiative (OCI) aims to standardize container runtimes, image format, and distribution • The OCI reference implementation (runc) powers Docker, containerd, and CRI-O

Slide 41

Slide 41 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. OCI runtime spec • Containers are “bundles” • Filesystem • JSON document • Filesystem can be a union • JSON document describes • cgroups • Namespaces • Additional mounts • Linux capabilities • Linux security modules • And more • Hooks can modify the bundle { "ociVersion": "1.0.1", ⋮ "root": { "path": "/var/lib/docker/overlay2/03004c/merged" }, ⋮ "hooks": { "prestart": [{"path": "/proc/9306/exe"}] }, "linux": { "resources": { "cpu": {"shares": 0}, "pids": {“limit": 0}, ⋮ }, "cgroupsPath": "/docker/bd5cebc8950c", "namespaces": [ {"type": "mount"}, {"type": "network"}, ⋮ ], ⋮ }

Slide 42

Slide 42 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. OCI runtime hooks • Hooks run • Before a container starts • After a container starts • After a container stops • Hooks can modify the filesystem, modify the JSON file, or take other actions • Hooks run sequentially, in an order defined in the JSON file • Docker generates a bundle without hooks • Docker does let you specify your own runtime • Your runtime could inject hooks, then execute the real runtime • This is how Nvidia’s container runtime works

Slide 43

Slide 43 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Slide 44

Slide 44 text

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Slide 45

Slide 45 text

Thank you! © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Samuel Karp @samuelkarp