Slide 1

Slide 1 text

Containers from Scratch Eric Chiang Senior Engineer, CoreOS twitter.com/erchiang github.com/ericchiang

Slide 2

Slide 2 text

No content

Slide 3

Slide 3 text

No content

Slide 4

Slide 4 text

No content

Slide 5

Slide 5 text

Today’s agenda Build a container without using a container runtime (docker, rkt, etc.)

Slide 6

Slide 6 text

Step 1: The container image What are you shipping around? TL;DR - It’s a tarball. Images contain: ● App metadata (how to run your app) ● Filesystem (your app + an operating system?)

Slide 7

Slide 7 text

Step 1: The container image Container filesystem: something that looks like an OS No kernel, no init system.

Slide 8

Slide 8 text

$ mkdir rootfs $ sudo dnf -y \ --installroot=$PWD/rootfs \ --releasever=24 install \ @development-tools \ procps-ng \ python3 \ which \ iproute \ net-tools $ ls rootfs

Slide 9

Slide 9 text

Step 2: chroot Next step is to execute a process in our filesystem. chroot(2)

Slide 10

Slide 10 text

$ sudo chroot rootfs

Slide 11

Slide 11 text

Step 3: namespaces The chroots of other systems. ● Process trees. ● Network interfaces. ● Mounted volumes. clone(2) and unshare(2)

Slide 12

Slide 12 text

$ sudo unshare -p -f \ --mount-proc=$PWD/rootfs/proc \ chroot rootfs /bin/bash

Slide 13

Slide 13 text

Step 4: entering namespaces Namespaces are composable. Kubernetes pod: ● Multiple processes with different chroots. ● Same network and mount namespace. setns(2)

Slide 14

Slide 14 text

# PID=1234 # ls /proc/$PID/ns cgroup ipc mnt net pid user uts

Slide 15

Slide 15 text

# PID=1234 # ls /proc/$PID/ns cgroup ipc mnt net pid user uts # nsenter \ --pid=/proc/$PID/ns/pid \ --mount=/proc/$PID/ns/mnt \ chroot $PWD/rootfs /bin/bash

Slide 16

Slide 16 text

Step 5: volume mounts Let’s inject files into our chroot. How does docker’s -v flag work or Kubernetes host mounts?

Slide 17

Slide 17 text

# nsenter --mount=/proc/$PID/ns/mnt \ mount --bind -o ro \ $PWD/readonlyfiles \ $PWD/rootfs/var/readonlyfiles

Slide 18

Slide 18 text

Step 6: cgroups cgroups, resource restrictions for processes.

Slide 19

Slide 19 text

# ls /sys/fs/cgroup/ # mkdir /sys/fs/cgroup/memory/demo # echo $$ > /sys/fs/cgroup/memory/demo # cat /proc/self/cgroup

Slide 20

Slide 20 text

# CGROUP=/sys/fs/cgroup/memory/demo/ # echo "100000000" > $CGROUP/memory.limit_in_bytes # echo "0" > $CGROUP/memory.swappiness # python3 hungry.py

Slide 21

Slide 21 text

Step 7: cgroup namespaces Q: How do you restrict a process from re-assigning its own cgroup? A: More namespaces!

Slide 22

Slide 22 text

$ sudo unshare -C -p -f \ --mount-proc=rootfs/proc \ chroot rootfs /bin/bash cat /proc/self/cgroup mkdir -p /sys/fs/cgroup mount -t tmpfs cgroup_root /sys/fs/cgroup mkdir -p /sys/fs/cgroup/memory mount -t cgroup memory -omemory \ /sys/fs/cgroup/memory

Slide 23

Slide 23 text

# echo "How to remove a cgroup" # echo "Reassign each task, remove the dir" # echo $$ > /sys/fs/cgroup/memory/tasks # rmdir /sys/fs/cgroup/memory/demo

Slide 24

Slide 24 text

Step 8: capabilities “I have a co-worker who said: ‘Docker is about running random code downloaded from the Internet and running it as root.’” - Dan Walsh (Red Hat)

Slide 25

Slide 25 text

Step 8: capabilities This section probably should have covered: - SELinux - seccomp - AppArmor Those are hard to demo, so we’ll be covering capabilities.

Slide 26

Slide 26 text

$ go build -o /tmp/listen listen.go $ sudo setcap cap_net_bind_service=+ep \ /tmp/listen $ getcap /tmp/listen

Slide 27

Slide 27 text

$ sudo capsh --print $ sudo capsh --drop=cap_chown --

Slide 28

Slide 28 text

Step 9: network namespaces

Slide 29

Slide 29 text

$ sudo unshare -n chroot rootfs # ip addr # ip link set dev lo up

Slide 30

Slide 30 text

ip link add veth0 type veth peer name veth1 ip link set veth1 netns $PID ifconfig veth0 10.1.1.2/24 up # (inside namespace) ifconfig veth1 10.1.1.1/24 up

Slide 31

Slide 31 text

Step 10: user namespaces Mapping of UIDs/GIDs from the host to the container. Container thinks it’s root when it’s not.

Slide 32

Slide 32 text

$ unshare --map-root-user # cat /proc/self/uid_map # capsh --print

Slide 33

Slide 33 text

Step 10: user namespaces Still need a lot of permissions on the host. ● Unpacking images (device files). ● Dealing with cgroups.

Slide 34

Slide 34 text

Conclusion

Slide 35

Slide 35 text

Conclusion “Containers” are a bunch of technologies provided by the Linux Kernel. Container runtimes are opinionated wrappers around these technologies.

Slide 36

Slide 36 text

Links Namespaces in operation, Michael Kerrisk https://lwn.net/Articles/531114 Building minimal containers, Brian Redbeard https://github.com/brianredbeard/minimal_containers cgroups V1, Paul Menage https://www.kernel.org/doc/Documentation/cgroup-v1/cgroups.txt Getting Towards Real Sandbox Containers, Jessie Frazelle https://blog.jessfraz.com/post/getting-towards-real-sandbox-containers/ (Also lots of Linux man pages)

Slide 37

Slide 37 text

eric.chiang@coreos.com twitter.com/erchiang github.com/ericchiang QUESTIONS? Thanks! We’re hiring for my team! coreos.com/careers Let’s talk! More events: coreos.com/community LONGER CHAT?