Slide 1

Slide 1 text

Cgroups, Namespaces, and Containers. What and How much is too much to contain? Piyush Verma Oogway Consulting

Slide 2

Slide 2 text

Virtual Machine

Slide 3

Slide 3 text

VM - What is it? - What is it not?

Slide 4

Slide 4 text

Containers

Slide 5

Slide 5 text

Container - What is it? - What is it not?

Slide 6

Slide 6 text

Container vs. VM

Slide 7

Slide 7 text

Time to Build

Slide 8

Slide 8 text

Time to Quarantine?

Slide 9

Slide 9 text

Container. How much to pack?

Slide 10

Slide 10 text

What is a Cgroup? - Mechanism for aggregating/partitioning sets of tasks, and all their future children, into hierarchical groups with specialized behaviour. - First Class Citizens - Process-like Hierarchical model, but: - Multiple parallel hierarchies coexist. - Each hierarchy connects to a Subsystem.

Slide 11

Slide 11 text

What is a Hierarchy?

Slide 12

Slide 12 text

What is a subsystem? - Represents Single resource - CPU - Memory - Blkio - cpuacct - cpuset - devices - freezer - ns - etc. - Something that does something to a group of Tasks :-/

Slide 13

Slide 13 text

Hierarchy Rules

Slide 14

Slide 14 text

Nomenclature - Processes are called: - Tasks - Subsystems are also-called: - Controllers. - Resource controllers.

Slide 15

Slide 15 text

Hierarchy can be attached to multiple subsystems As long as next rule is not violated

Slide 16

Slide 16 text

Subsystem can have more than one hierarchy, but Next rule is not violated

Slide 17

Slide 17 text

Subsystems cannot share hierarchies

Slide 18

Slide 18 text

Task belongs to only one cgroup in a hierarchy

Slide 19

Slide 19 text

Child inherits parents’ memberships But detaches thereafter.

Slide 20

Slide 20 text

Tasks hold no relationship in a group.

Slide 21

Slide 21 text

Kernel maintains hierarchical constraints on Limits if devices cgroup /child1 cannot access a disk drive, then /child1/child2 cannot give itself those rights

Slide 22

Slide 22 text

Implications combined

Slide 23

Slide 23 text

Implications - There is only one way that a task can be limited or affected by any single subsystem. - You can group several subsystems together so that they affect all tasks in a single hierarchy. - Cgroups in that hierarchy have different parameters set, those tasks will be affected differently. - Constant, Refactor is required for best Knapsack.

Slide 24

Slide 24 text

How to set this up?

Slide 25

Slide 25 text

Manual meson10@xps:~/workspace$ cgcreate -h Usage: cgcreate [-h] [-f mode] [-d mode] [-s mode] [-t :] [-a :] -g : [-g ...] Create control group(s) -a : Owner of the group and all its files -d, --dperm=mode Group directory permissions -f, --fperm=mode Group file permissions -g : Control group which should be added -h, --help Display this help -s, --tperm=mode Tasks file permissions -t : Owner of the tasks file

Slide 26

Slide 26 text

Automatic /etc/cgconfig.conf mount { subsystem = /mount/point … } group { [] { = ; … } … }

Slide 27

Slide 27 text

Automatic: Example conf mount { cpuset = /cgroup/cpuset; cpu = /cgroup/cpu; cpuacct= /cgroup/cpuacct; memory = /cgroup/memory; devices= /cgroup/devices; freezer= /cgroup/freezer; net_cls= /cgroup/net_cls; blkio = /cgroup/blkio; } group daemons/sql { cpuset { cpuset.mems = 0; cpuset.cpus = 0; } }

Slide 28

Slide 28 text

Automatic: Translates to # mkdir /cgroup/cpuset # mount -t cgroup -o cpuset /cgroup/cpuset # mkdir -p /cgroup/red/daemons/sql # echo $(cgget -n -v -r cpuset.mems /) > /cgroup/red/daemons/cpuset.mems # echo $(cgget -n -v -r cpuset.cpus /) > /cgroup/red/daemons/cpuset.cpus # echo 0 > /cgroup/red/daemons/sql/cpuset.mems # echo 0 > /cgroup/red/daemons/sql/cpuset.cpus

Slide 29

Slide 29 text

Wait, Why do I need this?

Slide 30

Slide 30 text

blkio - Control and Monitor access to Block devices. - Policies: - Weight Division - Upper-Bound Throttling - All devices - Per device

Slide 31

Slide 31 text

blkio: Example $: echo 500 > blkio.weight $: echo 8:0 500 > blkio.weight_device $: echo "8:0 10485760" > /cgroup/blkio/test/blkio.throttle.read_bps_device $: echo "8:0 10" > /cgroup/blkio/test/blkio.throttle.read_iops_device $: echo "8:0 10485760" > /cgroup/blkio/test/blkio.throttle.write_bps_device $: echo "8:0 10" > /cgroup/blkio/test/blkio.throttle.write_iops_device

Slide 32

Slide 32 text

cpu - Control and Monitor CPU Access. - Policies: - Ceiling Enforcement - Relative Shares - cpu.shares - cpu.cfs_quota_us & cpu.cfs_period_us

Slide 33

Slide 33 text

cpuacct - Generates reports on CPU used. - To reset: - $: echo 0 > /cgroup/cpuacct/cpuacct.usage

Slide 34

Slide 34 text

cpuset - Assigns individual CPUs and Memory nodes to groups. - cpuset.cpus (0-2,16) -> 0,1,2, and 16 - cpuset.mems (0-2, 16) -> 0,1,2, and 16 - cpuset.memory_migrate - cpuset.cpu_exclusive - cpuset.mem_exclusive

Slide 35

Slide 35 text

devices - Allows or Denies access to devices by groups. - devices.allow type, major, minor, access - type: a,b,c -> all, block, character - major: minor 8:1 -> /dev/sda1 - access: r,w,m - devices.deny

Slide 36

Slide 36 text

freezer - Suspends or Resumes task in a cgroup. - freezer.state - FROZEN (r,w) - FREEZING (r) - THAWED (r,w) - Helps in: - Process migration

Slide 37

Slide 37 text

memory - Soft limits - Hard limits - Charges $: mount -t memory -o memory memory /cgroup/memory $: mkdir /cgroup/memory/blue $: echo 104857600 > memory.limit_in_bytes $: echo $$ > tasks $: ~/.heavy_memory.sh Killed

Slide 38

Slide 38 text

ns - Group processes into namespace - Process in a namespace can see each-other - But not in other namespace. - Also confused as containers.

Slide 39

Slide 39 text

Example Scenarios

Slide 40

Slide 40 text

Namespaces

Slide 41

Slide 41 text

Namespaces - Kernel Feature - System starts with namespace of each kind. - 7 namespaces - Each process is associated with a namespace.

Slide 42

Slide 42 text

System start: namespace init Each of the namespaces are enabled at system start and assigned an Inode meson10@xps:~$ lsns NS TYPE NPROCS PID USER COMMAND 4026531835 cgroup 49 921 meson10 /lib/systemd/systemd --user 4026531836 pid 49 921 meson10 /lib/systemd/systemd --user 4026531837 user 49 921 meson10 /lib/systemd/systemd --user 4026531838 uts 49 921 meson10 /lib/systemd/systemd --user 4026531839 ipc 49 921 meson10 /lib/systemd/systemd --user 4026531840 mnt 49 921 meson10 /lib/systemd/systemd --user 4026532009 net 49 921 meson10 /lib/systemd/systemd --user

Slide 43

Slide 43 text

Identifying Namespace New entries (inodes) are added to /proc//ns, one for each namespace meson10@xps:~$ ls -al /proc/942/ns total 0 lrwxrwxrwx 1 meson10 meson10 0 Dec 6 05:34 cgroup -> cgroup:[4026531835] lrwxrwxrwx 1 meson10 meson10 0 Dec 6 05:34 ipc -> ipc:[4026531839] lrwxrwxrwx 1 meson10 meson10 0 Dec 6 05:34 mnt -> mnt:[4026531840] lrwxrwxrwx 1 meson10 meson10 0 Dec 6 05:34 net -> net:[4026532009] lrwxrwxrwx 1 meson10 meson10 0 Dec 6 05:34 pid -> pid:[4026531836] lrwxrwxrwx 1 meson10 meson10 0 Dec 6 05:34 pid_for_children -> pid:[4026531836] lrwxrwxrwx 1 meson10 meson10 0 Dec 6 05:34 user -> user:[4026531837] lrwxrwxrwx 1 meson10 meson10 0 Dec 6 05:34 uts -> uts:[4026531838]

Slide 44

Slide 44 text

Namespace utils: Unshare & nsenter

Slide 45

Slide 45 text

unshare meson10@xps:~$ unshare --help Options: -m, --mount[=] unshare mounts namespace -u, --uts[=] unshare UTS namespace (hostname etc) -i, --ipc[=] unshare System V IPC namespace -n, --net[=] unshare network namespace -p, --pid[=] unshare pid namespace -U, --user[=] unshare user namespace -C, --cgroup[=] unshare cgroup namespace -f, --fork fork before launching --mount-proc[=] mount proc filesystem first (implies --mount) -r, --map-root-user map current user to root (implies --user) --propagation slave|shared|private|unchanged modify mount propagation in mount namespace -s, --setgroups allow|deny control the setgroups syscall in user namespaces

Slide 46

Slide 46 text

nsenter meson10@xps:~$ nsenter --help Options: -a, --all enter all namespaces -t, --target target process to get namespaces from -m, --mount[=] enter mount namespace -u, --uts[=] enter UTS namespace (hostname etc) -i, --ipc[=] enter System V IPC namespace -n, --net[=] enter network namespace -p, --pid[=] enter pid namespace -C, --cgroup[=] enter cgroup namespace -U, --user[=] enter user namespace -S, --setuid set uid in entered namespace -G, --setgid set gid in entered namespace --preserve-credentials do not touch uids or gids -r, --root[=] set the root directory -w, --wd[=] set the working directory -F, --no-fork do not fork before exec'ing

Slide 47

Slide 47 text

uts namespace meson10@xps:~$ hostname xps.piyushverma.net meson10@xps:~$ sudo unshare -u /bin/bash root@xps:~# hostname hello root@xps:~# hostname Hello root@xps:~# exit meson10@xps:~$ hostname xps.piyushverma.net

Slide 48

Slide 48 text

Ipc namespace meson10@xps:~$ ipcs -q ------ Message Queues -------- key msqid owner perms used-bytes messages 0xef4c7712 131074 meson10 644 0 0 meson10@xps:~$ sudo unshare -i /bin/bash root@xps:~# ipcs -q ------ Message Queues -------- key msqid owner perms used-bytes messages

Slide 49

Slide 49 text

net namespace meson10@xps:~/workspace/meson10/linuxlab$ ip link 1: lo: mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 2: wlp58s0: mtu 1500 qdisc mq state UP mode DORMANT group default qlen 1000 link/ether 18:5e:0f:ee:d9:32 brd ff:ff:ff:ff:ff:ff 9: bridge0: mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000 link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff

Slide 50

Slide 50 text

net namespace root@xps:~/workspace/meson10/linuxlab# ip link 1: lo: mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00

Slide 51

Slide 51 text

net namespace meson10@xps:~$ sudo unshare -n /bin/bash [sudo] password for meson10: root@xps:~/workspace/meson10/linuxlab# ip addr 1: lo: mtu 65536 qdisc noop state DOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00

Slide 52

Slide 52 text

net namespace meson10@xps:~$ sudo unshare -n /bin/bash [sudo] password for meson10: root@xps:~/workspace/meson10/linuxlab# ip addr 1: lo: mtu 65536 qdisc noop state DOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 root@xps:~/workspace/meson10/linuxlab# ping localhost connect: Network is unreachable root@xps:~/workspace/meson10/linuxlab# ip link set dev lo up && ping localhost PING localhost (127.0.0.1) 56(84) bytes of data. 64 bytes from localhost (127.0.0.1): icmp_seq=1 ttl=64 time=0.033 ms

Slide 53

Slide 53 text

net namespace implications - One device can only be connected to one namespace. - If eth0 is connected to root namespace, created namespace won’t find internet access.

Slide 54

Slide 54 text

net namespace solution $: sudo ip link add host type veth peer name guest $: sudo ip link set guest netns $: sudo ip addr add 192.168.0.2/24 dev host $: sudo ip link set host up $ns: ip addr add 192.168.0.1/24 dev guest $ns: ip link set guest up $: brctl addbr bridge0 $: ip addr add 192.168.1.2/24 dev bridge0 $: ip link set dev bridge0 up $: brctl addif bridge0 host $: ip link set host up $ns: ip addr add 192.168.1.1/24 dev guest $ns: ip link set guest up $ns: ip route add default via 192.168.1.2

Slide 55

Slide 55 text

pid namespace $: sudo unshare -p /bin/bash - Child process enters a new PID namespace - Gets PID 1 - Forked Process gets PID for namespace and a global PID. - Signals - Register explicit signals. - Ctrl-C doesn’t work in Docker. - Child dying, grandchildren get connected to PID1. - If PID1 dies: - children get SIGKILL recursively - namespace is deleted.

Slide 56

Slide 56 text

bash: fork: Cannot allocate memory - unshare exec bash - bash forks subprocess - First subprocess becomes PID1 - Subprocess exits. - Kernel calls disable_pid_allocation - Clears the PIDNS_HASH_ADDING flag. - New PID by alloc_pid function. - -ENOMEM - Use --fork and bash becomes PID1.

Slide 57

Slide 57 text

mnt namespace - Isolated list of mount points - Unshare copies the parent’s mountpoints - May conditionally propagate. - Private by default. - If unshared namespace user != parent namespace user, it is less privileged. - For less privileged namespace, shared become slaves. - Mount flags cannot be altered across less privileged mounts

Slide 58

Slide 58 text

bind - Shared - Private - Slave - unbindable

Slide 59

Slide 59 text

Shared bind

Slide 60

Slide 60 text

Slave bind

Slide 61

Slide 61 text

Private bind

Slide 62

Slide 62 text

rbind

Slide 63

Slide 63 text

persistent namespaces $: touch netns $: sudo unshare --net=netns -f /bin/bash ns$: hostname hello ns$: Ctrl-D; exit $: sudo nsenter --net=netns ns$: hostname Hello meson10@xps:~$ ls -al /proc/942/ns/net lrwxrwxrwx 1 meson10 meson10 0 Dec 6 05:34 /proc/942/ns/net -> net:[4026532009] meson10@xps:~$ sudo ls -al /proc/25762/ns/net sudo: unable to resolve host hello: No such file or directory lrwxrwxrwx 1 root root 0 Dec 6 15:10 /proc/25762/ns/net -> net:[4026532391]

Slide 64

Slide 64 text

Thank you! Piyush Verma @meson10 Oogway Consulting http://oogway.in