Cgroups and Namespaces in Linux

Ee5407f7a79eb620c4fd54c136847b33?s=47 Piyush Verma
December 06, 2017

Cgroups and Namespaces in Linux

Building blocks of Linux containers.

Ee5407f7a79eb620c4fd54c136847b33?s=128

Piyush Verma

December 06, 2017
Tweet

Transcript

  1. 1.

    Cgroups, Namespaces, and Containers. What and How much is too

    much to contain? Piyush Verma Oogway Consulting
  2. 10.

    What is a Cgroup? - Mechanism for aggregating/partitioning sets of

    tasks, and all their future children, into hierarchical groups with specialized behaviour. - First Class Citizens - Process-like Hierarchical model, but: - Multiple parallel hierarchies coexist. - Each hierarchy connects to a Subsystem.
  3. 12.

    What is a subsystem? - Represents Single resource - CPU

    - Memory - Blkio - cpuacct - cpuset - devices - freezer - ns - etc. - Something that does something to a group of Tasks :-/
  4. 14.

    Nomenclature - Processes are called: - Tasks - Subsystems are

    also-called: - Controllers. - Resource controllers.
  5. 21.

    Kernel maintains hierarchical constraints on Limits if devices cgroup /child1

    cannot access a disk drive, then /child1/child2 cannot give itself those rights
  6. 23.

    Implications - There is only one way that a task

    can be limited or affected by any single subsystem. - You can group several subsystems together so that they affect all tasks in a single hierarchy. - Cgroups in that hierarchy have different parameters set, those tasks will be affected differently. - Constant, Refactor is required for best Knapsack.
  7. 25.

    Manual meson10@xps:~/workspace$ cgcreate -h Usage: cgcreate [-h] [-f mode] [-d

    mode] [-s mode] [-t <tuid>:<tgid>] [-a <agid>:<auid>] -g <controllers>:<path> [-g ...] Create control group(s) -a <tuid>:<tgid> Owner of the group and all its files -d, --dperm=mode Group directory permissions -f, --fperm=mode Group file permissions -g <controllers>:<path> Control group which should be added -h, --help Display this help -s, --tperm=mode Tasks file permissions -t <tuid>:<tgid> Owner of the tasks file
  8. 26.

    Automatic /etc/cgconfig.conf mount { subsystem = /mount/point … } group

    <name> { [<permissions>] <controller> { <param name> = <param value>; … } … }
  9. 27.

    Automatic: Example conf mount { cpuset = /cgroup/cpuset; cpu =

    /cgroup/cpu; cpuacct= /cgroup/cpuacct; memory = /cgroup/memory; devices= /cgroup/devices; freezer= /cgroup/freezer; net_cls= /cgroup/net_cls; blkio = /cgroup/blkio; } group daemons/sql { cpuset { cpuset.mems = 0; cpuset.cpus = 0; } }
  10. 28.

    Automatic: Translates to # mkdir /cgroup/cpuset # mount -t cgroup

    -o cpuset /cgroup/cpuset # mkdir -p /cgroup/red/daemons/sql # echo $(cgget -n -v -r cpuset.mems /) > /cgroup/red/daemons/cpuset.mems # echo $(cgget -n -v -r cpuset.cpus /) > /cgroup/red/daemons/cpuset.cpus # echo 0 > /cgroup/red/daemons/sql/cpuset.mems # echo 0 > /cgroup/red/daemons/sql/cpuset.cpus
  11. 30.

    blkio - Control and Monitor access to Block devices. -

    Policies: - Weight Division - Upper-Bound Throttling - All devices - Per device
  12. 31.

    blkio: Example $: echo 500 > blkio.weight $: echo 8:0

    500 > blkio.weight_device $: echo "8:0 10485760" > /cgroup/blkio/test/blkio.throttle.read_bps_device $: echo "8:0 10" > /cgroup/blkio/test/blkio.throttle.read_iops_device $: echo "8:0 10485760" > /cgroup/blkio/test/blkio.throttle.write_bps_device $: echo "8:0 10" > /cgroup/blkio/test/blkio.throttle.write_iops_device
  13. 32.

    cpu - Control and Monitor CPU Access. - Policies: -

    Ceiling Enforcement - Relative Shares - cpu.shares - cpu.cfs_quota_us & cpu.cfs_period_us
  14. 33.

    cpuacct - Generates reports on CPU used. - To reset:

    - $: echo 0 > /cgroup/cpuacct/cpuacct.usage
  15. 34.

    cpuset - Assigns individual CPUs and Memory nodes to groups.

    - cpuset.cpus (0-2,16) -> 0,1,2, and 16 - cpuset.mems (0-2, 16) -> 0,1,2, and 16 - cpuset.memory_migrate - cpuset.cpu_exclusive - cpuset.mem_exclusive
  16. 35.

    devices - Allows or Denies access to devices by groups.

    - devices.allow type, major, minor, access - type: a,b,c -> all, block, character - major: minor 8:1 -> /dev/sda1 - access: r,w,m - devices.deny
  17. 36.

    freezer - Suspends or Resumes task in a cgroup. -

    freezer.state - FROZEN (r,w) - FREEZING (r) - THAWED (r,w) - Helps in: - Process migration
  18. 37.

    memory - Soft limits - Hard limits - Charges $:

    mount -t memory -o memory memory /cgroup/memory $: mkdir /cgroup/memory/blue $: echo 104857600 > memory.limit_in_bytes $: echo $$ > tasks $: ~/.heavy_memory.sh Killed
  19. 38.

    ns - Group processes into namespace - Process in a

    namespace can see each-other - But not in other namespace. - Also confused as containers.
  20. 41.

    Namespaces - Kernel Feature - System starts with namespace of

    each kind. - 7 namespaces - Each process is associated with a namespace.
  21. 42.

    System start: namespace init Each of the namespaces are enabled

    at system start and assigned an Inode meson10@xps:~$ lsns NS TYPE NPROCS PID USER COMMAND 4026531835 cgroup 49 921 meson10 /lib/systemd/systemd --user 4026531836 pid 49 921 meson10 /lib/systemd/systemd --user 4026531837 user 49 921 meson10 /lib/systemd/systemd --user 4026531838 uts 49 921 meson10 /lib/systemd/systemd --user 4026531839 ipc 49 921 meson10 /lib/systemd/systemd --user 4026531840 mnt 49 921 meson10 /lib/systemd/systemd --user 4026532009 net 49 921 meson10 /lib/systemd/systemd --user
  22. 43.

    Identifying Namespace New entries (inodes) are added to /proc/<pid>/ns, one

    for each namespace meson10@xps:~$ ls -al /proc/942/ns total 0 lrwxrwxrwx 1 meson10 meson10 0 Dec 6 05:34 cgroup -> cgroup:[4026531835] lrwxrwxrwx 1 meson10 meson10 0 Dec 6 05:34 ipc -> ipc:[4026531839] lrwxrwxrwx 1 meson10 meson10 0 Dec 6 05:34 mnt -> mnt:[4026531840] lrwxrwxrwx 1 meson10 meson10 0 Dec 6 05:34 net -> net:[4026532009] lrwxrwxrwx 1 meson10 meson10 0 Dec 6 05:34 pid -> pid:[4026531836] lrwxrwxrwx 1 meson10 meson10 0 Dec 6 05:34 pid_for_children -> pid:[4026531836] lrwxrwxrwx 1 meson10 meson10 0 Dec 6 05:34 user -> user:[4026531837] lrwxrwxrwx 1 meson10 meson10 0 Dec 6 05:34 uts -> uts:[4026531838]
  23. 45.

    unshare meson10@xps:~$ unshare --help Options: -m, --mount[=<file>] unshare mounts namespace

    -u, --uts[=<file>] unshare UTS namespace (hostname etc) -i, --ipc[=<file>] unshare System V IPC namespace -n, --net[=<file>] unshare network namespace -p, --pid[=<file>] unshare pid namespace -U, --user[=<file>] unshare user namespace -C, --cgroup[=<file>] unshare cgroup namespace -f, --fork fork before launching <program> --mount-proc[=<dir>] mount proc filesystem first (implies --mount) -r, --map-root-user map current user to root (implies --user) --propagation slave|shared|private|unchanged modify mount propagation in mount namespace -s, --setgroups allow|deny control the setgroups syscall in user namespaces
  24. 46.

    nsenter meson10@xps:~$ nsenter --help Options: -a, --all enter all namespaces

    -t, --target <pid> target process to get namespaces from -m, --mount[=<file>] enter mount namespace -u, --uts[=<file>] enter UTS namespace (hostname etc) -i, --ipc[=<file>] enter System V IPC namespace -n, --net[=<file>] enter network namespace -p, --pid[=<file>] enter pid namespace -C, --cgroup[=<file>] enter cgroup namespace -U, --user[=<file>] enter user namespace -S, --setuid <uid> set uid in entered namespace -G, --setgid <gid> set gid in entered namespace --preserve-credentials do not touch uids or gids -r, --root[=<dir>] set the root directory -w, --wd[=<dir>] set the working directory -F, --no-fork do not fork before exec'ing <program>
  25. 47.

    uts namespace meson10@xps:~$ hostname xps.piyushverma.net meson10@xps:~$ sudo unshare -u /bin/bash

    root@xps:~# hostname hello root@xps:~# hostname Hello root@xps:~# exit meson10@xps:~$ hostname xps.piyushverma.net
  26. 48.

    Ipc namespace meson10@xps:~$ ipcs -q ------ Message Queues -------- key

    msqid owner perms used-bytes messages 0xef4c7712 131074 meson10 644 0 0 meson10@xps:~$ sudo unshare -i /bin/bash root@xps:~# ipcs -q ------ Message Queues -------- key msqid owner perms used-bytes messages
  27. 49.

    net namespace meson10@xps:~/workspace/meson10/linuxlab$ ip link 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536

    qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 2: wlp58s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DORMANT group default qlen 1000 link/ether 18:5e:0f:ee:d9:32 brd ff:ff:ff:ff:ff:ff 9: bridge0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000 link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff
  28. 50.

    net namespace root@xps:~/workspace/meson10/linuxlab# ip link 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536

    qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
  29. 51.

    net namespace meson10@xps:~$ sudo unshare -n /bin/bash [sudo] password for

    meson10: root@xps:~/workspace/meson10/linuxlab# ip addr 1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
  30. 52.

    net namespace meson10@xps:~$ sudo unshare -n /bin/bash [sudo] password for

    meson10: root@xps:~/workspace/meson10/linuxlab# ip addr 1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 root@xps:~/workspace/meson10/linuxlab# ping localhost connect: Network is unreachable root@xps:~/workspace/meson10/linuxlab# ip link set dev lo up && ping localhost PING localhost (127.0.0.1) 56(84) bytes of data. 64 bytes from localhost (127.0.0.1): icmp_seq=1 ttl=64 time=0.033 ms
  31. 53.

    net namespace implications - One device can only be connected

    to one namespace. - If eth0 is connected to root namespace, created namespace won’t find internet access.
  32. 54.

    net namespace solution $: sudo ip link add host type

    veth peer name guest $: sudo ip link set guest netns <PID> $: sudo ip addr add 192.168.0.2/24 dev host $: sudo ip link set host up $ns: ip addr add 192.168.0.1/24 dev guest $ns: ip link set guest up $: brctl addbr bridge0 $: ip addr add 192.168.1.2/24 dev bridge0 $: ip link set dev bridge0 up $: brctl addif bridge0 host $: ip link set host up $ns: ip addr add 192.168.1.1/24 dev guest $ns: ip link set guest up $ns: ip route add default via 192.168.1.2
  33. 55.

    pid namespace $: sudo unshare -p /bin/bash - Child process

    enters a new PID namespace - Gets PID 1 - Forked Process gets PID for namespace and a global PID. - Signals - Register explicit signals. - Ctrl-C doesn’t work in Docker. - Child dying, grandchildren get connected to PID1. - If PID1 dies: - children get SIGKILL recursively - namespace is deleted.
  34. 56.

    bash: fork: Cannot allocate memory - unshare exec bash -

    bash forks subprocess - First subprocess becomes PID1 - Subprocess exits. - Kernel calls disable_pid_allocation - Clears the PIDNS_HASH_ADDING flag. - New PID by alloc_pid function. - -ENOMEM - Use --fork and bash becomes PID1.
  35. 57.

    mnt namespace - Isolated list of mount points - Unshare

    copies the parent’s mountpoints - May conditionally propagate. - Private by default. - If unshared namespace user != parent namespace user, it is less privileged. - For less privileged namespace, shared become slaves. - Mount flags cannot be altered across less privileged mounts
  36. 62.
  37. 63.

    persistent namespaces $: touch netns $: sudo unshare --net=netns -f

    /bin/bash ns$: hostname hello ns$: Ctrl-D; exit $: sudo nsenter --net=netns ns$: hostname Hello meson10@xps:~$ ls -al /proc/942/ns/net lrwxrwxrwx 1 meson10 meson10 0 Dec 6 05:34 /proc/942/ns/net -> net:[4026532009] meson10@xps:~$ sudo ls -al /proc/25762/ns/net sudo: unable to resolve host hello: No such file or directory lrwxrwxrwx 1 root root 0 Dec 6 15:10 /proc/25762/ns/net -> net:[4026532391]