Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Cgroups and Namespaces in Linux

Piyush Verma
December 06, 2017

Cgroups and Namespaces in Linux

Building blocks of Linux containers.

Piyush Verma

December 06, 2017
Tweet

More Decks by Piyush Verma

Other Decks in Technology

Transcript

  1. Cgroups, Namespaces,
    and Containers.
    What and How much is too much to contain?
    Piyush Verma
    Oogway Consulting

    View full-size slide

  2. Virtual Machine

    View full-size slide

  3. VM
    - What is it?
    - What is it not?

    View full-size slide

  4. Container
    - What is it?
    - What is it not?

    View full-size slide

  5. Container vs. VM

    View full-size slide

  6. Time to Build

    View full-size slide

  7. Time to Quarantine?

    View full-size slide

  8. Container. How much to pack?

    View full-size slide

  9. What is a Cgroup?
    - Mechanism for aggregating/partitioning sets of tasks, and all their future children, into
    hierarchical groups with specialized behaviour.
    - First Class Citizens
    - Process-like Hierarchical model, but:
    - Multiple parallel hierarchies coexist.
    - Each hierarchy connects to a Subsystem.

    View full-size slide

  10. What is a Hierarchy?

    View full-size slide

  11. What is a subsystem?
    - Represents Single resource
    - CPU
    - Memory
    - Blkio
    - cpuacct
    - cpuset
    - devices
    - freezer
    - ns
    - etc.
    - Something that does something to a group of Tasks :-/

    View full-size slide

  12. Hierarchy Rules

    View full-size slide

  13. Nomenclature
    - Processes are called:
    - Tasks
    - Subsystems are also-called:
    - Controllers.
    - Resource controllers.

    View full-size slide

  14. Hierarchy can be
    attached to
    multiple
    subsystems
    As long as next rule is not violated

    View full-size slide

  15. Subsystem can
    have more than one
    hierarchy, but
    Next rule is not violated

    View full-size slide

  16. Subsystems cannot
    share hierarchies

    View full-size slide

  17. Task belongs to
    only one cgroup in
    a hierarchy

    View full-size slide

  18. Child inherits
    parents’
    memberships
    But detaches thereafter.

    View full-size slide

  19. Tasks hold no
    relationship in a
    group.

    View full-size slide

  20. Kernel maintains
    hierarchical
    constraints on
    Limits
    if devices cgroup /child1 cannot access a disk
    drive, then /child1/child2 cannot give itself those
    rights

    View full-size slide

  21. Implications combined

    View full-size slide

  22. Implications
    - There is only one way that a task can be limited or affected by any single subsystem.
    - You can group several subsystems together so that they affect all tasks in a single hierarchy.
    - Cgroups in that hierarchy have different parameters set, those tasks will be affected differently.
    - Constant, Refactor is required for best Knapsack.

    View full-size slide

  23. How to set this up?

    View full-size slide

  24. Manual
    meson10@xps:~/workspace$ cgcreate -h
    Usage: cgcreate [-h] [-f mode] [-d mode] [-s mode] [-t :] [-a :] -g : [-g
    ...]
    Create control group(s)
    -a : Owner of the group and all its files
    -d, --dperm=mode Group directory permissions
    -f, --fperm=mode Group file permissions
    -g : Control group which should be added
    -h, --help Display this help
    -s, --tperm=mode Tasks file permissions
    -t : Owner of the tasks file

    View full-size slide

  25. Automatic
    /etc/cgconfig.conf
    mount {
    subsystem = /mount/point

    }
    group {
    []
    {
    = ;

    }

    }

    View full-size slide

  26. Automatic: Example conf
    mount {
    cpuset = /cgroup/cpuset;
    cpu = /cgroup/cpu;
    cpuacct= /cgroup/cpuacct;
    memory = /cgroup/memory;
    devices= /cgroup/devices;
    freezer= /cgroup/freezer;
    net_cls= /cgroup/net_cls;
    blkio = /cgroup/blkio;
    }
    group daemons/sql {
    cpuset {
    cpuset.mems = 0;
    cpuset.cpus = 0;
    }
    }

    View full-size slide

  27. Automatic: Translates to
    # mkdir /cgroup/cpuset
    # mount -t cgroup -o cpuset /cgroup/cpuset
    # mkdir -p /cgroup/red/daemons/sql
    # echo $(cgget -n -v -r cpuset.mems /) > /cgroup/red/daemons/cpuset.mems
    # echo $(cgget -n -v -r cpuset.cpus /) > /cgroup/red/daemons/cpuset.cpus
    # echo 0 > /cgroup/red/daemons/sql/cpuset.mems
    # echo 0 > /cgroup/red/daemons/sql/cpuset.cpus

    View full-size slide

  28. Wait, Why do I need this?

    View full-size slide

  29. blkio
    - Control and Monitor access to Block devices.
    - Policies:
    - Weight Division
    - Upper-Bound Throttling
    - All devices
    - Per device

    View full-size slide

  30. blkio: Example
    $: echo 500 > blkio.weight
    $: echo 8:0 500 > blkio.weight_device
    $: echo "8:0 10485760" > /cgroup/blkio/test/blkio.throttle.read_bps_device
    $: echo "8:0 10" > /cgroup/blkio/test/blkio.throttle.read_iops_device
    $: echo "8:0 10485760" > /cgroup/blkio/test/blkio.throttle.write_bps_device
    $: echo "8:0 10" > /cgroup/blkio/test/blkio.throttle.write_iops_device

    View full-size slide

  31. cpu
    - Control and Monitor CPU Access.
    - Policies:
    - Ceiling Enforcement
    - Relative Shares
    - cpu.shares
    - cpu.cfs_quota_us & cpu.cfs_period_us

    View full-size slide

  32. cpuacct
    - Generates reports on CPU used.
    - To reset:
    - $: echo 0 > /cgroup/cpuacct/cpuacct.usage

    View full-size slide

  33. cpuset
    - Assigns individual CPUs and Memory nodes to groups.
    - cpuset.cpus (0-2,16) -> 0,1,2, and 16
    - cpuset.mems (0-2, 16) -> 0,1,2, and 16
    - cpuset.memory_migrate
    - cpuset.cpu_exclusive
    - cpuset.mem_exclusive

    View full-size slide

  34. devices
    - Allows or Denies access to devices by groups.
    - devices.allow type, major, minor, access
    - type: a,b,c -> all, block, character
    - major: minor 8:1 -> /dev/sda1
    - access: r,w,m
    - devices.deny

    View full-size slide

  35. freezer
    - Suspends or Resumes task in a cgroup.
    - freezer.state
    - FROZEN (r,w)
    - FREEZING (r)
    - THAWED (r,w)
    - Helps in:
    - Process migration

    View full-size slide

  36. memory
    - Soft limits
    - Hard limits
    - Charges
    $: mount -t memory -o memory memory /cgroup/memory
    $: mkdir /cgroup/memory/blue
    $: echo 104857600 > memory.limit_in_bytes
    $: echo $$ > tasks
    $: ~/.heavy_memory.sh
    Killed

    View full-size slide

  37. ns
    - Group processes into namespace
    - Process in a namespace can see each-other
    - But not in other namespace.
    - Also confused as containers.

    View full-size slide

  38. Example Scenarios

    View full-size slide

  39. Namespaces
    - Kernel Feature
    - System starts with namespace of each kind.
    - 7 namespaces
    - Each process is associated with a namespace.

    View full-size slide

  40. System start: namespace init
    Each of the namespaces are enabled at system start and assigned an Inode
    meson10@xps:~$ lsns
    NS TYPE NPROCS PID USER COMMAND
    4026531835 cgroup 49 921 meson10 /lib/systemd/systemd --user
    4026531836 pid 49 921 meson10 /lib/systemd/systemd --user
    4026531837 user 49 921 meson10 /lib/systemd/systemd --user
    4026531838 uts 49 921 meson10 /lib/systemd/systemd --user
    4026531839 ipc 49 921 meson10 /lib/systemd/systemd --user
    4026531840 mnt 49 921 meson10 /lib/systemd/systemd --user
    4026532009 net 49 921 meson10 /lib/systemd/systemd --user

    View full-size slide

  41. Identifying Namespace
    New entries (inodes) are added to /proc//ns, one for each namespace
    meson10@xps:~$ ls -al /proc/942/ns
    total 0
    lrwxrwxrwx 1 meson10 meson10 0 Dec 6 05:34 cgroup -> cgroup:[4026531835]
    lrwxrwxrwx 1 meson10 meson10 0 Dec 6 05:34 ipc -> ipc:[4026531839]
    lrwxrwxrwx 1 meson10 meson10 0 Dec 6 05:34 mnt -> mnt:[4026531840]
    lrwxrwxrwx 1 meson10 meson10 0 Dec 6 05:34 net -> net:[4026532009]
    lrwxrwxrwx 1 meson10 meson10 0 Dec 6 05:34 pid -> pid:[4026531836]
    lrwxrwxrwx 1 meson10 meson10 0 Dec 6 05:34 pid_for_children -> pid:[4026531836]
    lrwxrwxrwx 1 meson10 meson10 0 Dec 6 05:34 user -> user:[4026531837]
    lrwxrwxrwx 1 meson10 meson10 0 Dec 6 05:34 uts -> uts:[4026531838]

    View full-size slide

  42. Namespace utils:
    Unshare & nsenter

    View full-size slide

  43. unshare
    meson10@xps:~$ unshare --help
    Options:
    -m, --mount[=] unshare mounts namespace
    -u, --uts[=] unshare UTS namespace (hostname etc)
    -i, --ipc[=] unshare System V IPC namespace
    -n, --net[=] unshare network namespace
    -p, --pid[=] unshare pid namespace
    -U, --user[=] unshare user namespace
    -C, --cgroup[=] unshare cgroup namespace
    -f, --fork fork before launching
    --mount-proc[=] mount proc filesystem first (implies --mount)
    -r, --map-root-user map current user to root (implies --user)
    --propagation slave|shared|private|unchanged modify mount propagation in mount namespace
    -s, --setgroups allow|deny control the setgroups syscall in user namespaces

    View full-size slide

  44. nsenter
    meson10@xps:~$ nsenter --help
    Options:
    -a, --all enter all namespaces
    -t, --target target process to get namespaces from
    -m, --mount[=] enter mount namespace
    -u, --uts[=] enter UTS namespace (hostname etc)
    -i, --ipc[=] enter System V IPC namespace
    -n, --net[=] enter network namespace
    -p, --pid[=] enter pid namespace
    -C, --cgroup[=] enter cgroup namespace
    -U, --user[=] enter user namespace
    -S, --setuid set uid in entered namespace
    -G, --setgid set gid in entered namespace
    --preserve-credentials do not touch uids or gids
    -r, --root[=] set the root directory
    -w, --wd[=] set the working directory
    -F, --no-fork do not fork before exec'ing

    View full-size slide

  45. uts namespace
    meson10@xps:~$ hostname
    xps.piyushverma.net
    meson10@xps:~$ sudo unshare -u /bin/bash
    root@xps:~# hostname hello
    root@xps:~# hostname
    Hello
    root@xps:~# exit
    meson10@xps:~$ hostname
    xps.piyushverma.net

    View full-size slide

  46. Ipc namespace
    meson10@xps:~$ ipcs -q
    ------ Message Queues --------
    key msqid owner perms used-bytes messages
    0xef4c7712 131074 meson10 644 0 0
    meson10@xps:~$ sudo unshare -i /bin/bash
    root@xps:~# ipcs -q
    ------ Message Queues --------
    key msqid owner perms used-bytes messages

    View full-size slide

  47. net namespace
    meson10@xps:~/workspace/meson10/linuxlab$ ip link
    1: lo: mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group
    default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    2: wlp58s0: mtu 1500 qdisc mq state UP mode DORMANT group
    default qlen 1000
    link/ether 18:5e:0f:ee:d9:32 brd ff:ff:ff:ff:ff:ff
    9: bridge0: mtu 1500 qdisc noqueue state DOWN mode
    DEFAULT group default qlen 1000
    link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff

    View full-size slide

  48. net namespace
    root@xps:~/workspace/meson10/linuxlab# ip link
    1: lo: mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group
    default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00

    View full-size slide

  49. net namespace
    meson10@xps:~$ sudo unshare -n /bin/bash
    [sudo] password for meson10:
    root@xps:~/workspace/meson10/linuxlab# ip addr
    1: lo: mtu 65536 qdisc noop state DOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00

    View full-size slide

  50. net namespace
    meson10@xps:~$ sudo unshare -n /bin/bash
    [sudo] password for meson10:
    root@xps:~/workspace/meson10/linuxlab# ip addr
    1: lo: mtu 65536 qdisc noop state DOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    root@xps:~/workspace/meson10/linuxlab# ping localhost
    connect: Network is unreachable
    root@xps:~/workspace/meson10/linuxlab# ip link set dev lo up && ping localhost
    PING localhost (127.0.0.1) 56(84) bytes of data.
    64 bytes from localhost (127.0.0.1): icmp_seq=1 ttl=64 time=0.033 ms

    View full-size slide

  51. net namespace implications
    - One device can only be connected to one namespace.
    - If eth0 is connected to root namespace, created namespace won’t find internet access.

    View full-size slide

  52. net namespace solution
    $: sudo ip link add host type veth peer name guest
    $: sudo ip link set guest netns
    $: sudo ip addr add 192.168.0.2/24 dev host
    $: sudo ip link set host up
    $ns: ip addr add 192.168.0.1/24 dev guest
    $ns: ip link set guest up
    $: brctl addbr bridge0
    $: ip addr add 192.168.1.2/24 dev bridge0
    $: ip link set dev bridge0 up
    $: brctl addif bridge0 host
    $: ip link set host up
    $ns: ip addr add 192.168.1.1/24 dev guest
    $ns: ip link set guest up
    $ns: ip route add default via 192.168.1.2

    View full-size slide

  53. pid namespace
    $: sudo unshare -p /bin/bash
    - Child process enters a new PID namespace
    - Gets PID 1
    - Forked Process gets PID for namespace and a global PID.
    - Signals
    - Register explicit signals.
    - Ctrl-C doesn’t work in Docker.
    - Child dying, grandchildren get connected to PID1.
    - If PID1 dies:
    - children get SIGKILL recursively
    - namespace is deleted.

    View full-size slide

  54. bash: fork: Cannot allocate memory
    - unshare exec bash
    - bash forks subprocess
    - First subprocess becomes PID1
    - Subprocess exits.
    - Kernel calls disable_pid_allocation
    - Clears the PIDNS_HASH_ADDING flag.
    - New PID by alloc_pid function.
    - -ENOMEM
    - Use --fork and bash becomes PID1.

    View full-size slide

  55. mnt namespace
    - Isolated list of mount points
    - Unshare copies the parent’s mountpoints
    - May conditionally propagate.
    - Private by default.
    - If unshared namespace user != parent namespace user, it is less privileged.
    - For less privileged namespace, shared become slaves.
    - Mount flags cannot be altered across less privileged mounts

    View full-size slide

  56. bind
    - Shared
    - Private
    - Slave
    - unbindable

    View full-size slide

  57. Private bind

    View full-size slide

  58. persistent namespaces
    $: touch netns
    $: sudo unshare --net=netns -f /bin/bash
    ns$: hostname hello
    ns$: Ctrl-D; exit
    $: sudo nsenter --net=netns
    ns$: hostname
    Hello
    meson10@xps:~$ ls -al /proc/942/ns/net
    lrwxrwxrwx 1 meson10 meson10 0 Dec 6 05:34 /proc/942/ns/net -> net:[4026532009]
    meson10@xps:~$ sudo ls -al /proc/25762/ns/net
    sudo: unable to resolve host hello: No such file or directory
    lrwxrwxrwx 1 root root 0 Dec 6 15:10 /proc/25762/ns/net -> net:[4026532391]

    View full-size slide

  59. Thank you!
    Piyush Verma
    @meson10
    Oogway
    Consulting
    http://oogway.in

    View full-size slide