Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Cgroups and Namespaces in Linux

Piyush Verma
December 06, 2017

Cgroups and Namespaces in Linux

Building blocks of Linux containers.

Piyush Verma

December 06, 2017
Tweet

More Decks by Piyush Verma

Other Decks in Technology

Transcript

  1. Cgroups, Namespaces,
    and Containers.
    What and How much is too much to contain?
    Piyush Verma
    Oogway Consulting

    View Slide

  2. Virtual Machine

    View Slide

  3. VM
    - What is it?
    - What is it not?

    View Slide

  4. Containers

    View Slide

  5. Container
    - What is it?
    - What is it not?

    View Slide

  6. Container vs. VM

    View Slide

  7. Time to Build

    View Slide

  8. Time to Quarantine?

    View Slide

  9. Container. How much to pack?

    View Slide

  10. What is a Cgroup?
    - Mechanism for aggregating/partitioning sets of tasks, and all their future children, into
    hierarchical groups with specialized behaviour.
    - First Class Citizens
    - Process-like Hierarchical model, but:
    - Multiple parallel hierarchies coexist.
    - Each hierarchy connects to a Subsystem.

    View Slide

  11. What is a Hierarchy?

    View Slide

  12. What is a subsystem?
    - Represents Single resource
    - CPU
    - Memory
    - Blkio
    - cpuacct
    - cpuset
    - devices
    - freezer
    - ns
    - etc.
    - Something that does something to a group of Tasks :-/

    View Slide

  13. Hierarchy Rules

    View Slide

  14. Nomenclature
    - Processes are called:
    - Tasks
    - Subsystems are also-called:
    - Controllers.
    - Resource controllers.

    View Slide

  15. Hierarchy can be
    attached to
    multiple
    subsystems
    As long as next rule is not violated

    View Slide

  16. Subsystem can
    have more than one
    hierarchy, but
    Next rule is not violated

    View Slide

  17. Subsystems cannot
    share hierarchies

    View Slide

  18. Task belongs to
    only one cgroup in
    a hierarchy

    View Slide

  19. Child inherits
    parents’
    memberships
    But detaches thereafter.

    View Slide

  20. Tasks hold no
    relationship in a
    group.

    View Slide

  21. Kernel maintains
    hierarchical
    constraints on
    Limits
    if devices cgroup /child1 cannot access a disk
    drive, then /child1/child2 cannot give itself those
    rights

    View Slide

  22. Implications combined

    View Slide

  23. Implications
    - There is only one way that a task can be limited or affected by any single subsystem.
    - You can group several subsystems together so that they affect all tasks in a single hierarchy.
    - Cgroups in that hierarchy have different parameters set, those tasks will be affected differently.
    - Constant, Refactor is required for best Knapsack.

    View Slide

  24. How to set this up?

    View Slide

  25. Manual
    [email protected]:~/workspace$ cgcreate -h
    Usage: cgcreate [-h] [-f mode] [-d mode] [-s mode] [-t :] [-a :] -g : [-g
    ...]
    Create control group(s)
    -a : Owner of the group and all its files
    -d, --dperm=mode Group directory permissions
    -f, --fperm=mode Group file permissions
    -g : Control group which should be added
    -h, --help Display this help
    -s, --tperm=mode Tasks file permissions
    -t : Owner of the tasks file

    View Slide

  26. Automatic
    /etc/cgconfig.conf
    mount {
    subsystem = /mount/point

    }
    group {
    []
    {
    = ;

    }

    }

    View Slide

  27. Automatic: Example conf
    mount {
    cpuset = /cgroup/cpuset;
    cpu = /cgroup/cpu;
    cpuacct= /cgroup/cpuacct;
    memory = /cgroup/memory;
    devices= /cgroup/devices;
    freezer= /cgroup/freezer;
    net_cls= /cgroup/net_cls;
    blkio = /cgroup/blkio;
    }
    group daemons/sql {
    cpuset {
    cpuset.mems = 0;
    cpuset.cpus = 0;
    }
    }

    View Slide

  28. Automatic: Translates to
    # mkdir /cgroup/cpuset
    # mount -t cgroup -o cpuset /cgroup/cpuset
    # mkdir -p /cgroup/red/daemons/sql
    # echo $(cgget -n -v -r cpuset.mems /) > /cgroup/red/daemons/cpuset.mems
    # echo $(cgget -n -v -r cpuset.cpus /) > /cgroup/red/daemons/cpuset.cpus
    # echo 0 > /cgroup/red/daemons/sql/cpuset.mems
    # echo 0 > /cgroup/red/daemons/sql/cpuset.cpus

    View Slide

  29. Wait, Why do I need this?

    View Slide

  30. blkio
    - Control and Monitor access to Block devices.
    - Policies:
    - Weight Division
    - Upper-Bound Throttling
    - All devices
    - Per device

    View Slide

  31. blkio: Example
    $: echo 500 > blkio.weight
    $: echo 8:0 500 > blkio.weight_device
    $: echo "8:0 10485760" > /cgroup/blkio/test/blkio.throttle.read_bps_device
    $: echo "8:0 10" > /cgroup/blkio/test/blkio.throttle.read_iops_device
    $: echo "8:0 10485760" > /cgroup/blkio/test/blkio.throttle.write_bps_device
    $: echo "8:0 10" > /cgroup/blkio/test/blkio.throttle.write_iops_device

    View Slide

  32. cpu
    - Control and Monitor CPU Access.
    - Policies:
    - Ceiling Enforcement
    - Relative Shares
    - cpu.shares
    - cpu.cfs_quota_us & cpu.cfs_period_us

    View Slide

  33. cpuacct
    - Generates reports on CPU used.
    - To reset:
    - $: echo 0 > /cgroup/cpuacct/cpuacct.usage

    View Slide

  34. cpuset
    - Assigns individual CPUs and Memory nodes to groups.
    - cpuset.cpus (0-2,16) -> 0,1,2, and 16
    - cpuset.mems (0-2, 16) -> 0,1,2, and 16
    - cpuset.memory_migrate
    - cpuset.cpu_exclusive
    - cpuset.mem_exclusive

    View Slide

  35. devices
    - Allows or Denies access to devices by groups.
    - devices.allow type, major, minor, access
    - type: a,b,c -> all, block, character
    - major: minor 8:1 -> /dev/sda1
    - access: r,w,m
    - devices.deny

    View Slide

  36. freezer
    - Suspends or Resumes task in a cgroup.
    - freezer.state
    - FROZEN (r,w)
    - FREEZING (r)
    - THAWED (r,w)
    - Helps in:
    - Process migration

    View Slide

  37. memory
    - Soft limits
    - Hard limits
    - Charges
    $: mount -t memory -o memory memory /cgroup/memory
    $: mkdir /cgroup/memory/blue
    $: echo 104857600 > memory.limit_in_bytes
    $: echo $$ > tasks
    $: ~/.heavy_memory.sh
    Killed

    View Slide

  38. ns
    - Group processes into namespace
    - Process in a namespace can see each-other
    - But not in other namespace.
    - Also confused as containers.

    View Slide

  39. Example Scenarios

    View Slide

  40. Namespaces

    View Slide

  41. Namespaces
    - Kernel Feature
    - System starts with namespace of each kind.
    - 7 namespaces
    - Each process is associated with a namespace.

    View Slide

  42. System start: namespace init
    Each of the namespaces are enabled at system start and assigned an Inode
    [email protected]:~$ lsns
    NS TYPE NPROCS PID USER COMMAND
    4026531835 cgroup 49 921 meson10 /lib/systemd/systemd --user
    4026531836 pid 49 921 meson10 /lib/systemd/systemd --user
    4026531837 user 49 921 meson10 /lib/systemd/systemd --user
    4026531838 uts 49 921 meson10 /lib/systemd/systemd --user
    4026531839 ipc 49 921 meson10 /lib/systemd/systemd --user
    4026531840 mnt 49 921 meson10 /lib/systemd/systemd --user
    4026532009 net 49 921 meson10 /lib/systemd/systemd --user

    View Slide

  43. Identifying Namespace
    New entries (inodes) are added to /proc//ns, one for each namespace
    [email protected]:~$ ls -al /proc/942/ns
    total 0
    lrwxrwxrwx 1 meson10 meson10 0 Dec 6 05:34 cgroup -> cgroup:[4026531835]
    lrwxrwxrwx 1 meson10 meson10 0 Dec 6 05:34 ipc -> ipc:[4026531839]
    lrwxrwxrwx 1 meson10 meson10 0 Dec 6 05:34 mnt -> mnt:[4026531840]
    lrwxrwxrwx 1 meson10 meson10 0 Dec 6 05:34 net -> net:[4026532009]
    lrwxrwxrwx 1 meson10 meson10 0 Dec 6 05:34 pid -> pid:[4026531836]
    lrwxrwxrwx 1 meson10 meson10 0 Dec 6 05:34 pid_for_children -> pid:[4026531836]
    lrwxrwxrwx 1 meson10 meson10 0 Dec 6 05:34 user -> user:[4026531837]
    lrwxrwxrwx 1 meson10 meson10 0 Dec 6 05:34 uts -> uts:[4026531838]

    View Slide

  44. Namespace utils:
    Unshare & nsenter

    View Slide

  45. unshare
    [email protected]:~$ unshare --help
    Options:
    -m, --mount[=] unshare mounts namespace
    -u, --uts[=] unshare UTS namespace (hostname etc)
    -i, --ipc[=] unshare System V IPC namespace
    -n, --net[=] unshare network namespace
    -p, --pid[=] unshare pid namespace
    -U, --user[=] unshare user namespace
    -C, --cgroup[=] unshare cgroup namespace
    -f, --fork fork before launching
    --mount-proc[=] mount proc filesystem first (implies --mount)
    -r, --map-root-user map current user to root (implies --user)
    --propagation slave|shared|private|unchanged modify mount propagation in mount namespace
    -s, --setgroups allow|deny control the setgroups syscall in user namespaces

    View Slide

  46. nsenter
    [email protected]:~$ nsenter --help
    Options:
    -a, --all enter all namespaces
    -t, --target target process to get namespaces from
    -m, --mount[=] enter mount namespace
    -u, --uts[=] enter UTS namespace (hostname etc)
    -i, --ipc[=] enter System V IPC namespace
    -n, --net[=] enter network namespace
    -p, --pid[=] enter pid namespace
    -C, --cgroup[=] enter cgroup namespace
    -U, --user[=] enter user namespace
    -S, --setuid set uid in entered namespace
    -G, --setgid set gid in entered namespace
    --preserve-credentials do not touch uids or gids
    -r, --root[=] set the root directory
    -w, --wd[=] set the working directory
    -F, --no-fork do not fork before exec'ing

    View Slide

  47. uts namespace
    [email protected]:~$ hostname
    xps.piyushverma.net
    [email protected]:~$ sudo unshare -u /bin/bash
    [email protected]:~# hostname hello
    [email protected]:~# hostname
    Hello
    [email protected]:~# exit
    [email protected]:~$ hostname
    xps.piyushverma.net

    View Slide

  48. Ipc namespace
    [email protected]:~$ ipcs -q
    ------ Message Queues --------
    key msqid owner perms used-bytes messages
    0xef4c7712 131074 meson10 644 0 0
    [email protected]:~$ sudo unshare -i /bin/bash
    [email protected]:~# ipcs -q
    ------ Message Queues --------
    key msqid owner perms used-bytes messages

    View Slide

  49. net namespace
    [email protected]:~/workspace/meson10/linuxlab$ ip link
    1: lo: mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group
    default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    2: wlp58s0: mtu 1500 qdisc mq state UP mode DORMANT group
    default qlen 1000
    link/ether 18:5e:0f:ee:d9:32 brd ff:ff:ff:ff:ff:ff
    9: bridge0: mtu 1500 qdisc noqueue state DOWN mode
    DEFAULT group default qlen 1000
    link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff

    View Slide

  50. net namespace
    [email protected]:~/workspace/meson10/linuxlab# ip link
    1: lo: mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group
    default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00

    View Slide

  51. net namespace
    [email protected]:~$ sudo unshare -n /bin/bash
    [sudo] password for meson10:
    [email protected]:~/workspace/meson10/linuxlab# ip addr
    1: lo: mtu 65536 qdisc noop state DOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00

    View Slide

  52. net namespace
    [email protected]:~$ sudo unshare -n /bin/bash
    [sudo] password for meson10:
    [email protected]:~/workspace/meson10/linuxlab# ip addr
    1: lo: mtu 65536 qdisc noop state DOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    [email protected]:~/workspace/meson10/linuxlab# ping localhost
    connect: Network is unreachable
    [email protected]:~/workspace/meson10/linuxlab# ip link set dev lo up && ping localhost
    PING localhost (127.0.0.1) 56(84) bytes of data.
    64 bytes from localhost (127.0.0.1): icmp_seq=1 ttl=64 time=0.033 ms

    View Slide

  53. net namespace implications
    - One device can only be connected to one namespace.
    - If eth0 is connected to root namespace, created namespace won’t find internet access.

    View Slide

  54. net namespace solution
    $: sudo ip link add host type veth peer name guest
    $: sudo ip link set guest netns
    $: sudo ip addr add 192.168.0.2/24 dev host
    $: sudo ip link set host up
    $ns: ip addr add 192.168.0.1/24 dev guest
    $ns: ip link set guest up
    $: brctl addbr bridge0
    $: ip addr add 192.168.1.2/24 dev bridge0
    $: ip link set dev bridge0 up
    $: brctl addif bridge0 host
    $: ip link set host up
    $ns: ip addr add 192.168.1.1/24 dev guest
    $ns: ip link set guest up
    $ns: ip route add default via 192.168.1.2

    View Slide

  55. pid namespace
    $: sudo unshare -p /bin/bash
    - Child process enters a new PID namespace
    - Gets PID 1
    - Forked Process gets PID for namespace and a global PID.
    - Signals
    - Register explicit signals.
    - Ctrl-C doesn’t work in Docker.
    - Child dying, grandchildren get connected to PID1.
    - If PID1 dies:
    - children get SIGKILL recursively
    - namespace is deleted.

    View Slide

  56. bash: fork: Cannot allocate memory
    - unshare exec bash
    - bash forks subprocess
    - First subprocess becomes PID1
    - Subprocess exits.
    - Kernel calls disable_pid_allocation
    - Clears the PIDNS_HASH_ADDING flag.
    - New PID by alloc_pid function.
    - -ENOMEM
    - Use --fork and bash becomes PID1.

    View Slide

  57. mnt namespace
    - Isolated list of mount points
    - Unshare copies the parent’s mountpoints
    - May conditionally propagate.
    - Private by default.
    - If unshared namespace user != parent namespace user, it is less privileged.
    - For less privileged namespace, shared become slaves.
    - Mount flags cannot be altered across less privileged mounts

    View Slide

  58. bind
    - Shared
    - Private
    - Slave
    - unbindable

    View Slide

  59. Shared bind

    View Slide

  60. Slave bind

    View Slide

  61. Private bind

    View Slide

  62. rbind

    View Slide

  63. persistent namespaces
    $: touch netns
    $: sudo unshare --net=netns -f /bin/bash
    ns$: hostname hello
    ns$: Ctrl-D; exit
    $: sudo nsenter --net=netns
    ns$: hostname
    Hello
    [email protected]:~$ ls -al /proc/942/ns/net
    lrwxrwxrwx 1 meson10 meson10 0 Dec 6 05:34 /proc/942/ns/net -> net:[4026532009]
    [email protected]:~$ sudo ls -al /proc/25762/ns/net
    sudo: unable to resolve host hello: No such file or directory
    lrwxrwxrwx 1 root root 0 Dec 6 15:10 /proc/25762/ns/net -> net:[4026532391]

    View Slide

  64. Thank you!
    Piyush Verma
    @meson10
    Oogway
    Consulting
    http://oogway.in

    View Slide