Upgrade to Pro — share decks privately, control downloads, hide ads and more …

cgroup v2 internals

Kenta Tada
December 05, 2020

cgroup v2 internals

Kenta Tada

December 05, 2020
Tweet

More Decks by Kenta Tada

Other Decks in Programming

Transcript

  1. About me ⚫Kenta Tada ⚫Software Engineer, Sony ⚫CloudNative Days Tokyo

    2020 • https://speakerdeck.com/kentatada/embedded-container- runtime-using-linux-capabilities-seccomp-cgroups 2
  2. Unified hierarchy ⚫All the controllers are under the same hierarchy

    since cgroup v2. ⚫If the parent cgroup disables some controllers, those cannot be enabled in the descendant cgroups • You can control what controllers are enabled in the descendant cgroups from the file of cgroup.subtree_control 5 /sys/fs/cgroup /cgtest1 /cpu.* /memory.* /cgtest2 /cpu.* /memory.*
  3. New features ⚫nsdelegate and cgroup namespace for unprivileged container •

    But some controllers need a privilege because of using eBPF. ⚫cgroup aware OOM killer ⚫PSI per cgroup ⚫(NEW) utilization clamping support • Assign the actual the computational power assigned to task groups considering the actual frequency which is depending on the operation of schedutil and asymmetric capacity systems like Arm's big.LITTLE. 6 https://github.com/torvalds/linux/commit/2480c093130f64ac3a410504fa8b3db1fc4b87ce
  4. v1 subsystem Kernel object name source v2 subsystem blkio io_cgrp_subsys

    block/blk-cgroup.c io cpuacct cpuacct_cgrp_subsys kernel/sched/cpuacct.c cpu cpu cpu_cgrp_subsys kernel/sched/core.c cpu cpuset cpuset_cgrp_subsys kernel/cgroup/cpuset.c cpuset devices devices_cgrp_subsys security/device_cgroup.c Using eBPF freezer freezer_cgrp_subsys (v1)kernel/cgroup/legacy_freezer.c (v2)kernel/cgroup/freezer.c freezer hugetlb hugetlb_cgrp_subsys mm/hugetlb_cgroup.c hugetlb memory memory_cgrp_subsys mm/memcontrol.c memory net_cls net_cls_cgrp_subsys net/core/netclassid_cgroup.c Using eBPF net_prio net_prio_cgrp_subsys net/core/netprio_cgroup.c Using eBPF perf_event perf_event_cgrp_subsys kernel/events/core.c perf_event pids pids_cgrp_subsys kernel/cgroup/pids.c pids rdma rdma_cgrp_subsys kernel/cgroup/rdma.c rdma The difference between v1 and v2 at kernel 5.9 7 https://events.static.linuxfound.org/sites/events/files/slides/cgroup_and_namespaces.pdf
  5. How to confirm the available controllers ⚫cgroup.controllers • Ex. /sys/fs/cgroup/cgroup.controllers

    • Each cgroup has a “cgroup.controllers” file which lists all controllers available for the cgroup to enable ⚫But some controllers are not listed in the above file. ⚫Why?? 9 https://www.kernel.org/doc/html/v5.9/admin-guide/cgroup-v2.html
  6. Implicit or inhibit controllers ⚫Some controllers are not supported in

    the default hierarchy. • cgrp_dfl_inhibit_ss_mask ⚫Some controllers are implicitly enabled on the default hierarchy. • cgrp_dfl_implicit_ss_mask ⚫When system boots, kernel sets up the above masks. 10 if (ss->implicit_on_dfl) cgrp_dfl_implicit_ss_mask |= 1 << ss->id; else if (!ss->dfl_cftypes) cgrp_dfl_inhibit_ss_mask |= 1 << ss->id; https://github.com/torvalds/linux/blob/v5.9/kernel/cgroup/cgroup.c#L5740-L5743
  7. cgroup.controllers in kernel ⚫ cgroup core interface files are defined

    as below: ⚫cgroup_controllers_show() calls cgroup_control(). ⚫cgroup_control() returns the visible controllers using the masks. 11 static u16 cgroup_control(struct cgroup *cgrp) … if (cgroup_on_dfl(cgrp)) root_ss_mask &= ~(cgrp_dfl_inhibit_ss_mask | cgrp_dfl_implicit_ss_mask); return root_ss_mask; } https://github.com/torvalds/linux/blob/v5.9/kernel/cgroup/cgroup.c .name = "cgroup.controllers", .seq_show = cgroup_controllers_show,
  8. Device controller in cgroup v1 ⚫Device controller allows or denies

    access to devices with each cgroup. ⚫There are three files to control behavior. • devices.allow is the allowlist of devices. • devices.deny is the denylist of devices. • devices.list shows available devices. ⚫Interface • Ex. Allow cgroup 1 to read and mknod /dev/null as below: 13 https://www.kernel.org/doc/html/v5.9/admin-guide/cgroup-v1/devices.html # echo 'c 1:3 mr' > /sys/fs/cgroup/1/devices.allow
  9. Device controller in cgroup v2 ⚫cgroup v2 uses eBPF for

    some reasons. • Ex. To control network access ⚫The eBPF program is attached to a specific cgroup. ⚫BPF_PROG_TYPE_CGROUP_DEVICE was introduced since kernel version 4.15. 14
  10. What is eBPF ⚫eBPF is a revolutionary technology that can

    run sandboxed programs in the Linux kernel without changing kernel source code. 15 https://ebpf.io/ http://www.brendangregg.com/ebpf.html
  11. Attach device control settings inside kernel space ⚫Attach the eBPF

    program to a cgroup • https://github.com/torvalds/linux/blob/v5.9/kernel/bpf/syscall.c#L4187 → bpf_prog_attach at kernel/bpf/syscall.c#L2839 → cgroup_bpf_prog_attach at kernel/bpf/cgroup.c#L762 → cgroup_bpf_attach at kernel/cgroup/cgroup.c#L6496 → __cgroup_bpf_attachat kernel/bpf/cgroup.c#L433 16
  12. __cgroup_bpf_attach 17 /** * __cgroup_bpf_attach() - Attach the program or

    the link to a cgroup, and * propagate the change to descendants * @cgrp: The cgroup which descendants to traverse * @prog: A program to attach * @link: A link to attach * @replace_prog: Previously attached program to replace if BPF_F_REPLACE is set * @type: Type of attach operation * @flags: Option flags * * Exactly one of @prog or @link can be non-null. * Must be called with cgroup_mutex held. */ int __cgroup_bpf_attach(struct cgroup *cgrp, struct bpf_prog *prog, struct bpf_prog *replace_prog, struct bpf_cgroup_link *link, enum bpf_attach_type type, u32 flags)
  13. Check device permissions inside kernel space ⚫For example, mknod(2) is

    executed. • https://github.com/torvalds/linux/blob/v5.9/fs/namei.c#L3528 → devcgroup_inode_mknod at include/linux/device_cgroup.h#L40 → devcgroup_check_permission at security/device_cgroup.c#L835 → BPF_CGROUP_RUN_PROG_DEVICE_CGROUP at include/linux/bpf-cgroup.h#L295 → __cgroup_bpf_check_dev_permissionat kernel/bpf/cgroup.c#L1125 18
  14. __cgroup_bpf_check_dev_permission 19 int __cgroup_bpf_check_dev_permission(short dev_type, u32 major, u32 minor, short

    access, enum bpf_attach_type type) { struct cgroup *cgrp; struct bpf_cgroup_dev_ctx ctx = { .access_type = (access << 16) | dev_type, .major = major, .minor = minor, }; int allow = 1; rcu_read_lock(); cgrp = task_dfl_cgroup(current); allow = BPF_PROG_RUN_ARRAY(cgrp->bpf.effective[type], &ctx, BPF_PROG_RUN); rcu_read_unlock(); return !allow; }
  15. Demo : Load device control settings from user space 20

    ⚫Observe kernel functions when the kernel test code is executed. ⚫The test code(test_dev_cgroup.c) only permits below: • major: 1, minor: 5( /dev/zero) • major: 1, minor: 9(/dev/urandom) ⚫bpftrace shows values which are actually checked in kernel. assert(system("mknod /tmp/test_dev_cgroup_null c 1 3")); assert(system("mknod /tmp/test_dev_cgroup_zero c 1 5") == 0); assert(system("dd if=/dev/urandom of=/dev/zero count=64") == 0); assert(system("dd if=/dev/urandom of=/dev/full count=64")); assert(system("dd if=/dev/random of=/dev/zero count=64")); The test code from https://github.com/torvalds/linux/blob/v5.10-rc3/tools/testing/selftests/bpf/test_dev_cgroup.c [Expected outputs] 1. major: 1, minor: 3 2. major: 1, minor: 5 3. major: 1, minor: 9 4. major: 1, minor: 5 5. major: 1, minor: 9(if is allowed) 6. major: 1, minor: 7(of is forbidden) 7. major: 1, minor: 8(if is forbidden)
  16. How to use eBPF-based device controller ⚫Many tools provide the

    abstraction layer. ⚫OCI runtime spec • This specification is originally designed for cgroup v1. • But some container runtimes can handle the configuration of cgroup v1 for v2. ⚫libcgroup is currently developing the facility of cgroup v2 interfaces. •https://github.com/libcgroup/libcgroup/issues/12 22
  17. Key takeaways ⚫There are a lot of features in cgroup

    v2. • Like uclamp, it is for not only cloud systems but also embedded systems ⚫cgroup v2 changed interfaces and the way of resource control. • Some cgroup v2 controllers are not supported in the default hierarchy. ⚫eBPF is important for cgroup v2. 23