Upgrade to Pro — share decks privately, control downloads, hide ads and more …

cgroup v2 internals

Kenta Tada
December 05, 2020

cgroup v2 internals

Kenta Tada

December 05, 2020
Tweet

More Decks by Kenta Tada

Other Decks in Programming

Transcript

  1. Copyright 2020 Sony Corporation
    cgroup v2 internals
    第13回 コンテナ技術の情報交換会@オンライン
    Kenta Tada
    R&D Center
    Sony Corporation

    View Slide

  2. About me
    ⚫Kenta Tada
    ⚫Software Engineer, Sony
    ⚫CloudNative Days Tokyo 2020
    • https://speakerdeck.com/kentatada/embedded-container-
    runtime-using-linux-capabilities-seccomp-cgroups
    2

    View Slide

  3. Agenda
    ⚫Overview of cgroup v2
    ⚫Available cgroup v2 controllers
    ⚫eBPF-based device controller for cgroup v2
    3

    View Slide

  4. Overview of cgroup v2
    4

    View Slide

  5. Unified hierarchy
    ⚫All the controllers are under the same hierarchy since
    cgroup v2.
    ⚫If the parent cgroup disables some controllers, those cannot
    be enabled in the descendant cgroups
    • You can control what controllers are enabled in the descendant cgroups from
    the file of cgroup.subtree_control
    5
    /sys/fs/cgroup
    /cgtest1
    /cpu.* /memory.*
    /cgtest2
    /cpu.* /memory.*

    View Slide

  6. New features
    ⚫nsdelegate and cgroup namespace for unprivileged container
    • But some controllers need a privilege because of using eBPF.
    ⚫cgroup aware OOM killer
    ⚫PSI per cgroup
    ⚫(NEW) utilization clamping support
    • Assign the actual the computational power assigned to task groups considering the
    actual frequency which is depending on the operation of schedutil and asymmetric
    capacity systems like Arm's big.LITTLE.
    6
    https://github.com/torvalds/linux/commit/2480c093130f64ac3a410504fa8b3db1fc4b87ce

    View Slide

  7. v1 subsystem Kernel object name source v2 subsystem
    blkio io_cgrp_subsys block/blk-cgroup.c io
    cpuacct cpuacct_cgrp_subsys kernel/sched/cpuacct.c cpu
    cpu cpu_cgrp_subsys kernel/sched/core.c cpu
    cpuset cpuset_cgrp_subsys kernel/cgroup/cpuset.c cpuset
    devices devices_cgrp_subsys security/device_cgroup.c Using eBPF
    freezer freezer_cgrp_subsys (v1)kernel/cgroup/legacy_freezer.c
    (v2)kernel/cgroup/freezer.c
    freezer
    hugetlb hugetlb_cgrp_subsys mm/hugetlb_cgroup.c hugetlb
    memory memory_cgrp_subsys mm/memcontrol.c memory
    net_cls net_cls_cgrp_subsys net/core/netclassid_cgroup.c Using eBPF
    net_prio net_prio_cgrp_subsys net/core/netprio_cgroup.c Using eBPF
    perf_event perf_event_cgrp_subsys kernel/events/core.c perf_event
    pids pids_cgrp_subsys kernel/cgroup/pids.c pids
    rdma rdma_cgrp_subsys kernel/cgroup/rdma.c rdma
    The difference between v1 and v2 at kernel 5.9
    7
    https://events.static.linuxfound.org/sites/events/files/slides/cgroup_and_namespaces.pdf

    View Slide

  8. Available cgroup v2 controllers
    8

    View Slide

  9. How to confirm the available controllers
    ⚫cgroup.controllers
    • Ex. /sys/fs/cgroup/cgroup.controllers
    • Each cgroup has a “cgroup.controllers” file which lists all
    controllers available for the cgroup to enable
    ⚫But some controllers are not listed in the above file.
    ⚫Why??
    9
    https://www.kernel.org/doc/html/v5.9/admin-guide/cgroup-v2.html

    View Slide

  10. Implicit or inhibit controllers
    ⚫Some controllers are not supported in the default hierarchy.
    • cgrp_dfl_inhibit_ss_mask
    ⚫Some controllers are implicitly enabled on the default
    hierarchy.
    • cgrp_dfl_implicit_ss_mask
    ⚫When system boots, kernel sets up the above masks.
    10
    if (ss->implicit_on_dfl)
    cgrp_dfl_implicit_ss_mask |= 1 << ss->id;
    else if (!ss->dfl_cftypes)
    cgrp_dfl_inhibit_ss_mask |= 1 << ss->id;
    https://github.com/torvalds/linux/blob/v5.9/kernel/cgroup/cgroup.c#L5740-L5743

    View Slide

  11. cgroup.controllers in kernel
    ⚫ cgroup core interface files are defined as below:
    ⚫cgroup_controllers_show() calls cgroup_control().
    ⚫cgroup_control() returns the visible controllers using the masks.
    11
    static u16 cgroup_control(struct cgroup *cgrp)

    if (cgroup_on_dfl(cgrp))
    root_ss_mask &= ~(cgrp_dfl_inhibit_ss_mask |
    cgrp_dfl_implicit_ss_mask);
    return root_ss_mask;
    }
    https://github.com/torvalds/linux/blob/v5.9/kernel/cgroup/cgroup.c
    .name = "cgroup.controllers",
    .seq_show = cgroup_controllers_show,

    View Slide

  12. eBPF-based device controller for cgroup v2
    12

    View Slide

  13. Device controller in cgroup v1
    ⚫Device controller allows or denies access to devices with
    each cgroup.
    ⚫There are three files to control behavior.
    • devices.allow is the allowlist of devices.
    • devices.deny is the denylist of devices.
    • devices.list shows available devices.
    ⚫Interface
    • Ex. Allow cgroup 1 to read and mknod /dev/null as below:
    13
    https://www.kernel.org/doc/html/v5.9/admin-guide/cgroup-v1/devices.html
    # echo 'c 1:3 mr' > /sys/fs/cgroup/1/devices.allow

    View Slide

  14. Device controller in cgroup v2
    ⚫cgroup v2 uses eBPF for some reasons.
    • Ex. To control network access
    ⚫The eBPF program is attached to a specific cgroup.
    ⚫BPF_PROG_TYPE_CGROUP_DEVICE was introduced since
    kernel version 4.15.
    14

    View Slide

  15. What is eBPF
    ⚫eBPF is a revolutionary technology that can run sandboxed
    programs in the Linux kernel without changing kernel
    source code.
    15
    https://ebpf.io/
    http://www.brendangregg.com/ebpf.html

    View Slide

  16. Attach device control settings inside kernel space
    ⚫Attach the eBPF program to a cgroup
    • https://github.com/torvalds/linux/blob/v5.9/kernel/bpf/syscall.c#L4187
    → bpf_prog_attach at kernel/bpf/syscall.c#L2839
    → cgroup_bpf_prog_attach at kernel/bpf/cgroup.c#L762
    → cgroup_bpf_attach at kernel/cgroup/cgroup.c#L6496
    → __cgroup_bpf_attachat kernel/bpf/cgroup.c#L433
    16

    View Slide

  17. __cgroup_bpf_attach
    17
    /**
    * __cgroup_bpf_attach() - Attach the program or the link to a cgroup, and
    * propagate the change to descendants
    * @cgrp: The cgroup which descendants to traverse
    * @prog: A program to attach
    * @link: A link to attach
    * @replace_prog: Previously attached program to replace if BPF_F_REPLACE is set
    * @type: Type of attach operation
    * @flags: Option flags
    *
    * Exactly one of @prog or @link can be non-null.
    * Must be called with cgroup_mutex held.
    */
    int __cgroup_bpf_attach(struct cgroup *cgrp,
    struct bpf_prog *prog, struct bpf_prog *replace_prog,
    struct bpf_cgroup_link *link,
    enum bpf_attach_type type, u32 flags)

    View Slide

  18. Check device permissions inside kernel space
    ⚫For example, mknod(2) is executed.
    • https://github.com/torvalds/linux/blob/v5.9/fs/namei.c#L3528
    → devcgroup_inode_mknod at include/linux/device_cgroup.h#L40
    → devcgroup_check_permission at security/device_cgroup.c#L835
    → BPF_CGROUP_RUN_PROG_DEVICE_CGROUP at
    include/linux/bpf-cgroup.h#L295
    → __cgroup_bpf_check_dev_permissionat
    kernel/bpf/cgroup.c#L1125
    18

    View Slide

  19. __cgroup_bpf_check_dev_permission
    19
    int __cgroup_bpf_check_dev_permission(short dev_type, u32 major, u32 minor,
    short access, enum bpf_attach_type type)
    {
    struct cgroup *cgrp;
    struct bpf_cgroup_dev_ctx ctx = {
    .access_type = (access << 16) | dev_type,
    .major = major,
    .minor = minor,
    };
    int allow = 1;
    rcu_read_lock();
    cgrp = task_dfl_cgroup(current);
    allow = BPF_PROG_RUN_ARRAY(cgrp->bpf.effective[type], &ctx,
    BPF_PROG_RUN);
    rcu_read_unlock();
    return !allow;
    }

    View Slide

  20. Demo : Load device control settings from user space
    20
    ⚫Observe kernel functions when the kernel test code is
    executed.
    ⚫The test code(test_dev_cgroup.c) only permits below:
    • major: 1, minor: 5( /dev/zero)
    • major: 1, minor: 9(/dev/urandom)
    ⚫bpftrace shows values which are actually checked in kernel.
    assert(system("mknod /tmp/test_dev_cgroup_null c 1 3"));
    assert(system("mknod /tmp/test_dev_cgroup_zero c 1 5") == 0);
    assert(system("dd if=/dev/urandom of=/dev/zero count=64") == 0);
    assert(system("dd if=/dev/urandom of=/dev/full count=64"));
    assert(system("dd if=/dev/random of=/dev/zero count=64"));
    The test code from
    https://github.com/torvalds/linux/blob/v5.10-rc3/tools/testing/selftests/bpf/test_dev_cgroup.c
    [Expected outputs]
    1. major: 1, minor: 3
    2. major: 1, minor: 5
    3. major: 1, minor: 9
    4. major: 1, minor: 5
    5. major: 1, minor: 9(if is allowed)
    6. major: 1, minor: 7(of is forbidden)
    7. major: 1, minor: 8(if is forbidden)

    View Slide

  21. Demo : Load device control settings from user space
    21

    View Slide

  22. How to use eBPF-based device controller
    ⚫Many tools provide the abstraction layer.
    ⚫OCI runtime spec
    • This specification is originally designed for cgroup v1.
    • But some container runtimes can handle the configuration of
    cgroup v1 for v2.
    ⚫libcgroup is currently developing the facility of cgroup v2
    interfaces.
    •https://github.com/libcgroup/libcgroup/issues/12
    22

    View Slide

  23. Key takeaways
    ⚫There are a lot of features in cgroup v2.
    • Like uclamp, it is for not only cloud systems but also embedded
    systems
    ⚫cgroup v2 changed interfaces and the way of resource
    control.
    • Some cgroup v2 controllers are not supported in the default
    hierarchy.
    ⚫eBPF is important for cgroup v2.
    23

    View Slide

  24. SONYはソニー株式会社の登録商標または商標です。
    各ソニー製品の商品名・サービス名はソニー株式会社またはグループ各社の登録商標または商標です。その他の製品および会社名は、各社の商号、登録商標または商標です。

    View Slide