How cgroup-v2 and PSI Impacts Cloud Native?

How cgroup-v2 and PSI Impacts Cloud Native?

@ CloudNative Days Tokyo 2019, in Toranomon


スライドに載せ忘れた参考サイトなど / More to read

「LXCで学ぶコンテナ入門 -軽量仮想化環境を実現する技術」のcgroup関係記事(in Japanese)

自分のブログ(in Japanese)

いますぐ実践! Linux システム管理 / Vol.228



July 23, 2019


  1. How cgroup-v2 and PSI Impacts Cloud Native? Uchio Kondo /

    GMO Pepabo, Inc. 2019.07.23 CloudNative Days Tokyo 2019 Image from pixabay:
  2. Uchio Kondo / @udzura ۙ౻ ͏͓ͪ Dev Productivity Team

    @ GMO Pepabo, Inc. RubyKaigi 2019 Local Organizer && Speaker CNDF 2019 Chair Community Organizer: #;͘͹Ͷͯ͢ (Fu-Kubernetes) Duolingo heavy user (seems to use Envoy :)
  3. ToC: •What is cgroup ??? •cgroup’s in Docker / OCI

    •cgroup v2 •Introduction to PSI(Pressure stall information) •Trying PSI •Future of overload detection
  4. What is cgroup?

  5. cgroup •One of the core feature of containers functionalities in

    Linux Kernel •Grouping processes/tasks, and controlling OS resources (such as CPU, memory, IO, count of processes) per each group •Both limitation and statistics for resources are available IUUQTHJIZPKQBENJOTFSJBMMJOVY@DPOUBJOFST
  6. Cgroupfs •Mounted by default when you use modern Linux distro

    •cgroup is controllable via this filesystem by file operation syscalls, such as open(2), read(2), write(2), mkdir(2)... •No special syscall is required!
  7. Solo usage (v1) $ sudo mkdir /sys/fs/cgroup/cpu/cndt2019 $ cat /sys/fs/cgroup/cpu/cndt2019/cpu.cfs_period_us

    100000 $ echo 20000 | \ sudo tee /sys/fs/cgroup/cpu/cndt2019/cpu.cfs_quota_us 20000 $ echo $$ | \ sudo tee /sys/fs/cgroup/cpu/cndt2019/tasks 6351 $ yes 4FURVPUBQFSQFSJPE "TTJHOUBTLUPBHSPVQ
  8. Limit CPU for yes(1) to 20%

  9. cgroup detailed features in v1 •cpu •memory •blkio •pids •net_cls

    •freezer ...
  10. cpu/cpuset controller •Limit CPU time in user/sys, and get stats

    •cpuset for CPU affinity
  11. quota/period cpu.cfs_period_us = 100000 cpu.cfs_quota_us = 20000 20000 / 100000

    = 20%
  12. memory controller •Limits/stats memory usage •memory/swap memory/kernel memory $ echo

    256M > group-a/memory.limit_in_bytes
  13. blkio controller •Limits/stat IO to block devices $ echo "8:0

    10" > group-a/blkio.throttle.{read,write}_iops_device $ echo "8:0 10485760" > group-a/blkio.throttle.{read,write}_bps_device
  14. pids controller •Limits/checks the number of tasks in a group

    •Prevents a fork-bomb attack $ echo 1024 > group-a/pids.max
  15. net_cls controller •Mark the packets from the group •Mark available

    from iptables or tc
  16. freezer controller •Freeze all of the group’s tasks •Then, resume

  17. perf_event •Collect and monitor perf events per a group •This

    may be interesting if used with eBPF...
  18. Others •devices - devide whitelist •hugetbl - limit the HugeTLB

    usage •rdma - limit RDMA/IB specific resources
  19. cgroup’s in Docker/containers

  20. cgroup used by Docker •Docker uses cgroup! •Once a container

    is booted, new cgroup for each docker container is created, named /docker/${CONTAINER_ID} •In container /proc/self/cgroup contains /docker/* string, so some of tool judge if being inside container by cgroup •Key to host PID -> container ID

  22. In OCI spec •Mentioned as “Linux Container Configuration” •

    •Details are defined per controllers. CPU, Memory, Pids... •These are good documents to understand each controllers
  23. cgroup v2

  24. A short history of cgroup • •“Process containers” in 2007

  25. In 2007 •Process containers -> control groups •cgroup is merged

    into kernel mainline in version 2.6.24
  26. Redesigning •Started in 2013 •First merge in kernel version 3.16...

    and still developing •v2 filesystem can be mountable with v1
  27. Features of cgroup-v2 •Unified hierarchy •Processes can belong to only

    leafs •Cgroup-aware OOM killer(OOM group) •nsdelegate •and... PSI!!
  28. Unified hierarchy /sys/fs/cgroup /sys/fs/cgroup /group-a /group-b /cpu.* /memory.* /io.* ...

    /cpu.* /memory.* /io.* ... /cpu /memory /blkio /group-a /group-b /group-a /group-c
  29. Processes can belong to only leafs /sys/fs/cgroup /parent-a /parent-b /parent-b/child-c

    /parent-b/child-d /parent-b/child-d/grandchild-e ⭕ ⭕ ⭕ ❎ ❎ ⭕ root is OK
  30. Cgroup-aware OOM killer •Can kill all of processes in a

    group when memory stall • • •To avoid partial kills to integrated workloads
  31. nsdelegate •“Consider cgroup namespaces as delegation boundaries” •Working with cgroup

    namespace •cgroup namespace = cgroup’s “chroot” •Without nsdelegate: a process can be moved to a group outside the namespace if it is visible
 (caused by remaining host-namespace mount, for example) •With nsdelegate: avoids this behavior
  32. Mounted with -o nsdelegate /sys/fs/cgroup /group-a /group-b / unshare --cgroup

    Modifiable /ns-g-a /ns-g-b Cannot modify ❎ Cgroup NS ...regardless of File access control
  33. Introduction to PSI (Pressure stall information)

  34. Pressure Stall Information •A new pressure metrics for Linux kernel

    •Developed and maintained by Facebook team •Both available in system-wide and per-cgroup2
  35. How to check the PSI •It create files: •/proc/pressure/{cpu,memory,io} for

    system-wide •/cgroup/group-name/{cpu,memory,io}.pressure
 when v2 enabled •Format is like: •Average in 10sec, 60sec and 300sec
  36. How to understand the PSI •Picture from Facebook’s official document:

    •If (some|all) of tasks are in delay due to lack of resource
 in 45 sec out of 1 min, the PSI is 75.00, for example •This resembles Load Average (and PSI is available per a container)
  37. Trying PSI!

  38. How to enable... cgroup-v2 and PSI •Processes: •Rebuild your Linux

    upper than Kernel 4.20... •With CONFIG_PSI enabled •Start the kernel with special parameters: •psi=1 to activate PSI (this is disabled... by default config) •systemd.unified_cgroup_hierarchy=1 to disable whole v1 usage in systemd!!! (BTW PSI itself is available in mixed env)
  39. Small demo •Using Docker on PSI-enabled kernel(5.2)

  40. Benchmarking •Doing the apache bench against LXC apache containers... •LXC

    fully supports cgroup v2 now •Using AB params: ab -c 300 -t 60 -n 500000 •Switch cgroup params by $N:
 echo "$N 100000" > /sys/fs/cgroup/lxc/${name}/cpu.max •And... check /sys/fs/cgroup/.../cpu.pressure periodically
  41. Graph

  42. c.f. Load Average in host w/PUFUIBUIPTUIBTDPSFT

  43. Consideration •When admin distribute less resources to a cgroup, PSI

    tends to be get higher score. •When one container got overloaded, LA gets high score, but the order is smaller. It may be ignored depending on the situation (like... many core host)
  44. Future of overload detection

  45. Generally, detecting a container overload is hard

  46. Key metrics/tools are changing •Load Avarage •Memory usage •psutils, top,

    vmstat... •netstat, iostat •syslog, auditd •perf Host-wide Per-Container •Cgroup stat •PSI(especially) •eBPF (per process) •USDT, syscalls... •sysdig/falco •perf --cgroup
  47. Detecting per-container hype •There are (or will be) some ways:

    •measurement, measurement, measurement •Existing stats are trackable by cAdvisor && Prometheus •Enable PSI and watch the score •Using perf with cgroup’s perf_event
  48. perf with containers •perf_event can be created per container(cgroup), but

    there seems to be less examples •Example by brendangregg •More example: count syscall by a container
  49. eBPF •eBPF is a emerging technology to trace and measure

    all of Linux events, and it is fast and powerful than ever. •eBPF can trace perf_events directly !!! •BCC is a human-readable wrapper for eBPF •Available in C++, Python, Lua / porting into mruby •e.g. snooping syscalls(such as execve/open) by container •My colleague’s tool - by detecting /proc/PID/cgroup
  50. “Using eBPF in Kubernetes” •eBPF can control and trace networking
  51. Migrating v1 -> v2

  52. Case study •cgroup v2 migrationat Google • •

  53. Case study •Mixing cgroupfs v1 & cgroupfs v2: finding solutions

    for container runtimes • • original/Christian_Brauner_1.pdf
  54. V2 support is now under discussion • •Tasks in OCI

    runtime spec: •
  55. FYI: OCI with cgroup-v2 support •crun is a small OCI

    compatible runtime, and they say they supports cgroup v2 •They defines mappings: • v2
  56. Conclusion

  57. Future of container tracing •OCI and containers developers are working

    hard on cgroup-v2 •Using PSI, we can get pressure information per container, in human- readable way •But is available after cgroup-v2 and so newer kernel •More tracing technology is coming, especially eBPF and BCC is interesting among them