How cgroup-v2 and PSI Impacts Cloud Native?

How cgroup-v2 and PSI Impacts Cloud Native?

@ CloudNative Days Tokyo 2019, in Toranomon

----

スライドに載せ忘れた参考サイトなど / More to read

「LXCで学ぶコンテナ入門 -軽量仮想化環境を実現する技術」のcgroup関係記事(in Japanese)
* https://gihyo.jp/admin/serial/01/linux_containers/0037
* https://gihyo.jp/admin/serial/01/linux_containers/0038
* https://gihyo.jp/admin/serial/01/linux_containers/0039
* https://gihyo.jp/admin/serial/01/linux_containers/0040
* https://gihyo.jp/admin/serial/01/linux_containers/0041

自分のブログ(in Japanese)
* https://udzura.hatenablog.jp/entry/2019/02/14/194244

いますぐ実践! Linux システム管理 / Vol.228
* http://www.usupi.org/sysad/228.html

2cf373725ded741824c50fd571eda6e1?s=128

KONDO Uchio

July 23, 2019
Tweet

Transcript

  1. 1.

    How cgroup-v2 and PSI Impacts Cloud Native? Uchio Kondo /

    GMO Pepabo, Inc. 2019.07.23 CloudNative Days Tokyo 2019 Image from pixabay: https://pixabay.com/images/id-3193865/
  2. 2.

    Uchio Kondo / @udzura https://blog.udzura.jp/ ۙ౻ ͏͓ͪ Dev Productivity Team

    @ GMO Pepabo, Inc. RubyKaigi 2019 Local Organizer && Speaker CNDF 2019 Chair Community Organizer: #;͘͹Ͷͯ͢ (Fu-Kubernetes) Duolingo heavy user (seems to use Envoy :)
  3. 3.

    ToC: •What is cgroup ??? •cgroup’s in Docker / OCI

    •cgroup v2 •Introduction to PSI(Pressure stall information) •Trying PSI •Future of overload detection
  4. 5.

    cgroup •One of the core feature of containers functionalities in

    Linux Kernel •Grouping processes/tasks, and controlling OS resources (such as CPU, memory, IO, count of processes) per each group •Both limitation and statistics for resources are available IUUQTHJIZPKQBENJOTFSJBMMJOVY@DPOUBJOFST
  5. 6.

    Cgroupfs •Mounted by default when you use modern Linux distro

    •cgroup is controllable via this filesystem by file operation syscalls, such as open(2), read(2), write(2), mkdir(2)... •No special syscall is required!
  6. 7.

    Solo usage (v1) $ sudo mkdir /sys/fs/cgroup/cpu/cndt2019 $ cat /sys/fs/cgroup/cpu/cndt2019/cpu.cfs_period_us

    100000 $ echo 20000 | \ sudo tee /sys/fs/cgroup/cpu/cndt2019/cpu.cfs_quota_us 20000 $ echo $$ | \ sudo tee /sys/fs/cgroup/cpu/cndt2019/tasks 6351 $ yes 4FURVPUBQFSQFSJPE "TTJHOUBTLUPBHSPVQ
  7. 13.

    blkio controller •Limits/stat IO to block devices $ echo "8:0

    10" > group-a/blkio.throttle.{read,write}_iops_device $ echo "8:0 10485760" > group-a/blkio.throttle.{read,write}_bps_device
  8. 14.

    pids controller •Limits/checks the number of tasks in a group

    •Prevents a fork-bomb attack $ echo 1024 > group-a/pids.max
  9. 17.
  10. 18.

    Others •devices - devide whitelist •hugetbl - limit the HugeTLB

    usage •rdma - limit RDMA/IB specific resources
  11. 20.

    cgroup used by Docker •Docker uses cgroup! •Once a container

    is booted, new cgroup for each docker container is created, named /docker/${CONTAINER_ID} •In container /proc/self/cgroup contains /docker/* string, so some of tool judge if being inside container by cgroup •Key to host PID -> container ID
  12. 22.

    In OCI spec •Mentioned as “Linux Container Configuration” •https://github.com/opencontainers/runtime-spec/blob/master/config- linux.md#control-groups

    •Details are defined per controllers. CPU, Memory, Pids... •These are good documents to understand each controllers
  13. 23.
  14. 25.
  15. 26.

    Redesigning •Started in 2013 •First merge in kernel version 3.16...

    and still developing •v2 filesystem can be mountable with v1
  16. 27.

    Features of cgroup-v2 •Unified hierarchy •Processes can belong to only

    leafs •Cgroup-aware OOM killer(OOM group) •nsdelegate •and... PSI!!
  17. 28.

    Unified hierarchy /sys/fs/cgroup /sys/fs/cgroup /group-a /group-b /cpu.* /memory.* /io.* ...

    /cpu.* /memory.* /io.* ... /cpu /memory /blkio /group-a /group-b /group-a /group-c
  18. 29.

    Processes can belong to only leafs /sys/fs/cgroup /parent-a /parent-b /parent-b/child-c

    /parent-b/child-d /parent-b/child-d/grandchild-e ⭕ ⭕ ⭕ ❎ ❎ ⭕ root is OK
  19. 30.

    Cgroup-aware OOM killer •Can kill all of processes in a

    group when memory stall •https://lwn.net/Articles/761582/ •memory.oom.group •To avoid partial kills to integrated workloads
  20. 31.

    nsdelegate •“Consider cgroup namespaces as delegation boundaries” •Working with cgroup

    namespace •cgroup namespace = cgroup’s “chroot” •Without nsdelegate: a process can be moved to a group outside the namespace if it is visible
 (caused by remaining host-namespace mount, for example) •With nsdelegate: avoids this behavior
  21. 32.

    Mounted with -o nsdelegate /sys/fs/cgroup /group-a /group-b / unshare --cgroup

    Modifiable /ns-g-a /ns-g-b Cannot modify ❎ Cgroup NS ...regardless of File access control
  22. 34.

    Pressure Stall Information •A new pressure metrics for Linux kernel

    •Developed and maintained by Facebook team •Both available in system-wide and per-cgroup2 https://facebookmicrosites.github.io/psi/
  23. 35.

    How to check the PSI •It create files: •/proc/pressure/{cpu,memory,io} for

    system-wide •/cgroup/group-name/{cpu,memory,io}.pressure
 when v2 enabled •Format is like: •Average in 10sec, 60sec and 300sec
  24. 36.

    How to understand the PSI •Picture from Facebook’s official document:

    •If (some|all) of tasks are in delay due to lack of resource
 in 45 sec out of 1 min, the PSI is 75.00, for example •This resembles Load Average (and PSI is available per a container)
  25. 38.

    How to enable... cgroup-v2 and PSI •Processes: •Rebuild your Linux

    upper than Kernel 4.20... •With CONFIG_PSI enabled •Start the kernel with special parameters: •psi=1 to activate PSI (this is disabled... by default config) •systemd.unified_cgroup_hierarchy=1 to disable whole v1 usage in systemd!!! (BTW PSI itself is available in mixed env)
  26. 40.

    Benchmarking •Doing the apache bench against LXC apache containers... •LXC

    fully supports cgroup v2 now •Using AB params: ab -c 300 -t 60 -n 500000 http://10.0.1.2/ •Switch cgroup params by $N:
 echo "$N 100000" > /sys/fs/cgroup/lxc/${name}/cpu.max •And... check /sys/fs/cgroup/.../cpu.pressure periodically
  27. 41.
  28. 43.

    Consideration •When admin distribute less resources to a cgroup, PSI

    tends to be get higher score. •When one container got overloaded, LA gets high score, but the order is smaller. It may be ignored depending on the situation (like... many core host)
  29. 46.

    Key metrics/tools are changing •Load Avarage •Memory usage •psutils, top,

    vmstat... •netstat, iostat •syslog, auditd •perf Host-wide Per-Container •Cgroup stat •PSI(especially) •eBPF (per process) •USDT, syscalls... •sysdig/falco •perf --cgroup
  30. 47.

    Detecting per-container hype •There are (or will be) some ways:

    •measurement, measurement, measurement •Existing stats are trackable by cAdvisor && Prometheus •Enable PSI and watch the score •Using perf with cgroup’s perf_event
  31. 48.

    perf with containers •perf_event can be created per container(cgroup), but

    there seems to be less examples •Example by brendangregg http://www.brendangregg.com/perf.html •More example: count syscall by a container
  32. 49.

    eBPF •eBPF is a emerging technology to trace and measure

    all of Linux events, and it is fast and powerful than ever. •eBPF can trace perf_events directly !!! •BCC is a human-readable wrapper for eBPF •Available in C++, Python, Lua / porting into mruby •e.g. snooping syscalls(such as execve/open) by container •My colleague’s tool - by detecting /proc/PID/cgroup
  33. 50.

    “Using eBPF in Kubernetes” •eBPF can control and trace networking

    https://kubernetes.io/blog/2017/12/using-ebpf-in-kubernetes/
  34. 53.

    Case study •Mixing cgroupfs v1 & cgroupfs v2: finding solutions

    for container runtimes •https://www.youtube.com/watch?v=P6Xnm0IhiSo •https://linuxpiter.com/system/attachments/files/000/001/342/ original/Christian_Brauner_1.pdf
  35. 54.

    V2 support is now under discussion •https://github.com/opencontainers/runc/issues/654 •Tasks in OCI

    runtime spec: •https://github.com/opencontainers/runtime-spec/issues/1002
  36. 55.

    FYI: OCI with cgroup-v2 support •crun is a small OCI

    compatible runtime, and they say they supports cgroup v2 •They defines mappings: •https://github.com/giuseppe/crun/blob/master/crun.1.md#cgroup- v2
  37. 57.

    Future of container tracing •OCI and containers developers are working

    hard on cgroup-v2 •Using PSI, we can get pressure information per container, in human- readable way •But is available after cgroup-v2 and so newer kernel •More tracing technology is coming, especially eBPF and BCC is interesting among them