Upgrade to Pro — share decks privately, control downloads, hide ads and more …

How cgroup-v2 and PSI Impacts Cloud Native?

How cgroup-v2 and PSI Impacts Cloud Native?

@ CloudNative Days Tokyo 2019, in Toranomon

----

スライドに載せ忘れた参考サイトなど / More to read

「LXCで学ぶコンテナ入門 -軽量仮想化環境を実現する技術」のcgroup関係記事(in Japanese)
* https://gihyo.jp/admin/serial/01/linux_containers/0037
* https://gihyo.jp/admin/serial/01/linux_containers/0038
* https://gihyo.jp/admin/serial/01/linux_containers/0039
* https://gihyo.jp/admin/serial/01/linux_containers/0040
* https://gihyo.jp/admin/serial/01/linux_containers/0041

自分のブログ(in Japanese)
* https://udzura.hatenablog.jp/entry/2019/02/14/194244

いますぐ実践! Linux システム管理 / Vol.228
* http://www.usupi.org/sysad/228.html

KONDO Uchio

July 23, 2019
Tweet

More Decks by KONDO Uchio

Other Decks in Technology

Transcript

  1. How cgroup-v2 and PSI
    Impacts Cloud Native?
    Uchio Kondo / GMO Pepabo, Inc.
    2019.07.23 CloudNative Days Tokyo 2019
    Image from pixabay: https://pixabay.com/images/id-3193865/

    View Slide

  2. Uchio Kondo / @udzura
    https://blog.udzura.jp/
    ۙ౻ ͏͓ͪ
    Dev Productivity Team @ GMO Pepabo, Inc.
    RubyKaigi 2019 Local Organizer && Speaker
    CNDF 2019 Chair
    Community Organizer: #;͘͹Ͷͯ͢ (Fu-Kubernetes)
    Duolingo heavy user (seems to use Envoy :)

    View Slide

  3. ToC:
    •What is cgroup ???
    •cgroup’s in Docker / OCI
    •cgroup v2
    •Introduction to PSI(Pressure stall information)
    •Trying PSI
    •Future of overload detection

    View Slide

  4. What is cgroup?

    View Slide

  5. cgroup
    •One of the core feature of containers functionalities in Linux Kernel
    •Grouping processes/tasks, and controlling OS resources (such as CPU,
    memory, IO, count of processes) per each group
    •Both limitation and statistics for resources are available
    IUUQTHJIZPKQBENJOTFSJBMMJOVY@DPOUBJOFST

    View Slide

  6. Cgroupfs
    •Mounted by default when you use modern Linux distro
    •cgroup is controllable via this filesystem by file operation syscalls,
    such as open(2), read(2), write(2), mkdir(2)...
    •No special syscall is required!

    View Slide

  7. Solo usage (v1)
    $ sudo mkdir /sys/fs/cgroup/cpu/cndt2019
    $ cat /sys/fs/cgroup/cpu/cndt2019/cpu.cfs_period_us
    100000
    $ echo 20000 | \
    sudo tee /sys/fs/cgroup/cpu/cndt2019/cpu.cfs_quota_us
    20000
    $ echo $$ | \
    sudo tee /sys/fs/cgroup/cpu/cndt2019/tasks
    6351
    $ yes
    4FURVPUBQFSQFSJPE
    "TTJHOUBTLUPBHSPVQ

    View Slide

  8. Limit CPU for yes(1) to 20%

    View Slide

  9. cgroup detailed features in v1
    •cpu
    •memory
    •blkio
    •pids
    •net_cls
    •freezer ...

    View Slide

  10. cpu/cpuset controller
    •Limit CPU time in user/sys, and get stats
    •cpuset for CPU affinity

    View Slide

  11. quota/period
    cpu.cfs_period_us = 100000
    cpu.cfs_quota_us = 20000
    20000 / 100000 = 20%

    View Slide

  12. memory controller
    •Limits/stats memory usage
    •memory/swap memory/kernel memory
    $ echo 256M > group-a/memory.limit_in_bytes

    View Slide

  13. blkio controller
    •Limits/stat IO to block devices
    $ echo "8:0 10" > group-a/blkio.throttle.{read,write}_iops_device
    $ echo "8:0 10485760" > group-a/blkio.throttle.{read,write}_bps_device

    View Slide

  14. pids controller
    •Limits/checks the number of tasks in a group
    •Prevents a fork-bomb attack
    $ echo 1024 > group-a/pids.max

    View Slide

  15. net_cls controller
    •Mark the packets from the group
    •Mark available from iptables or tc

    View Slide

  16. freezer controller
    •Freeze all of the group’s tasks
    •Then, resume

    View Slide

  17. perf_event
    •Collect and monitor perf events per a group
    •This may be interesting if used with eBPF...

    View Slide

  18. Others
    •devices - devide whitelist
    •hugetbl - limit the HugeTLB usage
    •rdma - limit RDMA/IB specific resources

    View Slide

  19. cgroup’s in Docker/containers

    View Slide

  20. cgroup used by Docker
    •Docker uses cgroup!
    •Once a container is booted, new cgroup for each docker container is
    created, named /docker/${CONTAINER_ID}
    •In container /proc/self/cgroup contains /docker/* string, so some of
    tool judge if being inside container by cgroup
    •Key to host PID -> container ID

    View Slide

  21. e.g. Judging Docker, in chef
    IUUQTHJUIVCDPNDIFGPIBJCMPCNBTUFSMJCPIBJQMVHJOTMJOVYWJSUVBMJ[BUJPOSC

    View Slide

  22. In OCI spec
    •Mentioned as “Linux Container Configuration”
    •https://github.com/opencontainers/runtime-spec/blob/master/config-
    linux.md#control-groups
    •Details are defined per controllers. CPU, Memory, Pids...
    •These are good documents to understand each controllers

    View Slide

  23. cgroup v2

    View Slide

  24. A short history of cgroup
    •https://lwn.net/Articles/236038/
    •“Process containers” in 2007

    View Slide

  25. In 2007
    •Process containers -> control groups
    •cgroup is merged into kernel mainline in version 2.6.24

    View Slide

  26. Redesigning
    •Started in 2013
    •First merge in kernel version 3.16... and still developing
    •v2 filesystem can be mountable with v1

    View Slide

  27. Features of cgroup-v2
    •Unified hierarchy
    •Processes can belong to only leafs
    •Cgroup-aware OOM killer(OOM group)
    •nsdelegate
    •and... PSI!!

    View Slide

  28. Unified hierarchy
    /sys/fs/cgroup /sys/fs/cgroup
    /group-a
    /group-b
    /cpu.*
    /memory.*
    /io.* ...
    /cpu.*
    /memory.*
    /io.* ...
    /cpu
    /memory
    /blkio
    /group-a
    /group-b
    /group-a
    /group-c

    View Slide

  29. Processes can belong to only leafs
    /sys/fs/cgroup
    /parent-a
    /parent-b
    /parent-b/child-c
    /parent-b/child-d
    /parent-b/child-d/grandchild-e






    root is OK

    View Slide

  30. Cgroup-aware OOM killer
    •Can kill all of processes in a group when memory stall
    •https://lwn.net/Articles/761582/
    •memory.oom.group
    •To avoid partial kills to integrated workloads

    View Slide

  31. nsdelegate
    •“Consider cgroup namespaces as delegation boundaries”
    •Working with cgroup namespace
    •cgroup namespace = cgroup’s “chroot”
    •Without nsdelegate: a process can be moved to a group outside the
    namespace if it is visible

    (caused by remaining host-namespace mount, for example)
    •With nsdelegate: avoids this behavior

    View Slide

  32. Mounted with -o nsdelegate
    /sys/fs/cgroup
    /group-a
    /group-b
    /
    unshare --cgroup
    Modifiable
    /ns-g-a
    /ns-g-b
    Cannot modify

    Cgroup NS
    ...regardless of File access control

    View Slide

  33. Introduction to PSI
    (Pressure stall information)

    View Slide

  34. Pressure Stall Information
    •A new pressure metrics for Linux kernel
    •Developed and maintained by Facebook team
    •Both available in system-wide and per-cgroup2
    https://facebookmicrosites.github.io/psi/

    View Slide

  35. How to check the PSI
    •It create files:
    •/proc/pressure/{cpu,memory,io} for system-wide
    •/cgroup/group-name/{cpu,memory,io}.pressure

    when v2 enabled
    •Format is like:
    •Average in 10sec, 60sec and 300sec

    View Slide

  36. How to understand the PSI
    •Picture from Facebook’s official document:
    •If (some|all) of tasks are in delay due to lack of resource

    in 45 sec out of 1 min, the PSI is 75.00, for example
    •This resembles Load Average (and PSI is available per a container)

    View Slide

  37. Trying PSI!

    View Slide

  38. How to enable... cgroup-v2 and PSI
    •Processes:
    •Rebuild your Linux upper than Kernel 4.20...
    •With CONFIG_PSI enabled
    •Start the kernel with special parameters:
    •psi=1 to activate PSI (this is disabled... by default config)
    •systemd.unified_cgroup_hierarchy=1 to disable whole v1
    usage in systemd!!! (BTW PSI itself is available in mixed env)

    View Slide

  39. Small demo
    •Using Docker on PSI-enabled kernel(5.2)

    View Slide

  40. Benchmarking
    •Doing the apache bench against LXC apache containers...
    •LXC fully supports cgroup v2 now
    •Using AB params: ab -c 300 -t 60 -n 500000 http://10.0.1.2/
    •Switch cgroup params by $N:

    echo "$N 100000" > /sys/fs/cgroup/lxc/${name}/cpu.max
    •And... check /sys/fs/cgroup/.../cpu.pressure periodically

    View Slide

  41. Graph

    View Slide

  42. c.f. Load Average in host
    w/PUFUIBUIPTUIBTDPSFT

    View Slide

  43. Consideration
    •When admin distribute less resources to a cgroup, PSI tends to be get
    higher score.
    •When one container got overloaded, LA gets high score, but the
    order is smaller. It may be ignored depending on the situation (like...
    many core host)

    View Slide

  44. Future of overload detection

    View Slide

  45. Generally,
    detecting a container
    overload is hard

    View Slide

  46. Key metrics/tools are changing
    •Load Avarage
    •Memory usage
    •psutils, top, vmstat...
    •netstat, iostat
    •syslog, auditd
    •perf
    Host-wide Per-Container
    •Cgroup stat
    •PSI(especially)
    •eBPF (per process)
    •USDT, syscalls...
    •sysdig/falco
    •perf --cgroup

    View Slide

  47. Detecting per-container hype
    •There are (or will be) some ways:
    •measurement, measurement, measurement
    •Existing stats are trackable by cAdvisor && Prometheus
    •Enable PSI and watch the score
    •Using perf with cgroup’s perf_event

    View Slide

  48. perf with containers
    •perf_event can be created per container(cgroup), but there seems to
    be less examples
    •Example by brendangregg http://www.brendangregg.com/perf.html
    •More example: count syscall by a container

    View Slide

  49. eBPF
    •eBPF is a emerging technology to trace and measure all of Linux
    events, and it is fast and powerful than ever.
    •eBPF can trace perf_events directly !!!
    •BCC is a human-readable wrapper for eBPF
    •Available in C++, Python, Lua / porting into mruby
    •e.g. snooping syscalls(such as execve/open) by container
    •My colleague’s tool - by detecting /proc/PID/cgroup

    View Slide

  50. “Using eBPF in Kubernetes”
    •eBPF can control and trace networking
    https://kubernetes.io/blog/2017/12/using-ebpf-in-kubernetes/

    View Slide

  51. Migrating v1 -> v2

    View Slide

  52. Case study
    •cgroup v2 migrationat Google
    • https://www.linuxplumbersconf.com/event/2/contributions/204/
    • https://www.linuxplumbersconf.org/event/2/contributions/204/attachments/143/378/LPC2018-
    cgroup-v2.pdf

    View Slide

  53. Case study
    •Mixing cgroupfs v1 & cgroupfs v2: finding solutions for container
    runtimes
    •https://www.youtube.com/watch?v=P6Xnm0IhiSo
    •https://linuxpiter.com/system/attachments/files/000/001/342/
    original/Christian_Brauner_1.pdf

    View Slide

  54. V2 support is now under discussion
    •https://github.com/opencontainers/runc/issues/654
    •Tasks in OCI runtime spec:
    •https://github.com/opencontainers/runtime-spec/issues/1002

    View Slide

  55. FYI: OCI with cgroup-v2 support
    •crun is a small OCI compatible runtime, and they say they supports
    cgroup v2
    •They defines mappings:
    •https://github.com/giuseppe/crun/blob/master/crun.1.md#cgroup-
    v2

    View Slide

  56. Conclusion

    View Slide

  57. Future of container tracing
    •OCI and containers developers are working hard on cgroup-v2
    •Using PSI, we can get pressure information per container, in human-
    readable way
    •But is available after cgroup-v2 and so newer kernel
    •More tracing technology is coming, especially eBPF and BCC is
    interesting among them

    View Slide