How cgroup-v2 and PSI Impacts Cloud Native?

How cgroup-v2 and PSI Impacts Cloud Native? Uchio Kondo /
GMO Pepabo, Inc. 2019.07.23 CloudNative Days Tokyo 2019 Image from pixabay: https://pixabay.com/images/id-3193865/

Uchio Kondo / @udzura https://blog.udzura.jp/ ۙ౻ ͏͓ͪ Dev Productivity Team
@ GMO Pepabo, Inc. RubyKaigi 2019 Local Organizer && Speaker CNDF 2019 Chair Community Organizer: #;͘͹Ͷͯ͢ (Fu-Kubernetes) Duolingo heavy user (seems to use Envoy :)

ToC: •What is cgroup ??? •cgroup’s in Docker / OCI
•cgroup v2 •Introduction to PSI(Pressure stall information) •Trying PSI •Future of overload detection

What is cgroup?

cgroup •One of the core feature of containers functionalities in
Linux Kernel •Grouping processes/tasks, and controlling OS resources (such as CPU, memory, IO, count of processes) per each group •Both limitation and statistics for resources are available IUUQTHJIZPKQBENJOTFSJBMMJOVY@DPOUBJOFST

Cgroupfs •Mounted by default when you use modern Linux distro
•cgroup is controllable via this filesystem by file operation syscalls, such as open(2), read(2), write(2), mkdir(2)... •No special syscall is required!

Solo usage (v1) $ sudo mkdir /sys/fs/cgroup/cpu/cndt2019 $ cat /sys/fs/cgroup/cpu/cndt2019/cpu.cfs_period_us
100000 $ echo 20000 | \ sudo tee /sys/fs/cgroup/cpu/cndt2019/cpu.cfs_quota_us 20000 $ echo $$ | \ sudo tee /sys/fs/cgroup/cpu/cndt2019/tasks 6351 $ yes 4FURVPUBQFSQFSJPE "TTJHOUBTLUPBHSPVQ

Limit CPU for yes(1) to 20%

cgroup detailed features in v1 •cpu •memory •blkio •pids •net_cls
•freezer ...

cpu/cpuset controller •Limit CPU time in user/sys, and get stats
•cpuset for CPU affinity

quota/period cpu.cfs_period_us = 100000 cpu.cfs_quota_us = 20000 20000 / 100000
= 20%

memory controller •Limits/stats memory usage •memory/swap memory/kernel memory $ echo
256M > group-a/memory.limit_in_bytes

blkio controller •Limits/stat IO to block devices $ echo "8:0
10" > group-a/blkio.throttle.{read,write}_iops_device $ echo "8:0 10485760" > group-a/blkio.throttle.{read,write}_bps_device

pids controller •Limits/checks the number of tasks in a group
•Prevents a fork-bomb attack $ echo 1024 > group-a/pids.max

net_cls controller •Mark the packets from the group •Mark available
from iptables or tc

freezer controller •Freeze all of the group’s tasks •Then, resume

perf_event •Collect and monitor perf events per a group •This
may be interesting if used with eBPF...

Others •devices - devide whitelist •hugetbl - limit the HugeTLB
usage •rdma - limit RDMA/IB specific resources

cgroup’s in Docker/containers

cgroup used by Docker •Docker uses cgroup! •Once a container
is booted, new cgroup for each docker container is created, named /docker/${CONTAINER_ID} •In container /proc/self/cgroup contains /docker/* string, so some of tool judge if being inside container by cgroup •Key to host PID -> container ID

e.g. Judging Docker, in chef IUUQTHJUIVCDPNDIFGPIBJCMPCNBTUFSMJCPIBJQMVHJOTMJOVYWJSUVBMJ[BUJPOSC

In OCI spec •Mentioned as “Linux Container Configuration” •https://github.com/opencontainers/runtime-spec/blob/master/config- linux.md#control-groups
•Details are defined per controllers. CPU, Memory, Pids... •These are good documents to understand each controllers

cgroup v2

A short history of cgroup •https://lwn.net/Articles/236038/ •“Process containers” in 2007

In 2007 •Process containers -> control groups •cgroup is merged
into kernel mainline in version 2.6.24

Redesigning •Started in 2013 •First merge in kernel version 3.16...
and still developing •v2 filesystem can be mountable with v1

Features of cgroup-v2 •Unified hierarchy •Processes can belong to only
leafs •Cgroup-aware OOM killer(OOM group) •nsdelegate •and... PSI!!

Uniﬁed hierarchy /sys/fs/cgroup /sys/fs/cgroup /group-a /group-b /cpu.* /memory.* /io.* ...
/cpu.* /memory.* /io.* ... /cpu /memory /blkio /group-a /group-b /group-a /group-c

Processes can belong to only leafs /sys/fs/cgroup /parent-a /parent-b /parent-b/child-c
/parent-b/child-d /parent-b/child-d/grandchild-e ⭕ ⭕ ⭕ ❎ ❎ ⭕ root is OK

Cgroup-aware OOM killer •Can kill all of processes in a
group when memory stall •https://lwn.net/Articles/761582/ •memory.oom.group •To avoid partial kills to integrated workloads

nsdelegate •“Consider cgroup namespaces as delegation boundaries” •Working with cgroup
namespace •cgroup namespace = cgroup’s “chroot” •Without nsdelegate: a process can be moved to a group outside the namespace if it is visible  (caused by remaining host-namespace mount, for example) •With nsdelegate: avoids this behavior

Mounted with -o nsdelegate /sys/fs/cgroup /group-a /group-b / unshare --cgroup
Modiﬁable /ns-g-a /ns-g-b Cannot modify ❎ Cgroup NS ...regardless of File access control

Introduction to PSI (Pressure stall information)

Pressure Stall Information •A new pressure metrics for Linux kernel
•Developed and maintained by Facebook team •Both available in system-wide and per-cgroup2 https://facebookmicrosites.github.io/psi/

How to check the PSI •It create files: •/proc/pressure/{cpu,memory,io} for
system-wide •/cgroup/group-name/{cpu,memory,io}.pressure  when v2 enabled •Format is like: •Average in 10sec, 60sec and 300sec

How to understand the PSI •Picture from Facebook’s official document:
•If (some|all) of tasks are in delay due to lack of resource  in 45 sec out of 1 min, the PSI is 75.00, for example •This resembles Load Average (and PSI is available per a container)

Trying PSI!

How to enable... cgroup-v2 and PSI •Processes: •Rebuild your Linux
upper than Kernel 4.20... •With CONFIG_PSI enabled •Start the kernel with special parameters: •psi=1 to activate PSI (this is disabled... by default config) •systemd.unified_cgroup_hierarchy=1 to disable whole v1 usage in systemd!!! (BTW PSI itself is available in mixed env)

Small demo •Using Docker on PSI-enabled kernel(5.2)

Benchmarking •Doing the apache bench against LXC apache containers... •LXC
fully supports cgroup v2 now •Using AB params: ab -c 300 -t 60 -n 500000 http://10.0.1.2/ •Switch cgroup params by $N:  echo "$N 100000" > /sys/fs/cgroup/lxc/${name}/cpu.max •And... check /sys/fs/cgroup/.../cpu.pressure periodically

c.f. Load Average in host w/PUFUIBUIPTUIBTDPSFT

Consideration •When admin distribute less resources to a cgroup, PSI
tends to be get higher score. •When one container got overloaded, LA gets high score, but the order is smaller. It may be ignored depending on the situation (like... many core host)

Future of overload detection

Generally, detecting a container overload is hard

Key metrics/tools are changing •Load Avarage •Memory usage •psutils, top,
vmstat... •netstat, iostat •syslog, auditd •perf Host-wide Per-Container •Cgroup stat •PSI(especially) •eBPF (per process) •USDT, syscalls... •sysdig/falco •perf --cgroup

Detecting per-container hype •There are (or will be) some ways:
•measurement, measurement, measurement •Existing stats are trackable by cAdvisor && Prometheus •Enable PSI and watch the score •Using perf with cgroup’s perf_event

perf with containers •perf_event can be created per container(cgroup), but
there seems to be less examples •Example by brendangregg http://www.brendangregg.com/perf.html •More example: count syscall by a container

eBPF •eBPF is a emerging technology to trace and measure
all of Linux events, and it is fast and powerful than ever. •eBPF can trace perf_events directly !!! •BCC is a human-readable wrapper for eBPF •Available in C++, Python, Lua / porting into mruby •e.g. snooping syscalls(such as execve/open) by container •My colleague’s tool - by detecting /proc/PID/cgroup

“Using eBPF in Kubernetes” •eBPF can control and trace networking
https://kubernetes.io/blog/2017/12/using-ebpf-in-kubernetes/

Migrating v1 -> v2

Case study •cgroup v2 migrationat Google • https://www.linuxplumbersconf.com/event/2/contributions/204/ • https://www.linuxplumbersconf.org/event/2/contributions/204/attachments/143/378/LPC2018-
cgroup-v2.pdf

Case study •Mixing cgroupfs v1 & cgroupfs v2: finding solutions
for container runtimes •https://www.youtube.com/watch?v=P6Xnm0IhiSo •https://linuxpiter.com/system/attachments/files/000/001/342/ original/Christian_Brauner_1.pdf

V2 support is now under discussion •https://github.com/opencontainers/runc/issues/654 •Tasks in OCI
runtime spec: •https://github.com/opencontainers/runtime-spec/issues/1002

FYI: OCI with cgroup-v2 support •crun is a small OCI
compatible runtime, and they say they supports cgroup v2 •They defines mappings: •https://github.com/giuseppe/crun/blob/master/crun.1.md#cgroup- v2

Conclusion

Future of container tracing •OCI and containers developers are working
hard on cgroup-v2 •Using PSI, we can get pressure information per container, in human- readable way •But is available after cgroup-v2 and so newer kernel •More tracing technology is coming, especially eBPF and BCC is interesting among them

How cgroup-v2 and PSI Impacts Cloud Native?

How cgroup-v2 and PSI Impacts Cloud Native?

More Decks by KONDO Uchio

Other Decks in Technology

Featured

Transcript