Slide 1

Slide 1 text

How cgroup-v2 and PSI Impacts Cloud Native? Uchio Kondo / GMO Pepabo, Inc. 2019.07.23 CloudNative Days Tokyo 2019 Image from pixabay: https://pixabay.com/images/id-3193865/

Slide 2

Slide 2 text

Uchio Kondo / @udzura https://blog.udzura.jp/ ۙ౻ ͏͓ͪ Dev Productivity Team @ GMO Pepabo, Inc. RubyKaigi 2019 Local Organizer && Speaker CNDF 2019 Chair Community Organizer: #;͘͹Ͷͯ͢ (Fu-Kubernetes) Duolingo heavy user (seems to use Envoy :)

Slide 3

Slide 3 text

ToC: •What is cgroup ??? •cgroup’s in Docker / OCI •cgroup v2 •Introduction to PSI(Pressure stall information) •Trying PSI •Future of overload detection

Slide 4

Slide 4 text

What is cgroup?

Slide 5

Slide 5 text

cgroup •One of the core feature of containers functionalities in Linux Kernel •Grouping processes/tasks, and controlling OS resources (such as CPU, memory, IO, count of processes) per each group •Both limitation and statistics for resources are available IUUQTHJIZPKQBENJOTFSJBMMJOVY@DPOUBJOFST

Slide 6

Slide 6 text

Cgroupfs •Mounted by default when you use modern Linux distro •cgroup is controllable via this filesystem by file operation syscalls, such as open(2), read(2), write(2), mkdir(2)... •No special syscall is required!

Slide 7

Slide 7 text

Solo usage (v1) $ sudo mkdir /sys/fs/cgroup/cpu/cndt2019 $ cat /sys/fs/cgroup/cpu/cndt2019/cpu.cfs_period_us 100000 $ echo 20000 | \ sudo tee /sys/fs/cgroup/cpu/cndt2019/cpu.cfs_quota_us 20000 $ echo $$ | \ sudo tee /sys/fs/cgroup/cpu/cndt2019/tasks 6351 $ yes 4FURVPUBQFSQFSJPE "TTJHOUBTLUPBHSPVQ

Slide 8

Slide 8 text

Limit CPU for yes(1) to 20%

Slide 9

Slide 9 text

cgroup detailed features in v1 •cpu •memory •blkio •pids •net_cls •freezer ...

Slide 10

Slide 10 text

cpu/cpuset controller •Limit CPU time in user/sys, and get stats •cpuset for CPU affinity

Slide 11

Slide 11 text

quota/period cpu.cfs_period_us = 100000 cpu.cfs_quota_us = 20000 20000 / 100000 = 20%

Slide 12

Slide 12 text

memory controller •Limits/stats memory usage •memory/swap memory/kernel memory $ echo 256M > group-a/memory.limit_in_bytes

Slide 13

Slide 13 text

blkio controller •Limits/stat IO to block devices $ echo "8:0 10" > group-a/blkio.throttle.{read,write}_iops_device $ echo "8:0 10485760" > group-a/blkio.throttle.{read,write}_bps_device

Slide 14

Slide 14 text

pids controller •Limits/checks the number of tasks in a group •Prevents a fork-bomb attack $ echo 1024 > group-a/pids.max

Slide 15

Slide 15 text

net_cls controller •Mark the packets from the group •Mark available from iptables or tc

Slide 16

Slide 16 text

freezer controller •Freeze all of the group’s tasks •Then, resume

Slide 17

Slide 17 text

perf_event •Collect and monitor perf events per a group •This may be interesting if used with eBPF...

Slide 18

Slide 18 text

Others •devices - devide whitelist •hugetbl - limit the HugeTLB usage •rdma - limit RDMA/IB specific resources

Slide 19

Slide 19 text

cgroup’s in Docker/containers

Slide 20

Slide 20 text

cgroup used by Docker •Docker uses cgroup! •Once a container is booted, new cgroup for each docker container is created, named /docker/${CONTAINER_ID} •In container /proc/self/cgroup contains /docker/* string, so some of tool judge if being inside container by cgroup •Key to host PID -> container ID

Slide 21

Slide 21 text

e.g. Judging Docker, in chef IUUQTHJUIVCDPNDIFGPIBJCMPCNBTUFSMJCPIBJQMVHJOTMJOVYWJSUVBMJ[BUJPOSC

Slide 22

Slide 22 text

In OCI spec •Mentioned as “Linux Container Configuration” •https://github.com/opencontainers/runtime-spec/blob/master/config- linux.md#control-groups •Details are defined per controllers. CPU, Memory, Pids... •These are good documents to understand each controllers

Slide 23

Slide 23 text

cgroup v2

Slide 24

Slide 24 text

A short history of cgroup •https://lwn.net/Articles/236038/ •“Process containers” in 2007

Slide 25

Slide 25 text

In 2007 •Process containers -> control groups •cgroup is merged into kernel mainline in version 2.6.24

Slide 26

Slide 26 text

Redesigning •Started in 2013 •First merge in kernel version 3.16... and still developing •v2 filesystem can be mountable with v1

Slide 27

Slide 27 text

Features of cgroup-v2 •Unified hierarchy •Processes can belong to only leafs •Cgroup-aware OOM killer(OOM group) •nsdelegate •and... PSI!!

Slide 28

Slide 28 text

Unified hierarchy /sys/fs/cgroup /sys/fs/cgroup /group-a /group-b /cpu.* /memory.* /io.* ... /cpu.* /memory.* /io.* ... /cpu /memory /blkio /group-a /group-b /group-a /group-c

Slide 29

Slide 29 text

Processes can belong to only leafs /sys/fs/cgroup /parent-a /parent-b /parent-b/child-c /parent-b/child-d /parent-b/child-d/grandchild-e ⭕ ⭕ ⭕ ❎ ❎ ⭕ root is OK

Slide 30

Slide 30 text

Cgroup-aware OOM killer •Can kill all of processes in a group when memory stall •https://lwn.net/Articles/761582/ •memory.oom.group •To avoid partial kills to integrated workloads

Slide 31

Slide 31 text

nsdelegate •“Consider cgroup namespaces as delegation boundaries” •Working with cgroup namespace •cgroup namespace = cgroup’s “chroot” •Without nsdelegate: a process can be moved to a group outside the namespace if it is visible
 (caused by remaining host-namespace mount, for example) •With nsdelegate: avoids this behavior

Slide 32

Slide 32 text

Mounted with -o nsdelegate /sys/fs/cgroup /group-a /group-b / unshare --cgroup Modifiable /ns-g-a /ns-g-b Cannot modify ❎ Cgroup NS ...regardless of File access control

Slide 33

Slide 33 text

Introduction to PSI (Pressure stall information)

Slide 34

Slide 34 text

Pressure Stall Information •A new pressure metrics for Linux kernel •Developed and maintained by Facebook team •Both available in system-wide and per-cgroup2 https://facebookmicrosites.github.io/psi/

Slide 35

Slide 35 text

How to check the PSI •It create files: •/proc/pressure/{cpu,memory,io} for system-wide •/cgroup/group-name/{cpu,memory,io}.pressure
 when v2 enabled •Format is like: •Average in 10sec, 60sec and 300sec

Slide 36

Slide 36 text

How to understand the PSI •Picture from Facebook’s official document: •If (some|all) of tasks are in delay due to lack of resource
 in 45 sec out of 1 min, the PSI is 75.00, for example •This resembles Load Average (and PSI is available per a container)

Slide 37

Slide 37 text

Trying PSI!

Slide 38

Slide 38 text

How to enable... cgroup-v2 and PSI •Processes: •Rebuild your Linux upper than Kernel 4.20... •With CONFIG_PSI enabled •Start the kernel with special parameters: •psi=1 to activate PSI (this is disabled... by default config) •systemd.unified_cgroup_hierarchy=1 to disable whole v1 usage in systemd!!! (BTW PSI itself is available in mixed env)

Slide 39

Slide 39 text

Small demo •Using Docker on PSI-enabled kernel(5.2)

Slide 40

Slide 40 text

Benchmarking •Doing the apache bench against LXC apache containers... •LXC fully supports cgroup v2 now •Using AB params: ab -c 300 -t 60 -n 500000 http://10.0.1.2/ •Switch cgroup params by $N:
 echo "$N 100000" > /sys/fs/cgroup/lxc/${name}/cpu.max •And... check /sys/fs/cgroup/.../cpu.pressure periodically

Slide 41

Slide 41 text

Graph

Slide 42

Slide 42 text

c.f. Load Average in host w/PUFUIBUIPTUIBTDPSFT

Slide 43

Slide 43 text

Consideration •When admin distribute less resources to a cgroup, PSI tends to be get higher score. •When one container got overloaded, LA gets high score, but the order is smaller. It may be ignored depending on the situation (like... many core host)

Slide 44

Slide 44 text

Future of overload detection

Slide 45

Slide 45 text

Generally, detecting a container overload is hard

Slide 46

Slide 46 text

Key metrics/tools are changing •Load Avarage •Memory usage •psutils, top, vmstat... •netstat, iostat •syslog, auditd •perf Host-wide Per-Container •Cgroup stat •PSI(especially) •eBPF (per process) •USDT, syscalls... •sysdig/falco •perf --cgroup

Slide 47

Slide 47 text

Detecting per-container hype •There are (or will be) some ways: •measurement, measurement, measurement •Existing stats are trackable by cAdvisor && Prometheus •Enable PSI and watch the score •Using perf with cgroup’s perf_event

Slide 48

Slide 48 text

perf with containers •perf_event can be created per container(cgroup), but there seems to be less examples •Example by brendangregg http://www.brendangregg.com/perf.html •More example: count syscall by a container

Slide 49

Slide 49 text

eBPF •eBPF is a emerging technology to trace and measure all of Linux events, and it is fast and powerful than ever. •eBPF can trace perf_events directly !!! •BCC is a human-readable wrapper for eBPF •Available in C++, Python, Lua / porting into mruby •e.g. snooping syscalls(such as execve/open) by container •My colleague’s tool - by detecting /proc/PID/cgroup

Slide 50

Slide 50 text

“Using eBPF in Kubernetes” •eBPF can control and trace networking https://kubernetes.io/blog/2017/12/using-ebpf-in-kubernetes/

Slide 51

Slide 51 text

Migrating v1 -> v2

Slide 52

Slide 52 text

Case study •cgroup v2 migrationat Google • https://www.linuxplumbersconf.com/event/2/contributions/204/ • https://www.linuxplumbersconf.org/event/2/contributions/204/attachments/143/378/LPC2018- cgroup-v2.pdf

Slide 53

Slide 53 text

Case study •Mixing cgroupfs v1 & cgroupfs v2: finding solutions for container runtimes •https://www.youtube.com/watch?v=P6Xnm0IhiSo •https://linuxpiter.com/system/attachments/files/000/001/342/ original/Christian_Brauner_1.pdf

Slide 54

Slide 54 text

V2 support is now under discussion •https://github.com/opencontainers/runc/issues/654 •Tasks in OCI runtime spec: •https://github.com/opencontainers/runtime-spec/issues/1002

Slide 55

Slide 55 text

FYI: OCI with cgroup-v2 support •crun is a small OCI compatible runtime, and they say they supports cgroup v2 •They defines mappings: •https://github.com/giuseppe/crun/blob/master/crun.1.md#cgroup- v2

Slide 56

Slide 56 text

Conclusion

Slide 57

Slide 57 text

Future of container tracing •OCI and containers developers are working hard on cgroup-v2 •Using PSI, we can get pressure information per container, in human- readable way •But is available after cgroup-v2 and so newer kernel •More tracing technology is coming, especially eBPF and BCC is interesting among them