Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Resource control in production at Meta

Resource control in production at Meta

Control groups (or cgroups for short) are one of the most fundamental technologies underpinning our modern love of containerisation and resource control. Back in 2016, we released a complete overhaul of how cgroups work internally: cgroup v2, released with Linux 4.5. This brought many new and exciting possibilities to increase system stability and throughput, but with those possibilities have also come challenges of a type which we have largely not faced in Linux before.

This talk will go into some of the challenges faced in overhauling Linux’s resource isolation and control capabilities, and how we’ve gone about fixing them. This will include some of the most complex and counter-intuitive practical effects we’ve seen in production, with details of how our expectations and knowledge have developed over the last 5 years using this on over a million machines in production, with insights that are immediately applicable to anyone who runs Linux at scale.

We will also go over the state-of-the-art of resource control in the “real world” outside of companies like Meta and Google, looking at how cgroup v2 is changing the technical landscape for distributions and containerisation technologies for the better.

Chris Down

Kernel Recipes

September 29, 2023
Tweet

More Decks by Kernel Recipes

Other Decks in Programming

Transcript

  1. 7 years of cgroup v2 The future of Linux resource

    control Chris Down Kernel, Meta https://chrisdown.name
  2. v1, each cgroup exists in the context of only one

    resource: /sys/fs/cgroup memory cgroup 1 cgroup 2 ... cpu cgroup 3 cgroup 4 ... v2, {en,dis}able resources per-cgroup using cgroup.subtree_control: /sys/fs/cgroup cgroup 1 cgroup 2 cgroup 3 ... cgroup 4 cgroup 5 cgroup 6 ...
  3. Why do we need a single resource hierarchy? ▪ Memory

    starts to run out ▪ This causes us to reclaim page caches/swap, causing disk IO
  4. Why do we need a single resource hierarchy? ▪ Memory

    starts to run out ▪ This causes us to reclaim page caches/swap, causing disk IO ▪ This reclaim costs sometimes non-trivial CPU cycles
  5. How can you view memory usage for a process in

    Linux? ▪ SIKE THIS SLIDE WAS A TRAP
  6. % cat /proc/self/cgroup 0::/system.slice/foo.service % cat /sys/fs/cgroup/system.slice/foo.service/memory.current 3786670080 ▪ memory.current

    tells the truth, but the truth is sometimes complicated ▪ Slack grows to fill up to cgroup limits if there’s no global pressure
  7. % time make -j4 -s real 3m58.050s user 13m33.735s sys

    1m30.130s # Peak memory.current bytes: 803934208
  8. % sudo sh -c 'echo 600M > memory.high' % time

    make -j4 -s real 4m0.654s user 13m28.493s sys 1m31.509s # Peak memory.current bytes: 629116928
  9. % sudo sh -c 'echo 400M > memory.high' % time

    make -j4 -s real 4m3.186s user 13m20.452s sys 1m31.085s # Peak memory.current bytes: 419368960
  10. % sudo sh -c 'echo 300M > memory.high' % time

    make -j4 -s ^C real 9m9.974s user 10m59.315s sys 1m16.576s
  11. % sudo senpai /sys/fs/cgroup/... 2021-05-20 14:26:09 limit=100.00M pressure=0.00 delta=8432 integral=8432

    % make -j4 -s [...find the real usage...] 2021-05-20 14:26:43 limit=340.48M pressure=0.16 delta=202 integral=202 2021-05-20 14:26:44 limit=340.48M pressure=0.13 delta=0 integral=202 bit.ly/cgsenpai
  12. HDD NVM + SSD CXL RAM CPU cache ↑ high

    cost, low latency ↓ low cost, high latency
  13. HDD NVM + SSD CXL zswap RAM CPU cache ↑

    high cost, low latency ↓ low cost, high latency
  14. New swap algorithm in kernel 5.8+: ▪ Repeadly faulting/evicting a

    cache page over and over? Evict a heap page instead
  15. New swap algorithm in kernel 5.8+: ▪ Repeadly faulting/evicting a

    cache page over and over? Evict a heap page instead ▪ We only trade one type of paging for another: we’re not adding I/O load
  16. Effects of swap algorithm improvements: ▪ Decrease in heap memory

    ▪ Increase in cache memory ▪ Increase in web server performance
  17. Effects of swap algorithm improvements: ▪ Decrease in heap memory

    ▪ Increase in cache memory ▪ Increase in web server performance ▪ Decrease in disk I/O from paging activity
  18. Effects of swap algorithm improvements: ▪ Decrease in heap memory

    ▪ Increase in cache memory ▪ Increase in web server performance ▪ Decrease in disk I/O from paging activity ▪ Increase in workload stacking opportunities
  19. ▪ Memory starts to run out ▪ This causes us

    to reclaim page caches/swap, causing disk IO ▪ This reclaim costs sometimes non-trivial CPU cycles
  20. All the cool kids are using it cgroupv2 users: ▪

    containerd ≥ 1.4 ▪ Docker/Moby ≥ 20.10 ▪ podman ≥ 1.4.4 ▪ runc ≥ 1.0.0 ▪ systemd ≥ 226 Distributions: ▪ Fedora uses by default on ≥ 32 ▪ Coming to other distributions by default soonTM
  21. Try it out: ▪ cgroup_no_v1=all on kernel command line ▪

    Docs: bit.ly/cgroupv2doc ▪ Whitepaper: bit.ly/cgroupv2wp Feedback: ▪ E-mail: [email protected] ▪ Mastodon: @[email protected]