SIG-Node 2020-05-12, experiences of advanced resource management in Kubernetes

Alexander Kanevskiy, Krisztian Litkey, Ismo Puustinen 2020-05-12

2 Agenda § Why? § Demo § What do we
know about – Hardware in general – CPUs – Memory – UX

The stock Kubernetes + stock containerd, with some addons 3

Node 1 Node 0 System devices topology Socket 0 Core
0 Core 1 Core 6 Core 7 Core 2 Core 3 Core 8 Core 9 Core 4 Core 5 Core 10 Core 11 PCIe UPI Socket 1 Core 0 Core 1 Core 6 Core 7 Core 2 Core 3 Core 8 Core 9 Core 4 Core 5 Core 10 Core 11 UPI PCIe Memory Controller Memory Controller

Node 5 Node 4 System topology in real world Node
0 Node 2 Node 1 Node 3 Package 0 Core 0 Core 1 Core 5 Core 6 Memory Controller Core 2 Core 7 Memory Controller Core 3 Core 4 Core 8 Core 9 PCIe UPI Package 1 Core 0 Core 1 Core 5 Core 6 Core 2 Core 7 Core 3 Core 4 Core 8 Core 9 UPI PCIe UPI UPI Memory Controller Memory Controller PCIe PCIe UPI UPI DMI DMI Chipset QAT x16 QAT x16 QAT x16 I/O Hub 4x10G NIC

8 CPU Things to keep in mind for CPUs §
CPU cores vs. threads § CPU cores frequencies: base, turbo, throttling § CPU usage: Shared, Exclusive, Isolated § Additional CPU resources: Cache, Memory Bandwidth § Interconnect is an expensive resource § “C” in “NUMA” stands for “CPU” – Vendors have BIOS configurable settings to redefine what means NUMA (SNC, NPS,…) § Workload migration cost: low to very low § CPUs for Kubelet might be not the same meaning inside VM based runtimes

Group … 9 System root Socket 0 Die 0 CPU
0 CPU 1 Die 1 CPU 2 CPU 3 Socket 1 Die 0 CPU 4 CPU 5 Die 1 CPU 6 CPU 7 For each leaf node § Groups of CPU+Memory § Dynamic pools – Shared – Exclusive – Isolated – “Throttled” – … Parent nodes § Sum of subtree resources Topology-Aware CPU policy physical_package_id die_id core_id

10 Memory Things to keep in mind for Memory §
Memory types – DRAM – “Persistent”, in ”volatile, system RAM mode” (PMEM) – High Bandwidth (HBM) § Kernel’s “Normal” vs. “Movable” § NUMA – Distances – Have CPU – Have ”normal” memory § Workload migration costs: medium to HIGH

NUMA C 11 NUMA A NUMA B System root Socket
0 Die 0 IO and Memory CPUs Core Thread Thread Core... DRAM PMEM HBM IO and Memory ... Die ... Socket ... Each Node § CPU – CPU-less NUMA nodes are linked to nodes with CPUs § All memory types – DRAM – PMEM – HBM § Placement Cost calculated based on – Requested memory type(s) and amount of available memory – Later: BW, WSS MemTier Topology-Aware policy

12 Linux OS Memory Tiering Principle of operation § Memory
pages promoted from PMEM to DRAM when capacity available § Cold pages in DRAM Demoted to PMEM Page Promotions Page Demotions DDR

13 User friendly resources controls § Good UX that spawns
public cloud, VMs, bare metal is hard – Especially for non-trivial resources (block I/O, caches, …) § Placing workloads might lead to situations where it can’t be done – Reject? – Rebalance? § Rebalancing of the running workloads can be also hard – Assigned devices – Memory migrations – Priorities The story of jar, rocks, pebbles and sand…

14 What next? UX is the key in our opinion
§ User should expect “it just works great” by default § Advanced users should be able to utilize good patterns on resource groups – Affinity/anti-affinity pattern – Device pipelines § Solutions that we do now should be aligned with where hardware is evolving to § Maybe not in Kubelet…?

Description of the functionality CRI Resource Manager https://github.com/intel/cri-resource-manager • is
a Container Runtime Interface proxy • sits between CRI Clients and the CRI Runtime • applies (hardware) resource policies to containers CPU, Memory, Cache, Memory Bandwidth, Block I/O, … • policies are applied by • modifying proxied container requests, or • generating container update requests, or • triggering extra policy-specific actions during request processing

17 Legal notices and disclaimers § Intel technologies’ features and
benefits depend on system configuration and may require enabled hardware, software or service activation. § Performance varies depending on system configuration. § No computer system can be absolutely secure. § Check with your system manufacturer or retailer or learn more at www.intel.com. § Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. § *Other names and brands may be claimed as the property of others. § © Intel Corporation

SIG-Node 2020-05-12, experiences of advanced re...

SIG-Node 2020-05-12, experiences of advanced resource management in Kubernetes

Alexander D. Kanevskiy

More Decks by Alexander D. Kanevskiy

Other Decks in Programming

Featured

Transcript

Alexander Kanevskiy, Krisztian Litkey, Ismo Puustinen 2020-05-12

2 Agenda § Why? § Demo § What do we

The stock Kubernetes + stock containerd, with some addons 3

5

Node 1 Node 0 System devices topology Socket 0 Core

Node 5 Node 4 System topology in real world Node

8 CPU Things to keep in mind for CPUs §

Group … 9 System root Socket 0 Die 0 CPU

10 Memory Things to keep in mind for Memory §

NUMA C 11 NUMA A NUMA B System root Socket

12 Linux OS Memory Tiering Principle of operation § Memory

13 User friendly resources controls § Good UX that spawns

14 What next? UX is the key in our opinion

15

Description of the functionality CRI Resource Manager https://github.com/intel/cri-resource-manager • is

17 Legal notices and disclaimers § Intel technologies’ features and