Slide 1

Slide 1 text

Block I/O and Cache controls in the Runtimes TAG-Runtimes, 2021-05-20 Markus Lehtonen, Antti Kervinen, Alexander Kanevskiy, Krisztian Litkey

Slide 2

Slide 2 text

2 The challenges of Block I/O and Cache

Slide 3

Slide 3 text

3 Challenges: resctrl § Cache and Memory Bandwidth • Allocation and monitoring • Limited amount of classes • Exclusive cache lanes • Node hardware specific "intelRdt": { "closID": "guaranteed_group", "l3CacheSchema": "L3:0=7f0;1=1f", "memBwSchema": "MB:0=20;1=70" }

Slide 4

Slide 4 text

4 Challenges: blkio § More complex resource • Weight does not have capacity • Weight can be per device • Throttling is per device § Node level • Mapping classes to actual per device parameters "blockIO": { "weight": 10, "weightDevice": [ { "major": 8, "minor": 0, "weight": 500 }, { "major": 8, "minor": 16, "weight": 400 } ], "throttleReadBpsDevice": [ { "major": 8, "minor": 0, "rate": 600 } ], "throttleWriteIOPSDevice": [ { "major": 8, "minor": 16, "rate": 300 } ] }

Slide 5

Slide 5 text

5 The idea on how to get those usable…

Slide 6

Slide 6 text

6 Class based resources § Block I/O • Containers can be assigned to Block I/O class • Each class can define • Per device throttling on I/O ops / bandwidth • Priorities § Intel RDT • Containers can be assigned to RDT class • LLC and Memory Bandwidth • Shared and exclusive class allocations • User friendly configuration

Slide 7

Slide 7 text

7 How that idea looks in reality

Slide 8

Slide 8 text

8 Cache configuration § Example of cache partitioning based on Kubernetes QoS partitions: default: # Only one partition which gets all resources l3Allocation: "100%" mbAllocation: ["100%"] classes: # L3 cache lines are partially shared # between classes Guaranteed: # Guaranteed can use full cache and mem bw l3Allocation: "100%" mbAllocation: ["100%"] Burstable: # Burstable can use 60% of the cache # lines and 50% of mem bw l3Allocation: "60%" mbAllocation: ["50%"] Besteffort: # Besteffort can use 30% of the cache # lines but 50% of mem bw l3Allocation: "30%" mbAllocation: ["50%"]

Slide 9

Slide 9 text

9 Cache configuration § Example of cache partitioning based on “Billing QoS” partitions: exclusive: # Partition gets exclusively 60% of all cache lines l3Allocation: "60%" classes: gold: # This single class gets 100% what was allocated # for the partition (i.e. 60% of all cache lines) l3Allocation: "100%" shared: # Partition gets exclusively 40% of all cache lines l3Allocation: "40%" classes: silver: # "silver" gets 100% what was allocated for # the partition (i.e. 40% of all cache lines) l3Allocation: "100%" bronze: # "bronze" only gets 50% what was allocated for # the partition (i.e. 20% of all cache lines) l3Allocation: "50%"

Slide 10

Slide 10 text

10 Pod annotations § Annotations can be • For whole Pod • Per container (overrides Pod) apiVersion: v1 kind: Pod metadata: name: test annotations: rdt.resources.beta.kubernetes.io/pod: bronze rdt.resources.beta.kubernetes.io/container.container2: gold spec: containers: - name: container1 image: k8s.gcr.io/pause - name: container2 image: k8s.gcr.io/pause - name: container3 image: k8s.gcr.io/pause

Slide 11

Slide 11 text

11 Block I/O configuration § Example: prioritization and throttling classes: highprio: - weight: 400 throttled: - devices: - /dev/disk/by-id/!(*-part?) throttlereadbps: 40M throttlewritebps: 40M - devices: - /dev/disk/by-id/*SSD!(*-part?) throttlereadbps: 80M throttlewritebps: 80M

Slide 12

Slide 12 text

12 Pod annotations: Cache and Block I/O § Annotations can be • For whole Pod • Per container • overrides Pod level class apiVersion: v1 kind: Pod metadata: name: db-servers annotations: blockio.resources.beta.kubernetes.io/pod: highprio blockio.resources.beta.kubernetes.io/container.backup: throttled rdt.resources.beta.kubernetes.io/pod: gold rdt.resources.beta.kubernetes.io/container.backup: bronze spec: containers: - name: server0 image: k8s.gcr.io/pause - name: server1 image: k8s.gcr.io/pause - name: backup image: k8s.gcr.io/pause

Slide 13

Slide 13 text

13 Implementation

Slide 14

Slide 14 text

14 Implementation § We have PRs open against cri-o and containerd • Container Runtimes read RDT & Block I/O config file and configures resctrl via goresctrl • Container Runtime interprets both container and Pod annotations and assigns containers to classes accordingly – modification in generated OCI configs • goresctl library code to be moved under opencontainers to be re-usable across runtimes § PRs • CRI-O • https://github.com/cri-o/cri-o/pull/4830 - RDT • https://github.com/cri-o/cri-o/pull/4873 – Block I/O • Containerd • https://github.com/containerd/containerd/pull/5439 – RDT • https://github.com/containerd/containerd/pull/5490 – Block I/O

Slide 15

Slide 15 text

15 Next steps § Phase 0: initial support via annotations • Pod annotations and container (device plugins) annotations from Kubelet are propagated to Runtimes • Container Runtimes applying classes based on those annotations § Phase 1: CRI support • Support Cache & Block I/O classes in CRI protocol • Fields in Pod and Container resources to define classes • Runtimes still interpret annotations unless CRI field is not populated § Phase 2: “First class citizen” in Kubernetes Pod spec • Deprecate annotations • Cache & Block I/O classes becomes fields for Pod and Container scopes

Slide 16

Slide 16 text

16