CloudConf Italy - A Tale of Tail Latency

A Tale of Tail Latency Ara Pulido @arapulido (Understanding Kubernetes
CPU Requests and Limits For Sustainability and Proﬁt)

Datadog is a platform that helps companies improve observability and
security of their infrastructure and applications

@arapulido Understanding Kubernetes CPU requests and limits for SUSTAINABILITY &
PROFIT

@arapulido All of this would be less interesting if ENERGY
WAS FREE

@arapulido

@arapulido CPU Throttling Cloud costs Energy waste Too small Too
big Why right-sizing is so important

@arapulido

@arapulido CPU requests & KUBERNETES SCHEDULING

@arapulido Node

@arapulido Node System processes Kubernetes processes system-reserved kube-reserved Allocatable eviction-threshold

@arapulido

@arapulido CPU requests & LINUX CPU SCHEDULING

@arapulido Completely Fair Scheduler CPU requests translates to CPU shares
proportionally

@arapulido

@arapulido Kubernetes scheduling CPU Linux scheduling

@arapulido In case of contention - Proportion 1:2:1 CPU Linux
scheduling

@arapulido Noisy neighbours

@arapulido Noisy neighbours IF WE ARE NOT RIGHT-SIZED

@arapulido

@arapulido “Noisy” neighbours

@arapulido

@arapulido What if I set no requests?

@arapulido What if I set no requests? Creating a pod
without CPU requests or limits will effectively allow it to be scheduled on any suitable node, regardless of the amount of CPU left on that node. In practice, it will still get some minimal CPU guarantees.

@arapulido kubernetes.cpu.usage.total kubernetes.cpu.requests Type: gauge. Number of nanocores used. High
cardinality, including pod_name, container_name, container_id. Type: gauge. Number of nanocores requested. High cardinality, including pod_name, container_name, container_id. Metrics to watch

@arapulido CPU limits & LINUX CFS QUOTA

@arapulido CGROUPS • Quota • Period (100ms, default)

@arapulido CPU Limits

@arapulido CPU Limits CPU cycles needs

@arapulido Containers get throttled

@arapulido container.cpu.throttled container.cpu.throttled.periods Type: gauge. The total cpu throttled time
(nanoseconds) High cardinality, including pod_name, container_name, container_id. Type: gauge. The number of periods during which the container was throttled High cardinality, including pod_name, container_name, container_id. Metrics to watch

@arapulido Why setting CPU limits

@arapulido Benchmarking Why setting CPU limits

@arapulido Benchmarking Multi-tenant environments Why setting CPU limits

@arapulido Benchmarking Multi-tenant environments Predictability Why setting CPU limits

@arapulido Benchmarking Multi-tenant environments Predictability Guaranteed Quality of Service Why
setting CPU limits

@arapulido But wait! THERE’S MORE

@arapulido CPUManager

@arapulido • All cores are shared • A workload task/thread
can migrate from one CPU to another, as the kernel scheduler sees ﬁt none CPUManager

@arapulido • All cores are shared • A workload task/thread
can migrate from one CPU to another, as the kernel scheduler sees ﬁt • Linux CPUSets • Exclusive access to cores • Caveat: exclusive access only applies to containers (not system processes) none static CPUManager

@arapulido Example of pod with exclusive CPUs Requests = Limits
Both CPU and memory requests / limits are set CPU request is an integer

@arapulido Metrics to watch container.cpu.limit Type: gauge. maximum CPU time
available to the container (in nanocores) High cardinality, including pod_name, container_name, container_id. Min(container limit, host capacity - static assignments)

@arapulido How it works

@arapulido

@arapulido pod1=3cores Yellow: container.cpu.limit Purple: container.cpu.usage

@arapulido

@arapulido pod1=3cores, can only use up to the new limit

@arapulido Summary

@arapulido CPU requests are not only used to schedule pods
on nodes They will also be used to proportionally distribute CPU time in case of contention CPU limits will affect your application performance But your container will keep running (not evicted). It will be throttled CPU pinning can be helpful for certain applications But take into account that the amount of CPU available for the rest of applications on the same node will be reduced Summary

Thanks! Ara Pulido @arapulido

CloudConf Italy - A Tale of Tail Latency

CloudConf Italy - A Tale of Tail Latency

More Decks by Ara

Other Decks in Programming

Featured

Transcript