instant payment platform created and managed by the monetary authority of Brazil, the Central Bank of Brazil (BCB), which enables the quick execution(max 10 seconds) of payments and transfers 24/7.
low CPU usage But when there's a load spike, the CPU becomes a bottleneck Latencies skyrocket, system sometimes become unresponsive Resolution: more CPU capacity!?
drop significantly when a bottleneck is reached The median CPU usage (blue line) was dropping over time and the number of cores (bars) growing at a higher pace than the load Symptoms of a bottleneck
requests cpu.shares Soft limit Prioritization when all CPUs are busy No CPU limits cpu.quota Hard limit Enforced even if node has available CPU Yes** ** Partially since it's a quota and not a concurrency limit, but it's generally enough
Current schools of thought • What about small systems with fractional quotas? • No possibility of bursts • k8s scheduling pressure CPU pinning is the one true way! Just disable limits! { }
Current schools of thought • What about small systems with fractional quotas? • No possibility of bursts • k8s scheduling pressure CPU pinning is the one true way! Just disable limits! { } What is it hiding? 🤔
manual thread pool tuning, avoids CPU throttling on the fly Rejects work above the system's capacity, avoids unbounded queuing and GC death spirals 01 02 03 Fine-grained perf metrics Adaptive concurrency Reactive backpressure
escalate via exponential steps - Control loop stability Introduce metastable state thresholds to stabilize changes Controlling degradation If there's degradation, reduce concurrency If threads are reliably scheduled, allow more concurrency
escalate via exponential steps - Control loop stability Introduce metastable state thresholds to stabilize changes Controlling degradation If there's degradation, reduce concurrency If threads are reliably scheduled, allow more concurrency
escalate via exponential steps - Control loop stability Introduce metastable state thresholds to stabilize changes Controlling degradation If there's degradation, reduce concurrency If threads are reliably scheduled, allow more concurrency