A JVM threading model for the containerized times Flavio Brasil, Principal Engineer Luiz Hespanha, Principal Engineer Systems Performance @ Nubank { } ...
PIX Running since the end of 2020, Pix is an instant payment platform created and managed by the monetary authority of Brazil, the Central Bank of Brazil (BCB), which enables the quick execution(max 10 seconds) of payments and transfers 24/7.
The crash resolution paradox The system normally operates at a low CPU usage But when there's a load spike, the CPU becomes a bottleneck Latencies skyrocket, system sometimes become unresponsive
The crash resolution paradox The system normally operates at a low CPU usage But when there's a load spike, the CPU becomes a bottleneck Latencies skyrocket, system sometimes become unresponsive Resolution: more CPU capacity!?
Cluster-wide instability Some crashes escalated to k8s nodes becoming saturated A few systems consumed all CPU resources and became noisy neighbors Several nodes become saturated and instability spread to collocated services
As described by the Universal Scalability Law (USL), efficiency can drop significantly when a bottleneck is reached The median CPU usage (blue line) was dropping over time and the number of cores (bars) growing at a higher pace than the load Symptoms of a bottleneck
CPU isolation Linux config Nature Mechanism Prevents node saturation CPU requests cpu.shares Soft limit Prioritization when all CPUs are busy CPU limits cpu.quota Hard limit Enforced even if node has available CPU
CPU isolation Linux config Nature Mechanism Prevents node saturation CPU requests cpu.shares Soft limit Prioritization when all CPUs are busy No CPU limits cpu.quota Hard limit Enforced even if node has available CPU Yes** ** Partially since it's a quota and not a concurrency limit, but it's generally enough
● How can we prevent node saturation? ● Environment-dependent performance Current schools of thought CPU pinning is the one true way! Just disable limits! { }
● How can we prevent node saturation? ● Environment-dependent performance Current schools of thought ● What about small systems with fractional quotas? ● No possibility of bursts ● k8s scheduling pressure CPU pinning is the one true way! Just disable limits! { }
● How can we prevent node saturation? ● Environment-dependent performance Current schools of thought ● What about small systems with fractional quotas? ● No possibility of bursts ● k8s scheduling pressure CPU pinning is the one true way! Just disable limits! { } What is it hiding? 🤔
You can't improve what you don't measure! Nauvoo No more manual thread pool tuning, avoids CPU throttling on the fly 01 02 03 Fine-grained perf metrics Adaptive concurrency
You can't improve what you don't measure! Nauvoo No more manual thread pool tuning, avoids CPU throttling on the fly Rejects work above the system's capacity, avoids unbounded queuing and GC death spirals 01 02 03 Fine-grained perf metrics Adaptive concurrency Reactive backpressure
Main challenges - Reaction time Start with small changes and escalate via exponential steps - Control loop stability Introduce metastable state thresholds to stabilize changes Controlling degradation If there's degradation, reduce concurrency If threads are reliably scheduled, allow more concurrency
Main challenges - Reaction time Start with small changes and escalate via exponential steps - Control loop stability Introduce metastable state thresholds to stabilize changes Controlling degradation If there's degradation, reduce concurrency If threads are reliably scheduled, allow more concurrency
Main challenges - Reaction time Start with small changes and escalate via exponential steps - Control loop stability Introduce metastable state thresholds to stabilize changes Controlling degradation If there's degradation, reduce concurrency If threads are reliably scheduled, allow more concurrency
Demo Adapting # of threads to a load with low CPU usage and thread blocking Load increasing Number of Threads increasing ing to handle the load CPU Throttling under control
Demo Adapting # of threads to a CPU intensive load + rejections Number of Threads decreasing Service suffering with CPU Throttling Tasks being rejected
CREDITS: This presentation template was created by Slidesgo, and includes icons by Flaticon, and infographics & images by Freepik Optimizations by several teams! major rewrites, database migration, tunings, ... Flavio
CREDITS: This presentation template was created by Slidesgo, and includes icons by Flaticon, and infographics & images by Freepik Optimizations by several teams! major rewrites, database migration, tunings, ... Nauvoo
CREDITS: This presentation template was created by Slidesgo, and includes icons by Flaticon, and infographics & images by Freepik Thanks! @fbrasisil @luiz_hespanha