Upgrade to Pro — share decks privately, control downloads, hide ads and more …

A JVM Threading Model for the containerized times

Luiz Hespanha
September 21, 2023

A JVM Threading Model for the containerized times

Presented in the Strange Loop event in 2023.

Luiz Hespanha

September 21, 2023
Tweet

Other Decks in Programming

Transcript

  1. A JVM threading model for the containerized times Flavio Brasil,

    Principal Engineer Luiz Hespanha, Principal Engineer Systems Performance @ Nubank { } ...
  2. PIX Running since the end of 2020, Pix is an

    instant payment platform created and managed by the monetary authority of Brazil, the Central Bank of Brazil (BCB), which enables the quick execution(max 10 seconds) of payments and transfers 24/7.
  3. The system normally operates at a low CPU usage But

    when there's a load spike, the CPU becomes a bottleneck The crash resolution paradox Flavio
  4. The crash resolution paradox The system normally operates at a

    low CPU usage But when there's a load spike, the CPU becomes a bottleneck Latencies skyrocket, system sometimes become unresponsive
  5. The crash resolution paradox The system normally operates at a

    low CPU usage But when there's a load spike, the CPU becomes a bottleneck Latencies skyrocket, system sometimes become unresponsive Resolution: more CPU capacity!?
  6. Some crashes escalated to k8s nodes becoming saturated A few

    systems consumed all CPU resources and became noisy neighbors Cluster-wide instability
  7. Cluster-wide instability Some crashes escalated to k8s nodes becoming saturated

    A few systems consumed all CPU resources and became noisy neighbors Several nodes become saturated and instability spread to collocated services
  8. As described by the Universal Scalability Law (USL), efficiency can

    drop significantly when a bottleneck is reached Symptoms of a bottleneck
  9. As described by the Universal Scalability Law (USL), efficiency can

    drop significantly when a bottleneck is reached The median CPU usage (blue line) was dropping over time and the number of cores (bars) growing at a higher pace than the load Symptoms of a bottleneck
  10. CPU isolation Linux config Nature Mechanism Prevents node saturation CPU

    requests cpu.shares Soft limit CPU limits cpu.quota Hard limit Hespanha
  11. CPU isolation Linux config Nature Mechanism Prevents node saturation CPU

    requests cpu.shares Soft limit Prioritization when all CPUs are busy CPU limits cpu.quota Hard limit Enforced even if node has available CPU
  12. CPU isolation Linux config Nature Mechanism Prevents node saturation CPU

    requests cpu.shares Soft limit Prioritization when all CPUs are busy No CPU limits cpu.quota Hard limit Enforced even if node has available CPU Yes** ** Partially since it's a quota and not a concurrency limit, but it's generally enough
  13. • How can we prevent node saturation? • Environment-dependent performance

    Current schools of thought CPU pinning is the one true way! Just disable limits! { }
  14. • How can we prevent node saturation? • Environment-dependent performance

    Current schools of thought • What about small systems with fractional quotas? • No possibility of bursts • k8s scheduling pressure CPU pinning is the one true way! Just disable limits! { }
  15. • How can we prevent node saturation? • Environment-dependent performance

    Current schools of thought • What about small systems with fractional quotas? • No possibility of bursts • k8s scheduling pressure CPU pinning is the one true way! Just disable limits! { } What is it hiding? 🤔
  16. You can't improve what you don't measure! Nauvoo No more

    manual thread pool tuning, avoids CPU throttling on the fly 01 02 03 Fine-grained perf metrics Adaptive concurrency
  17. You can't improve what you don't measure! Nauvoo No more

    manual thread pool tuning, avoids CPU throttling on the fly Rejects work above the system's capacity, avoids unbounded queuing and GC death spirals 01 02 03 Fine-grained perf metrics Adaptive concurrency Reactive backpressure
  18. Detecting degradation Multiple checks: • CPU usage • Throttled %

    • Memory Tries to avoid degradation v0: check all the things { }
  19. Detecting degradation Multiple checks: • CPU usage • Throttled %

    • Memory Tries to avoid degradation v1: heartbeat mode v0: check all the things { }
  20. Detecting degradation • Inspired by jHiccup • while(true) { measure

    Thread.sleep(1) } • Allows a configurable level of degradation • Also detects GC pauses, safepoints, allocation stalls Multiple checks: • CPU usage • Throttled % • Memory Tries to avoid degradation v1: heartbeat mode v0: check all the things { }
  21. Main challenges - Reaction time Start with small changes and

    escalate via exponential steps - Control loop stability Introduce metastable state thresholds to stabilize changes Controlling degradation If there's degradation, reduce concurrency If threads are reliably scheduled, allow more concurrency
  22. Main challenges - Reaction time Start with small changes and

    escalate via exponential steps - Control loop stability Introduce metastable state thresholds to stabilize changes Controlling degradation If there's degradation, reduce concurrency If threads are reliably scheduled, allow more concurrency
  23. Main challenges - Reaction time Start with small changes and

    escalate via exponential steps - Control loop stability Introduce metastable state thresholds to stabilize changes Controlling degradation If there's degradation, reduce concurrency If threads are reliably scheduled, allow more concurrency
  24. Demo Adapting # of threads to a load with low

    CPU usage and thread blocking Load increasing Number of Threads increasing ing to handle the load CPU Throttling under control
  25. Demo Adapting # of threads to a CPU intensive load

    + rejections Number of Threads decreasing Service suffering with CPU Throttling Tasks being rejected
  26. CREDITS: This presentation template was created by Slidesgo, and includes

    icons by Flaticon, and infographics & images by Freepik Optimizations by several teams! major rewrites, database migration, tunings, ... Flavio
  27. CREDITS: This presentation template was created by Slidesgo, and includes

    icons by Flaticon, and infographics & images by Freepik Optimizations by several teams! major rewrites, database migration, tunings, ... Nauvoo
  28. CREDITS: This presentation template was created by Slidesgo, and includes

    icons by Flaticon, and infographics & images by Freepik Thanks! @fbrasisil @luiz_hespanha