Devopsdays Cape Town: Resource Saturation Monitoring and Capacity Planning on

Devopsdays Cape Town: Resource Saturation Monitoring and Capacity Planning on

In this talk, I present the dangers of resource saturation, and how you can utilise prometheus functionality to monitor, track and predict potential future resource saturation issues, up to two weeks in advance.


Andrew Newdigate

September 05, 2019


  1. 2.

    2 Introduction Andrew Newdigate Distinguished Engineer, Infrastructure, GitLab Past Gigs….

    • Lead, Google Cloud Migration, • Lead, Gitaly (Git Application Infrastructure), GitLab • Founder: (acquired; GitLab, 2017) Twitter @suprememoocow
  2. 4.

    4 Why do we need Capacity Planning? Any sufficiently complex

    system will have bottlenecks. When these bottlenecks reach saturation, systems fail in unexpected ways.
  3. 7.

    7 Resource Saturation in Software Systems Case Study of Resource

    Saturation Redis Single Core Saturation Resource Saturation
  4. 9.

    9 Resource Saturation in Software Systems Signals • Latency alerts

    from multiple services (web, api, CI runners) • Saturation across multiple components, leading to queueing, leading to more saturation • Increased 502/502 error rates across multiple services • No recent single application change which obviously caused the problem • No recent infrastructure changes • No unusual user activity (abuse) Redis Degradation
  5. 10.

    10 Resource Saturation in Software Systems Redis Cache CPU Saturation

    • Redis uses a single threaded model • Redis running on 4 core servers, 3 of the cores with ~idle • 1 core maxed out at close to 100% CPU • Redis cache operations queuing, leading to slow down across multiple systems Redis Degradation
  6. 11.

    11 Resource Saturation in Software Systems Workarounds • Faster servers

    • Shard Redis cache • Application changes - move to multi-tier (L1/L2) caching on several high traffic endpoints • Fixed several (oldish)performance regressions High Mean-Time-to-Recovery (MTTR) None of these potential fixes could be implemented in minutes. They all required planning and relatively long execution times. Potential Fixes for Redis CPU Saturation
  7. 12.

    • Our job as Infrastructure Engineers is way more fun

    if we avoid resource saturation • MTTR is high on Resource Saturation issues ◦ Sometimes there are no quick fixes • In a complex system, there are many different bottlenecks that you need to look out for: ◦ CPU ◦ Single CPU Core (for single-threaded applications) ◦ Memory ◦ Database Connection Pools (pgbouncer) and Server Connections (postgres) ◦ Redis Clients ◦ File Descriptors ◦ Disk Throughput / IOPS • Saturation of any one of these could lead to an outage Takeaways
  8. 14.

    Goal: avoid resource saturation altogether Create an early warning system

    to help us predict and avoid resource saturation issues before they become a problem. Capacity Planning Part 2
  9. 17.

    Setup a recording rule with two fixed dimensions (labels) service_component:saturation:ratio

    Fixed Dimensions • “service” the service reporting the resource eg service="monorail” or service="postgres” • “component” dimension - the component resource we are measuring eg component="memory” or component="cpu” All series report a ratio between 0 and 1. 0 is 0% (good). 1 = 100% Saturated (bad) Saturation Measurement Recording Rules
  10. 18.

    Workers: % unicorn worker processes utilized Disk: % disk space

    utilized CPU: % compute utilized across all nodes in a service Single Node CPU: Maximum compute utilization % on single node in service Single Core CPU: Maximum single core utilization for any core in a service (useful for single threaded services like Redis, pgbouncer, etc) Database Pools: % database connection pool utilization Many others, see: Resource saturation being measured
  11. 19.

    - record: service_component:saturation:ratio labels: component: 'single_core_cpu' expr: > max( 1

    - rate(node_cpu_seconds_total{ service=~"redis|pgbouncer", mode="idle" }[1m]) ) by (service) Aggregate with max Example: single-threaded CPU Saturation # service_component:saturation:ratio{component="single_core_cpu",service="patroni"} 0.972 # service_component:saturation:ratio{component="single_core_cpu",service="redis"} 0.404 Single threaded services
  12. 20.

    - record: service_component:saturation:ratio labels: component: 'open_fds' expr: > max( process_open_fds

    / process_max_fds ) by (service) # service_component:saturation:ratio{component="open_fds", service="gitaly"} 0.238 # service_component:saturation:ratio{component="open_fds", service="web"} 0.054 Example: File descriptor saturation
  13. 21.

    service_component:saturation:ratio{component="disk_space",service="gitaly"} 0.84 service_component:saturation:ratio{component="single_core_cpu",service="pgbouncer"} 0.82 service_component:saturation:ratio{component="memory",service="redis-cache"} 0.78 service_component:saturation:ratio{component="single_node_cpu",service="haproxy"} 0.71 service_component:saturation:ratio{component="memory",service="sidekiq"} 0.61

    service_component:saturation:ratio{component="single_node_cpu",service="git"} 0.60 service_component:saturation:ratio{component="cgroup_memory",service="gitaly"} 0.59 service_component:saturation:ratio{component="disk_space",service="postgres"} 0.57 service_component:saturation:ratio{component="cpu",service="haproxy"} 0.56 Example: Saturation Metrics
  14. 24.

    - alert: saturation_out_of_bounds_upper_5m expr: | service_component:saturation:ratio > 0.9 for: 5m

    annotations: title: | The `{{ $labels.service }}` service, `{{ $labels.component }}` component has a saturation exceeding 90% Generalised alert for all saturation metrics
  15. 30.

    Estimating a worst-case with standard deviation Worst Case Prediction Calculation

    1. Trend Prediction: Use Linear prediction on our rolling 7 day average to extend our trend forward by 2 weeks 2. Std Dev: Calculate the standard deviation for each metric for the past week 3. Worst Case: Trend Prediction + 2 * std dev WorstCase = LinearPrediction(Average(LastWeeksWorthOfData)) + 2 * StdDev(LastWeeksWorthOfData)
  16. 32.

    Worst-Case Predictions in PromQL # Average values for each component,

    over a week - record: service_component:saturation:ratio:avg_over_time_1w expr: > avg_over_time(service_component:saturation:ratio[1w]) # Stddev for each resource saturation component, over a week - record: service_component:saturation:ratio:stddev_over_time_1w expr: > stddev_over_time(service_component:saturation:ratio[1w])
  17. 35.

    Top 10 highest priority saturation points topk(10, clamp_min( gitlab_component_saturation:ratio:predict_linear_2w, 0

    ) + 2 * gitlab_component_saturation:ratio:stddev_over_time_1w ) List top 10 highest priority potential resource saturation issues
  18. 36.
  19. 37.

    Conclusion Capacity planning signals are used for ranking not precision:

    this technique provides an order list of potential bottlenecks, from most urgent to least, this is fed into our roadmap for prioritization. Dogfooding: we’re using these techniques on, but we will be incorporating some of them into GitLab (the product) to help GitLab sysadmins and support engineers to predict problems. Early days: we’re still figuring this out. If you have ideas, please get in touch.
  20. 38.

    Questions? Andrew Newdigate | @suprememoocow Resource Saturation Monitoring and

    Capacity Planning rules at: Thank you