Devopsdays Cape Town: Resource Saturation Monitoring and Capacity Planning on GitLab.com

Devopsdays Cape Town: Resource Saturation Monitoring and Capacity Planning on GitLab.com

In this talk, I present the dangers of resource saturation, and how you can utilise prometheus functionality to monitor, track and predict potential future resource saturation issues, up to two weeks in advance.

679358dfa6d36763e1918a0c23aad4c0?s=128

Andrew Newdigate

September 05, 2019
Tweet

Transcript

  1. 1 Resource Saturation Monitoring and Capacity Planning on GitLab.com Andrew

    Newdigate, GitLab
  2. 2 Introduction Andrew Newdigate Distinguished Engineer, Infrastructure, GitLab Past Gigs….

    • Lead, Google Cloud Migration, GitLab.com • Lead, Gitaly (Git Application Infrastructure), GitLab • Founder: Gitter.im (acquired; GitLab, 2017) Twitter @suprememoocow
  3. 3 Why do we need Capacity Planning? Everything scales infinitely

    when you’re in the Cloud, right?
  4. 4 Why do we need Capacity Planning? Any sufficiently complex

    system will have bottlenecks. When these bottlenecks reach saturation, systems fail in unexpected ways.
  5. 5 Gridlock Traffic is Resource Saturation Resource Saturation In Real

    Life: Gridlock Traffic
  6. 6 Resource Saturation In Real Life: Gridlock Traffic DEADLOCK

  7. 7 Resource Saturation in Software Systems Case Study of Resource

    Saturation GitLab.com Redis Single Core Saturation https://gitlab.com/gitlab-com/gl-infra/production/issues/928 https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/7157 Resource Saturation
  8. 8 Resource Saturation in Software Systems Web Performance (Apdex Score)

  9. 9 Resource Saturation in Software Systems Signals • Latency alerts

    from multiple services (web, api, CI runners) • Saturation across multiple components, leading to queueing, leading to more saturation • Increased 502/502 error rates across multiple services • No recent single application change which obviously caused the problem • No recent infrastructure changes • No unusual user activity (abuse) GitLab.com Redis Degradation
  10. 10 Resource Saturation in Software Systems Redis Cache CPU Saturation

    • Redis uses a single threaded model • Redis running on 4 core servers, 3 of the cores with ~idle • 1 core maxed out at close to 100% CPU • Redis cache operations queuing, leading to slow down across multiple systems GitLab.com Redis Degradation
  11. 11 Resource Saturation in Software Systems Workarounds • Faster servers

    • Shard Redis cache • Application changes - move to multi-tier (L1/L2) caching on several high traffic endpoints • Fixed several (oldish)performance regressions High Mean-Time-to-Recovery (MTTR) None of these potential fixes could be implemented in minutes. They all required planning and relatively long execution times. Potential Fixes for Redis CPU Saturation
  12. • Our job as Infrastructure Engineers is way more fun

    if we avoid resource saturation • MTTR is high on Resource Saturation issues ◦ Sometimes there are no quick fixes • In a complex system, there are many different bottlenecks that you need to look out for: ◦ CPU ◦ Single CPU Core (for single-threaded applications) ◦ Memory ◦ Database Connection Pools (pgbouncer) and Server Connections (postgres) ◦ Redis Clients ◦ File Descriptors ◦ Disk Throughput / IOPS • Saturation of any one of these could lead to an outage Takeaways
  13. Failure is not Linear

  14. Goal: avoid resource saturation altogether Create an early warning system

    to help us predict and avoid resource saturation issues before they become a problem. Capacity Planning Part 2
  15. Measuring Saturation

  16. Saturation = (Current Resource Usage) / (Maximum Available Resource) Measuring

    Saturation
  17. Setup a recording rule with two fixed dimensions (labels) service_component:saturation:ratio

    Fixed Dimensions • “service” the service reporting the resource eg service="monorail” or service="postgres” • “component” dimension - the component resource we are measuring eg component="memory” or component="cpu” All series report a ratio between 0 and 1. 0 is 0% (good). 1 = 100% Saturated (bad) Saturation Measurement Recording Rules
  18. Workers: % unicorn worker processes utilized Disk: % disk space

    utilized CPU: % compute utilized across all nodes in a service Single Node CPU: Maximum compute utilization % on single node in service Single Core CPU: Maximum single core utilization for any core in a service (useful for single threaded services like Redis, pgbouncer, etc) Database Pools: % database connection pool utilization Many others, see: https://gitlab.com/gitlab-com/runbooks/blob/master/rules/service_saturation.yml Resource saturation being measured
  19. - record: service_component:saturation:ratio labels: component: 'single_core_cpu' expr: > max( 1

    - rate(node_cpu_seconds_total{ service=~"redis|pgbouncer", mode="idle" }[1m]) ) by (service) Aggregate with max Example: single-threaded CPU Saturation # service_component:saturation:ratio{component="single_core_cpu",service="patroni"} 0.972 # service_component:saturation:ratio{component="single_core_cpu",service="redis"} 0.404 Single threaded services
  20. - record: service_component:saturation:ratio labels: component: 'open_fds' expr: > max( process_open_fds

    / process_max_fds ) by (service) # service_component:saturation:ratio{component="open_fds", service="gitaly"} 0.238 # service_component:saturation:ratio{component="open_fds", service="web"} 0.054 Example: File descriptor saturation
  21. service_component:saturation:ratio{component="disk_space",service="gitaly"} 0.84 service_component:saturation:ratio{component="single_core_cpu",service="pgbouncer"} 0.82 service_component:saturation:ratio{component="memory",service="redis-cache"} 0.78 service_component:saturation:ratio{component="single_node_cpu",service="haproxy"} 0.71 service_component:saturation:ratio{component="memory",service="sidekiq"} 0.61

    service_component:saturation:ratio{component="single_node_cpu",service="git"} 0.60 service_component:saturation:ratio{component="cgroup_memory",service="gitaly"} 0.59 service_component:saturation:ratio{component="disk_space",service="postgres"} 0.57 service_component:saturation:ratio{component="cpu",service="haproxy"} 0.56 Example: Saturation Metrics
  22. Example: Redis CPU Saturation, May - Mid July Everything fine

    Everything on fire
  23. Alerting on immediate saturation issues 90% threshold

  24. - alert: saturation_out_of_bounds_upper_5m expr: | service_component:saturation:ratio > 0.9 for: 5m

    annotations: title: | The `{{ $labels.service }}` service, `{{ $labels.component }}` component has a saturation exceeding 90% Generalised alert for all saturation metrics
  25. Capacity Planning and Forecasting

  26. Can we use Linear Prediction?

  27. Linear interpolation doesn’t work well on non-linear data

  28. Linear interpolation doesn’t work well on non-linear data

  29. A hurricane warning, not a weather forecast...

  30. Estimating a worst-case with standard deviation Worst Case Prediction Calculation

    1. Trend Prediction: Use Linear prediction on our rolling 7 day average to extend our trend forward by 2 weeks 2. Std Dev: Calculate the standard deviation for each metric for the past week 3. Worst Case: Trend Prediction + 2 * std dev WorstCase = LinearPrediction(Average(LastWeeksWorthOfData)) + 2 * StdDev(LastWeeksWorthOfData)
  31. Linear interpolation doesn’t work well on non-linear data 1 2

  32. Worst-Case Predictions in PromQL # Average values for each component,

    over a week - record: service_component:saturation:ratio:avg_over_time_1w expr: > avg_over_time(service_component:saturation:ratio[1w]) # Stddev for each resource saturation component, over a week - record: service_component:saturation:ratio:stddev_over_time_1w expr: > stddev_over_time(service_component:saturation:ratio[1w])
  33. - record: service_component:saturation:ratio:predict_linear_2w expr: > predict_linear( service_component:saturation:ratio:avg_over_time_1w[1w], 86400 * 14

    # 14 days, in seconds ) Worst-Case Predictions in PromQL
  34. Retrofitting the model to the data

  35. Top 10 highest priority saturation points topk(10, clamp_min( gitlab_component_saturation:ratio:predict_linear_2w, 0

    ) + 2 * gitlab_component_saturation:ratio:stddev_over_time_1w ) List top 10 highest priority potential resource saturation issues
  36. Points

  37. Conclusion Capacity planning signals are used for ranking not precision:

    this technique provides an order list of potential bottlenecks, from most urgent to least, this is fed into our roadmap for prioritization. Dogfooding: we’re using these techniques on GitLab.com, but we will be incorporating some of them into GitLab (the product) to help GitLab sysadmins and support engineers to predict problems. Early days: we’re still figuring this out. If you have ideas, please get in touch.
  38. Questions? Andrew Newdigate | @suprememoocow GitLab.com Resource Saturation Monitoring and

    Capacity Planning rules at: https://gitlab.com/gitlab-com/runbooks Thank you