Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Devopsdays Cape Town: Resource Saturation Monitoring and Capacity Planning on GitLab.com

Devopsdays Cape Town: Resource Saturation Monitoring and Capacity Planning on GitLab.com

In this talk, I present the dangers of resource saturation, and how you can utilise prometheus functionality to monitor, track and predict potential future resource saturation issues, up to two weeks in advance.

Andrew Newdigate

September 05, 2019
Tweet

More Decks by Andrew Newdigate

Other Decks in Technology

Transcript

  1. 1 Resource Saturation Monitoring and Capacity Planning on GitLab.com Andrew

    Newdigate, GitLab
  2. 2 Introduction Andrew Newdigate Distinguished Engineer, Infrastructure, GitLab Past Gigs….

    • Lead, Google Cloud Migration, GitLab.com • Lead, Gitaly (Git Application Infrastructure), GitLab • Founder: Gitter.im (acquired; GitLab, 2017) Twitter @suprememoocow
  3. 3 Why do we need Capacity Planning? Everything scales infinitely

    when you’re in the Cloud, right?
  4. 4 Why do we need Capacity Planning? Any sufficiently complex

    system will have bottlenecks. When these bottlenecks reach saturation, systems fail in unexpected ways.
  5. 5 Gridlock Traffic is Resource Saturation Resource Saturation In Real

    Life: Gridlock Traffic
  6. 6 Resource Saturation In Real Life: Gridlock Traffic DEADLOCK

  7. 7 Resource Saturation in Software Systems Case Study of Resource

    Saturation GitLab.com Redis Single Core Saturation https://gitlab.com/gitlab-com/gl-infra/production/issues/928 https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/7157 Resource Saturation
  8. 8 Resource Saturation in Software Systems Web Performance (Apdex Score)

  9. 9 Resource Saturation in Software Systems Signals • Latency alerts

    from multiple services (web, api, CI runners) • Saturation across multiple components, leading to queueing, leading to more saturation • Increased 502/502 error rates across multiple services • No recent single application change which obviously caused the problem • No recent infrastructure changes • No unusual user activity (abuse) GitLab.com Redis Degradation
  10. 10 Resource Saturation in Software Systems Redis Cache CPU Saturation

    • Redis uses a single threaded model • Redis running on 4 core servers, 3 of the cores with ~idle • 1 core maxed out at close to 100% CPU • Redis cache operations queuing, leading to slow down across multiple systems GitLab.com Redis Degradation
  11. 11 Resource Saturation in Software Systems Workarounds • Faster servers

    • Shard Redis cache • Application changes - move to multi-tier (L1/L2) caching on several high traffic endpoints • Fixed several (oldish)performance regressions High Mean-Time-to-Recovery (MTTR) None of these potential fixes could be implemented in minutes. They all required planning and relatively long execution times. Potential Fixes for Redis CPU Saturation
  12. • Our job as Infrastructure Engineers is way more fun

    if we avoid resource saturation • MTTR is high on Resource Saturation issues ◦ Sometimes there are no quick fixes • In a complex system, there are many different bottlenecks that you need to look out for: ◦ CPU ◦ Single CPU Core (for single-threaded applications) ◦ Memory ◦ Database Connection Pools (pgbouncer) and Server Connections (postgres) ◦ Redis Clients ◦ File Descriptors ◦ Disk Throughput / IOPS • Saturation of any one of these could lead to an outage Takeaways
  13. Failure is not Linear

  14. Goal: avoid resource saturation altogether Create an early warning system

    to help us predict and avoid resource saturation issues before they become a problem. Capacity Planning Part 2
  15. Measuring Saturation

  16. Saturation = (Current Resource Usage) / (Maximum Available Resource) Measuring

    Saturation
  17. Setup a recording rule with two fixed dimensions (labels) service_component:saturation:ratio

    Fixed Dimensions • “service” the service reporting the resource eg service="monorail” or service="postgres” • “component” dimension - the component resource we are measuring eg component="memory” or component="cpu” All series report a ratio between 0 and 1. 0 is 0% (good). 1 = 100% Saturated (bad) Saturation Measurement Recording Rules
  18. Workers: % unicorn worker processes utilized Disk: % disk space

    utilized CPU: % compute utilized across all nodes in a service Single Node CPU: Maximum compute utilization % on single node in service Single Core CPU: Maximum single core utilization for any core in a service (useful for single threaded services like Redis, pgbouncer, etc) Database Pools: % database connection pool utilization Many others, see: https://gitlab.com/gitlab-com/runbooks/blob/master/rules/service_saturation.yml Resource saturation being measured
  19. - record: service_component:saturation:ratio labels: component: 'single_core_cpu' expr: > max( 1

    - rate(node_cpu_seconds_total{ service=~"redis|pgbouncer", mode="idle" }[1m]) ) by (service) Aggregate with max Example: single-threaded CPU Saturation # service_component:saturation:ratio{component="single_core_cpu",service="patroni"} 0.972 # service_component:saturation:ratio{component="single_core_cpu",service="redis"} 0.404 Single threaded services
  20. - record: service_component:saturation:ratio labels: component: 'open_fds' expr: > max( process_open_fds

    / process_max_fds ) by (service) # service_component:saturation:ratio{component="open_fds", service="gitaly"} 0.238 # service_component:saturation:ratio{component="open_fds", service="web"} 0.054 Example: File descriptor saturation
  21. service_component:saturation:ratio{component="disk_space",service="gitaly"} 0.84 service_component:saturation:ratio{component="single_core_cpu",service="pgbouncer"} 0.82 service_component:saturation:ratio{component="memory",service="redis-cache"} 0.78 service_component:saturation:ratio{component="single_node_cpu",service="haproxy"} 0.71 service_component:saturation:ratio{component="memory",service="sidekiq"} 0.61

    service_component:saturation:ratio{component="single_node_cpu",service="git"} 0.60 service_component:saturation:ratio{component="cgroup_memory",service="gitaly"} 0.59 service_component:saturation:ratio{component="disk_space",service="postgres"} 0.57 service_component:saturation:ratio{component="cpu",service="haproxy"} 0.56 Example: Saturation Metrics
  22. Example: Redis CPU Saturation, May - Mid July Everything fine

    Everything on fire
  23. Alerting on immediate saturation issues 90% threshold

  24. - alert: saturation_out_of_bounds_upper_5m expr: | service_component:saturation:ratio > 0.9 for: 5m

    annotations: title: | The `{{ $labels.service }}` service, `{{ $labels.component }}` component has a saturation exceeding 90% Generalised alert for all saturation metrics
  25. Capacity Planning and Forecasting

  26. Can we use Linear Prediction?

  27. Linear interpolation doesn’t work well on non-linear data

  28. Linear interpolation doesn’t work well on non-linear data

  29. A hurricane warning, not a weather forecast...

  30. Estimating a worst-case with standard deviation Worst Case Prediction Calculation

    1. Trend Prediction: Use Linear prediction on our rolling 7 day average to extend our trend forward by 2 weeks 2. Std Dev: Calculate the standard deviation for each metric for the past week 3. Worst Case: Trend Prediction + 2 * std dev WorstCase = LinearPrediction(Average(LastWeeksWorthOfData)) + 2 * StdDev(LastWeeksWorthOfData)
  31. Linear interpolation doesn’t work well on non-linear data 1 2

  32. Worst-Case Predictions in PromQL # Average values for each component,

    over a week - record: service_component:saturation:ratio:avg_over_time_1w expr: > avg_over_time(service_component:saturation:ratio[1w]) # Stddev for each resource saturation component, over a week - record: service_component:saturation:ratio:stddev_over_time_1w expr: > stddev_over_time(service_component:saturation:ratio[1w])
  33. - record: service_component:saturation:ratio:predict_linear_2w expr: > predict_linear( service_component:saturation:ratio:avg_over_time_1w[1w], 86400 * 14

    # 14 days, in seconds ) Worst-Case Predictions in PromQL
  34. Retrofitting the model to the data

  35. Top 10 highest priority saturation points topk(10, clamp_min( gitlab_component_saturation:ratio:predict_linear_2w, 0

    ) + 2 * gitlab_component_saturation:ratio:stddev_over_time_1w ) List top 10 highest priority potential resource saturation issues
  36. Points

  37. Conclusion Capacity planning signals are used for ranking not precision:

    this technique provides an order list of potential bottlenecks, from most urgent to least, this is fed into our roadmap for prioritization. Dogfooding: we’re using these techniques on GitLab.com, but we will be incorporating some of them into GitLab (the product) to help GitLab sysadmins and support engineers to predict problems. Early days: we’re still figuring this out. If you have ideas, please get in touch.
  38. Questions? Andrew Newdigate | @suprememoocow GitLab.com Resource Saturation Monitoring and

    Capacity Planning rules at: https://gitlab.com/gitlab-com/runbooks Thank you