Devopsdays Cape Town: Resource Saturation Monitoring and Capacity Planning on GitLab.com

1 Resource Saturation Monitoring and Capacity Planning on GitLab.com Andrew
Newdigate, GitLab

2 Introduction Andrew Newdigate Distinguished Engineer, Infrastructure, GitLab Past Gigs….
• Lead, Google Cloud Migration, GitLab.com • Lead, Gitaly (Git Application Infrastructure), GitLab • Founder: Gitter.im (acquired; GitLab, 2017) Twitter @suprememoocow

3 Why do we need Capacity Planning? Everything scales inﬁnitely
when you’re in the Cloud, right?

4 Why do we need Capacity Planning? Any suﬃciently complex
system will have bottlenecks. When these bottlenecks reach saturation, systems fail in unexpected ways.

5 Gridlock Traﬃc is Resource Saturation Resource Saturation In Real
Life: Gridlock Traﬃc

6 Resource Saturation In Real Life: Gridlock Traﬃc DEADLOCK

7 Resource Saturation in Software Systems Case Study of Resource
Saturation GitLab.com Redis Single Core Saturation https://gitlab.com/gitlab-com/gl-infra/production/issues/928 https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/7157 Resource Saturation

8 Resource Saturation in Software Systems Web Performance (Apdex Score)

9 Resource Saturation in Software Systems Signals • Latency alerts
from multiple services (web, api, CI runners) • Saturation across multiple components, leading to queueing, leading to more saturation • Increased 502/502 error rates across multiple services • No recent single application change which obviously caused the problem • No recent infrastructure changes • No unusual user activity (abuse) GitLab.com Redis Degradation

10 Resource Saturation in Software Systems Redis Cache CPU Saturation
• Redis uses a single threaded model • Redis running on 4 core servers, 3 of the cores with ~idle • 1 core maxed out at close to 100% CPU • Redis cache operations queuing, leading to slow down across multiple systems GitLab.com Redis Degradation

11 Resource Saturation in Software Systems Workarounds • Faster servers
• Shard Redis cache • Application changes - move to multi-tier (L1/L2) caching on several high traﬃc endpoints • Fixed several (oldish)performance regressions High Mean-Time-to-Recovery (MTTR) None of these potential ﬁxes could be implemented in minutes. They all required planning and relatively long execution times. Potential Fixes for Redis CPU Saturation

• Our job as Infrastructure Engineers is way more fun
if we avoid resource saturation • MTTR is high on Resource Saturation issues ◦ Sometimes there are no quick fixes • In a complex system, there are many different bottlenecks that you need to look out for: ◦ CPU ◦ Single CPU Core (for single-threaded applications) ◦ Memory ◦ Database Connection Pools (pgbouncer) and Server Connections (postgres) ◦ Redis Clients ◦ File Descriptors ◦ Disk Throughput / IOPS • Saturation of any one of these could lead to an outage Takeaways

Failure is not Linear

Goal: avoid resource saturation altogether Create an early warning system
to help us predict and avoid resource saturation issues before they become a problem. Capacity Planning Part 2

Measuring Saturation

Saturation = (Current Resource Usage) / (Maximum Available Resource) Measuring
Saturation

Setup a recording rule with two ﬁxed dimensions (labels) service_component:saturation:ratio
Fixed Dimensions • “service” the service reporting the resource eg service="monorail” or service="postgres” • “component” dimension - the component resource we are measuring eg component="memory” or component="cpu” All series report a ratio between 0 and 1. 0 is 0% (good). 1 = 100% Saturated (bad) Saturation Measurement Recording Rules

Workers: % unicorn worker processes utilized Disk: % disk space
utilized CPU: % compute utilized across all nodes in a service Single Node CPU: Maximum compute utilization % on single node in service Single Core CPU: Maximum single core utilization for any core in a service (useful for single threaded services like Redis, pgbouncer, etc) Database Pools: % database connection pool utilization Many others, see: https://gitlab.com/gitlab-com/runbooks/blob/master/rules/service_saturation.yml Resource saturation being measured

- record: service_component:saturation:ratio labels: component: 'single_core_cpu' expr: > max( 1
- rate(node_cpu_seconds_total{ service=~"redis|pgbouncer", mode="idle" }[1m]) ) by (service) Aggregate with max Example: single-threaded CPU Saturation # service_component:saturation:ratio{component="single_core_cpu",service="patroni"} 0.972 # service_component:saturation:ratio{component="single_core_cpu",service="redis"} 0.404 Single threaded services

- record: service_component:saturation:ratio labels: component: 'open_fds' expr: > max( process_open_fds
/ process_max_fds ) by (service) # service_component:saturation:ratio{component="open_fds", service="gitaly"} 0.238 # service_component:saturation:ratio{component="open_fds", service="web"} 0.054 Example: File descriptor saturation

service_component:saturation:ratio{component="disk_space",service="gitaly"} 0.84 service_component:saturation:ratio{component="single_core_cpu",service="pgbouncer"} 0.82 service_component:saturation:ratio{component="memory",service="redis-cache"} 0.78 service_component:saturation:ratio{component="single_node_cpu",service="haproxy"} 0.71 service_component:saturation:ratio{component="memory",service="sidekiq"} 0.61
service_component:saturation:ratio{component="single_node_cpu",service="git"} 0.60 service_component:saturation:ratio{component="cgroup_memory",service="gitaly"} 0.59 service_component:saturation:ratio{component="disk_space",service="postgres"} 0.57 service_component:saturation:ratio{component="cpu",service="haproxy"} 0.56 Example: Saturation Metrics

Example: Redis CPU Saturation, May - Mid July Everything fine
Everything on fire

Alerting on immediate saturation issues 90% threshold

- alert: saturation_out_of_bounds_upper_5m expr: | service_component:saturation:ratio > 0.9 for: 5m
annotations: title: | The `{{ $labels.service }}` service, `{{ $labels.component }}` component has a saturation exceeding 90% Generalised alert for all saturation metrics

Capacity Planning and Forecasting

Can we use Linear Prediction?

Linear interpolation doesn’t work well on non-linear data

A hurricane warning, not a weather forecast...

Estimating a worst-case with standard deviation Worst Case Prediction Calculation
1. Trend Prediction: Use Linear prediction on our rolling 7 day average to extend our trend forward by 2 weeks 2. Std Dev: Calculate the standard deviation for each metric for the past week 3. Worst Case: Trend Prediction + 2 * std dev WorstCase = LinearPrediction(Average(LastWeeksWorthOfData)) + 2 * StdDev(LastWeeksWorthOfData)

Linear interpolation doesn’t work well on non-linear data 1 2

Worst-Case Predictions in PromQL # Average values for each component,
over a week - record: service_component:saturation:ratio:avg_over_time_1w expr: > avg_over_time(service_component:saturation:ratio[1w]) # Stddev for each resource saturation component, over a week - record: service_component:saturation:ratio:stddev_over_time_1w expr: > stddev_over_time(service_component:saturation:ratio[1w])

- record: service_component:saturation:ratio:predict_linear_2w expr: > predict_linear( service_component:saturation:ratio:avg_over_time_1w[1w], 86400 * 14
# 14 days, in seconds ) Worst-Case Predictions in PromQL

Retroﬁtting the model to the data

Top 10 highest priority saturation points topk(10, clamp_min( gitlab_component_saturation:ratio:predict_linear_2w, 0
) + 2 * gitlab_component_saturation:ratio:stddev_over_time_1w ) List top 10 highest priority potential resource saturation issues

Points

Conclusion Capacity planning signals are used for ranking not precision:
this technique provides an order list of potential bottlenecks, from most urgent to least, this is fed into our roadmap for prioritization. Dogfooding: we’re using these techniques on GitLab.com, but we will be incorporating some of them into GitLab (the product) to help GitLab sysadmins and support engineers to predict problems. Early days: we’re still ﬁguring this out. If you have ideas, please get in touch.

Questions? Andrew Newdigate | @suprememoocow GitLab.com Resource Saturation Monitoring and
Capacity Planning rules at: https://gitlab.com/gitlab-com/runbooks Thank you

Devopsdays Cape Town: Resource Saturation Monit...

Devopsdays Cape Town: Resource Saturation Monitoring and Capacity Planning on GitLab.com

More Decks by Andrew Newdigate

Other Decks in Technology

Featured

Transcript