Devopsdays Cape Town: Resource Saturation Monitoring and Capacity Planning on GitLab.com

Slide 1

Slide 1 text

1 Resource Saturation Monitoring and Capacity Planning on GitLab.com Andrew Newdigate, GitLab

Slide 2

Slide 2 text

2 Introduction Andrew Newdigate Distinguished Engineer, Infrastructure, GitLab Past Gigs…. ● Lead, Google Cloud Migration, GitLab.com ● Lead, Gitaly (Git Application Infrastructure), GitLab ● Founder: Gitter.im (acquired; GitLab, 2017) Twitter @suprememoocow

Slide 3

Slide 3 text

3 Why do we need Capacity Planning? Everything scales inﬁnitely when you’re in the Cloud, right?

Slide 4

Slide 4 text

4 Why do we need Capacity Planning? Any suﬃciently complex system will have bottlenecks. When these bottlenecks reach saturation, systems fail in unexpected ways.

Slide 5

Slide 5 text

5 Gridlock Traﬃc is Resource Saturation Resource Saturation In Real Life: Gridlock Traﬃc

Slide 6

Slide 6 text

6 Resource Saturation In Real Life: Gridlock Traﬃc DEADLOCK

Slide 7

Slide 7 text

7 Resource Saturation in Software Systems Case Study of Resource Saturation GitLab.com Redis Single Core Saturation https://gitlab.com/gitlab-com/gl-infra/production/issues/928 https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/7157 Resource Saturation

Slide 8

Slide 8 text

8 Resource Saturation in Software Systems Web Performance (Apdex Score)

Slide 9

Slide 9 text

9 Resource Saturation in Software Systems Signals ● Latency alerts from multiple services (web, api, CI runners) ● Saturation across multiple components, leading to queueing, leading to more saturation ● Increased 502/502 error rates across multiple services ● No recent single application change which obviously caused the problem ● No recent infrastructure changes ● No unusual user activity (abuse) GitLab.com Redis Degradation

Slide 10

Slide 10 text

10 Resource Saturation in Software Systems Redis Cache CPU Saturation ● Redis uses a single threaded model ● Redis running on 4 core servers, 3 of the cores with ~idle ● 1 core maxed out at close to 100% CPU ● Redis cache operations queuing, leading to slow down across multiple systems GitLab.com Redis Degradation

Slide 11

Slide 11 text

11 Resource Saturation in Software Systems Workarounds ● Faster servers ● Shard Redis cache ● Application changes - move to multi-tier (L1/L2) caching on several high traﬃc endpoints ● Fixed several (oldish)performance regressions High Mean-Time-to-Recovery (MTTR) None of these potential ﬁxes could be implemented in minutes. They all required planning and relatively long execution times. Potential Fixes for Redis CPU Saturation

Slide 12

Slide 12 text

● Our job as Infrastructure Engineers is way more fun if we avoid resource saturation ● MTTR is high on Resource Saturation issues ○ Sometimes there are no quick fixes ● In a complex system, there are many different bottlenecks that you need to look out for: ○ CPU ○ Single CPU Core (for single-threaded applications) ○ Memory ○ Database Connection Pools (pgbouncer) and Server Connections (postgres) ○ Redis Clients ○ File Descriptors ○ Disk Throughput / IOPS ● Saturation of any one of these could lead to an outage Takeaways

Slide 13

Slide 13 text

Failure is not Linear

Slide 14

Slide 14 text

Goal: avoid resource saturation altogether Create an early warning system to help us predict and avoid resource saturation issues before they become a problem. Capacity Planning Part 2

Slide 15

Slide 15 text

Measuring Saturation

Slide 16

Slide 16 text

Saturation = (Current Resource Usage) / (Maximum Available Resource) Measuring Saturation

Slide 17

Slide 17 text

Setup a recording rule with two ﬁxed dimensions (labels) service_component:saturation:ratio Fixed Dimensions ● “service” the service reporting the resource eg service="monorail” or service="postgres” ● “component” dimension - the component resource we are measuring eg component="memory” or component="cpu” All series report a ratio between 0 and 1. 0 is 0% (good). 1 = 100% Saturated (bad) Saturation Measurement Recording Rules

Slide 18

Slide 18 text

Workers: % unicorn worker processes utilized Disk: % disk space utilized CPU: % compute utilized across all nodes in a service Single Node CPU: Maximum compute utilization % on single node in service Single Core CPU: Maximum single core utilization for any core in a service (useful for single threaded services like Redis, pgbouncer, etc) Database Pools: % database connection pool utilization Many others, see: https://gitlab.com/gitlab-com/runbooks/blob/master/rules/service_saturation.yml Resource saturation being measured

Slide 19

Slide 19 text

- record: service_component:saturation:ratio labels: component: 'single_core_cpu' expr: > max( 1 - rate(node_cpu_seconds_total{ service=~"redis|pgbouncer", mode="idle" }[1m]) ) by (service) Aggregate with max Example: single-threaded CPU Saturation # service_component:saturation:ratio{component="single_core_cpu",service="patroni"} 0.972 # service_component:saturation:ratio{component="single_core_cpu",service="redis"} 0.404 Single threaded services

Slide 20

Slide 20 text

- record: service_component:saturation:ratio labels: component: 'open_fds' expr: > max( process_open_fds / process_max_fds ) by (service) # service_component:saturation:ratio{component="open_fds", service="gitaly"} 0.238 # service_component:saturation:ratio{component="open_fds", service="web"} 0.054 Example: File descriptor saturation

Slide 21

Slide 21 text

service_component:saturation:ratio{component="disk_space",service="gitaly"} 0.84 service_component:saturation:ratio{component="single_core_cpu",service="pgbouncer"} 0.82 service_component:saturation:ratio{component="memory",service="redis-cache"} 0.78 service_component:saturation:ratio{component="single_node_cpu",service="haproxy"} 0.71 service_component:saturation:ratio{component="memory",service="sidekiq"} 0.61 service_component:saturation:ratio{component="single_node_cpu",service="git"} 0.60 service_component:saturation:ratio{component="cgroup_memory",service="gitaly"} 0.59 service_component:saturation:ratio{component="disk_space",service="postgres"} 0.57 service_component:saturation:ratio{component="cpu",service="haproxy"} 0.56 Example: Saturation Metrics

Slide 22

Slide 22 text

Example: Redis CPU Saturation, May - Mid July Everything fine Everything on fire

Slide 23

Slide 23 text

Alerting on immediate saturation issues 90% threshold

Slide 24

Slide 24 text

- alert: saturation_out_of_bounds_upper_5m expr: | service_component:saturation:ratio > 0.9 for: 5m annotations: title: | The `{{ $labels.service }}` service, `{{ $labels.component }}` component has a saturation exceeding 90% Generalised alert for all saturation metrics

Slide 25

Slide 25 text

Capacity Planning and Forecasting

Slide 26

Slide 26 text

Can we use Linear Prediction?

Slide 27

Slide 27 text

Linear interpolation doesn’t work well on non-linear data

Slide 28

Slide 28 text

Linear interpolation doesn’t work well on non-linear data

Slide 29

Slide 29 text

A hurricane warning, not a weather forecast...

Slide 30

Slide 30 text

Estimating a worst-case with standard deviation Worst Case Prediction Calculation 1. Trend Prediction: Use Linear prediction on our rolling 7 day average to extend our trend forward by 2 weeks 2. Std Dev: Calculate the standard deviation for each metric for the past week 3. Worst Case: Trend Prediction + 2 * std dev WorstCase = LinearPrediction(Average(LastWeeksWorthOfData)) + 2 * StdDev(LastWeeksWorthOfData)

Slide 31

Slide 31 text

Linear interpolation doesn’t work well on non-linear data 1 2

Slide 32

Slide 32 text

Worst-Case Predictions in PromQL # Average values for each component, over a week - record: service_component:saturation:ratio:avg_over_time_1w expr: > avg_over_time(service_component:saturation:ratio[1w]) # Stddev for each resource saturation component, over a week - record: service_component:saturation:ratio:stddev_over_time_1w expr: > stddev_over_time(service_component:saturation:ratio[1w])

Slide 33

Slide 33 text

- record: service_component:saturation:ratio:predict_linear_2w expr: > predict_linear( service_component:saturation:ratio:avg_over_time_1w[1w], 86400 * 14 # 14 days, in seconds ) Worst-Case Predictions in PromQL

Slide 34

Slide 34 text

Retroﬁtting the model to the data

Slide 35

Slide 35 text

Top 10 highest priority saturation points topk(10, clamp_min( gitlab_component_saturation:ratio:predict_linear_2w, 0 ) + 2 * gitlab_component_saturation:ratio:stddev_over_time_1w ) List top 10 highest priority potential resource saturation issues

Slide 36

Slide 36 text

Points

Slide 37

Slide 37 text

Conclusion Capacity planning signals are used for ranking not precision: this technique provides an order list of potential bottlenecks, from most urgent to least, this is fed into our roadmap for prioritization. Dogfooding: we’re using these techniques on GitLab.com, but we will be incorporating some of them into GitLab (the product) to help GitLab sysadmins and support engineers to predict problems. Early days: we’re still ﬁguring this out. If you have ideas, please get in touch.

Slide 38

Slide 38 text

Questions? Andrew Newdigate | @suprememoocow GitLab.com Resource Saturation Monitoring and Capacity Planning rules at: https://gitlab.com/gitlab-com/runbooks Thank you