Using Prometheus to Provide Large Scale Server Monitoring

› Grew up near Raleigh, North Carolina › Previously worked
in San Francisco, California › LINE Fukuoka ~4 years › Primarily focused on Monitoring as a Service › Dabbles in Swift and iOS development Introduction - Paul Traylor

Agenda › Overview of Prometheus › Using Promgen to manage
Prometheus › Alerts and Routing › Supporting Tenants › Scaling for Long Term Storage

https://prometheus.io/docs/introduction/overview/ › Pull based monitoring system › Flexible alerting rules
› Default support in Grafana › CNCF Graduated Project › Currently undergoing standardization https://openmetrics.io/ What is Prometheus?

https://github.com/line/promgen › Provide Prometheus configuration via file discovery › Route
notifications from Alert Manager What is Promgen?

Why did we build Promgen? › Developers do not want
to manually write configurations • Prometheus scrape configs • Alert rules • Alertmanager routing › Want a web ui to quickly configure things › Custom Prometheus discovery is via files https://github.com/line/promgen

Prometheus at LINE

› 46,000,000+ time series (15 second scrape interval) › 34,000+
targets › 9+ shards › 2M~8M time series per server › 2 weeks ~ 10 weeks data per server Prometheus at LINE

Prometheus as a Service LINE Scale - Challenges › Reliability
(Sharding) › Performance (Alerting) › Multi-tenancy › Long term storage

Clusters - Prometheus › Alpha Cluster • Testing and development
• Mostly VMs ~10 › Beta Cluster • Integration Testing • Monitoring for Development and Staging services • Several default rules disabled for noise › Release Cluster • Monitoring for live services

Release Cluster - Run everything in Pairs Sharding - Prometheus
› 20 Prometheus Servers (Physical Servers) • 20 CPU / 256GB Memory / 10G Network / SAS 2400 GB • 2 Prometheus servers per Shard (for HA) › 2 Web Servers • Nginx / Grafana / Promgen / Thanos Query › 4 blackbox_exporter probes • 2 monitoring inside private cloud (Verda) • 2 monitoring from external cloud (AWS)

Sharding - Prometheus › 4 General purpose shards • Free
for any developer to register on › Example team specific shards • LINE Shop • Spring Boot + Armeria • Large number of samples / Target • Data Analysis • Hadoop + Fluentd + Kafka • Fewer samples per target / more targets • Ads platform • Many microservices in Go • Tendency to include many campaign ids in labels • Securities (ূ݊) • Separate environment for security reasons

Alerts - Alert manager

Alerts - Routing › Promgen uses "service" and "project" to
route notifications › Easy to accidentally erase them using "sum()" and "count()"

Alerts - Routing sum(some_metric) {} 2 sum(some_metric) by (service, project)
{service=“A”, project=“A1”} 1 {service=“B”, project=“B1”} 1 some_metric {service="A", project="A1", instance="aaa001.example.com"} 1 {service="A", project="A1", instance="aaa002.example.com"} 1

Alerts - Global Alert Rules vs Custom › Want to
have various default rules • Memory • Disk › Users are free to register custom rules for their service › We also support overriding with custom thresholds

Alerts - Parent and Child Rules # Global Rule excludes
children # Uses Promgen <exclude> tag internally example_rule{<exclude>} example_rule{service!~"A|B",} # Service A override includes self - example_rule{service="A",} # Service B override includes self, but excludes children - example_rule{service="B", project!~"C"}: # Project Override - example_rule{project="C"}

Alerts - Global PromgenLargeMetricsIncrease meta:samples:sum = sum by(service, project, job)
(scrape_samples_scraped) meta:samples:sum > 2 * meta:samples:sum offset 1w > 100000 ExcessivelyLargeScrape scrape_samples_scraped > 100000 SSLCertificateExpireSoon probe_ssl_earliest_cert_expiry - time() < 60 * 60 * 24 * 14

Alerts - Internal AlertDeliveryError rate(alertmanager_notifications_failed_total[5m]) > 0 PrometheusNotificationDelivery rate(prometheus_notifications_errors_total[5m]) >
0 PrometheusConfigSync max(prometheus_config_last_reload_success_timestamp_seconds) BY (project,service) - min(prometheus_config_last_reload_success_timestamp_seconds) BY (project,service) > 120

Main Grafana Instance Tenants - Grafana › Shared across most
teams • ~400 Active Users • ~50 Organizations • ~1300 Dashboards Created

Tenants - Grafana › Shared organization • group permissions •
Authed against internal GHE instance • Manual login for project managers and other users › Custom organizations • custom dashbords • datasources › Contractor organizations • Separate instances in docker containers • Authed against specific GHE orgs • Should only be able to see their data

https://github.com/kfdm/promql-guard Tenants - promql-guard

# htpasswd is used for tenant passwords htpasswd: guard.htpasswd #
tenant configured with promql matchers hosts: - username: tenantA prometheus: upstream: https://prometheus.example.com matcher: '{service="tenantA"}' - username: tenantB prometheus: upstream: https://thanos.example.com matcher: '{app=~"appY|appZ"}' foo - bar -> foo{service="tenantA"} - bar{service="tenantA"} secret{app="appX} -> secret{app="appX", app=~"appY| appZ"} Tenants - promql-guard https://github.com/kfdm/promql-guard

Scaling - Long Term Storage › Prometheus does a good
job showing recent metrics › Sometimes users want to know what happend 6+ months ago › Need to use something else to store historical data

Scaling - Thanos Components

Scaling - Thanos Data Model › Uses object storage (ex.
S3, Minio, etc) • 150+ Tb in Object Storage • 600,000 + Objects • Over 1 year data in release enviornment • Over 2 years data in beta environment › Use Prometheus block format • 2h, 8h, 2d, 2w • Some 2w blocks are 500+ GB

Scaling - Thanos at LINE › 9 thanos-compact nodes •
Needs lots of disk to rewrite blocks • Some 2w blocks are 500+ GB • Sometimes have to order “custom” Verda servers › 5 thanos-store nodes • Needs disk for blocks • Needs memory for index

› Currently storing in a single, large bucket › assigning
time buckets • --min-time • --max-time › spliting up label set • --selector.relabel-config-file • --selector.relabel-config Scaling - thanos-store

› 9 match numbers shard • take serial make it
parallel --debug.max-compaction-level --selector.relabel-config-file --selector.relabel-config Scaling - thanos-compact

Summary › A single Prometheus server can scale pretty well
› Careful sharding can scale further › Pay attention to rule query › Be aware of sharing data › Tools like Thanos can make Prometheus scale further

Using Prometheus to Provide Large Scale Server ...

Using Prometheus to Provide Large Scale Server Monitoring

LINE DevDay 2020

More Decks by LINE DevDay 2020

Other Decks in Technology

Featured

Transcript

› Grew up near Raleigh, North Carolina › Previously worked

Agenda › Overview of Prometheus › Using Promgen to manage

https://prometheus.io/docs/introduction/overview/ › Pull based monitoring system › Flexible alerting rules

https://github.com/line/promgen › Provide Prometheus configuration via file discovery › Route

Why did we build Promgen? › Developers do not want

Prometheus at LINE

› 46,000,000+ time series (15 second scrape interval) › 34,000+

Prometheus as a Service LINE Scale - Challenges › Reliability

Clusters - Prometheus › Alpha Cluster • Testing and development

Release Cluster - Run everything in Pairs Sharding - Prometheus

Sharding - Prometheus › 4 General purpose shards • Free

Alerts - Alert manager

Alerts - Routing › Promgen uses "service" and "project" to

Alerts - Routing sum(some_metric) {} 2 sum(some_metric) by (service, project)

Alerts - Global Alert Rules vs Custom › Want to

Alerts - Parent and Child Rules # Global Rule excludes

Alerts - Global PromgenLargeMetricsIncrease meta:samples:sum = sum by(service, project, job)

Alerts - Internal AlertDeliveryError rate(alertmanager_notifications_failed_total[5m]) > 0 PrometheusNotificationDelivery rate(prometheus_notifications_errors_total[5m]) >

Main Grafana Instance Tenants - Grafana › Shared across most

Tenants - Grafana › Shared organization • group permissions •

https://github.com/kfdm/promql-guard Tenants - promql-guard

# htpasswd is used for tenant passwords htpasswd: guard.htpasswd #

Scaling - Long Term Storage › Prometheus does a good

Scaling - Thanos Components

Scaling - Thanos Data Model › Uses object storage (ex.

Scaling - Thanos at LINE › 9 thanos-compact nodes •

› Currently storing in a single, large bucket › assigning

› 9 match numbers shard • take serial make it

Summary › A single Prometheus server can scale pretty well