Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Using Prometheus to Provide Large Scale Server Monitoring

Using Prometheus to Provide Large Scale Server Monitoring

Eebedc2ee7ff95ffb9d9102c6d4a065c?s=128

LINE DevDay 2020

November 25, 2020
Tweet

Transcript

  1. None
  2. › Grew up near Raleigh, North Carolina › Previously worked

    in San Francisco, California › LINE Fukuoka ~4 years › Primarily focused on Monitoring as a Service › Dabbles in Swift and iOS development Introduction - Paul Traylor
  3. Agenda › Overview of Prometheus › Using Promgen to manage

    Prometheus › Alerts and Routing › Supporting Tenants › Scaling for Long Term Storage
  4. https://prometheus.io/docs/introduction/overview/ › Pull based monitoring system › Flexible alerting rules

    › Default support in Grafana › CNCF Graduated Project › Currently undergoing standardization https://openmetrics.io/ What is Prometheus?
  5. https://github.com/line/promgen › Provide Prometheus configuration via file discovery › Route

    notifications from Alert Manager What is Promgen?
  6. Why did we build Promgen? › Developers do not want

    to manually write configurations • Prometheus scrape configs • Alert rules • Alertmanager routing › Want a web ui to quickly configure things › Custom Prometheus discovery is via files https://github.com/line/promgen
  7. Prometheus at LINE

  8. › 46,000,000+ time series (15 second scrape interval) › 34,000+

    targets › 9+ shards › 2M~8M time series per server › 2 weeks ~ 10 weeks data per server Prometheus at LINE
  9. Prometheus as a Service LINE Scale - Challenges › Reliability

    (Sharding) › Performance (Alerting) › Multi-tenancy › Long term storage
  10. Clusters - Prometheus › Alpha Cluster • Testing and development

    • Mostly VMs ~10 › Beta Cluster • Integration Testing • Monitoring for Development and Staging services • Several default rules disabled for noise › Release Cluster • Monitoring for live services
  11. Release Cluster - Run everything in Pairs Sharding - Prometheus

    › 20 Prometheus Servers (Physical Servers) • 20 CPU / 256GB Memory / 10G Network / SAS 2400 GB • 2 Prometheus servers per Shard (for HA) › 2 Web Servers • Nginx / Grafana / Promgen / Thanos Query › 4 blackbox_exporter probes • 2 monitoring inside private cloud (Verda) • 2 monitoring from external cloud (AWS)
  12. Sharding - Prometheus › 4 General purpose shards • Free

    for any developer to register on › Example team specific shards • LINE Shop • Spring Boot + Armeria • Large number of samples / Target • Data Analysis • Hadoop + Fluentd + Kafka • Fewer samples per target / more targets • Ads platform • Many microservices in Go • Tendency to include many campaign ids in labels • Securities (ূ݊) • Separate environment for security reasons
  13. Alerts - Alert manager

  14. Alerts - Routing › Promgen uses "service" and "project" to

    route notifications › Easy to accidentally erase them using "sum()" and "count()"
  15. Alerts - Routing sum(some_metric) {} 2 sum(some_metric) by (service, project)

    {service=“A”, project=“A1”} 1 {service=“B”, project=“B1”} 1 some_metric {service="A", project="A1", instance="aaa001.example.com"} 1 {service="A", project="A1", instance="aaa002.example.com"} 1
  16. Alerts - Global Alert Rules vs Custom › Want to

    have various default rules • Memory • Disk › Users are free to register custom rules for their service › We also support overriding with custom thresholds
  17. Alerts - Parent and Child Rules # Global Rule excludes

    children # Uses Promgen <exclude> tag internally example_rule{<exclude>} example_rule{service!~"A|B",} # Service A override includes self - example_rule{service="A",} # Service B override includes self, but excludes children - example_rule{service="B", project!~"C"}: # Project Override - example_rule{project="C"}
  18. Alerts - Global PromgenLargeMetricsIncrease meta:samples:sum = sum by(service, project, job)

    (scrape_samples_scraped) meta:samples:sum > 2 * meta:samples:sum offset 1w > 100000 ExcessivelyLargeScrape scrape_samples_scraped > 100000 SSLCertificateExpireSoon probe_ssl_earliest_cert_expiry - time() < 60 * 60 * 24 * 14
  19. Alerts - Internal AlertDeliveryError rate(alertmanager_notifications_failed_total[5m]) > 0 PrometheusNotificationDelivery rate(prometheus_notifications_errors_total[5m]) >

    0 PrometheusConfigSync max(prometheus_config_last_reload_success_timestamp_seconds) BY (project,service) - min(prometheus_config_last_reload_success_timestamp_seconds) BY (project,service) > 120
  20. Main Grafana Instance Tenants - Grafana › Shared across most

    teams • ~400 Active Users • ~50 Organizations • ~1300 Dashboards Created
  21. Tenants - Grafana › Shared organization • group permissions •

    Authed against internal GHE instance • Manual login for project managers and other users › Custom organizations • custom dashbords • datasources › Contractor organizations • Separate instances in docker containers • Authed against specific GHE orgs • Should only be able to see their data
  22. https://github.com/kfdm/promql-guard Tenants - promql-guard

  23. # htpasswd is used for tenant passwords htpasswd: guard.htpasswd #

    tenant configured with promql matchers hosts: - username: tenantA prometheus: upstream: https://prometheus.example.com matcher: '{service="tenantA"}' - username: tenantB prometheus: upstream: https://thanos.example.com matcher: '{app=~"appY|appZ"}' foo - bar -> foo{service="tenantA"} - bar{service="tenantA"} secret{app="appX} -> secret{app="appX", app=~"appY| appZ"} Tenants - promql-guard https://github.com/kfdm/promql-guard
  24. Scaling - Long Term Storage › Prometheus does a good

    job showing recent metrics › Sometimes users want to know what happend 6+ months ago › Need to use something else to store historical data
  25. Scaling - Thanos Components

  26. Scaling - Thanos Data Model › Uses object storage (ex.

    S3, Minio, etc) • 150+ Tb in Object Storage • 600,000 + Objects • Over 1 year data in release enviornment • Over 2 years data in beta environment › Use Prometheus block format • 2h, 8h, 2d, 2w • Some 2w blocks are 500+ GB
  27. Scaling - Thanos at LINE › 9 thanos-compact nodes •

    Needs lots of disk to rewrite blocks • Some 2w blocks are 500+ GB • Sometimes have to order “custom” Verda servers › 5 thanos-store nodes • Needs disk for blocks • Needs memory for index
  28. › Currently storing in a single, large bucket › assigning

    time buckets • --min-time • --max-time › spliting up label set • --selector.relabel-config-file • --selector.relabel-config Scaling - thanos-store
  29. › 9 match numbers shard • take serial make it

    parallel --debug.max-compaction-level --selector.relabel-config-file --selector.relabel-config Scaling - thanos-compact
  30. Summary › A single Prometheus server can scale pretty well

    › Careful sharding can scale further › Pay attention to rule query › Be aware of sharing data › Tools like Thanos can make Prometheus scale further