Managing Prometheus with Promgen

Managing Prometheus with Promgen

Paul Traylor (LINE Fukuoka, Software Engineer)
Promgenは、インフラ・サービス監視ツール「Prometheus」の構成とアラートルーティングの両方を管理するためのLINEのOSSです。 Promgenがどのように開発され、それをサービスとしての監視を容易にするためにどのように使用できるかを紹介します。

romgen is LINE's open source tool for managing both Prometheus configuration and alert routing. We will introduce how Promgen was developed and how it can be used to facilitate monitoring as a service.

53850955f15249a1a9dc49df6113e400?s=128

LINE Developers

October 07, 2020
Tweet

Transcript

  1. Paul Traylor - LINE Fukuoka - 2020/10/07 Managing Prometheus with

    Promgen
  2. Introduction - Paul Traylor @kfdm • Grew up near Raleigh,

    North Carolina • Previously worked in San Francisco, California • LINE Fukuoka ~4 years • Primarily focused on Monitoring as a Service • Dabbles in Swift and iOS development
  3. Agenda • What is Prometheus? • What is Promgen? •

    Promgen Development History • Challenges • Future Goals
  4. What is Prometheus? https://prometheus.io/docs/introduction/overview/ • Pull based monitoring system •

    Flexible alerting rules • Default support in Grafana • Cloud Native Project
 https://www.cncf.io/projects/ • Currently undergoing standardization
 https://openmetrics.io/
  5. Example Alerting https://prometheus.io/docs/prometheus/latest/querying/basics/ • Watching error rate of Alert manager

    notifications • rate(alertmanager_notifications_failed_total[5m]) > 0 • Watching the memory of a server • (node_memory_MemTotal_bytes -
 node_memory_MemFree_bytes -
 node_memory_Cached_bytes -
 node_memory_Buffers_bytes ) /
 node_memory_MemTotal_bytes > 0.95
  6. Prometheus at LINE • 46,000,000+ samples per scrape interval •

    34,000+ targets • 8+ shards
  7. What is Promgen? https://github.com/line/promgen • Provide Prometheus scrape configuration via

    file discovery • Provide an easy way to manage alert rules • Route notifications from Alert Manager Prometheus Promgen Alert Manager
  8. Why did we build Promgen? • Prometheus has several discovery

    methods built in • Kubernetes • Consol • Docker • etc • If you need something custom, there is file discovery
  9. Why did we build Promgen? • Custom Prometheus discovery is

    via files • Developers do not want to manually write configurations • Prometheus scrape configs • Alert rules • Alertmanager routing • Want a web ui to quickly configure things
  10. First version - Sinatra app • Great proof of concept

    • Single list of projects • Only one of each type of notifier • This was originally the version that was open sourced • Only supported regular targets and notifications • Only global alerting rule support
  11. Django Rewrite • Take advantage of ORM, migrations, and admin

    site • Use celery for distributing configuration changes and queuing notifications • Support for multiple Prometheus servers • Support for monitoring other endpoints via blackbox_exporter
  12. Various Upgrades • Prometheus 1.x -> 2.x upgrade • Rule

    format changed • Shard refactoring • Service -> Shard migrated to Project -> Shard • Permissions for certain global objects • Silence from UI • Migrating from jQuery to Vuejs
  13. Challenges - Terminology • Prometheus has many terms like “discovery”,

    “scrapes”, “targets” that need to be defined • Promgen also has many terms like “project”, “service”, “farm” • Some of these come from in-house usage that may not make sense to others • Hard to rename later • May be a few terms I can still clean up like “job” vs “exporter”
  14. Challenges - Education • There are still many developers who

    are new to Prometheus, so need a lot of inline help • PromQL is new to many, so often need to use examples with SQL to help illustrate points • PromQL `sum` and `count` are similar to MySQL `GROUP BY` • This becomes very important for notifications • `up` vs `count(up)` vs `count(up) by (service)`
  15. Challenges - Day to Day tasks • OSS is not

    full time job, so have to balance triage of line/promgen tickets • Often it’s easier to track tasks internally instead of tracking everything on GitHub
  16. Challenges - Timezones • OSS does not have time zone

    boundries so questions may come from anywhere in the world • Often there is an extra delay just due to timezones
  17. Future Goals • Finish jQuery -> Vuejs migration • Better

    user management / controls • Better inline help messages • Finish cleaning up API
  18. Prometheus at LINE DEVELOPER DAY 2020 https://linedevday.linecorp.com/jp/2020/ w 6TJOH1SPNHFOUPNBOBHF1SPNFUIFVT w

    "MFSUTBOE3PVUJOH w 4VQQPSUJOH5FOBOUT w 4DBMJOHGPS-POH5FSN4UPSBHF
  19. LINE Fukuoka •https://linefukuoka.co.jp/ja/career