Slide 1

Slide 1 text

Paul Traylor - LINE Fukuoka - 2020/10/07 Managing Prometheus with Promgen

Slide 2

Slide 2 text

Introduction - Paul Traylor @kfdm • Grew up near Raleigh, North Carolina • Previously worked in San Francisco, California • LINE Fukuoka ~4 years • Primarily focused on Monitoring as a Service • Dabbles in Swift and iOS development

Slide 3

Slide 3 text

Agenda • What is Prometheus? • What is Promgen? • Promgen Development History • Challenges • Future Goals

Slide 4

Slide 4 text

What is Prometheus? https://prometheus.io/docs/introduction/overview/ • Pull based monitoring system • Flexible alerting rules • Default support in Grafana • Cloud Native Project
 https://www.cncf.io/projects/ • Currently undergoing standardization
 https://openmetrics.io/

Slide 5

Slide 5 text

Example Alerting https://prometheus.io/docs/prometheus/latest/querying/basics/ • Watching error rate of Alert manager notifications • rate(alertmanager_notifications_failed_total[5m]) > 0 • Watching the memory of a server • (node_memory_MemTotal_bytes -
 node_memory_MemFree_bytes -
 node_memory_Cached_bytes -
 node_memory_Buffers_bytes ) /
 node_memory_MemTotal_bytes > 0.95

Slide 6

Slide 6 text

Prometheus at LINE • 46,000,000+ samples per scrape interval • 34,000+ targets • 8+ shards

Slide 7

Slide 7 text

What is Promgen? https://github.com/line/promgen • Provide Prometheus scrape configuration via file discovery • Provide an easy way to manage alert rules • Route notifications from Alert Manager Prometheus Promgen Alert Manager

Slide 8

Slide 8 text

Why did we build Promgen? • Prometheus has several discovery methods built in • Kubernetes • Consol • Docker • etc • If you need something custom, there is file discovery

Slide 9

Slide 9 text

Why did we build Promgen? • Custom Prometheus discovery is via files • Developers do not want to manually write configurations • Prometheus scrape configs • Alert rules • Alertmanager routing • Want a web ui to quickly configure things

Slide 10

Slide 10 text

First version - Sinatra app • Great proof of concept • Single list of projects • Only one of each type of notifier • This was originally the version that was open sourced • Only supported regular targets and notifications • Only global alerting rule support

Slide 11

Slide 11 text

Django Rewrite • Take advantage of ORM, migrations, and admin site • Use celery for distributing configuration changes and queuing notifications • Support for multiple Prometheus servers • Support for monitoring other endpoints via blackbox_exporter

Slide 12

Slide 12 text

Various Upgrades • Prometheus 1.x -> 2.x upgrade • Rule format changed • Shard refactoring • Service -> Shard migrated to Project -> Shard • Permissions for certain global objects • Silence from UI • Migrating from jQuery to Vuejs

Slide 13

Slide 13 text

Challenges - Terminology • Prometheus has many terms like “discovery”, “scrapes”, “targets” that need to be defined • Promgen also has many terms like “project”, “service”, “farm” • Some of these come from in-house usage that may not make sense to others • Hard to rename later • May be a few terms I can still clean up like “job” vs “exporter”

Slide 14

Slide 14 text

Challenges - Education • There are still many developers who are new to Prometheus, so need a lot of inline help • PromQL is new to many, so often need to use examples with SQL to help illustrate points • PromQL `sum` and `count` are similar to MySQL `GROUP BY` • This becomes very important for notifications • `up` vs `count(up)` vs `count(up) by (service)`

Slide 15

Slide 15 text

Challenges - Day to Day tasks • OSS is not full time job, so have to balance triage of line/promgen tickets • Often it’s easier to track tasks internally instead of tracking everything on GitHub

Slide 16

Slide 16 text

Challenges - Timezones • OSS does not have time zone boundries so questions may come from anywhere in the world • Often there is an extra delay just due to timezones

Slide 17

Slide 17 text

Future Goals • Finish jQuery -> Vuejs migration • Better user management / controls • Better inline help messages • Finish cleaning up API

Slide 18

Slide 18 text

Prometheus at LINE DEVELOPER DAY 2020 https://linedevday.linecorp.com/jp/2020/ w 6TJOH1SPNHFOUPNBOBHF1SPNFUIFVT w "MFSUTBOE3PVUJOH w 4VQQPSUJOH5FOBOUT w 4DBMJOHGPS-POH5FSN4UPSBHF

Slide 19

Slide 19 text

LINE Fukuoka •https://linefukuoka.co.jp/ja/career