Prometheus at Google NYC Tech Talks Nov 2016

Prometheus Monitoring System Cindy Sridharan 16th November, 2016 @copyconstruct

$whoami Developer - Go, Python Work - imgix

Monitoring at a startup imgix • 19 employees • 7
engineers Requirements • Ease of Use • Ease of Operation • Cost Effective

Why monitoring? • Alerting • Debugging • Dashboarding • Predictive
analysis • Anomaly detection • Analytics

Monitoring Landscape

Whitebox vs Blackbox Whitebox Monitoring derived from data gathered from
internals of systems Examples • Logs, request tracing • Metrics/Stats • Exception Tracking • APM, RUM, EUM Blackbox Helps detect when a problem is ongoing and contributing to real symptoms Examples • Polling • Uptime monitoring

What to monitor?

Four Golden Signals 1) Latency 2) Traffic 3) Errors 4)
Saturation Proposed by the book Site Reliability Engineering

What we need? • Alerting • Debugging • Dashboarding •
Predictive analysis • Anomaly detection • Analytics

What is Prometheus? • Based on Google’s Borgmon • Whitebox
monitoring system for metrics collection • Does one thing (metrics based monitoring) and does it well

Why Prometheus? ✅ It’s not Nagios ✅ Ease of Operation
✅ Ease of Use ✅ Cost effective

Ease of Operation • Single node; no clustering • Clustering
is ridiculously hard to get right • For HA, run two identical Prometheus servers • No dependencies (only SSD) • Single statically linked Go binary

Ease of Use • Very easy for developers to instrument
new applications • Just need to expose an HTTP endpoint for Prometheus to scrape • A “push-gateway” available for short lived jobs

Cost Effective

1) Ingestion + Metrics Collection 2) Processing + Storage 3)
Visualization 4) Alerting 5) Analysis How does Prometheus work?

Ingestion • Targets discovered via service discovery • SkyDns, Consul,
SRV records etc.

Stateful Clients help instrument applications directly Clients pre aggregate –
much more efficient Pushgateway for short lived jobs Exporters – JMX exporters, node exporters etc. Fairly easy to write new exporters - https://github.com/imgix/s6_exporter - https://github.com/imgix/heka_exporter Ingestion

Pull Model

Does it Scale?

- Pull over HTTP - Does Pull Scale? https://prometheus.io/blog/2016/07/ 23/pull-does-not-scale-or-does-it/
Pull Model

- Monitors if a service is up as a part
of gathering metrics - No need for app to register with a CP K/V store like ZooKeeper, Consul etc. - With polling know something is down with a failed scrape - Easier to configure/operate/scale Advantages of Pull

When not to Pull?

Storage

- Single node - No clustering - For HA just
run 2 identical Prometheus Storage

2 Prometheus for HA

Storage internals

- LevelDB for indexing for PromQL queries - Indexes used
to fetch time series is required for evaluating a PromQL expression - All data required for a PromQL query evaluation needs to be in memory Indexing

- Custom storage layer – 1024 bytes in memory chunks
synced to disk - compression + batched writes to timeseries files on disk - 3 different types of chunk encoding: - delta - double delta - variable bit width encoding https://promcon.io/2016-berlin/talks/the- prometheus-time-series-database/ Storage

Storage

- 3X RAM required than needed by the memory chunks
alone - If chunk size exceeds 10% of configured value, Prometheus will start throttling ingestion till value is exceeded by only 5% - This is done by skipping scrapes Storage

- More the chunks in memory per time series, better
the batching of writes to disk - PromQL queries that need large number of time series (large number of chunks) make heavy use of the LevelDB cache - storage.local.memory-chunks – most recently used memory chunk size (default 1048576); configurable Storage

Visualization

Visualization - PromDash - Grafana (data source for Prometheus is
included since Grafana 2.5.0 (2015-10-28)) - Arbitrary templates using Go’s txt/template package

Alerting

Alerting • Configurable Rules based on Prometheus expression language •
Alertmanager – a system that acts upon alerts • Alertmanager dedupes, routes alerts, makes decisions based on rules etc. • Integrates with Slack, HipChat, PagerDuty etc

PromQL Alerting

Analysis with PromQL

Analysis - PromQL - One of the greatest strengths of
Prometheus - Non SQL, Turing complete query language - Only for reads of times series in memory

Multidimensional Data Model Data model – labels, not hierarchy Labels
– • More flexible • More efficient • Dimensional queries

PromQL From Julius Volz’s PromCon presentation https://promcon.io/2016- berlin/talks/prometheus-design-and- philosophy/ PromQL
rate(api_http_requests_total[5m])

PromQL vs SQL PromQL rate(api_http_requests_total[5m]) SQL SELECT job, instance, method,
status, path, rate(value, 5m) FROM api_http_requests_total

Conclusion - Cost effectiveness is paramount - Followed by ease
of operation and use, especially when developers do “operations” - Prometheus optimizes for both

Resources • https://prometheus.io/blog/ • https://promcon.io/2016- berlin/schedule/ • https://prometheus.io/docs/introducti on/overview/

Thank You! [email protected] Twitter - @copyconstruct

Prometheus at Google NYC Tech Talks Nov 2016

Prometheus at Google NYC Tech Talks Nov 2016

More Decks by Cindy Sridharan

Other Decks in Technology

Featured

Transcript