Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Prometheus at Google NYC Tech Talks Nov 2016

Prometheus at Google NYC Tech Talks Nov 2016

November 2016 Google NYC SRE tech talk

Cindy Sridharan

November 16, 2016
Tweet

More Decks by Cindy Sridharan

Other Decks in Technology

Transcript

  1. Monitoring at a startup imgix • 19 employees • 7

    engineers Requirements • Ease of Use • Ease of Operation • Cost Effective
  2. Whitebox vs Blackbox Whitebox Monitoring derived from data gathered from

    internals of systems Examples • Logs, request tracing • Metrics/Stats • Exception Tracking • APM, RUM, EUM Blackbox Helps detect when a problem is ongoing and contributing to real symptoms Examples • Polling • Uptime monitoring
  3. Four Golden Signals 1) Latency 2) Traffic 3) Errors 4)

    Saturation Proposed by the book Site Reliability Engineering
  4. What we need? • Alerting • Debugging • Dashboarding •

    Predictive analysis • Anomaly detection • Analytics
  5. What is Prometheus? • Based on Google’s Borgmon • Whitebox

    monitoring system for metrics collection • Does one thing (metrics based monitoring) and does it well
  6. Ease of Operation • Single node; no clustering • Clustering

    is ridiculously hard to get right • For HA, run two identical Prometheus servers • No dependencies (only SSD) • Single statically linked Go binary
  7. Ease of Use • Very easy for developers to instrument

    new applications • Just need to expose an HTTP endpoint for Prometheus to scrape • A “push-gateway” available for short lived jobs
  8. 1) Ingestion + Metrics Collection 2) Processing + Storage 3)

    Visualization 4) Alerting 5) Analysis How does Prometheus work?
  9. Stateful Clients help instrument applications directly Clients pre aggregate –

    much more efficient Pushgateway for short lived jobs Exporters – JMX exporters, node exporters etc. Fairly easy to write new exporters - https://github.com/imgix/s6_exporter - https://github.com/imgix/heka_exporter Ingestion
  10. - Monitors if a service is up as a part

    of gathering metrics - No need for app to register with a CP K/V store like ZooKeeper, Consul etc. - With polling know something is down with a failed scrape - Easier to configure/operate/scale Advantages of Pull
  11. - Single node - No clustering - For HA just

    run 2 identical Prometheus Storage
  12. - LevelDB for indexing for PromQL queries - Indexes used

    to fetch time series is required for evaluating a PromQL expression - All data required for a PromQL query evaluation needs to be in memory Indexing
  13. - Custom storage layer – 1024 bytes in memory chunks

    synced to disk - compression + batched writes to timeseries files on disk - 3 different types of chunk encoding: - delta - double delta - variable bit width encoding https://promcon.io/2016-berlin/talks/the- prometheus-time-series-database/ Storage
  14. - 3X RAM required than needed by the memory chunks

    alone - If chunk size exceeds 10% of configured value, Prometheus will start throttling ingestion till value is exceeded by only 5% - This is done by skipping scrapes Storage
  15. - More the chunks in memory per time series, better

    the batching of writes to disk - PromQL queries that need large number of time series (large number of chunks) make heavy use of the LevelDB cache - storage.local.memory-chunks – most recently used memory chunk size (default 1048576); configurable Storage
  16. Visualization - PromDash - Grafana (data source for Prometheus is

    included since Grafana 2.5.0 (2015-10-28)) - Arbitrary templates using Go’s txt/template package
  17. Alerting • Configurable Rules based on Prometheus expression language •

    Alertmanager – a system that acts upon alerts • Alertmanager dedupes, routes alerts, makes decisions based on rules etc. • Integrates with Slack, HipChat, PagerDuty etc
  18. Analysis - PromQL - One of the greatest strengths of

    Prometheus - Non SQL, Turing complete query language - Only for reads of times series in memory
  19. Multidimensional Data Model Data model – labels, not hierarchy Labels

    – • More flexible • More efficient • Dimensional queries
  20. PromQL vs SQL PromQL rate(api_http_requests_total[5m]) SQL SELECT job, instance, method,

    status, path, rate(value, 5m) FROM api_http_requests_total
  21. Conclusion - Cost effectiveness is paramount - Followed by ease

    of operation and use, especially when developers do “operations” - Prometheus optimizes for both