Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Prometheus at Google NYC Tech Talks Nov 2016

Prometheus at Google NYC Tech Talks Nov 2016

November 2016 Google NYC SRE tech talk

1a73d340b0467da58dc2591b3a0edd5a?s=128

Cindy Sridharan

November 16, 2016
Tweet

Transcript

  1. Prometheus Monitoring System Cindy Sridharan 16th November, 2016 @copyconstruct

  2. $whoami Developer - Go, Python Work - imgix

  3. imgix

  4. Monitoring at a startup imgix • 19 employees • 7

    engineers Requirements • Ease of Use • Ease of Operation • Cost Effective
  5. Why monitoring? • Alerting • Debugging • Dashboarding • Predictive

    analysis • Anomaly detection • Analytics
  6. Monitoring Landscape

  7. Monitoring Landscape

  8. Whitebox vs Blackbox Whitebox Monitoring derived from data gathered from

    internals of systems Examples • Logs, request tracing • Metrics/Stats • Exception Tracking • APM, RUM, EUM Blackbox Helps detect when a problem is ongoing and contributing to real symptoms Examples • Polling • Uptime monitoring
  9. What to monitor?

  10. Four Golden Signals 1) Latency 2) Traffic 3) Errors 4)

    Saturation Proposed by the book Site Reliability Engineering
  11. What we need? • Alerting • Debugging • Dashboarding •

    Predictive analysis • Anomaly detection • Analytics
  12. What is Prometheus? • Based on Google’s Borgmon • Whitebox

    monitoring system for metrics collection • Does one thing (metrics based monitoring) and does it well
  13. Why Prometheus? ✅ It’s not Nagios ✅ Ease of Operation

    ✅ Ease of Use ✅ Cost effective
  14. Ease of Operation • Single node; no clustering • Clustering

    is ridiculously hard to get right • For HA, run two identical Prometheus servers • No dependencies (only SSD) • Single statically linked Go binary
  15. Ease of Use • Very easy for developers to instrument

    new applications • Just need to expose an HTTP endpoint for Prometheus to scrape • A “push-gateway” available for short lived jobs
  16. Cost Effective

  17. 1) Ingestion + Metrics Collection 2) Processing + Storage 3)

    Visualization 4) Alerting 5) Analysis How does Prometheus work?
  18. Ingestion • Targets discovered via service discovery • SkyDns, Consul,

    SRV records etc.
  19. Stateful Clients help instrument applications directly Clients pre aggregate –

    much more efficient Pushgateway for short lived jobs Exporters – JMX exporters, node exporters etc. Fairly easy to write new exporters - https://github.com/imgix/s6_exporter - https://github.com/imgix/heka_exporter Ingestion
  20. Pull Model

  21. Does it Scale?

  22. - Pull over HTTP - Does Pull Scale? https://prometheus.io/blog/2016/07/ 23/pull-does-not-scale-or-does-it/

    Pull Model
  23. - Monitors if a service is up as a part

    of gathering metrics - No need for app to register with a CP K/V store like ZooKeeper, Consul etc. - With polling know something is down with a failed scrape - Easier to configure/operate/scale Advantages of Pull
  24. When not to Pull?

  25. Storage

  26. - Single node - No clustering - For HA just

    run 2 identical Prometheus Storage
  27. 2 Prometheus for HA

  28. Storage internals

  29. - LevelDB for indexing for PromQL queries - Indexes used

    to fetch time series is required for evaluating a PromQL expression - All data required for a PromQL query evaluation needs to be in memory Indexing
  30. - Custom storage layer – 1024 bytes in memory chunks

    synced to disk - compression + batched writes to timeseries files on disk - 3 different types of chunk encoding: - delta - double delta - variable bit width encoding https://promcon.io/2016-berlin/talks/the- prometheus-time-series-database/ Storage
  31. Storage

  32. - 3X RAM required than needed by the memory chunks

    alone - If chunk size exceeds 10% of configured value, Prometheus will start throttling ingestion till value is exceeded by only 5% - This is done by skipping scrapes Storage
  33. - More the chunks in memory per time series, better

    the batching of writes to disk - PromQL queries that need large number of time series (large number of chunks) make heavy use of the LevelDB cache - storage.local.memory-chunks – most recently used memory chunk size (default 1048576); configurable Storage
  34. Visualization

  35. Visualization

  36. Visualization - PromDash - Grafana (data source for Prometheus is

    included since Grafana 2.5.0 (2015-10-28)) - Arbitrary templates using Go’s txt/template package
  37. Alerting

  38. Alerting • Configurable Rules based on Prometheus expression language •

    Alertmanager – a system that acts upon alerts • Alertmanager dedupes, routes alerts, makes decisions based on rules etc. • Integrates with Slack, HipChat, PagerDuty etc
  39. PromQL Alerting

  40. PromQL Alerting

  41. PromQL Alerting

  42. Analysis with PromQL

  43. Analysis - PromQL - One of the greatest strengths of

    Prometheus - Non SQL, Turing complete query language - Only for reads of times series in memory
  44. Multidimensional Data Model Data model – labels, not hierarchy Labels

    – • More flexible • More efficient • Dimensional queries
  45. PromQL From Julius Volz’s PromCon presentation https://promcon.io/2016- berlin/talks/prometheus-design-and- philosophy/ PromQL

    rate(api_http_requests_total[5m])
  46. PromQL vs SQL PromQL rate(api_http_requests_total[5m]) SQL SELECT job, instance, method,

    status, path, rate(value, 5m) FROM api_http_requests_total
  47. Conclusion - Cost effectiveness is paramount - Followed by ease

    of operation and use, especially when developers do “operations” - Prometheus optimizes for both
  48. Resources • https://prometheus.io/blog/ • https://promcon.io/2016- berlin/schedule/ • https://prometheus.io/docs/introducti on/overview/

  49. None
  50. Thank You! cindysridharan@gmail.com Twitter - @copyconstruct