Monitoring mit Prometheus

Slide 1

Slide 1 text

Monitoring mit Prometheus München, 09. Mai 2016

Slide 2

Slide 2 text

Contents Part 1 (Slides) •  Monitoring Categories: –  Traditional functional monitoring vs. time series –  Push vs. pull •  Best practice: Why is Nagios so successful? •  Prometheus Overview •  Prometheus: Four Top Features 09 May 2016 www.consol.de 2 Part 2 (Demo) •  Instrumentation / Exporting Metrics –  OS Metrics –  Application Metrics (Java) –  Docker •  Querying •  Grafana Dashboard

Slide 3

Slide 3 text

Contents Part 1 (Slides) •  Monitoring Categories: –  Traditional functional monitoring vs. time series –  Push vs pull •  Best practice: Why is Nagios so successful? •  Prometheus Overview •  Prometheus: Four Top Features 09 May 2016 www.consol.de 3 Part 2 (Demo) •  Instrumentation / Exporting Metrics –  OS Metrics –  Application Metrics (Java) –  Docker •  Querying •  Grafana Dashboard

Slide 4

Slide 4 text

Approaches to Monitoring (I) Traditional functional monitoring •  Idea: Alert goes off when the website goes down. •  Example: Traditional Nagios •  Focus: Static infrastructure 09 May 2016 www.consol.de 4 Time Series •  Idea: Generate Graphs instead of alerts •  Example: Prometheus et al. •  Nice for dynamic infrastructure

Slide 5

Slide 5 text

Contents Part 1 (Slides) •  Monitoring Categories: –  Traditional functional monitoring vs. time series –  Push vs. pull •  Best practice: Why is Nagios so successful? •  Prometheus Overview •  Prometheus: Four Top Features 09 May 2016 www.consol.de 5 Part 2 (Demo) •  Instrumentation / Exporting Metrics –  OS Metrics –  Application Metrics (Java) –  Docker •  Querying •  Grafana Dashboard

Slide 6

Slide 6 text

Approaches to Monitoring (II) Push •  Examples: ELK stack •  Do not poll, but passively accept data •  Providing new hosts is effortless •  More efficient for traditional monitoring (where each pull triggers a check and waits for the check to be executed) 09 May 2016 www.consol.de 6 Pull •  Example: Prometheus •  High availability made easy •  Functional sharding made easy •  Server cannot be overwhelmed

Slide 7

Slide 7 text

Contents Part 1 (Slides) •  Monitoring Categories: –  Traditional functional monitoring vs. time series –  Push vs. pull •  Best practice: Why is Nagios so successful? •  Prometheus Overview •  Prometheus: Four Top Features 09 May 2016 www.consol.de 7 Part 2 (Demo) •  Instrumentation / Exporting Metrics –  OS Metrics –  Application Metrics (Java) –  Docker •  Querying •  Grafana Dashboard

Slide 8

Slide 8 text

Why is Nagios (and its derivates) so Successful? Prometheus adapts the modular approach and takes it even further, as GUI and alerting are also independent tools. It is easy for applications to integrate with Prometheus! 09 May 2016 www.consol.de 8 •  Its strength is that it’s really just a scheduler and event handler, and all the complicated check work is done by a highly modular and portable system of plugins. •  Easy to write checks, as checks are stand-alone programs •  Thriving ecosystem of checks (monitoring plugins) •  Ubiquitous, everyone in the industry has some experience with Nagios configs.

Slide 9

Slide 9 text

Contents Part 1 (Slides) •  Monitoring Categories: –  Traditional functional monitoring vs. time series –  Push vs. pull •  Best practice: Why is Nagios so successful? •  Prometheus Overview •  Prometheus: Four Top Features 09 May 2016 www.consol.de 9 Part 2 (Demo) •  Instrumentation / Exporting Metrics –  OS Metrics –  Application Metrics (Java) –  Docker •  Querying •  Grafana Dashboard

Slide 10

Slide 10 text

Prometheus Overview •  Prometheus is an Open Source project. •  Started at Soundcloud, open-sourced in January 2015. •  Goal at Soundcloud was to replace StatsD and Graphite. •  Prometheus is insipired by Google‘s Borgmon monitoring system (main Developers at Soundcloud are ex-Googlers) •  It is another time-series database similar to InfluxDB, but that adds alerting functionality. It includes its own poller, but can also passively receive data like Graphite and InfluxDB. •  This new generation of tools is all highly modular and composable. 09 May 2016 www.consol.de 10

Slide 11

Slide 11 text

Prometheus Top Four Features •  Multi-dimensional data model –  Like OpenTSDB (Prometheus uses same format for metric notation as OpenTSDB) •  Operational Simplicity –  Simple Go executable (unlike OpenTSDB (Hadoop, HBase, specific versions)) •  Scalability –  Single monitoring server can handle thousands of targets, hundreds of thousands of samples per second, millions of time series –  Running many servers is easy for HA and Sharding. •  Powerful Query Language 09 May 2016 www.consol.de 11

Slide 12

Slide 12 text

Contents Part 1 (Slides) •  Monitoring Categories: –  Traditional functional monitoring vs. time series –  Push vs. pull •  Best practice: Why is Nagios so successful? •  Prometheus Overview •  Prometheus: Four Top Features 09 May 2016 www.consol.de 12 Part 2 (Demo) •  Instrumentation / Exporting Metrics –  OS Metrics –  Application Metrics (Java) –  Docker •  Querying •  Grafana Dashboard

Slide 13

Slide 13 text

Contents Part 1 (Slides) •  Monitoring Categories: –  Traditional functional monitoring vs. time series –  Push vs. pull •  Best practice: Why is Nagios so successful? •  Prometheus Overview •  Prometheus: Four Top Features 09 May 2016 www.consol.de 13 Part 2 (Demo) •  Instrumentation / Exporting Metrics –  OS Metrics –  Application Metrics (Java) –  Docker •  Querying •  Grafana Dashboard

Slide 14

Slide 14 text

Prometheus Ecosystem 09 May 2016 www.consol.de 14

Slide 15

Slide 15 text

Exporter vs Instrumentation Two ways of monitoring services: •  Instrument service with one of the client libraries –  Official: Go, Java, Ruby, Python –  Unofficial: dotNet/C#, NodeJS, Bash •  Exporters –  If you cannot / don‘t want to patch your services (like basic Linux kernel metrics), you can use an exporter as a „middleware“ between your service and Prometheus. –  Lot of exporters available (see after the demo) 09 May 2016 www.consol.de 15 Will demo exporters in a minute!!! DIY

Slide 16

Slide 16 text

Demo: node_exporter 09 May 2016 www.consol.de 16

Slide 17

Slide 17 text

Demo: jmx_exporter 09 May 2016 www.consol.de 17

Slide 18

Slide 18 text

Demo: cAdvisor Instrumentation 09 May 2016 www.consol.de 18

Slide 19

Slide 19 text

More Exporters 06/05/16 www.consol.de 19

Slide 20

Slide 20 text

Prometheus Ecosystem 09 May 2016 www.consol.de 20

Slide 21

Slide 21 text

Prometheus Ecosystem 09 May 2016 www.consol.de 21

Slide 22

Slide 22 text

Demo: Prometheus Server 09 May 2016 www.consol.de 22 global: scrape_interval: 15s scrape_configs: - job_name: "prometheus" target_groups: - targets: ['localhost:9090'] - job_name: "node" target_groups: - targets: ['localhost:9100'] Config: node_exporter runs here. prometheus runs here.

Slide 23

Slide 23 text

Demo: PromQL 09 May 2016 www.consol.de 23 node_network_transmit_bytes node_network_transmit_bytes{device=~"eth.*"} sum(node_network_transmit_bytes) sum(node_network_transmit_bytes) by (device) sum(node_network_transmit_bytes) by (instance) sum(node_network_transmit_bytes + node_network_receive_bytes) by (device) sum(node_network_transmit_bytes + node_network_receive_bytes) by (device) / 1024 / 1024 # All the values we have recorded within the last 5 minutes. node_network_transmit_bytes[5m] # per-second average of the last 5 minutes rate(node_network_transmit_bytes[5m]) Example Queries:

Slide 24

Slide 24 text

PromQL Operators and Functions 09 May 2016 www.consol.de 24

Slide 25

Slide 25 text

Prometheus Ecosystem 09 May 2016 www.consol.de 25

Slide 26

Slide 26 text

Demo: Grafana •  Visualization interface •  Dashboard composer •  Can use multiple backends simultaneously •  Mix and match graphs from different tools on a single dashboard •  Embed annotations into time- series graphs •  Backends: Graphite, InfluxDB, Prometheus 09 May 2016 www.consol.de 26

Slide 27

Slide 27 text

Prometheus Ecosystem 09 May 2016 www.consol.de 27

Slide 28

Slide 28 text

Alert Manager •  Same flexible Query Language as for Graphs •  Alerts can inherit same labels as the time series they are based on: –  Which instance is failing –  What sub-thing in this instance is causing problems –  Can silence very specificly by any label set, for example: everything in a particular zone / from a particular job / with a label matching a regex / for two hours. 09 May 2016 www.consol.de 28

Slide 29

Slide 29 text

Alert Manager: Step 1, Step 2, Step 3 09 May 2016 www.consol.de 29 # Alert for any instance that have a median request latency >1s. ALERT APIHighRequestLatency IF api_http_request_latencies_second{quantile="0.5"} > 1 FOR 1m ANNOTATIONS { summary = "High request latency on {{ $labels.instance }}", description = "{{ $labels.instance }} has a median request latency above 1s (current value: {{ $value }}s)", } Step 1: Alerting rules are configured in Prometheus: Same query language as for graphs.

Slide 30

Slide 30 text

Alert Manager: Step 1, Step 2, Step 3 09 May 2016 www.consol.de 30 Step 2: Start prometheus with -alertmanager-url command line flag. Step 3: Run Alert Manager and configure it with the following: •  Grouping of alerts •  Inhibition / notification rate limiting •  Silencing alert dependencies Available receivers: •  email •  https://www.pagerduty.com •  https://pushover.net •  https://slack.com •  https://www.opsgenie.com •  webhook

Slide 31

Slide 31 text

Prometheus Ecosystem 09 May 2016 www.consol.de 31

Slide 32

Slide 32 text

Dynamic Service discovery Out-of-the-box service discovery for: •  DNS-SD: DNS Service Discovery. This is a DNS extension using DNS SRV records to get IPs and Ports of current services. Used internally at Soundcloud. •  Consul: Service discovery tool by Hashicorp. •  Kubernetes: Container orchestration by Google. •  Marathon: Container orchestration for Apache Mesos. •  Zookeeper: Java-based distributed config, service discovery via Nerve or Serversets. •  EC2: Amazon Cloud •  File-Watcher: Directory with config files is watched by Prometheus. New services create new files. Generic approach to plug in any kind of mechanism. 09 May 2016 www.consol.de 32

Slide 33

Slide 33 text

Related Talks •  The Open-Source Monitoring Landscape by Michael Merideth, VictorOps: http://www.slideshare.net/vo_mike/the-opensource-monitoring-landscape •  Prometheus: A Next-Generation Monitoring System (Talk) by Julius Volz and Björn Rabenstein, SoundCloud: https://www.usenix.org/conference/srecon15europe/program/presentation/ rabenstein •  Prometheus service discovery for Kubernetes & Openshift Origin by Jimmy Dyson, RedHat: https://vimeo.com/139706674 09 May 2016 www.consol.de 33

Slide 34

Slide 34 text

34 www.consol.de ConSol Software GmbH Franziskanerstraße 38 D-81669 München Tel: +49-89-45841-100 Fax: +49-89-45841-111 [email protected] www.consol.de 09 May 2016