Slide 1

Slide 1 text

Performance Monitoring with the ELK Stack: collectd

Slide 2

Slide 2 text

{ } CC-BY-ND 4.0 Why would I want to do performance monitoring with ELK? 2

Slide 3

Slide 3 text

{ } CC-BY-ND 4.0 Performance Metrics • CPU • Disk • Memory • Network • More! 3

Slide 4

Slide 4 text

{ } CC-BY-ND 4.0 Introducing: collectd https://collectd.org 4

Slide 5

Slide 5 text

{ } CC-BY-ND 4.0 For Windows users... • SSC Serv – Commercial product – Uses collectd protocol – Disk, df, CPU, Interface, Terminal Services – Can monitor any performance counter available via the Performance Data Handles interface. – http://ssc-serv.com 5

Slide 6

Slide 6 text

{ } CC-BY-ND 4.0 Log data... 6 Logs

Slide 7

Slide 7 text

{ } CC-BY-ND 4.0 Performance metrics... 7 Metrics

Slide 8

Slide 8 text

{ } CC-BY-ND 4.0 Correlation! 8 Logs Metrics

Slide 9

Slide 9 text

{ } CC-BY-ND 4.0 Get the whole picture! 9

Slide 10

Slide 10 text

{ } CC-BY-ND 4.0 Configuring Logstash... input { udp { host => "x.x.x.x" port => 25826 buffer_size => 1452 type => "collectd" codec => collectd { } } } 10

Slide 11

Slide 11 text

{ } CC-BY-ND 4.0 Configuring Logstash... • Authentication & Security • NaN handling • Interval pruning • typesdb 11

Slide 12

Slide 12 text

{ } CC-BY-ND 4.0 Configuring collectd... Hostname "host.example.com" LoadPlugin interface LoadPlugin load LoadPlugin memory LoadPlugin network Interface "eth0" IgnoreSelected false 12

Slide 13

Slide 13 text

{ } CC-BY-ND 4.0 Configuring collectd... • Intervals are configurable – Global – Per Plugin 13

Slide 14

Slide 14 text

{ } CC-BY-ND 4.0 Plugins • df, disk – Disk usage statistics. • load – The 1m, 5m, and 15m load averages • memory – free, buffered, cached, used, etc. • interface – Per-interface network usage/traffic statistics. 14

Slide 15

Slide 15 text

{ } CC-BY-ND 4.0 Plugins • ConnTrack – Tracks the number of entries in Linux's connection tracking table. • ContextSwitch – Collects the number of context switches done by the operating system. 15

Slide 16

Slide 16 text

{ } CC-BY-ND 4.0 Plugins • DBI/PostgreSQL/Oracle – Returns values from queries. • Entropy – Collects the available entropy on a system 16

Slide 17

Slide 17 text

{ } CC-BY-ND 4.0 Plugins • memcached – Collects the number of connections and requests handled by the daemon, the CPU resources consumed, number of items cached, number of threads, and bytes sent and received. • MySQL – Connects to a MySQL db, issues a SHOW STATUS command, and returns many of the variables. 17

Slide 18

Slide 18 text

{ } CC-BY-ND 4.0 Plugins • Swap – Collects the amount of memory currently written onto hard disk (or whatever the system calls “swap”) • TCPConns – Counts the number of TCP connections to or from a specified port. Results include each state: LISTEN, ESTABLISHED, CLOSE_WAIT, etc. 18

Slide 19

Slide 19 text

{ } CC-BY-ND 4.0 BIND (9.5.0+) Global statistics ▪ OpCodes ▪ Query types (A, MX, AAAA, …) ▪ Overall server statistics (#Queries, #Responses, …) ▪ Zone maintenance statistics (#Notifications, #Updates, …) ▪ Resolver statistics (usually empty) ▪ Memory statistics Per-view statistics ▪ Query types ▪ Resolver statistics (#Queries, #Responses, #NXDOMAIN, …) ▪ RR-set cache statistics (#entries by type) Per-zone statistics ▪ Overall statistics (Success, #NXRRSET, …) 19

Slide 20

Slide 20 text

{ } CC-BY-ND 4.0 IP Tables • Per-rule byte and packet counters, selected by: – Position (e.g. “the fourth rule in the ‘INPUT’ queue in the ‘filter’ table”) – Comment (using the “COMMENT” match). • Low overhead – Uses libiptc. Communicates with the kernel directly. 20

Slide 21

Slide 21 text

{ } CC-BY-ND 4.0 SNMP • Uses Net-SNMP • Use collectd to collect stats from: – Switches – Routers – UPS – Rack monitoring systems, – and more! 21

Slide 22

Slide 22 text

{ } CC-BY-ND 4.0 Custom Plugins & Extensions • C • Perl • Python • Exec • Unix-sockets • Java • Java MBean support, via jcollectd 22

Slide 23

Slide 23 text

{ } CC-BY-ND 4.0 Logstash output { "host":"host.example.com", "@timestamp":"2015-03-06T12:26:43.790-07:00", "@version":"1", "type":"collectd", "plugin":"memory", "collectd_type":"memory", "type_instance":"used", "value":8517087232, } 23

Slide 24

Slide 24 text

{ } CC-BY-ND 4.0 Logstash output { "host":"host.example.com", "@timestamp":"2015-03-06T12:26:43.790-07:00", "@version":"1", "type":"collectd", "plugin":"memory", "collectd_type":"memory", "type_instance":"used", "value":8517087232, } 24

Slide 25

Slide 25 text

{ } CC-BY-ND 4.0 Logstash output { "host":"host.example.com", "@timestamp":"2015-03-06T12:38:45.789-07:00", "@version":"1", "type":"collectd", "plugin":"interface", "plugin_instance":"eth0", "collectd_type":"if_packets", "rx":0, "tx":0 } 25

Slide 26

Slide 26 text

{ } CC-BY-ND 4.0 Logstash output { "host":"host.example.com", "@timestamp":"2015-03-06T12:38:45.789-07:00", "@version":"1", "type":"collectd", "plugin":"interface", "plugin_instance":"eth0", "collectd_type":"if_packets", "rx":0, "tx":0 } 26

Slide 27

Slide 27 text

{ } CC-BY-ND 4.0 27 What now?

Slide 28

Slide 28 text

CC-BY-ND 4.0 Alerting You have all your data in elasticsearch, now what ?

Slide 29

Slide 29 text

CC-BY-ND 4.0 CC-BY-ND 4.0 Brian Murphy Elasticsearch developer Previously at Loggly and Splunk. [email protected]

Slide 30

Slide 30 text

CC-BY-ND 4.0 CC-BY-ND 4.0 Why ? ● No point in having the data unless you can act on it. ● Dashboards are great for an overview but don’ t allow you to get notified when things go wrong. ● The time something happens is as important as what happens.

Slide 31

Slide 31 text

CC-BY-ND 4.0 CC-BY-ND 4.0 Alerting vs Percolator ? Percolator is fantastic at what it does but it has some limitations ● Only processes a single event at a time. ● No access to aggregations. ● No history of what was percolated and what matched.

Slide 32

Slide 32 text

CC-BY-ND 4.0 CC-BY-ND 4.0 Anatomy of an Alert An alert is defined by four key elements ● Schedule ● Input ● Condition ● Actions

Slide 33

Slide 33 text

CC-BY-ND 4.0 CC-BY-ND 4.0 Schedule

Slide 34

Slide 34 text

CC-BY-ND 4.0 CC-BY-ND 4.0 Schedule The schedule defines when and how often an alert should run. ● Every 5 mins (check the load on my production web server) ● Every hour (count how many errors my database logs contain) ● On the last day of the month (run a check to see if my traffic has gone up from last month)

Slide 35

Slide 35 text

CC-BY-ND 4.0 CC-BY-ND 4.0 Cron Alert schedules can be defined by a cron syntax. ● Great for people who know cron. ● Terrible for people who don’t.

Slide 36

Slide 36 text

CC-BY-ND 4.0 CC-BY-ND 4.0 Cron Cron syntax is very powerful but hard (at least for me). 0 0,12 1 */2 * Run every 12am and 12pm on the 1st day of every 2nd month. * 12 16 * Mon Run every minute during the 12th hour of Monday, 16th, but only if the day is the 16th of the month.

Slide 37

Slide 37 text

CC-BY-ND 4.0 CC-BY-ND 4.0

Slide 38

Slide 38 text

CC-BY-ND 4.0 CC-BY-ND 4.0 Simpler We will support a simplified syntax for defining schedules. "hourly" : { "minute" : 30 } Run every hour on the 30th minute. "daily" : { "at" : [ "midnight", "noon", "17:00" ] } Run at 00:00, 12:00 and 17:00 every day. "interval" : "5m" Run every 5 minutes.

Slide 39

Slide 39 text

CC-BY-ND 4.0 CC-BY-ND 4.0 Input

Slide 40

Slide 40 text

CC-BY-ND 4.0 CC-BY-ND 4.0 Input The input generates or loads the data that will be used by a running alert ● Run an aggregation on my collectd index to get my load average over the last 5 minutes. ● Count how many errors my log index had for my database server. ● Run a date histogram aggregation to get my web traffic for the last two months.

Slide 41

Slide 41 text

CC-BY-ND 4.0 CC-BY-ND 4.0 Input Inputs can be (for now) elasticsearch searches or static data. Searches give you full access to the elasticsearch query dsl and can span multiple indices. Searches can be templated with access to two special variables 1. The time the alert ran. 2. The time the alert was scheduled to run.

Slide 42

Slide 42 text

CC-BY-ND 4.0 CC-BY-ND 4.0 Input "query" : { "filtered": { "query": { "match": { "response": 404 } }, "filter": { "range": { "@timestamp" : { "from": "{{ctx.scheduled_fire_time}}||-5m", "to": "{{ctx.scheduled_fire_time}}" } } } } }

Slide 43

Slide 43 text

CC-BY-ND 4.0 CC-BY-ND 4.0 Condition

Slide 44

Slide 44 text

CC-BY-ND 4.0 CC-BY-ND 4.0 Condition The condition decides whether the alert actions should be executed ● Is the load result from the input over a threshold. ● Does the count from the input mean that I need to be paged? ● Is the trend in a date histogram unexpected ?

Slide 45

Slide 45 text

CC-BY-ND 4.0 CC-BY-ND 4.0 Condition Conditions can be evaluated using scripts. "script" : "return ctx.payload.total_hits > 5" "script" : "ok_count = 0.0;error_count = 0.0;for(bucket in ctx.payload. aggregations.response.buckets) {if (bucket.key < 400){ok_count += bucket. doc_count;} else {error_count += bucket. doc_count;}}; return error_count/ (ok_count+1) >= 0.1;"

Slide 46

Slide 46 text

CC-BY-ND 4.0 CC-BY-ND 4.0 Condition Even simpler if you use on disk on indexed scripts. ● "script" : "hit_checker" "type" : "indexed" "params" : { "threshold" : 5 }

Slide 47

Slide 47 text

CC-BY-ND 4.0 CC-BY-ND 4.0 Actions

Slide 48

Slide 48 text

CC-BY-ND 4.0 CC-BY-ND 4.0 Actions The actions take the result of the alert and deliver it to external and internal systems. ● Email the sysadmin to let him know that load on the cluster is too high. ● Generate a pagerduty API call to all database administrators. ● Index the result of the alert.

Slide 49

Slide 49 text

CC-BY-ND 4.0 CC-BY-ND 4.0 Actions The email and webhook actions support templating. "email" : { "to" : "[email protected]", "subject" : "{{alert_name}} has triggered with {{ctx.payload.hits.total}} results", "body" : "The {{alert_name}} found errors on {{#ctx.payload.aggregations.names}} {{name}}, {{/ctx.payload.aggregations.name}} servers. " }

Slide 50

Slide 50 text

CC-BY-ND 4.0 CC-BY-ND 4.0 Actions The email and webhook actions support templating. "webhook" : { "method" : "POST", "url" : "http://host.domain/third-party- system/{{alert_name}}", "body" : "Encountered {ctx.payload.hits. total}} errors" }

Slide 51

Slide 51 text

CC-BY-ND 4.0 CC-BY-ND 4.0 Alerts and elasticsearch

Slide 52

Slide 52 text

CC-BY-ND 4.0 CC-BY-ND 4.0 Alerts and elasticsearch Alerts are indexed elasticsearch documents. Every time an alert runs an elasticsearch `alert_history` document is generated in a time based index. This document contains all the information from the alert run along with whether or not the condition matched and the status of the actions.

Slide 53

Slide 53 text

CC-BY-ND 4.0 CC-BY-ND 4.0 Alerts and elasticsearch Since alerts and alert runs are indexed documents in elasticsearch you can generate kibana dashboards of your alerts and run alerts on alerts. ● Run an alert every day and check the number of triggered alerts that failed to execute their actions ● Run an alert every day that checks that the expected number of alerts ran. ● Run an alert that checks if the one node is triggering alerts more than others.

Slide 54

Slide 54 text

CC-BY-ND 4.0 CC-BY-ND 4.0 When ? Soon. There will be a beta, if you are interested please let our product team know.

Slide 55

Slide 55 text

CC-BY-ND 4.0