Slide 1

Slide 1 text

Monitoring Sucks And what you can do about it Bacon 21st April 2012 gareth rushgrove | morethanseven.net http://www.flickr.com/photos/map408/2412123378

Slide 2

Slide 2 text

Me

Slide 3

Slide 3 text

Gareth Rushgrove @garethr gareth rushgrove | morethanseven.net

Slide 4

Slide 4 text

Blog at morethanseven.net gareth rushgrove | morethanseven.net

Slide 5

Slide 5 text

Curate devopsweekly.com gareth rushgrove | morethanseven.net

Slide 6

Slide 6 text

Work at UK Government Digital Service Text gareth rushgrove | morethanseven.net

Slide 7

Slide 7 text

Serious Government Business gareth rushgrove | morethanseven.net

Slide 8

Slide 8 text

The talk

Slide 9

Slide 9 text

- Monitoring running applications is interesting - Most monitoring tools sucks http://www.flickr.com/photos/iancarroll/5027441664 gareth rushgrove | morethanseven.net I want to convince you that...

Slide 10

Slide 10 text

http://www.flickr.com/photos/iancarroll/5027441664 gareth rushgrove | morethanseven.net ...and - Monitoring running applications is interesting - Most monitoring tools sucks

Slide 11

Slide 11 text

What we have

Slide 12

Slide 12 text

gareth rushgrove | morethanseven.net Clunky user interfaces

Slide 13

Slide 13 text

gareth rushgrove | morethanseven.net # Example configuration file for Munin, generated by ‘make build’ # The next three variables specifies where the location of the RRD # databases, the HTML output, and the logs, severally. They all # must be writable by the user running munin-cron. dbdir /var/lib/munin htmldir /var/www/munin logdir /var/log/munin rundir /var/run/munin # Where to look for the HTML templates tmpldir /etc/munin/templates # Make graphs show values per minute instead of per second #graph_period minute # Drop [email protected] and [email protected] an email everytime # something changes (OK -> WARNING, CRITICAL -> OK, etc) contact.yourname.command mail -s “MUNIN – [${var:host}] ~ ${var:graph_title} ~ warnings: ${loop<,>:wfields ${var:label}=${var:value}} ~ criticals: ${loop<,> :cfields ${var:label}=${var:value}}” [email protected] # # # For those with Nagios, the following might come in handy. In addition, # the services must be defined in the Nagios server as well. #contact.nagios.command /usr/sbin/send_nsca -H nagios.host.com -c /etc/send_nsca.cfg # a simple host tree [location1-wms1.otherdomain.tld] address 169.254.30.86 use_node_name yes load.load.warning 15 load.load.critical 30 memory.apps.warning 6442450944 memory.committed.warning 8589934592 # memory.committed.warn 8589934592 memory.committed.critical 17179869184 df._dev_cciss_c0d0p1.warning 75 df._dev_mapper_VolGroup00_LogVol00.warning 90 df._dev_mapper_VolGroup00_LogVol01.warning 90 df._dev_mapper_VolGroup00_LogVol02.warning 90 df._dev_mapper_VolGroup00_LogVol04.warning 90 df._dev_mapper_VolGroup01_LogVol00.warning 90 df._dev_mapper_VolGroup02_LogVol00.warning 90 df._dev_mapper_VolGroup03_LogVol00.warning 90 df._dev_cciss_c0d0p1.critical 95 df._dev_mapper_VolGroup00_LogVol00.critical 95 df._dev_mapper_VolGroup00_LogVol01.critical 95 df._dev_mapper_VolGroup00_LogVol02.critical 95 df._dev_mapper_VolGroup00_LogVol04.critical 95 df._dev_mapper_VolGroup01_LogVol00.critical 95 df._dev_mapper_VolGroup02_LogVol00.critical 95 df._dev_mapper_VolGroup03_LogVol00.critical 95 [location1-wms2.otherdomain.tld] address 169.254.30.88 use_node_name yes load.load.warning 15 load.load.critical 30 memory.apps.warning 6442450944 memory.committed.warning 8589934592 memory.committed.critical 17179869184 f._dev_cciss_c0d0p1.critical 95 df._dev_mapper_VolGroup00_LogVol00.critical 95 df._dev_mapper_VolGroup00_LogVol01.critical 95 df._dev_mapper_VolGroup00_LogVol02.critical 95 df._dev_mapper_VolGroup00_LogVol04.critical 95 df._dev_mapper_VolGroup01_LogVol00.critical 95 df._dev_mapper_VolGroup02_LogVol00.critical 95 df._dev_mapper_VolGroup03_LogVol00.critical 95 [location2-wms2.otherdomain.tld] address 169.254.20.22 use_node_name yes load.load.warning 15 load.load.critical 30 memory.apps.warning 6442450944 memory.committed.warning 8589934592 memory.committed.critical 17179869184 df._dev_cciss_c0d0p1.warning 75 df._dev_mapper_VolGroup00_LogVol00.warning 90 df._dev_mapper_VolGroup00_LogVol01.warning 90 df._dev_mapper_VolGroup00_LogVol02.warning 90 df._dev_mapper_VolGroup00_LogVol04.warning 90 df._dev_mapper_VolGroup01_LogVol00.warning 90 df._dev_mapper_VolGroup02_LogVol00.warning 90 df._dev_mapper_VolGroup03_LogVol00.warning 90 df._dev_cciss_c0d0p1.critical 95 df._dev_mapper_VolGroup00_LogVol00.critical 95 df._dev_mapper_VolGroup00_LogVol01.critical 95 df._dev_mapper_VolGroup00_LogVol02.critical 95 df._dev_mapper_VolGroup00_LogVol04.critical 95 df._dev_mapper_VolGroup01_LogVol00.critical 95 df._dev_mapper_VolGroup02_LogVol00.critical 95 df._dev_mapper_VolGroup03_LogVol00.critical 95 [location2-ts1.otherdomain.tld] address 169.254.20.24 use_node_name no memory.swap.label swap memory.swap.draw STACK memory.swap.info Swap memory used [location2-ts2.otherdomain.tld] address 169.254.20.26 use_node_name no memory.swap.label swap memory.swap.draw STACK memory.swap.info Swap memory used [location2-mfc1.otherdomain.tld] address 169.254.20.28 use_node_name no memory.swap.label swap memory.swap.draw STACK memory.swap.info Swap memory used [location2-mfc2.otherdomain.tld] address 169.254.20.30 use_node_name no memory.swap.label swap memory.swap.draw STACK memory.swap.info Swap memory used [otherdomain.tld;Totals] update no load1.graph_title Loads-WMS1 load1.graph_order location1wms1=location1wms1.otherdomain.tld:lo ad.load location2-wms1=location2-wms1.otherdomain.tld:load.load df._dev_cciss_c0d0p1.warning 75 df._dev_mapper_VolGroup00_LogVol00.warning 90 df._dev_mapper_VolGroup00_LogVol01.warning 90 df._dev_mapper_VolGroup00_LogVol02.warning 90 df._dev_mapper_VolGroup00_LogVol04.warning 90 df._dev_mapper_VolGroup01_LogVol00.warning 90 df._dev_mapper_VolGroup02_LogVol00.warning 90 df._dev_mapper_VolGroup03_LogVol00.warning 90 df._dev_cciss_c0d0p1.critical 95 df._dev_mapper_VolGroup00_LogVol00.critical 95 df._dev_mapper_VolGroup00_LogVol01.critical 95 df._dev_mapper_VolGroup00_LogVol02.critical 95 df._dev_mapper_VolGroup00_LogVol04.critical 95 df._dev_mapper_VolGroup01_LogVol00.critical 95 df._dev_mapper_VolGroup02_LogVol00.critical 95 df._dev_mapper_VolGroup03_LogVol00.critical 95 [location1-ts1.otherdomain.tld] address 169.254.30.90 use_node_name no memory.swap.label swap memory.swap.draw STACK memory.swap.info Swap memory used [location1m-fc1.otherdomain.tld] address 169.254.30.94 use_node_name no memory.swap.label swap memory.swap.draw STACK memory.swap.info Swap memory used [location1-mfc2.otherdomain.tld] address 169.254.30.96 use_node_name no memory.swap.label swap memory.swap.draw STACK memory.swap.info Swap memory used [location1-ts2.otherdomain.tld] address 169.254.30.92 use_node_name no memory.swap.label swap memory.swap.draw STACK memory.swap.info Swap memory used memory.apps.label usage memory.unused.label pagefile [location2-wms1.otherdomain.tld] address 169.254.20.20 use_node_name yes load.load.warning 15 load.load.critical 30 memory.apps.warning 6442450944 memory.committed.warning 8589934592 memory.committed.critical 17179869184 df._dev_cciss_c0d0p1.warning 75 df._dev_mapper_VolGroup00_LogVol00.warning 90 df._dev_mapper_VolGroup00_LogVol01.warning 90 df._dev_mapper_VolGroup00_LogVol02.warning 90 df._dev_mapper_VolGroup00_LogVol04.warning 90 df._dev_mapper_VolGroup01_LogVol00.warning 90 df._dev_mapper_VolGroup02_LogVol00.warning 90 df._dev_mapper_VolGroup03_LogVol00.warning 90 load2.graph_title Loads-WMS2 load2.graph_order location1wms2=location1wms2.otherdomain.tld:load.load location2-wms2=location2-wms2.otherdomain.tld:load.load load3.graph_title Loads on top of each other load3.dummy_field.stack location1wms1=location1wms1.otherdomain.tld:load.load location2-wms1=location2-wms1.otherdomain.tld:load.load location1wms2=locati on1wms2.otherdomain.tld:load.load location2-wms2=location2-wms2.otherdomain.tld:load.load load3.dummy_field.draw AREA # We want area instead the default LINE2. load3.dummy_field.label dummy # This is needed. Silly, really. memory1.graph_title Memory SWAP WMS memory1.graph_order location1wms1=location1wms1.otherdomain.tld:memory.swap location2-wms1=location2-wms1.otherdomain.tld:memory.swap location1wms2=locati on1wms2.otherdomain.tld:memory.swap location2-wms2=location2-wms2.otherdomain. tld:memory.swap memory2.graph_title Memory Committed WMS memory2.graph_order location1wms1=location1wms1.otherdomain.tld:memory.committed location2-wms1=location2-wms1.otherdomain.tld:memory.committed location1wms2=loca ion1wms2.otherdomain.tld:memory.committed location2-wms2=location2-wms2.otherdo main.tld:memory.committed # load3.graph_title Loads summarised # load3.combined_loads.sum location1wms1.otherdomain.tld:load.load ocation2-wms1.otherdomain.tld:load.load # load3.combined_loads.label Combined loads # Must be set, as this is # # not a dummy field! [ip-wms1.domain.tld] address 127.0.0.1 use_node_name yes load.load.warning 15 load.load.critical 30 memory.apps.warning 6442450944 memory.committed.warning 8589934592 memory.committed.critical 17179869184 [ip-wms2.domain.tld] address 192.168.101.51 use_node_name yes load.load.warning 15 load.load.critical 30 memory.apps.warning 6442450944 memory.committed.warning 8589934592 memory.committed.critical 17179869184 [windows-pc.domain.tld] address 192.168.101.26 use_node_name yes memory.swap.label swap memory.swap.draw STACK memory.swap.info Swap memory used Verbose configuration

Slide 14

Slide 14 text

gareth rushgrove | morethanseven.net Long check intervals

Slide 15

Slide 15 text

gareth rushgrove | morethanseven.net Host centric world view

Slide 16

Slide 16 text

gareth rushgrove | morethanseven.net Another source of truth

Slide 17

Slide 17 text

Not just me

Slide 18

Slide 18 text

gareth rushgrove | morethanseven.net A Tweet

Slide 19

Slide 19 text

gareth rushgrove | morethanseven.net A blog post

Slide 20

Slide 20 text

gareth rushgrove | morethanseven.net An IRC room ##monitoringsucks

Slide 21

Slide 21 text

gareth rushgrove | morethanseven.net A Twitter hashtag #monitoringsucks

Slide 22

Slide 22 text

gareth rushgrove | morethanseven.net A GitHub repository

Slide 23

Slide 23 text

What we want (really really want)

Slide 24

Slide 24 text

gareth rushgrove | morethanseven.net Metrics and graphs

Slide 25

Slide 25 text

gareth rushgrove | morethanseven.net System AND business data

Slide 26

Slide 26 text

gareth rushgrove | morethanseven.net Log streams

Slide 27

Slide 27 text

{ "service_key": "e93facc04764012d7bfb002500d5d1a6", "incident_key": "srv01/HTTP", "event_type": "trigger", "description": "FAILURE on machine srv01.acme.com", "details": { "ping time": "1500ms", "load avg": 0.75 } } gareth rushgrove | morethanseven.net APIs

Slide 28

Slide 28 text

gareth rushgrove | morethanseven.net Alerts

Slide 29

Slide 29 text

gareth rushgrove | morethanseven.net Dashboards

Slide 30

Slide 30 text

Goings on (just a quick sample)

Slide 31

Slide 31 text

gareth rushgrove | morethanseven.net Naming things (is hard) - Metric - Context - Resource - Event - Action - Collection - Event processing - Presentation - Analytics a numeric or boolean data point metadata about a metric the source of a metric metric combined with context a response to a given metric getting the metrics taking action graphs, emails, dashboards, etc. correlation

Slide 32

Slide 32 text

gareth rushgrove | morethanseven.net Sharing setups

Slide 33

Slide 33 text

gareth rushgrove | morethanseven.net Low latency message based tools

Slide 34

Slide 34 text

gareth rushgrove | morethanseven.net Monitoring == Testing

Slide 35

Slide 35 text

Scenario: check that calendars works correctly Given I am testing "calendars" Then I should be able to visit: | Path | | /when-do-the-clocks-change | | /bank-holidays | gareth rushgrove | morethanseven.net Monitoring unit tests

Slide 36

Slide 36 text

gareth rushgrove | morethanseven.net For one of my colleague Mat Scenario: check we don't get results for cheese Given I am testing "search" When I search for "cheese" Then I should receive no results

Slide 37

Slide 37 text

gareth rushgrove | morethanseven.net monitors.txt

Slide 38

Slide 38 text

gareth rushgrove | morethanseven.net JSON example "homepage performance": { "visit": "http://monitorstxt.org", "page": { "should have": { "download time": { "maximum": "0.5 seconds" } } }, "assets": { "should have": { "download time": { "maximum": "2 seconds" } } } },

Slide 39

Slide 39 text

"homepage performance": { "visit": "http://monitorstxt.org", "page": { "should have": { "download time": { "maximum": "0.5 seconds" } } }, "assets": { "should have": { "download time": { "maximum": "2 seconds" } } } }, gareth rushgrove | morethanseven.net Monitoring system agnostic

Slide 40

Slide 40 text

Open Source

Slide 41

Slide 41 text

gareth rushgrove | morethanseven.net Graphite

Slide 42

Slide 42 text

gareth rushgrove | morethanseven.net GDash

Slide 43

Slide 43 text

gareth rushgrove | morethanseven.net Statsd

Slide 44

Slide 44 text

@statsd = Statsd.new('statsd.example.com', 1234) @statsd.increment('foo.bar') gareth rushgrove | morethanseven.net Ruby counter

Slide 45

Slide 45 text

StatsdClient client = new StatsdClient("host", 1234); client.increment("foo.bar"); gareth rushgrove | morethanseven.net Java counter

Slide 46

Slide 46 text

gareth rushgrove | morethanseven.net Logster

Slide 47

Slide 47 text

⚡ logster --output=ganglia \ NginxLogster \ /var/log/nginx/access.log gareth rushgrove | morethanseven.net Point at log files

Slide 48

Slide 48 text

gareth rushgrove | morethanseven.net Get metrics in Ganglia or Graphite

Slide 49

Slide 49 text

gareth rushgrove | morethanseven.net Graylog2

Slide 50

Slide 50 text

gareth rushgrove | morethanseven.net Logstash

Slide 51

Slide 51 text

gareth rushgrove | morethanseven.net Riemann

Slide 52

Slide 52 text

(streams prn partial prn "this event is interesting:") (where (state "error") (fn [event] (info event)) ) gareth rushgrove | morethanseven.net Clojure stream parsing

Slide 53

Slide 53 text

gareth rushgrove | morethanseven.net Configuration management

Slide 54

Slide 54 text

gareth rushgrove | morethanseven.net Or Configuration management

Slide 55

Slide 55 text

gareth rushgrove | morethanseven.net Automate checks @@nagios_service { "check_nginx_5xx_on_${::hostname}": use => 'generic-service', check_command => 'check_ganglia_metric!nginx_http_5xx!0.05!0.1', service_description => 'check nginx error rate', host_name => "${::govuk_class}-${::hostname}", target => '/etc/nagios3/conf.d/nagios_service.cfg', }

Slide 56

Slide 56 text

gareth rushgrove | morethanseven.net Defined outside monitoring systems @@nagios_service { "check_nginx_5xx_on_${::hostname}": use => 'generic-service', check_command => 'check_ganglia_metric!nginx_http_5xx!0.05!0.1', service_description => 'check nginx error rate', host_name => "${::govuk_class}-${::hostname}", target => '/etc/nagios3/conf.d/nagios_service.cfg', }

Slide 57

Slide 57 text

gareth rushgrove | morethanseven.net Automate graylog collection graylogtail::collect { 'graylogtail-access': log_file => '/var/log/nginx/access.log', facility => $name, }

Slide 58

Slide 58 text

gareth rushgrove | morethanseven.net Automate logster collection cron { 'logster-nginx': command => '/usr/sbin/logster NginxLogster /var/log/nginx/access.log', user => root, minute => '*/2' }

Slide 59

Slide 59 text

SAAS (pay nice people for software)

Slide 60

Slide 60 text

gareth rushgrove | morethanseven.net New Relic

Slide 61

Slide 61 text

gareth rushgrove | morethanseven.net New Relic dashboard

Slide 62

Slide 62 text

gareth rushgrove | morethanseven.net Librato Metrics

Slide 63

Slide 63 text

gareth rushgrove | morethanseven.net Librato Graphs

Slide 64

Slide 64 text

gareth rushgrove | morethanseven.net Splunk

Slide 65

Slide 65 text

gareth rushgrove | morethanseven.net Splunk dashboard

Slide 66

Slide 66 text

gareth rushgrove | morethanseven.net PagerDuty

Slide 67

Slide 67 text

gareth rushgrove | morethanseven.net PagerDuty scheduler

Slide 68

Slide 68 text

gareth rushgrove | morethanseven.net Boundary

Slide 69

Slide 69 text

gareth rushgrove | morethanseven.net Network traffic analysis

Slide 70

Slide 70 text

Takeaway (if all you remember is)

Slide 71

Slide 71 text

gareth rushgrove | morethanseven.net Admit we have a problem

Slide 72

Slide 72 text

gareth rushgrove | morethanseven.net Help build stuff

Slide 73

Slide 73 text

Lots of links - https://github.com/monitoringsucks/ - http://graylog2.org - http://logstash.net - https://github.com/etsy/logster - https://github.com/etsy/statsd - http://aphyr.github.com/riemann/ - http://graphite.wikidot.com/ - http://monitorstxt.org/ - http://auxesis.github.com/cucumber-nagios/ gareth rushgrove | morethanseven.net

Slide 74

Slide 74 text

The End

Slide 75

Slide 75 text

http://www.flickr.com/photos/benterrett/6852348725/ One more thing gareth rushgrove | morethanseven.net

Slide 76

Slide 76 text

Questions? gareth rushgrove | morethanseven.net http://flickr.com/photos/psd/102332391/