Monitoring Sucks

Monitoring Sucks

Talk from #devslovebacon conference all about why monitoring sucks and what people are doing about it

98234c645fe8c935edc0fec0186d28b8?s=128

Gareth Rushgrove

April 21, 2012
Tweet

Transcript

  1. 1.

    Monitoring Sucks And what you can do about it Bacon

    21st April 2012 gareth rushgrove | morethanseven.net http://www.flickr.com/photos/map408/2412123378
  2. 2.

    Me

  3. 9.

    - Monitoring running applications is interesting - Most monitoring tools

    sucks http://www.flickr.com/photos/iancarroll/5027441664 gareth rushgrove | morethanseven.net I want to convince you that...
  4. 13.

    gareth rushgrove | morethanseven.net # Example configuration file for Munin,

    generated by ‘make build’ # The next three variables specifies where the location of the RRD # databases, the HTML output, and the logs, severally. They all # must be writable by the user running munin-cron. dbdir /var/lib/munin htmldir /var/www/munin logdir /var/log/munin rundir /var/run/munin # Where to look for the HTML templates tmpldir /etc/munin/templates # Make graphs show values per minute instead of per second #graph_period minute # Drop somejuser@fnord.comm and anotheruser@blibb.comm an email everytime # something changes (OK -> WARNING, CRITICAL -> OK, etc) contact.yourname.command mail -s “MUNIN – [${var:host}] ~ ${var:graph_title} ~ warnings: ${loop<,>:wfields ${var:label}=${var:value}} ~ criticals: ${loop<,> :cfields ${var:label}=${var:value}}” your.email@domain.tld # # # For those with Nagios, the following might come in handy. In addition, # the services must be defined in the Nagios server as well. #contact.nagios.command /usr/sbin/send_nsca -H nagios.host.com -c /etc/send_nsca.cfg # a simple host tree [location1-wms1.otherdomain.tld] address 169.254.30.86 use_node_name yes load.load.warning 15 load.load.critical 30 memory.apps.warning 6442450944 memory.committed.warning 8589934592 # memory.committed.warn 8589934592 memory.committed.critical 17179869184 df._dev_cciss_c0d0p1.warning 75 df._dev_mapper_VolGroup00_LogVol00.warning 90 df._dev_mapper_VolGroup00_LogVol01.warning 90 df._dev_mapper_VolGroup00_LogVol02.warning 90 df._dev_mapper_VolGroup00_LogVol04.warning 90 df._dev_mapper_VolGroup01_LogVol00.warning 90 df._dev_mapper_VolGroup02_LogVol00.warning 90 df._dev_mapper_VolGroup03_LogVol00.warning 90 df._dev_cciss_c0d0p1.critical 95 df._dev_mapper_VolGroup00_LogVol00.critical 95 df._dev_mapper_VolGroup00_LogVol01.critical 95 df._dev_mapper_VolGroup00_LogVol02.critical 95 df._dev_mapper_VolGroup00_LogVol04.critical 95 df._dev_mapper_VolGroup01_LogVol00.critical 95 df._dev_mapper_VolGroup02_LogVol00.critical 95 df._dev_mapper_VolGroup03_LogVol00.critical 95 [location1-wms2.otherdomain.tld] address 169.254.30.88 use_node_name yes load.load.warning 15 load.load.critical 30 memory.apps.warning 6442450944 memory.committed.warning 8589934592 memory.committed.critical 17179869184 f._dev_cciss_c0d0p1.critical 95 df._dev_mapper_VolGroup00_LogVol00.critical 95 df._dev_mapper_VolGroup00_LogVol01.critical 95 df._dev_mapper_VolGroup00_LogVol02.critical 95 df._dev_mapper_VolGroup00_LogVol04.critical 95 df._dev_mapper_VolGroup01_LogVol00.critical 95 df._dev_mapper_VolGroup02_LogVol00.critical 95 df._dev_mapper_VolGroup03_LogVol00.critical 95 [location2-wms2.otherdomain.tld] address 169.254.20.22 use_node_name yes load.load.warning 15 load.load.critical 30 memory.apps.warning 6442450944 memory.committed.warning 8589934592 memory.committed.critical 17179869184 df._dev_cciss_c0d0p1.warning 75 df._dev_mapper_VolGroup00_LogVol00.warning 90 df._dev_mapper_VolGroup00_LogVol01.warning 90 df._dev_mapper_VolGroup00_LogVol02.warning 90 df._dev_mapper_VolGroup00_LogVol04.warning 90 df._dev_mapper_VolGroup01_LogVol00.warning 90 df._dev_mapper_VolGroup02_LogVol00.warning 90 df._dev_mapper_VolGroup03_LogVol00.warning 90 df._dev_cciss_c0d0p1.critical 95 df._dev_mapper_VolGroup00_LogVol00.critical 95 df._dev_mapper_VolGroup00_LogVol01.critical 95 df._dev_mapper_VolGroup00_LogVol02.critical 95 df._dev_mapper_VolGroup00_LogVol04.critical 95 df._dev_mapper_VolGroup01_LogVol00.critical 95 df._dev_mapper_VolGroup02_LogVol00.critical 95 df._dev_mapper_VolGroup03_LogVol00.critical 95 [location2-ts1.otherdomain.tld] address 169.254.20.24 use_node_name no memory.swap.label swap memory.swap.draw STACK memory.swap.info Swap memory used [location2-ts2.otherdomain.tld] address 169.254.20.26 use_node_name no memory.swap.label swap memory.swap.draw STACK memory.swap.info Swap memory used [location2-mfc1.otherdomain.tld] address 169.254.20.28 use_node_name no memory.swap.label swap memory.swap.draw STACK memory.swap.info Swap memory used [location2-mfc2.otherdomain.tld] address 169.254.20.30 use_node_name no memory.swap.label swap memory.swap.draw STACK memory.swap.info Swap memory used [otherdomain.tld;Totals] update no load1.graph_title Loads-WMS1 load1.graph_order location1wms1=location1wms1.otherdomain.tld:lo ad.load location2-wms1=location2-wms1.otherdomain.tld:load.load df._dev_cciss_c0d0p1.warning 75 df._dev_mapper_VolGroup00_LogVol00.warning 90 df._dev_mapper_VolGroup00_LogVol01.warning 90 df._dev_mapper_VolGroup00_LogVol02.warning 90 df._dev_mapper_VolGroup00_LogVol04.warning 90 df._dev_mapper_VolGroup01_LogVol00.warning 90 df._dev_mapper_VolGroup02_LogVol00.warning 90 df._dev_mapper_VolGroup03_LogVol00.warning 90 df._dev_cciss_c0d0p1.critical 95 df._dev_mapper_VolGroup00_LogVol00.critical 95 df._dev_mapper_VolGroup00_LogVol01.critical 95 df._dev_mapper_VolGroup00_LogVol02.critical 95 df._dev_mapper_VolGroup00_LogVol04.critical 95 df._dev_mapper_VolGroup01_LogVol00.critical 95 df._dev_mapper_VolGroup02_LogVol00.critical 95 df._dev_mapper_VolGroup03_LogVol00.critical 95 [location1-ts1.otherdomain.tld] address 169.254.30.90 use_node_name no memory.swap.label swap memory.swap.draw STACK memory.swap.info Swap memory used [location1m-fc1.otherdomain.tld] address 169.254.30.94 use_node_name no memory.swap.label swap memory.swap.draw STACK memory.swap.info Swap memory used [location1-mfc2.otherdomain.tld] address 169.254.30.96 use_node_name no memory.swap.label swap memory.swap.draw STACK memory.swap.info Swap memory used [location1-ts2.otherdomain.tld] address 169.254.30.92 use_node_name no memory.swap.label swap memory.swap.draw STACK memory.swap.info Swap memory used memory.apps.label usage memory.unused.label pagefile [location2-wms1.otherdomain.tld] address 169.254.20.20 use_node_name yes load.load.warning 15 load.load.critical 30 memory.apps.warning 6442450944 memory.committed.warning 8589934592 memory.committed.critical 17179869184 df._dev_cciss_c0d0p1.warning 75 df._dev_mapper_VolGroup00_LogVol00.warning 90 df._dev_mapper_VolGroup00_LogVol01.warning 90 df._dev_mapper_VolGroup00_LogVol02.warning 90 df._dev_mapper_VolGroup00_LogVol04.warning 90 df._dev_mapper_VolGroup01_LogVol00.warning 90 df._dev_mapper_VolGroup02_LogVol00.warning 90 df._dev_mapper_VolGroup03_LogVol00.warning 90 load2.graph_title Loads-WMS2 load2.graph_order location1wms2=location1wms2.otherdomain.tld:load.load location2-wms2=location2-wms2.otherdomain.tld:load.load load3.graph_title Loads on top of each other load3.dummy_field.stack location1wms1=location1wms1.otherdomain.tld:load.load location2-wms1=location2-wms1.otherdomain.tld:load.load location1wms2=locati on1wms2.otherdomain.tld:load.load location2-wms2=location2-wms2.otherdomain.tld:load.load load3.dummy_field.draw AREA # We want area instead the default LINE2. load3.dummy_field.label dummy # This is needed. Silly, really. memory1.graph_title Memory SWAP WMS memory1.graph_order location1wms1=location1wms1.otherdomain.tld:memory.swap location2-wms1=location2-wms1.otherdomain.tld:memory.swap location1wms2=locati on1wms2.otherdomain.tld:memory.swap location2-wms2=location2-wms2.otherdomain. tld:memory.swap memory2.graph_title Memory Committed WMS memory2.graph_order location1wms1=location1wms1.otherdomain.tld:memory.committed location2-wms1=location2-wms1.otherdomain.tld:memory.committed location1wms2=loca ion1wms2.otherdomain.tld:memory.committed location2-wms2=location2-wms2.otherdo main.tld:memory.committed # load3.graph_title Loads summarised # load3.combined_loads.sum location1wms1.otherdomain.tld:load.load ocation2-wms1.otherdomain.tld:load.load # load3.combined_loads.label Combined loads # Must be set, as this is # # not a dummy field! [ip-wms1.domain.tld] address 127.0.0.1 use_node_name yes load.load.warning 15 load.load.critical 30 memory.apps.warning 6442450944 memory.committed.warning 8589934592 memory.committed.critical 17179869184 [ip-wms2.domain.tld] address 192.168.101.51 use_node_name yes load.load.warning 15 load.load.critical 30 memory.apps.warning 6442450944 memory.committed.warning 8589934592 memory.committed.critical 17179869184 [windows-pc.domain.tld] address 192.168.101.26 use_node_name yes memory.swap.label swap memory.swap.draw STACK memory.swap.info Swap memory used Verbose configuration
  5. 27.

    { "service_key": "e93facc04764012d7bfb002500d5d1a6", "incident_key": "srv01/HTTP", "event_type": "trigger", "description": "FAILURE on

    machine srv01.acme.com", "details": { "ping time": "1500ms", "load avg": 0.75 } } gareth rushgrove | morethanseven.net APIs
  6. 31.

    gareth rushgrove | morethanseven.net Naming things (is hard) - Metric

    - Context - Resource - Event - Action - Collection - Event processing - Presentation - Analytics a numeric or boolean data point metadata about a metric the source of a metric metric combined with context a response to a given metric getting the metrics taking action graphs, emails, dashboards, etc. correlation
  7. 35.

    Scenario: check that calendars works correctly Given I am testing

    "calendars" Then I should be able to visit: | Path | | /when-do-the-clocks-change | | /bank-holidays | gareth rushgrove | morethanseven.net Monitoring unit tests
  8. 36.

    gareth rushgrove | morethanseven.net For one of my colleague Mat

    Scenario: check we don't get results for cheese Given I am testing "search" When I search for "cheese" Then I should receive no results
  9. 38.

    gareth rushgrove | morethanseven.net JSON example "homepage performance": { "visit":

    "http://monitorstxt.org", "page": { "should have": { "download time": { "maximum": "0.5 seconds" } } }, "assets": { "should have": { "download time": { "maximum": "2 seconds" } } } },
  10. 39.

    "homepage performance": { "visit": "http://monitorstxt.org", "page": { "should have": {

    "download time": { "maximum": "0.5 seconds" } } }, "assets": { "should have": { "download time": { "maximum": "2 seconds" } } } }, gareth rushgrove | morethanseven.net Monitoring system agnostic
  11. 52.

    (streams prn partial prn "this event is interesting:") (where (state

    "error") (fn [event] (info event)) ) gareth rushgrove | morethanseven.net Clojure stream parsing
  12. 55.

    gareth rushgrove | morethanseven.net Automate checks @@nagios_service { "check_nginx_5xx_on_${::hostname}": use

    => 'generic-service', check_command => 'check_ganglia_metric!nginx_http_5xx!0.05!0.1', service_description => 'check nginx error rate', host_name => "${::govuk_class}-${::hostname}", target => '/etc/nagios3/conf.d/nagios_service.cfg', }
  13. 56.

    gareth rushgrove | morethanseven.net Defined outside monitoring systems @@nagios_service {

    "check_nginx_5xx_on_${::hostname}": use => 'generic-service', check_command => 'check_ganglia_metric!nginx_http_5xx!0.05!0.1', service_description => 'check nginx error rate', host_name => "${::govuk_class}-${::hostname}", target => '/etc/nagios3/conf.d/nagios_service.cfg', }
  14. 58.

    gareth rushgrove | morethanseven.net Automate logster collection cron { 'logster-nginx':

    command => '/usr/sbin/logster NginxLogster /var/log/nginx/access.log', user => root, minute => '*/2' }
  15. 73.

    Lots of links - https://github.com/monitoringsucks/ - http://graylog2.org - http://logstash.net -

    https://github.com/etsy/logster - https://github.com/etsy/statsd - http://aphyr.github.com/riemann/ - http://graphite.wikidot.com/ - http://monitorstxt.org/ - http://auxesis.github.com/cucumber-nagios/ gareth rushgrove | morethanseven.net
  16. 74.