Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Monitoring Sucks

Monitoring Sucks

Talk from #devslovebacon conference all about why monitoring sucks and what people are doing about it

Gareth Rushgrove

April 21, 2012
Tweet

More Decks by Gareth Rushgrove

Other Decks in Technology

Transcript

  1. Monitoring Sucks And what you can do about it Bacon

    21st April 2012 gareth rushgrove | morethanseven.net http://www.flickr.com/photos/map408/2412123378
  2. Me

  3. - Monitoring running applications is interesting - Most monitoring tools

    sucks http://www.flickr.com/photos/iancarroll/5027441664 gareth rushgrove | morethanseven.net I want to convince you that...
  4. gareth rushgrove | morethanseven.net # Example configuration file for Munin,

    generated by ‘make build’ # The next three variables specifies where the location of the RRD # databases, the HTML output, and the logs, severally. They all # must be writable by the user running munin-cron. dbdir /var/lib/munin htmldir /var/www/munin logdir /var/log/munin rundir /var/run/munin # Where to look for the HTML templates tmpldir /etc/munin/templates # Make graphs show values per minute instead of per second #graph_period minute # Drop [email protected] and [email protected] an email everytime # something changes (OK -> WARNING, CRITICAL -> OK, etc) contact.yourname.command mail -s “MUNIN – [${var:host}] ~ ${var:graph_title} ~ warnings: ${loop<,>:wfields ${var:label}=${var:value}} ~ criticals: ${loop<,> :cfields ${var:label}=${var:value}}” [email protected] # # # For those with Nagios, the following might come in handy. In addition, # the services must be defined in the Nagios server as well. #contact.nagios.command /usr/sbin/send_nsca -H nagios.host.com -c /etc/send_nsca.cfg # a simple host tree [location1-wms1.otherdomain.tld] address 169.254.30.86 use_node_name yes load.load.warning 15 load.load.critical 30 memory.apps.warning 6442450944 memory.committed.warning 8589934592 # memory.committed.warn 8589934592 memory.committed.critical 17179869184 df._dev_cciss_c0d0p1.warning 75 df._dev_mapper_VolGroup00_LogVol00.warning 90 df._dev_mapper_VolGroup00_LogVol01.warning 90 df._dev_mapper_VolGroup00_LogVol02.warning 90 df._dev_mapper_VolGroup00_LogVol04.warning 90 df._dev_mapper_VolGroup01_LogVol00.warning 90 df._dev_mapper_VolGroup02_LogVol00.warning 90 df._dev_mapper_VolGroup03_LogVol00.warning 90 df._dev_cciss_c0d0p1.critical 95 df._dev_mapper_VolGroup00_LogVol00.critical 95 df._dev_mapper_VolGroup00_LogVol01.critical 95 df._dev_mapper_VolGroup00_LogVol02.critical 95 df._dev_mapper_VolGroup00_LogVol04.critical 95 df._dev_mapper_VolGroup01_LogVol00.critical 95 df._dev_mapper_VolGroup02_LogVol00.critical 95 df._dev_mapper_VolGroup03_LogVol00.critical 95 [location1-wms2.otherdomain.tld] address 169.254.30.88 use_node_name yes load.load.warning 15 load.load.critical 30 memory.apps.warning 6442450944 memory.committed.warning 8589934592 memory.committed.critical 17179869184 f._dev_cciss_c0d0p1.critical 95 df._dev_mapper_VolGroup00_LogVol00.critical 95 df._dev_mapper_VolGroup00_LogVol01.critical 95 df._dev_mapper_VolGroup00_LogVol02.critical 95 df._dev_mapper_VolGroup00_LogVol04.critical 95 df._dev_mapper_VolGroup01_LogVol00.critical 95 df._dev_mapper_VolGroup02_LogVol00.critical 95 df._dev_mapper_VolGroup03_LogVol00.critical 95 [location2-wms2.otherdomain.tld] address 169.254.20.22 use_node_name yes load.load.warning 15 load.load.critical 30 memory.apps.warning 6442450944 memory.committed.warning 8589934592 memory.committed.critical 17179869184 df._dev_cciss_c0d0p1.warning 75 df._dev_mapper_VolGroup00_LogVol00.warning 90 df._dev_mapper_VolGroup00_LogVol01.warning 90 df._dev_mapper_VolGroup00_LogVol02.warning 90 df._dev_mapper_VolGroup00_LogVol04.warning 90 df._dev_mapper_VolGroup01_LogVol00.warning 90 df._dev_mapper_VolGroup02_LogVol00.warning 90 df._dev_mapper_VolGroup03_LogVol00.warning 90 df._dev_cciss_c0d0p1.critical 95 df._dev_mapper_VolGroup00_LogVol00.critical 95 df._dev_mapper_VolGroup00_LogVol01.critical 95 df._dev_mapper_VolGroup00_LogVol02.critical 95 df._dev_mapper_VolGroup00_LogVol04.critical 95 df._dev_mapper_VolGroup01_LogVol00.critical 95 df._dev_mapper_VolGroup02_LogVol00.critical 95 df._dev_mapper_VolGroup03_LogVol00.critical 95 [location2-ts1.otherdomain.tld] address 169.254.20.24 use_node_name no memory.swap.label swap memory.swap.draw STACK memory.swap.info Swap memory used [location2-ts2.otherdomain.tld] address 169.254.20.26 use_node_name no memory.swap.label swap memory.swap.draw STACK memory.swap.info Swap memory used [location2-mfc1.otherdomain.tld] address 169.254.20.28 use_node_name no memory.swap.label swap memory.swap.draw STACK memory.swap.info Swap memory used [location2-mfc2.otherdomain.tld] address 169.254.20.30 use_node_name no memory.swap.label swap memory.swap.draw STACK memory.swap.info Swap memory used [otherdomain.tld;Totals] update no load1.graph_title Loads-WMS1 load1.graph_order location1wms1=location1wms1.otherdomain.tld:lo ad.load location2-wms1=location2-wms1.otherdomain.tld:load.load df._dev_cciss_c0d0p1.warning 75 df._dev_mapper_VolGroup00_LogVol00.warning 90 df._dev_mapper_VolGroup00_LogVol01.warning 90 df._dev_mapper_VolGroup00_LogVol02.warning 90 df._dev_mapper_VolGroup00_LogVol04.warning 90 df._dev_mapper_VolGroup01_LogVol00.warning 90 df._dev_mapper_VolGroup02_LogVol00.warning 90 df._dev_mapper_VolGroup03_LogVol00.warning 90 df._dev_cciss_c0d0p1.critical 95 df._dev_mapper_VolGroup00_LogVol00.critical 95 df._dev_mapper_VolGroup00_LogVol01.critical 95 df._dev_mapper_VolGroup00_LogVol02.critical 95 df._dev_mapper_VolGroup00_LogVol04.critical 95 df._dev_mapper_VolGroup01_LogVol00.critical 95 df._dev_mapper_VolGroup02_LogVol00.critical 95 df._dev_mapper_VolGroup03_LogVol00.critical 95 [location1-ts1.otherdomain.tld] address 169.254.30.90 use_node_name no memory.swap.label swap memory.swap.draw STACK memory.swap.info Swap memory used [location1m-fc1.otherdomain.tld] address 169.254.30.94 use_node_name no memory.swap.label swap memory.swap.draw STACK memory.swap.info Swap memory used [location1-mfc2.otherdomain.tld] address 169.254.30.96 use_node_name no memory.swap.label swap memory.swap.draw STACK memory.swap.info Swap memory used [location1-ts2.otherdomain.tld] address 169.254.30.92 use_node_name no memory.swap.label swap memory.swap.draw STACK memory.swap.info Swap memory used memory.apps.label usage memory.unused.label pagefile [location2-wms1.otherdomain.tld] address 169.254.20.20 use_node_name yes load.load.warning 15 load.load.critical 30 memory.apps.warning 6442450944 memory.committed.warning 8589934592 memory.committed.critical 17179869184 df._dev_cciss_c0d0p1.warning 75 df._dev_mapper_VolGroup00_LogVol00.warning 90 df._dev_mapper_VolGroup00_LogVol01.warning 90 df._dev_mapper_VolGroup00_LogVol02.warning 90 df._dev_mapper_VolGroup00_LogVol04.warning 90 df._dev_mapper_VolGroup01_LogVol00.warning 90 df._dev_mapper_VolGroup02_LogVol00.warning 90 df._dev_mapper_VolGroup03_LogVol00.warning 90 load2.graph_title Loads-WMS2 load2.graph_order location1wms2=location1wms2.otherdomain.tld:load.load location2-wms2=location2-wms2.otherdomain.tld:load.load load3.graph_title Loads on top of each other load3.dummy_field.stack location1wms1=location1wms1.otherdomain.tld:load.load location2-wms1=location2-wms1.otherdomain.tld:load.load location1wms2=locati on1wms2.otherdomain.tld:load.load location2-wms2=location2-wms2.otherdomain.tld:load.load load3.dummy_field.draw AREA # We want area instead the default LINE2. load3.dummy_field.label dummy # This is needed. Silly, really. memory1.graph_title Memory SWAP WMS memory1.graph_order location1wms1=location1wms1.otherdomain.tld:memory.swap location2-wms1=location2-wms1.otherdomain.tld:memory.swap location1wms2=locati on1wms2.otherdomain.tld:memory.swap location2-wms2=location2-wms2.otherdomain. tld:memory.swap memory2.graph_title Memory Committed WMS memory2.graph_order location1wms1=location1wms1.otherdomain.tld:memory.committed location2-wms1=location2-wms1.otherdomain.tld:memory.committed location1wms2=loca ion1wms2.otherdomain.tld:memory.committed location2-wms2=location2-wms2.otherdo main.tld:memory.committed # load3.graph_title Loads summarised # load3.combined_loads.sum location1wms1.otherdomain.tld:load.load ocation2-wms1.otherdomain.tld:load.load # load3.combined_loads.label Combined loads # Must be set, as this is # # not a dummy field! [ip-wms1.domain.tld] address 127.0.0.1 use_node_name yes load.load.warning 15 load.load.critical 30 memory.apps.warning 6442450944 memory.committed.warning 8589934592 memory.committed.critical 17179869184 [ip-wms2.domain.tld] address 192.168.101.51 use_node_name yes load.load.warning 15 load.load.critical 30 memory.apps.warning 6442450944 memory.committed.warning 8589934592 memory.committed.critical 17179869184 [windows-pc.domain.tld] address 192.168.101.26 use_node_name yes memory.swap.label swap memory.swap.draw STACK memory.swap.info Swap memory used Verbose configuration
  5. { "service_key": "e93facc04764012d7bfb002500d5d1a6", "incident_key": "srv01/HTTP", "event_type": "trigger", "description": "FAILURE on

    machine srv01.acme.com", "details": { "ping time": "1500ms", "load avg": 0.75 } } gareth rushgrove | morethanseven.net APIs
  6. gareth rushgrove | morethanseven.net Naming things (is hard) - Metric

    - Context - Resource - Event - Action - Collection - Event processing - Presentation - Analytics a numeric or boolean data point metadata about a metric the source of a metric metric combined with context a response to a given metric getting the metrics taking action graphs, emails, dashboards, etc. correlation
  7. Scenario: check that calendars works correctly Given I am testing

    "calendars" Then I should be able to visit: | Path | | /when-do-the-clocks-change | | /bank-holidays | gareth rushgrove | morethanseven.net Monitoring unit tests
  8. gareth rushgrove | morethanseven.net For one of my colleague Mat

    Scenario: check we don't get results for cheese Given I am testing "search" When I search for "cheese" Then I should receive no results
  9. gareth rushgrove | morethanseven.net JSON example "homepage performance": { "visit":

    "http://monitorstxt.org", "page": { "should have": { "download time": { "maximum": "0.5 seconds" } } }, "assets": { "should have": { "download time": { "maximum": "2 seconds" } } } },
  10. "homepage performance": { "visit": "http://monitorstxt.org", "page": { "should have": {

    "download time": { "maximum": "0.5 seconds" } } }, "assets": { "should have": { "download time": { "maximum": "2 seconds" } } } }, gareth rushgrove | morethanseven.net Monitoring system agnostic
  11. (streams prn partial prn "this event is interesting:") (where (state

    "error") (fn [event] (info event)) ) gareth rushgrove | morethanseven.net Clojure stream parsing
  12. gareth rushgrove | morethanseven.net Automate checks @@nagios_service { "check_nginx_5xx_on_${::hostname}": use

    => 'generic-service', check_command => 'check_ganglia_metric!nginx_http_5xx!0.05!0.1', service_description => 'check nginx error rate', host_name => "${::govuk_class}-${::hostname}", target => '/etc/nagios3/conf.d/nagios_service.cfg', }
  13. gareth rushgrove | morethanseven.net Defined outside monitoring systems @@nagios_service {

    "check_nginx_5xx_on_${::hostname}": use => 'generic-service', check_command => 'check_ganglia_metric!nginx_http_5xx!0.05!0.1', service_description => 'check nginx error rate', host_name => "${::govuk_class}-${::hostname}", target => '/etc/nagios3/conf.d/nagios_service.cfg', }
  14. gareth rushgrove | morethanseven.net Automate logster collection cron { 'logster-nginx':

    command => '/usr/sbin/logster NginxLogster /var/log/nginx/access.log', user => root, minute => '*/2' }
  15. Lots of links - https://github.com/monitoringsucks/ - http://graylog2.org - http://logstash.net -

    https://github.com/etsy/logster - https://github.com/etsy/statsd - http://aphyr.github.com/riemann/ - http://graphite.wikidot.com/ - http://monitorstxt.org/ - http://auxesis.github.com/cucumber-nagios/ gareth rushgrove | morethanseven.net