Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Monitoring Sucks

Monitoring Sucks

Talk from #devslovebacon conference all about why monitoring sucks and what people are doing about it

98234c645fe8c935edc0fec0186d28b8?s=128

Gareth Rushgrove

April 21, 2012
Tweet

More Decks by Gareth Rushgrove

Other Decks in Technology

Transcript

  1. Monitoring Sucks And what you can do about it Bacon

    21st April 2012 gareth rushgrove | morethanseven.net http://www.flickr.com/photos/map408/2412123378
  2. Me

  3. Gareth Rushgrove @garethr gareth rushgrove | morethanseven.net

  4. Blog at morethanseven.net gareth rushgrove | morethanseven.net

  5. Curate devopsweekly.com gareth rushgrove | morethanseven.net

  6. Work at UK Government Digital Service Text gareth rushgrove |

    morethanseven.net
  7. Serious Government Business gareth rushgrove | morethanseven.net

  8. The talk

  9. - Monitoring running applications is interesting - Most monitoring tools

    sucks http://www.flickr.com/photos/iancarroll/5027441664 gareth rushgrove | morethanseven.net I want to convince you that...
  10. http://www.flickr.com/photos/iancarroll/5027441664 gareth rushgrove | morethanseven.net ...and - Monitoring running applications

    is interesting - Most monitoring tools sucks
  11. What we have

  12. gareth rushgrove | morethanseven.net Clunky user interfaces

  13. gareth rushgrove | morethanseven.net # Example configuration file for Munin,

    generated by ‘make build’ # The next three variables specifies where the location of the RRD # databases, the HTML output, and the logs, severally. They all # must be writable by the user running munin-cron. dbdir /var/lib/munin htmldir /var/www/munin logdir /var/log/munin rundir /var/run/munin # Where to look for the HTML templates tmpldir /etc/munin/templates # Make graphs show values per minute instead of per second #graph_period minute # Drop somejuser@fnord.comm and anotheruser@blibb.comm an email everytime # something changes (OK -> WARNING, CRITICAL -> OK, etc) contact.yourname.command mail -s “MUNIN – [${var:host}] ~ ${var:graph_title} ~ warnings: ${loop<,>:wfields ${var:label}=${var:value}} ~ criticals: ${loop<,> :cfields ${var:label}=${var:value}}” your.email@domain.tld # # # For those with Nagios, the following might come in handy. In addition, # the services must be defined in the Nagios server as well. #contact.nagios.command /usr/sbin/send_nsca -H nagios.host.com -c /etc/send_nsca.cfg # a simple host tree [location1-wms1.otherdomain.tld] address 169.254.30.86 use_node_name yes load.load.warning 15 load.load.critical 30 memory.apps.warning 6442450944 memory.committed.warning 8589934592 # memory.committed.warn 8589934592 memory.committed.critical 17179869184 df._dev_cciss_c0d0p1.warning 75 df._dev_mapper_VolGroup00_LogVol00.warning 90 df._dev_mapper_VolGroup00_LogVol01.warning 90 df._dev_mapper_VolGroup00_LogVol02.warning 90 df._dev_mapper_VolGroup00_LogVol04.warning 90 df._dev_mapper_VolGroup01_LogVol00.warning 90 df._dev_mapper_VolGroup02_LogVol00.warning 90 df._dev_mapper_VolGroup03_LogVol00.warning 90 df._dev_cciss_c0d0p1.critical 95 df._dev_mapper_VolGroup00_LogVol00.critical 95 df._dev_mapper_VolGroup00_LogVol01.critical 95 df._dev_mapper_VolGroup00_LogVol02.critical 95 df._dev_mapper_VolGroup00_LogVol04.critical 95 df._dev_mapper_VolGroup01_LogVol00.critical 95 df._dev_mapper_VolGroup02_LogVol00.critical 95 df._dev_mapper_VolGroup03_LogVol00.critical 95 [location1-wms2.otherdomain.tld] address 169.254.30.88 use_node_name yes load.load.warning 15 load.load.critical 30 memory.apps.warning 6442450944 memory.committed.warning 8589934592 memory.committed.critical 17179869184 f._dev_cciss_c0d0p1.critical 95 df._dev_mapper_VolGroup00_LogVol00.critical 95 df._dev_mapper_VolGroup00_LogVol01.critical 95 df._dev_mapper_VolGroup00_LogVol02.critical 95 df._dev_mapper_VolGroup00_LogVol04.critical 95 df._dev_mapper_VolGroup01_LogVol00.critical 95 df._dev_mapper_VolGroup02_LogVol00.critical 95 df._dev_mapper_VolGroup03_LogVol00.critical 95 [location2-wms2.otherdomain.tld] address 169.254.20.22 use_node_name yes load.load.warning 15 load.load.critical 30 memory.apps.warning 6442450944 memory.committed.warning 8589934592 memory.committed.critical 17179869184 df._dev_cciss_c0d0p1.warning 75 df._dev_mapper_VolGroup00_LogVol00.warning 90 df._dev_mapper_VolGroup00_LogVol01.warning 90 df._dev_mapper_VolGroup00_LogVol02.warning 90 df._dev_mapper_VolGroup00_LogVol04.warning 90 df._dev_mapper_VolGroup01_LogVol00.warning 90 df._dev_mapper_VolGroup02_LogVol00.warning 90 df._dev_mapper_VolGroup03_LogVol00.warning 90 df._dev_cciss_c0d0p1.critical 95 df._dev_mapper_VolGroup00_LogVol00.critical 95 df._dev_mapper_VolGroup00_LogVol01.critical 95 df._dev_mapper_VolGroup00_LogVol02.critical 95 df._dev_mapper_VolGroup00_LogVol04.critical 95 df._dev_mapper_VolGroup01_LogVol00.critical 95 df._dev_mapper_VolGroup02_LogVol00.critical 95 df._dev_mapper_VolGroup03_LogVol00.critical 95 [location2-ts1.otherdomain.tld] address 169.254.20.24 use_node_name no memory.swap.label swap memory.swap.draw STACK memory.swap.info Swap memory used [location2-ts2.otherdomain.tld] address 169.254.20.26 use_node_name no memory.swap.label swap memory.swap.draw STACK memory.swap.info Swap memory used [location2-mfc1.otherdomain.tld] address 169.254.20.28 use_node_name no memory.swap.label swap memory.swap.draw STACK memory.swap.info Swap memory used [location2-mfc2.otherdomain.tld] address 169.254.20.30 use_node_name no memory.swap.label swap memory.swap.draw STACK memory.swap.info Swap memory used [otherdomain.tld;Totals] update no load1.graph_title Loads-WMS1 load1.graph_order location1wms1=location1wms1.otherdomain.tld:lo ad.load location2-wms1=location2-wms1.otherdomain.tld:load.load df._dev_cciss_c0d0p1.warning 75 df._dev_mapper_VolGroup00_LogVol00.warning 90 df._dev_mapper_VolGroup00_LogVol01.warning 90 df._dev_mapper_VolGroup00_LogVol02.warning 90 df._dev_mapper_VolGroup00_LogVol04.warning 90 df._dev_mapper_VolGroup01_LogVol00.warning 90 df._dev_mapper_VolGroup02_LogVol00.warning 90 df._dev_mapper_VolGroup03_LogVol00.warning 90 df._dev_cciss_c0d0p1.critical 95 df._dev_mapper_VolGroup00_LogVol00.critical 95 df._dev_mapper_VolGroup00_LogVol01.critical 95 df._dev_mapper_VolGroup00_LogVol02.critical 95 df._dev_mapper_VolGroup00_LogVol04.critical 95 df._dev_mapper_VolGroup01_LogVol00.critical 95 df._dev_mapper_VolGroup02_LogVol00.critical 95 df._dev_mapper_VolGroup03_LogVol00.critical 95 [location1-ts1.otherdomain.tld] address 169.254.30.90 use_node_name no memory.swap.label swap memory.swap.draw STACK memory.swap.info Swap memory used [location1m-fc1.otherdomain.tld] address 169.254.30.94 use_node_name no memory.swap.label swap memory.swap.draw STACK memory.swap.info Swap memory used [location1-mfc2.otherdomain.tld] address 169.254.30.96 use_node_name no memory.swap.label swap memory.swap.draw STACK memory.swap.info Swap memory used [location1-ts2.otherdomain.tld] address 169.254.30.92 use_node_name no memory.swap.label swap memory.swap.draw STACK memory.swap.info Swap memory used memory.apps.label usage memory.unused.label pagefile [location2-wms1.otherdomain.tld] address 169.254.20.20 use_node_name yes load.load.warning 15 load.load.critical 30 memory.apps.warning 6442450944 memory.committed.warning 8589934592 memory.committed.critical 17179869184 df._dev_cciss_c0d0p1.warning 75 df._dev_mapper_VolGroup00_LogVol00.warning 90 df._dev_mapper_VolGroup00_LogVol01.warning 90 df._dev_mapper_VolGroup00_LogVol02.warning 90 df._dev_mapper_VolGroup00_LogVol04.warning 90 df._dev_mapper_VolGroup01_LogVol00.warning 90 df._dev_mapper_VolGroup02_LogVol00.warning 90 df._dev_mapper_VolGroup03_LogVol00.warning 90 load2.graph_title Loads-WMS2 load2.graph_order location1wms2=location1wms2.otherdomain.tld:load.load location2-wms2=location2-wms2.otherdomain.tld:load.load load3.graph_title Loads on top of each other load3.dummy_field.stack location1wms1=location1wms1.otherdomain.tld:load.load location2-wms1=location2-wms1.otherdomain.tld:load.load location1wms2=locati on1wms2.otherdomain.tld:load.load location2-wms2=location2-wms2.otherdomain.tld:load.load load3.dummy_field.draw AREA # We want area instead the default LINE2. load3.dummy_field.label dummy # This is needed. Silly, really. memory1.graph_title Memory SWAP WMS memory1.graph_order location1wms1=location1wms1.otherdomain.tld:memory.swap location2-wms1=location2-wms1.otherdomain.tld:memory.swap location1wms2=locati on1wms2.otherdomain.tld:memory.swap location2-wms2=location2-wms2.otherdomain. tld:memory.swap memory2.graph_title Memory Committed WMS memory2.graph_order location1wms1=location1wms1.otherdomain.tld:memory.committed location2-wms1=location2-wms1.otherdomain.tld:memory.committed location1wms2=loca ion1wms2.otherdomain.tld:memory.committed location2-wms2=location2-wms2.otherdo main.tld:memory.committed # load3.graph_title Loads summarised # load3.combined_loads.sum location1wms1.otherdomain.tld:load.load ocation2-wms1.otherdomain.tld:load.load # load3.combined_loads.label Combined loads # Must be set, as this is # # not a dummy field! [ip-wms1.domain.tld] address 127.0.0.1 use_node_name yes load.load.warning 15 load.load.critical 30 memory.apps.warning 6442450944 memory.committed.warning 8589934592 memory.committed.critical 17179869184 [ip-wms2.domain.tld] address 192.168.101.51 use_node_name yes load.load.warning 15 load.load.critical 30 memory.apps.warning 6442450944 memory.committed.warning 8589934592 memory.committed.critical 17179869184 [windows-pc.domain.tld] address 192.168.101.26 use_node_name yes memory.swap.label swap memory.swap.draw STACK memory.swap.info Swap memory used Verbose configuration
  14. gareth rushgrove | morethanseven.net Long check intervals

  15. gareth rushgrove | morethanseven.net Host centric world view

  16. gareth rushgrove | morethanseven.net Another source of truth

  17. Not just me

  18. gareth rushgrove | morethanseven.net A Tweet

  19. gareth rushgrove | morethanseven.net A blog post

  20. gareth rushgrove | morethanseven.net An IRC room ##monitoringsucks

  21. gareth rushgrove | morethanseven.net A Twitter hashtag #monitoringsucks

  22. gareth rushgrove | morethanseven.net A GitHub repository

  23. What we want (really really want)

  24. gareth rushgrove | morethanseven.net Metrics and graphs

  25. gareth rushgrove | morethanseven.net System AND business data

  26. gareth rushgrove | morethanseven.net Log streams

  27. { "service_key": "e93facc04764012d7bfb002500d5d1a6", "incident_key": "srv01/HTTP", "event_type": "trigger", "description": "FAILURE on

    machine srv01.acme.com", "details": { "ping time": "1500ms", "load avg": 0.75 } } gareth rushgrove | morethanseven.net APIs
  28. gareth rushgrove | morethanseven.net Alerts

  29. gareth rushgrove | morethanseven.net Dashboards

  30. Goings on (just a quick sample)

  31. gareth rushgrove | morethanseven.net Naming things (is hard) - Metric

    - Context - Resource - Event - Action - Collection - Event processing - Presentation - Analytics a numeric or boolean data point metadata about a metric the source of a metric metric combined with context a response to a given metric getting the metrics taking action graphs, emails, dashboards, etc. correlation
  32. gareth rushgrove | morethanseven.net Sharing setups

  33. gareth rushgrove | morethanseven.net Low latency message based tools

  34. gareth rushgrove | morethanseven.net Monitoring == Testing

  35. Scenario: check that calendars works correctly Given I am testing

    "calendars" Then I should be able to visit: | Path | | /when-do-the-clocks-change | | /bank-holidays | gareth rushgrove | morethanseven.net Monitoring unit tests
  36. gareth rushgrove | morethanseven.net For one of my colleague Mat

    Scenario: check we don't get results for cheese Given I am testing "search" When I search for "cheese" Then I should receive no results
  37. gareth rushgrove | morethanseven.net monitors.txt

  38. gareth rushgrove | morethanseven.net JSON example "homepage performance": { "visit":

    "http://monitorstxt.org", "page": { "should have": { "download time": { "maximum": "0.5 seconds" } } }, "assets": { "should have": { "download time": { "maximum": "2 seconds" } } } },
  39. "homepage performance": { "visit": "http://monitorstxt.org", "page": { "should have": {

    "download time": { "maximum": "0.5 seconds" } } }, "assets": { "should have": { "download time": { "maximum": "2 seconds" } } } }, gareth rushgrove | morethanseven.net Monitoring system agnostic
  40. Open Source

  41. gareth rushgrove | morethanseven.net Graphite

  42. gareth rushgrove | morethanseven.net GDash

  43. gareth rushgrove | morethanseven.net Statsd

  44. @statsd = Statsd.new('statsd.example.com', 1234) @statsd.increment('foo.bar') gareth rushgrove | morethanseven.net Ruby

    counter
  45. StatsdClient client = new StatsdClient("host", 1234); client.increment("foo.bar"); gareth rushgrove |

    morethanseven.net Java counter
  46. gareth rushgrove | morethanseven.net Logster

  47. ⚡ logster --output=ganglia \ NginxLogster \ /var/log/nginx/access.log gareth rushgrove |

    morethanseven.net Point at log files
  48. gareth rushgrove | morethanseven.net Get metrics in Ganglia or Graphite

  49. gareth rushgrove | morethanseven.net Graylog2

  50. gareth rushgrove | morethanseven.net Logstash

  51. gareth rushgrove | morethanseven.net Riemann

  52. (streams prn partial prn "this event is interesting:") (where (state

    "error") (fn [event] (info event)) ) gareth rushgrove | morethanseven.net Clojure stream parsing
  53. gareth rushgrove | morethanseven.net Configuration management

  54. gareth rushgrove | morethanseven.net Or Configuration management

  55. gareth rushgrove | morethanseven.net Automate checks @@nagios_service { "check_nginx_5xx_on_${::hostname}": use

    => 'generic-service', check_command => 'check_ganglia_metric!nginx_http_5xx!0.05!0.1', service_description => 'check nginx error rate', host_name => "${::govuk_class}-${::hostname}", target => '/etc/nagios3/conf.d/nagios_service.cfg', }
  56. gareth rushgrove | morethanseven.net Defined outside monitoring systems @@nagios_service {

    "check_nginx_5xx_on_${::hostname}": use => 'generic-service', check_command => 'check_ganglia_metric!nginx_http_5xx!0.05!0.1', service_description => 'check nginx error rate', host_name => "${::govuk_class}-${::hostname}", target => '/etc/nagios3/conf.d/nagios_service.cfg', }
  57. gareth rushgrove | morethanseven.net Automate graylog collection graylogtail::collect { 'graylogtail-access':

    log_file => '/var/log/nginx/access.log', facility => $name, }
  58. gareth rushgrove | morethanseven.net Automate logster collection cron { 'logster-nginx':

    command => '/usr/sbin/logster NginxLogster /var/log/nginx/access.log', user => root, minute => '*/2' }
  59. SAAS (pay nice people for software)

  60. gareth rushgrove | morethanseven.net New Relic

  61. gareth rushgrove | morethanseven.net New Relic dashboard

  62. gareth rushgrove | morethanseven.net Librato Metrics

  63. gareth rushgrove | morethanseven.net Librato Graphs

  64. gareth rushgrove | morethanseven.net Splunk

  65. gareth rushgrove | morethanseven.net Splunk dashboard

  66. gareth rushgrove | morethanseven.net PagerDuty

  67. gareth rushgrove | morethanseven.net PagerDuty scheduler

  68. gareth rushgrove | morethanseven.net Boundary

  69. gareth rushgrove | morethanseven.net Network traffic analysis

  70. Takeaway (if all you remember is)

  71. gareth rushgrove | morethanseven.net Admit we have a problem

  72. gareth rushgrove | morethanseven.net Help build stuff

  73. Lots of links - https://github.com/monitoringsucks/ - http://graylog2.org - http://logstash.net -

    https://github.com/etsy/logster - https://github.com/etsy/statsd - http://aphyr.github.com/riemann/ - http://graphite.wikidot.com/ - http://monitorstxt.org/ - http://auxesis.github.com/cucumber-nagios/ gareth rushgrove | morethanseven.net
  74. The End

  75. http://www.flickr.com/photos/benterrett/6852348725/ One more thing gareth rushgrove | morethanseven.net

  76. Questions? gareth rushgrove | morethanseven.net http://flickr.com/photos/psd/102332391/