Pro Yearly is on sale from $80 to $50! »

Monitoring Sucks

Monitoring Sucks

Talk from #devslovebacon conference all about why monitoring sucks and what people are doing about it

98234c645fe8c935edc0fec0186d28b8?s=128

Gareth Rushgrove

April 21, 2012
Tweet

Transcript

  1. Monitoring Sucks And what you can do about it Bacon

    21st April 2012 gareth rushgrove | morethanseven.net http://www.flickr.com/photos/map408/2412123378
  2. Me

  3. Gareth Rushgrove @garethr gareth rushgrove | morethanseven.net

  4. Blog at morethanseven.net gareth rushgrove | morethanseven.net

  5. Curate devopsweekly.com gareth rushgrove | morethanseven.net

  6. Work at UK Government Digital Service Text gareth rushgrove |

    morethanseven.net
  7. Serious Government Business gareth rushgrove | morethanseven.net

  8. The talk

  9. - Monitoring running applications is interesting - Most monitoring tools

    sucks http://www.flickr.com/photos/iancarroll/5027441664 gareth rushgrove | morethanseven.net I want to convince you that...
  10. http://www.flickr.com/photos/iancarroll/5027441664 gareth rushgrove | morethanseven.net ...and - Monitoring running applications

    is interesting - Most monitoring tools sucks
  11. What we have

  12. gareth rushgrove | morethanseven.net Clunky user interfaces

  13. gareth rushgrove | morethanseven.net # Example configuration file for Munin,

    generated by ‘make build’ # The next three variables specifies where the location of the RRD # databases, the HTML output, and the logs, severally. They all # must be writable by the user running munin-cron. dbdir /var/lib/munin htmldir /var/www/munin logdir /var/log/munin rundir /var/run/munin # Where to look for the HTML templates tmpldir /etc/munin/templates # Make graphs show values per minute instead of per second #graph_period minute # Drop somejuser@fnord.comm and anotheruser@blibb.comm an email everytime # something changes (OK -> WARNING, CRITICAL -> OK, etc) contact.yourname.command mail -s “MUNIN – [${var:host}] ~ ${var:graph_title} ~ warnings: ${loop<,>:wfields ${var:label}=${var:value}} ~ criticals: ${loop<,> :cfields ${var:label}=${var:value}}” your.email@domain.tld # # # For those with Nagios, the following might come in handy. In addition, # the services must be defined in the Nagios server as well. #contact.nagios.command /usr/sbin/send_nsca -H nagios.host.com -c /etc/send_nsca.cfg # a simple host tree [location1-wms1.otherdomain.tld] address 169.254.30.86 use_node_name yes load.load.warning 15 load.load.critical 30 memory.apps.warning 6442450944 memory.committed.warning 8589934592 # memory.committed.warn 8589934592 memory.committed.critical 17179869184 df._dev_cciss_c0d0p1.warning 75 df._dev_mapper_VolGroup00_LogVol00.warning 90 df._dev_mapper_VolGroup00_LogVol01.warning 90 df._dev_mapper_VolGroup00_LogVol02.warning 90 df._dev_mapper_VolGroup00_LogVol04.warning 90 df._dev_mapper_VolGroup01_LogVol00.warning 90 df._dev_mapper_VolGroup02_LogVol00.warning 90 df._dev_mapper_VolGroup03_LogVol00.warning 90 df._dev_cciss_c0d0p1.critical 95 df._dev_mapper_VolGroup00_LogVol00.critical 95 df._dev_mapper_VolGroup00_LogVol01.critical 95 df._dev_mapper_VolGroup00_LogVol02.critical 95 df._dev_mapper_VolGroup00_LogVol04.critical 95 df._dev_mapper_VolGroup01_LogVol00.critical 95 df._dev_mapper_VolGroup02_LogVol00.critical 95 df._dev_mapper_VolGroup03_LogVol00.critical 95 [location1-wms2.otherdomain.tld] address 169.254.30.88 use_node_name yes load.load.warning 15 load.load.critical 30 memory.apps.warning 6442450944 memory.committed.warning 8589934592 memory.committed.critical 17179869184 f._dev_cciss_c0d0p1.critical 95 df._dev_mapper_VolGroup00_LogVol00.critical 95 df._dev_mapper_VolGroup00_LogVol01.critical 95 df._dev_mapper_VolGroup00_LogVol02.critical 95 df._dev_mapper_VolGroup00_LogVol04.critical 95 df._dev_mapper_VolGroup01_LogVol00.critical 95 df._dev_mapper_VolGroup02_LogVol00.critical 95 df._dev_mapper_VolGroup03_LogVol00.critical 95 [location2-wms2.otherdomain.tld] address 169.254.20.22 use_node_name yes load.load.warning 15 load.load.critical 30 memory.apps.warning 6442450944 memory.committed.warning 8589934592 memory.committed.critical 17179869184 df._dev_cciss_c0d0p1.warning 75 df._dev_mapper_VolGroup00_LogVol00.warning 90 df._dev_mapper_VolGroup00_LogVol01.warning 90 df._dev_mapper_VolGroup00_LogVol02.warning 90 df._dev_mapper_VolGroup00_LogVol04.warning 90 df._dev_mapper_VolGroup01_LogVol00.warning 90 df._dev_mapper_VolGroup02_LogVol00.warning 90 df._dev_mapper_VolGroup03_LogVol00.warning 90 df._dev_cciss_c0d0p1.critical 95 df._dev_mapper_VolGroup00_LogVol00.critical 95 df._dev_mapper_VolGroup00_LogVol01.critical 95 df._dev_mapper_VolGroup00_LogVol02.critical 95 df._dev_mapper_VolGroup00_LogVol04.critical 95 df._dev_mapper_VolGroup01_LogVol00.critical 95 df._dev_mapper_VolGroup02_LogVol00.critical 95 df._dev_mapper_VolGroup03_LogVol00.critical 95 [location2-ts1.otherdomain.tld] address 169.254.20.24 use_node_name no memory.swap.label swap memory.swap.draw STACK memory.swap.info Swap memory used [location2-ts2.otherdomain.tld] address 169.254.20.26 use_node_name no memory.swap.label swap memory.swap.draw STACK memory.swap.info Swap memory used [location2-mfc1.otherdomain.tld] address 169.254.20.28 use_node_name no memory.swap.label swap memory.swap.draw STACK memory.swap.info Swap memory used [location2-mfc2.otherdomain.tld] address 169.254.20.30 use_node_name no memory.swap.label swap memory.swap.draw STACK memory.swap.info Swap memory used [otherdomain.tld;Totals] update no load1.graph_title Loads-WMS1 load1.graph_order location1wms1=location1wms1.otherdomain.tld:lo ad.load location2-wms1=location2-wms1.otherdomain.tld:load.load df._dev_cciss_c0d0p1.warning 75 df._dev_mapper_VolGroup00_LogVol00.warning 90 df._dev_mapper_VolGroup00_LogVol01.warning 90 df._dev_mapper_VolGroup00_LogVol02.warning 90 df._dev_mapper_VolGroup00_LogVol04.warning 90 df._dev_mapper_VolGroup01_LogVol00.warning 90 df._dev_mapper_VolGroup02_LogVol00.warning 90 df._dev_mapper_VolGroup03_LogVol00.warning 90 df._dev_cciss_c0d0p1.critical 95 df._dev_mapper_VolGroup00_LogVol00.critical 95 df._dev_mapper_VolGroup00_LogVol01.critical 95 df._dev_mapper_VolGroup00_LogVol02.critical 95 df._dev_mapper_VolGroup00_LogVol04.critical 95 df._dev_mapper_VolGroup01_LogVol00.critical 95 df._dev_mapper_VolGroup02_LogVol00.critical 95 df._dev_mapper_VolGroup03_LogVol00.critical 95 [location1-ts1.otherdomain.tld] address 169.254.30.90 use_node_name no memory.swap.label swap memory.swap.draw STACK memory.swap.info Swap memory used [location1m-fc1.otherdomain.tld] address 169.254.30.94 use_node_name no memory.swap.label swap memory.swap.draw STACK memory.swap.info Swap memory used [location1-mfc2.otherdomain.tld] address 169.254.30.96 use_node_name no memory.swap.label swap memory.swap.draw STACK memory.swap.info Swap memory used [location1-ts2.otherdomain.tld] address 169.254.30.92 use_node_name no memory.swap.label swap memory.swap.draw STACK memory.swap.info Swap memory used memory.apps.label usage memory.unused.label pagefile [location2-wms1.otherdomain.tld] address 169.254.20.20 use_node_name yes load.load.warning 15 load.load.critical 30 memory.apps.warning 6442450944 memory.committed.warning 8589934592 memory.committed.critical 17179869184 df._dev_cciss_c0d0p1.warning 75 df._dev_mapper_VolGroup00_LogVol00.warning 90 df._dev_mapper_VolGroup00_LogVol01.warning 90 df._dev_mapper_VolGroup00_LogVol02.warning 90 df._dev_mapper_VolGroup00_LogVol04.warning 90 df._dev_mapper_VolGroup01_LogVol00.warning 90 df._dev_mapper_VolGroup02_LogVol00.warning 90 df._dev_mapper_VolGroup03_LogVol00.warning 90 load2.graph_title Loads-WMS2 load2.graph_order location1wms2=location1wms2.otherdomain.tld:load.load location2-wms2=location2-wms2.otherdomain.tld:load.load load3.graph_title Loads on top of each other load3.dummy_field.stack location1wms1=location1wms1.otherdomain.tld:load.load location2-wms1=location2-wms1.otherdomain.tld:load.load location1wms2=locati on1wms2.otherdomain.tld:load.load location2-wms2=location2-wms2.otherdomain.tld:load.load load3.dummy_field.draw AREA # We want area instead the default LINE2. load3.dummy_field.label dummy # This is needed. Silly, really. memory1.graph_title Memory SWAP WMS memory1.graph_order location1wms1=location1wms1.otherdomain.tld:memory.swap location2-wms1=location2-wms1.otherdomain.tld:memory.swap location1wms2=locati on1wms2.otherdomain.tld:memory.swap location2-wms2=location2-wms2.otherdomain. tld:memory.swap memory2.graph_title Memory Committed WMS memory2.graph_order location1wms1=location1wms1.otherdomain.tld:memory.committed location2-wms1=location2-wms1.otherdomain.tld:memory.committed location1wms2=loca ion1wms2.otherdomain.tld:memory.committed location2-wms2=location2-wms2.otherdo main.tld:memory.committed # load3.graph_title Loads summarised # load3.combined_loads.sum location1wms1.otherdomain.tld:load.load ocation2-wms1.otherdomain.tld:load.load # load3.combined_loads.label Combined loads # Must be set, as this is # # not a dummy field! [ip-wms1.domain.tld] address 127.0.0.1 use_node_name yes load.load.warning 15 load.load.critical 30 memory.apps.warning 6442450944 memory.committed.warning 8589934592 memory.committed.critical 17179869184 [ip-wms2.domain.tld] address 192.168.101.51 use_node_name yes load.load.warning 15 load.load.critical 30 memory.apps.warning 6442450944 memory.committed.warning 8589934592 memory.committed.critical 17179869184 [windows-pc.domain.tld] address 192.168.101.26 use_node_name yes memory.swap.label swap memory.swap.draw STACK memory.swap.info Swap memory used Verbose configuration
  14. gareth rushgrove | morethanseven.net Long check intervals

  15. gareth rushgrove | morethanseven.net Host centric world view

  16. gareth rushgrove | morethanseven.net Another source of truth

  17. Not just me

  18. gareth rushgrove | morethanseven.net A Tweet

  19. gareth rushgrove | morethanseven.net A blog post

  20. gareth rushgrove | morethanseven.net An IRC room ##monitoringsucks

  21. gareth rushgrove | morethanseven.net A Twitter hashtag #monitoringsucks

  22. gareth rushgrove | morethanseven.net A GitHub repository

  23. What we want (really really want)

  24. gareth rushgrove | morethanseven.net Metrics and graphs

  25. gareth rushgrove | morethanseven.net System AND business data

  26. gareth rushgrove | morethanseven.net Log streams

  27. { "service_key": "e93facc04764012d7bfb002500d5d1a6", "incident_key": "srv01/HTTP", "event_type": "trigger", "description": "FAILURE on

    machine srv01.acme.com", "details": { "ping time": "1500ms", "load avg": 0.75 } } gareth rushgrove | morethanseven.net APIs
  28. gareth rushgrove | morethanseven.net Alerts

  29. gareth rushgrove | morethanseven.net Dashboards

  30. Goings on (just a quick sample)

  31. gareth rushgrove | morethanseven.net Naming things (is hard) - Metric

    - Context - Resource - Event - Action - Collection - Event processing - Presentation - Analytics a numeric or boolean data point metadata about a metric the source of a metric metric combined with context a response to a given metric getting the metrics taking action graphs, emails, dashboards, etc. correlation
  32. gareth rushgrove | morethanseven.net Sharing setups

  33. gareth rushgrove | morethanseven.net Low latency message based tools

  34. gareth rushgrove | morethanseven.net Monitoring == Testing

  35. Scenario: check that calendars works correctly Given I am testing

    "calendars" Then I should be able to visit: | Path | | /when-do-the-clocks-change | | /bank-holidays | gareth rushgrove | morethanseven.net Monitoring unit tests
  36. gareth rushgrove | morethanseven.net For one of my colleague Mat

    Scenario: check we don't get results for cheese Given I am testing "search" When I search for "cheese" Then I should receive no results
  37. gareth rushgrove | morethanseven.net monitors.txt

  38. gareth rushgrove | morethanseven.net JSON example "homepage performance": { "visit":

    "http://monitorstxt.org", "page": { "should have": { "download time": { "maximum": "0.5 seconds" } } }, "assets": { "should have": { "download time": { "maximum": "2 seconds" } } } },
  39. "homepage performance": { "visit": "http://monitorstxt.org", "page": { "should have": {

    "download time": { "maximum": "0.5 seconds" } } }, "assets": { "should have": { "download time": { "maximum": "2 seconds" } } } }, gareth rushgrove | morethanseven.net Monitoring system agnostic
  40. Open Source

  41. gareth rushgrove | morethanseven.net Graphite

  42. gareth rushgrove | morethanseven.net GDash

  43. gareth rushgrove | morethanseven.net Statsd

  44. @statsd = Statsd.new('statsd.example.com', 1234) @statsd.increment('foo.bar') gareth rushgrove | morethanseven.net Ruby

    counter
  45. StatsdClient client = new StatsdClient("host", 1234); client.increment("foo.bar"); gareth rushgrove |

    morethanseven.net Java counter
  46. gareth rushgrove | morethanseven.net Logster

  47. ⚡ logster --output=ganglia \ NginxLogster \ /var/log/nginx/access.log gareth rushgrove |

    morethanseven.net Point at log files
  48. gareth rushgrove | morethanseven.net Get metrics in Ganglia or Graphite

  49. gareth rushgrove | morethanseven.net Graylog2

  50. gareth rushgrove | morethanseven.net Logstash

  51. gareth rushgrove | morethanseven.net Riemann

  52. (streams prn partial prn "this event is interesting:") (where (state

    "error") (fn [event] (info event)) ) gareth rushgrove | morethanseven.net Clojure stream parsing
  53. gareth rushgrove | morethanseven.net Configuration management

  54. gareth rushgrove | morethanseven.net Or Configuration management

  55. gareth rushgrove | morethanseven.net Automate checks @@nagios_service { "check_nginx_5xx_on_${::hostname}": use

    => 'generic-service', check_command => 'check_ganglia_metric!nginx_http_5xx!0.05!0.1', service_description => 'check nginx error rate', host_name => "${::govuk_class}-${::hostname}", target => '/etc/nagios3/conf.d/nagios_service.cfg', }
  56. gareth rushgrove | morethanseven.net Defined outside monitoring systems @@nagios_service {

    "check_nginx_5xx_on_${::hostname}": use => 'generic-service', check_command => 'check_ganglia_metric!nginx_http_5xx!0.05!0.1', service_description => 'check nginx error rate', host_name => "${::govuk_class}-${::hostname}", target => '/etc/nagios3/conf.d/nagios_service.cfg', }
  57. gareth rushgrove | morethanseven.net Automate graylog collection graylogtail::collect { 'graylogtail-access':

    log_file => '/var/log/nginx/access.log', facility => $name, }
  58. gareth rushgrove | morethanseven.net Automate logster collection cron { 'logster-nginx':

    command => '/usr/sbin/logster NginxLogster /var/log/nginx/access.log', user => root, minute => '*/2' }
  59. SAAS (pay nice people for software)

  60. gareth rushgrove | morethanseven.net New Relic

  61. gareth rushgrove | morethanseven.net New Relic dashboard

  62. gareth rushgrove | morethanseven.net Librato Metrics

  63. gareth rushgrove | morethanseven.net Librato Graphs

  64. gareth rushgrove | morethanseven.net Splunk

  65. gareth rushgrove | morethanseven.net Splunk dashboard

  66. gareth rushgrove | morethanseven.net PagerDuty

  67. gareth rushgrove | morethanseven.net PagerDuty scheduler

  68. gareth rushgrove | morethanseven.net Boundary

  69. gareth rushgrove | morethanseven.net Network traffic analysis

  70. Takeaway (if all you remember is)

  71. gareth rushgrove | morethanseven.net Admit we have a problem

  72. gareth rushgrove | morethanseven.net Help build stuff

  73. Lots of links - https://github.com/monitoringsucks/ - http://graylog2.org - http://logstash.net -

    https://github.com/etsy/logster - https://github.com/etsy/statsd - http://aphyr.github.com/riemann/ - http://graphite.wikidot.com/ - http://monitorstxt.org/ - http://auxesis.github.com/cucumber-nagios/ gareth rushgrove | morethanseven.net
  74. The End

  75. http://www.flickr.com/photos/benterrett/6852348725/ One more thing gareth rushgrove | morethanseven.net

  76. Questions? gareth rushgrove | morethanseven.net http://flickr.com/photos/psd/102332391/