Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Practical examples with Sensu Monitoring framework

Practical examples with Sensu Monitoring framework

an overview of Sensu followed by some practical examples. presented at CentOS Dojo Phoenix, May 2013

joemiller

May 10, 2013
Tweet

More Decks by joemiller

Other Decks in Programming

Transcript

  1. Who am I? Joe Miller Ops & Engineering @ Pantheon

    (getpantheon.com) @miller_joe / github.com/joemiller Sensu user since 2011 Native Phoenician! living in SF since 2012 (it's good to be back in PHX, but we should do this in April next time)
  2. "The monitoring router" A framework for building a monitoring system

    scalable, malleable compose a system to meet your needs integrate well with CM (Chef, Puppet, etc) “cloud” friendly. (new nodes automatically know what checks to run. become monitored nodes automatically) integrity. excellent test coverage (rspec)
  3. benefits over other monitoring tools problem: discovery is slow and

    expensive problem: APIs for registering nodes/services are OK but still require extra work and moving parts easier: connect to a message queue, subscribe to topic "webserver" and immediately monitored same as all other webservers
  4. Quick History started as part-time project by Sean Porter (@portertech)

    while working at Sonian released open source (MIT license) in late 2011 @portertech now working full-time on Sensu through Heavy Water Operations (as of Spring 2013) commercial support available through Heavy Water Operations (http://sensuapp.org/support)
  5. Plugins write in any language uses nagios plugin protocol !

    sensu-community-plugins (github): ~ 130 nagios plugins: 1 billion ? re-use nagios plugins where you can!!
  6. "omnibus" package one package to install - rpm, deb "main"

    (stable) + "unstable" channels installs everything it needs in /opt/sensu Ruby 2.x, gems, supporting tools. tested on: debian, ubuntu, fedora, centos/rhel ships with: initd (default), upstart, systemd 0.9.12+ ships with integrated runit (optional) (`sensu-ctl`)
  7. Configuration Management leverage your existing CM to automatically attach checks

    to components in your infrastructure community CM modules available: sensu-chef (LWRPs) chef-monitor (recipes built with sensu-chef LWRPs) sensu-puppet ... or roll your own to fit existing CM model. https://github.com/sensu/sensu-chef https://github.com/portertech/chef-monitor https://github.com/sensu/sensu-puppet
  8. sensu-admin distributed as a separate project stateful, more features users,

    roles, scheduling downtimes, etc https://github.com/sensu/sensu-admin https://github.com/sensu/sensu-admin-chef
  9. Checks output data to STDOUT or STDERR exit status indicates

    severity: 0 = OK 1 = WARNING 2 = CRITICAL 3 = UNKNOWN or CUSTOM Look familiar? Same protocol as nagios checks (re-use nagios checks whenever possible!)
  10. Events { client: { name: "host01", address: "10.2.1.11", subscriptions: ["application"],

    timestamp: 1364274222 }, check: { name: "frontend_http", command: "check_http -u http://example.com", subscribers: ["application"], handlers: ["pagerduty"], interval: 60, output: "HTTP/1.1 503 Service Temporarily Unavailable", status: 2, history: [0, 2], flapping: false, issued: 1364274239, executed: 1364274240 }, occurrences: 1, action: "create" } http://docs.sensuapp.org/0.9/events.html
  11. Handlers take action on event data events passed to handlers

    on STDIN only if exit status is not OK except metric checks, which are always passed to handlers several types: [pipe, tcp, udp, amqp, set]
  12. sensu-server rabbitmq sensu-client subscriptions: [ webservers ] check request: check:

    "nginx_service" subscribers: webservers interval: 60 sensu-client subscriptions: [ webservers ] sensu-client subscriptions: [ webservers ] Check request check: check_http
  13. sensu-server rabbitmq sensu-client subscriptions: [ webservers ] check response: status:

    "2" output: "CRITICAL: port 80 timed out" sensu-client subscriptions: [ webservers ] sensu-client subscriptions: [ webservers ] handler: pagerduty.rb Check response handler: mail.rb
  14. Check subscriptions checks assigned to subscribers clients have subscriptions ['nginx',

    'frontend, 'base'] sensu-server sends check requests to subscribers works well when mapped to server roles in CM
  15. Standalone checks scheduled and defined only on client execute on

    their own schedule, sensu-server does not request execution easier to deploy in some CM systems
  16. Client socket sensu-client localhost:3030 (tcp and udp) push events from

    apps, scripts, etc echo '{ "handlers": ["default"], "name": "my_app_healthcheck", "output": "CRITICAL: MyApp is broke!", "status": 2 }' | nc -w1 127.0.0.1 3030
  17. Config files one or more config files designed with CM

    in mind /etc/sensu/config.json /etc/sensu/conf.d/*.json deep merging
  18. Client config { "client": { "name": "frontend01.dom.com", "address": "188.12.11.2", "subscriptions":

    [ "production", "webserver" ] } } http://docs.sensuapp.org/0.9/clients.html
  19. Check config { "checks": { "chef_client": { "command": "check-chef-client.rb", "subscribers":

    [ "production" ], "interval": 60 } } http://docs.sensuapp.org/0.9/checks.html
  20. Check config (standalone) { "checks": { "chef_client": { "command": "check-chef-client.rb",

    "interval": 60, "standalone": true } http://docs.sensuapp.org/0.9/checks.html
  21. Check config (metric) { "checks": { "foo": { "handler": "graphite",

    "type": "metric", "command": "echo metric 42 `date +%s`", "standalone": true, "interval": 10 } } http://docs.sensuapp.org/0.9/checks.html
  22. Handler config (pipe) { "handlers": { "pagerduty": { "type": "pipe",

    "command": "pagerduty.rb" } } http://docs.sensuapp.org/0.9/handlers.html
  23. Handler config (tcp) { "handlers": { "graphite": { "type": "tcp",

    "socket": { "host": "127.0.0.1", "port": 2003 }, "mutator": "only_check_output" ** } }
  24. check config no additional check is needed, all sensu-client's heartbeat

    to sensu-server. if they disappear for 180 seconds, sensu-server generates a "keepalive" critical event keepalive events sent to 'default' handler* (*configurable in 0.9.13+)
  25. handler config - default { "handlers": { "default": { "type":

    "set", "handlers": [ "awsdecommission", "pagerduty" ] }
  26. handler config - awsdecommission { "handlers": { "awsdecommission": { "type":

    "pipe", "command": "/etc/sensu/handlers/awsdecomm.rb", "severities": [ "ok", "warning", "critical" ] }
  27. awsdecommission config { "awsdecomm":{ "chef_server_host": "127.0.0.1", "chef_server_port": "4000", "chef_server_version": "0.10.16.2",

    "chef_client_user": "sensu", "chef_client_key_dir": "/etc/sensu/conf.d/handlers/sensu.pem", "access_key_id": "ACCESS_KEY_ID", "secret_access_key": "SECRET_ACCESS_KEY", "mail_from": "[email protected]", "mail_to": "[email protected]", "smtp_address": "localhost", "smtp_port": "25", "smtp_domain": "localhost" } }
  28. in the beginning, maybe you start simple "handlers": { "severity1":

    { "type": "set", "handlers": [ "pagerduty" ] ... "checks": { "mysql_healthcheck": { ... "handler": ["severity1"], ...
  29. then you decide to copy all events to your log

    system "handlers": { "severity1": { "type": "set", "handlers": [ "pagerduty", "logstash" ] <- add here ... "checks": { "mysql_healthcheck": { ... "handler": ["severity1"], <- no change to check configs
  30. "checks": { "mysql_healthcheck": { ... "handler": ["pagerduty"], "pagerduty_team": "dbas" ...

    keys in check definition are available to handlers "handlers": { "pagerduty": { "command": "pagerduty.rb" ... # /etc/sensu/handlers/pagerduty.rb # implement your logic to look at event['check']['pagerduty_team'] and route to the appropriate pagerduty escalation group
  31. "checks": { "mysql_healthcheck": { ... "handler": ["pagerduty"], ... "disk_check": {

    ... "handler": ["jira"], ... some alerts need immediate attention - call out ! some alerts can be deferred until business hours - open a ticket instead
  32. demo environment is deployed at a cloud provider that charges

    significantly less when instances are not running demo environment used by sales staff, which only works M-F 9-5 shut the environment down during non-business hours to save money but don't alert on it two possible ways to approach this:
  33. method 1 - delete demo nodes via the api #

    demo_shutdown_cron.sh ... for node in demoweb1, demodb1; do cloud-cli shutdown $node curl -X DELETE http://$SENSU_API_URL/client/$node done ... when the demo environment spins up, the nodes will automatically re-register with sensu
  34. method 2 - set custom client attribute { "client": {

    "name": "demoweb1", "handler": "pagerduty", "environment": "demo", ... add custom logic to your handlers to ignore clients in the demo environment outside of business hours
  35. each alert should have a document ("playbook") describing the alert

    and possible actions to take again, we're leveraging the ability to set arbitrary key/values on checks "checks": { "disk_check": { ... "playbook": "http://wiki.getpantheon. com/playbook/disk_check", ...
  36. handler extensions run inside sensu-server process, no forking tcp, udp

    handlers implemented as extensions risky !! do not block the reactor! only use when very high throughput needed! alternatively, implement your handler as a separate process and send events to it via the tcp handler
  37. more experiments with multi-datacenter setups to find best practices document

    HA best practices around redis, rabbitmq "out of box experience" - make it dead simple to sell to your boss & peers things are always improving
  38. • IRC: #sensu on freenode • mailing lists (google groups):

    ◦ sensu-user ◦ sensu-dev • sensu docs ◦ http://docs.sensuapp.org • "Why Switch? (Nagios to Sensu)" ◦ http://www.slideshare.net/jeremy_carroll/sensu-14485155 • HA Sensu (from @failshell on #sensu) ◦ https://blog.theroux.ca/sensu/high-availability-sensu/
  39. sandbox vagrant .box, built with chef-monitor recipes runs all sensu

    components repos.sensuapp.org/box/sandbox.box