(getpantheon.com) @miller_joe / github.com/joemiller Sensu user since 2011 Native Phoenician! living in SF since 2012 (it's good to be back in PHX, but we should do this in April next time)
scalable, malleable compose a system to meet your needs integrate well with CM (Chef, Puppet, etc) “cloud” friendly. (new nodes automatically know what checks to run. become monitored nodes automatically) integrity. excellent test coverage (rspec)
expensive problem: APIs for registering nodes/services are OK but still require extra work and moving parts easier: connect to a message queue, subscribe to topic "webserver" and immediately monitored same as all other webservers
while working at Sonian released open source (MIT license) in late 2011 @portertech now working full-time on Sensu through Heavy Water Operations (as of Spring 2013) commercial support available through Heavy Water Operations (http://sensuapp.org/support)
to components in your infrastructure community CM modules available: sensu-chef (LWRPs) chef-monitor (recipes built with sensu-chef LWRPs) sensu-puppet ... or roll your own to fit existing CM model. https://github.com/sensu/sensu-chef https://github.com/portertech/chef-monitor https://github.com/sensu/sensu-puppet
to sensu-server. if they disappear for 180 seconds, sensu-server generates a "keepalive" critical event keepalive events sent to 'default' handler* (*configurable in 0.9.13+)
keys in check definition are available to handlers "handlers": { "pagerduty": { "command": "pagerduty.rb" ... # /etc/sensu/handlers/pagerduty.rb # implement your logic to look at event['check']['pagerduty_team'] and route to the appropriate pagerduty escalation group
... "handler": ["jira"], ... some alerts need immediate attention - call out ! some alerts can be deferred until business hours - open a ticket instead
significantly less when instances are not running demo environment used by sales staff, which only works M-F 9-5 shut the environment down during non-business hours to save money but don't alert on it two possible ways to approach this:
demo_shutdown_cron.sh ... for node in demoweb1, demodb1; do cloud-cli shutdown $node curl -X DELETE http://$SENSU_API_URL/client/$node done ... when the demo environment spins up, the nodes will automatically re-register with sensu
"name": "demoweb1", "handler": "pagerduty", "environment": "demo", ... add custom logic to your handlers to ignore clients in the demo environment outside of business hours
and possible actions to take again, we're leveraging the ability to set arbitrary key/values on checks "checks": { "disk_check": { ... "playbook": "http://wiki.getpantheon. com/playbook/disk_check", ...
handlers implemented as extensions risky !! do not block the reactor! only use when very high throughput needed! alternatively, implement your handler as a separate process and send events to it via the tcp handler