Practical examples with Sensu Monitoring framework

Monitoring with Sensu CentOS Dojo - Phoenix - May 2013

Who am I? Joe Miller Ops & Engineering @ Pantheon
(getpantheon.com) @miller_joe / github.com/joemiller Sensu user since 2011 Native Phoenician! living in SF since 2012 (it's good to be back in PHX, but we should do this in April next time)

Agenda What is Sensu? How does it work? Cool use
cases

What is Sensu?

"The monitoring router" A framework for building a monitoring system
scalable, malleable compose a system to meet your needs integrate well with CM (Chef, Puppet, etc) “cloud” friendly. (new nodes automatically know what checks to run. become monitored nodes automatically) integrity. excellent test coverage (rspec)

benefits over other monitoring tools problem: discovery is slow and
expensive problem: APIs for registering nodes/services are OK but still require extra work and moving parts easier: connect to a message queue, subscribe to topic "webserver" and immediately monitored same as all other webservers

Quick History started as part-time project by Sean Porter (@portertech)
while working at Sonian released open source (MIT license) in late 2011 @portertech now working full-time on Sensu through Heavy Water Operations (as of Spring 2013) commercial support available through Heavy Water Operations (http://sensuapp.org/support)

The Stack Ruby 2.0 (EventMachine) RabbitMQ Redis JSON (everywhere)

Plugins write in any language uses nagios plugin protocol !
sensu-community-plugins (github): ~ 130 nagios plugins: 1 billion ? re-use nagios plugins where you can!!

"omnibus" package one package to install - rpm, deb "main"
(stable) + "unstable" channels installs everything it needs in /opt/sensu Ruby 2.x, gems, supporting tools. tested on: debian, ubuntu, fedora, centos/rhel ships with: initd (default), upstart, systemd 0.9.12+ ships with integrated runit (optional) (`sensu-ctl`)

Configuration Management leverage your existing CM to automatically attach checks
to components in your infrastructure community CM modules available: sensu-chef (LWRPs) chef-monitor (recipes built with sensu-chef LWRPs) sensu-puppet ... or roll your own to fit existing CM model. https://github.com/sensu/sensu-chef https://github.com/portertech/chef-monitor https://github.com/sensu/sensu-puppet

sensu-dashboard bundled with the omnibus package simple, stateless no user
authentication, roles, etc

sensu-admin distributed as a separate project stateful, more features users,
roles, scheduling downtimes, etc https://github.com/sensu/sensu-admin https://github.com/sensu/sensu-admin-chef

How does it work?

basics: checks, events, handlers

check handler event

Checks output data to STDOUT or STDERR exit status indicates
severity: 0 = OK 1 = WARNING 2 = CRITICAL 3 = UNKNOWN or CUSTOM Look familiar? Same protocol as nagios checks (re-use nagios checks whenever possible!)

Events { client: { name: "host01", address: "10.2.1.11", subscriptions: ["application"],
timestamp: 1364274222 }, check: { name: "frontend_http", command: "check_http -u http://example.com", subscribers: ["application"], handlers: ["pagerduty"], interval: 60, output: "HTTP/1.1 503 Service Temporarily Unavailable", status: 2, history: [0, 2], flapping: false, issued: 1364274239, executed: 1364274240 }, occurrences: 1, action: "create" } http://docs.sensuapp.org/0.9/events.html

Handlers take action on event data events passed to handlers
on STDIN only if exit status is not OK except metric checks, which are always passed to handlers several types: [pipe, tcp, udp, amqp, set]

message flow 1:N

sensu-server rabbitmq sensu-client subscriptions: [ webservers ] check request: check:
"nginx_service" subscribers: webservers interval: 60 sensu-client subscriptions: [ webservers ] sensu-client subscriptions: [ webservers ] Check request check: check_http

sensu-server rabbitmq sensu-client subscriptions: [ webservers ] check response: status:
"2" output: "CRITICAL: port 80 timed out" sensu-client subscriptions: [ webservers ] sensu-client subscriptions: [ webservers ] handler: pagerduty.rb Check response handler: mail.rb

Check subscriptions checks assigned to subscribers clients have subscriptions ['nginx',
'frontend, 'base'] sensu-server sends check requests to subscribers works well when mapped to server roles in CM

Standalone checks scheduled and defined only on client execute on
their own schedule, sensu-server does not request execution easier to deploy in some CM systems

Client socket sensu-client localhost:3030 (tcp and udp) push events from
apps, scripts, etc echo '{ "handlers": ["default"], "name": "my_app_healthcheck", "output": "CRITICAL: MyApp is broke!", "status": 2 }' | nc -w1 127.0.0.1 3030

configuration

Config files one or more config files designed with CM
in mind /etc/sensu/config.json /etc/sensu/conf.d/*.json deep merging

Client config { "client": { "name": "frontend01.dom.com", "address": "188.12.11.2", "subscriptions":
[ "production", "webserver" ] } } http://docs.sensuapp.org/0.9/clients.html

Check config { "checks": { "chef_client": { "command": "check-chef-client.rb", "subscribers":
[ "production" ], "interval": 60 } } http://docs.sensuapp.org/0.9/checks.html

Check config (standalone) { "checks": { "chef_client": { "command": "check-chef-client.rb",
"interval": 60, "standalone": true } http://docs.sensuapp.org/0.9/checks.html

Check config (metric) { "checks": { "foo": { "handler": "graphite",
"type": "metric", "command": "echo metric 42 `date +%s`", "standalone": true, "interval": 10 } } http://docs.sensuapp.org/0.9/checks.html

Handler config (pipe) { "handlers": { "pagerduty": { "type": "pipe",
"command": "pagerduty.rb" } } http://docs.sensuapp.org/0.9/handlers.html

Handler config (tcp) { "handlers": { "graphite": { "type": "tcp",
"socket": { "host": "127.0.0.1", "port": 2003 }, "mutator": "only_check_output" ** } }

example use cases

auto-decomission: if my EC2 instances are terminated, clean them up
and remove from sensu

check config no additional check is needed, all sensu-client's heartbeat
to sensu-server. if they disappear for 180 seconds, sensu-server generates a "keepalive" critical event keepalive events sent to 'default' handler* (*configurable in 0.9.13+)

handler config - default { "handlers": { "default": { "type":
"set", "handlers": [ "awsdecommission", "pagerduty" ] }

handler config - awsdecommission { "handlers": { "awsdecommission": { "type":
"pipe", "command": "/etc/sensu/handlers/awsdecomm.rb", "severities": [ "ok", "warning", "critical" ] }

awsdecommission config { "awsdecomm":{ "chef_server_host": "127.0.0.1", "chef_server_port": "4000", "chef_server_version": "0.10.16.2",
"chef_client_user": "sensu", "chef_client_key_dir": "/etc/sensu/conf.d/handlers/sensu.pem", "access_key_id": "ACCESS_KEY_ID", "secret_access_key": "SECRET_ACCESS_KEY", "mail_from": "[email protected]", "mail_to": "[email protected]", "smtp_address": "localhost", "smtp_port": "25", "smtp_domain": "localhost" } }

use handler set to simplify changes

in the beginning, maybe you start simple "handlers": { "severity1":
{ "type": "set", "handlers": [ "pagerduty" ] ... "checks": { "mysql_healthcheck": { ... "handler": ["severity1"], ...

then you decide to copy all events to your log
system "handlers": { "severity1": { "type": "set", "handlers": [ "pagerduty", "logstash" ] <- add here ... "checks": { "mysql_healthcheck": { ... "handler": ["severity1"], <- no change to check configs

route check results to different teams

"checks": { "mysql_healthcheck": { ... "handler": ["pagerduty"], "pagerduty_team": "dbas" ...
keys in check definition are available to handlers "handlers": { "pagerduty": { "command": "pagerduty.rb" ... # /etc/sensu/handlers/pagerduty.rb # implement your logic to look at event['check']['pagerduty_team'] and route to the appropriate pagerduty escalation group

route sev 1 tickets to pagerduty route sev 2 to
ticket system

"checks": { "mysql_healthcheck": { ... "handler": ["pagerduty"], ... "disk_check": {
... "handler": ["jira"], ... some alerts need immediate attention - call out ! some alerts can be deferred until business hours - open a ticket instead

only alert on demo environment during M-F, 9-5

demo environment is deployed at a cloud provider that charges
significantly less when instances are not running demo environment used by sales staff, which only works M-F 9-5 shut the environment down during non-business hours to save money but don't alert on it two possible ways to approach this:

method 1 - delete demo nodes via the api #
demo_shutdown_cron.sh ... for node in demoweb1, demodb1; do cloud-cli shutdown $node curl -X DELETE http://$SENSU_API_URL/client/$node done ... when the demo environment spins up, the nodes will automatically re-register with sensu

method 2 - set custom client attribute { "client": {
"name": "demoweb1", "handler": "pagerduty", "environment": "demo", ... add custom logic to your handlers to ignore clients in the demo environment outside of business hours

link checks to wiki docs (runbook-y stuff)

each alert should have a document ("playbook") describing the alert
and possible actions to take again, we're leveraging the ability to set arbitrary key/values on checks "checks": { "disk_check": { ... "playbook": "http://wiki.getpantheon. com/playbook/disk_check", ...

the 'playbook' attribute follows the event through the system. available
to handlers, GUIs, sensu API, etc

experimental features

handler extensions run inside sensu-server process, no forking tcp, udp
handlers implemented as extensions risky !! do not block the reactor! only use when very high throughput needed! alternatively, implement your handler as a separate process and send events to it via the tcp handler

room for improvement

more experiments with multi-datacenter setups to find best practices document
HA best practices around redis, rabbitmq "out of box experience" - make it dead simple to sell to your boss & peers things are always improving

MOAR DOCUMENTATION !! http://docs.sensuapp.org fork it and help, please! https://github.com/sensu/sensu-docs

other great resources

• IRC: #sensu on freenode • mailing lists (google groups):
◦ sensu-user ◦ sensu-dev • sensu docs ◦ http://docs.sensuapp.org • "Why Switch? (Nagios to Sensu)" ◦ http://www.slideshare.net/jeremy_carroll/sensu-14485155 • HA Sensu (from @failshell on #sensu) ◦ https://blog.theroux.ca/sensu/high-availability-sensu/

sandbox vagrant .box, built with chef-monitor recipes runs all sensu
components repos.sensuapp.org/box/sandbox.box

Practical examples with Sensu Monitoring framework

Practical examples with Sensu Monitoring framework

More Decks by joemiller

Other Decks in Programming

Featured

Transcript