Slide 1

Slide 1 text

Monitoring with Sensu CentOS Dojo - Phoenix - May 2013

Slide 2

Slide 2 text

Who am I? Joe Miller Ops & Engineering @ Pantheon (getpantheon.com) @miller_joe / github.com/joemiller Sensu user since 2011 Native Phoenician! living in SF since 2012 (it's good to be back in PHX, but we should do this in April next time)

Slide 3

Slide 3 text

Agenda What is Sensu? How does it work? Cool use cases

Slide 4

Slide 4 text

What is Sensu?

Slide 5

Slide 5 text

"The monitoring router" A framework for building a monitoring system scalable, malleable compose a system to meet your needs integrate well with CM (Chef, Puppet, etc) “cloud” friendly. (new nodes automatically know what checks to run. become monitored nodes automatically) integrity. excellent test coverage (rspec)

Slide 6

Slide 6 text

benefits over other monitoring tools problem: discovery is slow and expensive problem: APIs for registering nodes/services are OK but still require extra work and moving parts easier: connect to a message queue, subscribe to topic "webserver" and immediately monitored same as all other webservers

Slide 7

Slide 7 text

Quick History started as part-time project by Sean Porter (@portertech) while working at Sonian released open source (MIT license) in late 2011 @portertech now working full-time on Sensu through Heavy Water Operations (as of Spring 2013) commercial support available through Heavy Water Operations (http://sensuapp.org/support)

Slide 8

Slide 8 text

The Stack Ruby 2.0 (EventMachine) RabbitMQ Redis JSON (everywhere)

Slide 9

Slide 9 text

Plugins write in any language uses nagios plugin protocol ! sensu-community-plugins (github): ~ 130 nagios plugins: 1 billion ? re-use nagios plugins where you can!!

Slide 10

Slide 10 text

"omnibus" package one package to install - rpm, deb "main" (stable) + "unstable" channels installs everything it needs in /opt/sensu Ruby 2.x, gems, supporting tools. tested on: debian, ubuntu, fedora, centos/rhel ships with: initd (default), upstart, systemd 0.9.12+ ships with integrated runit (optional) (`sensu-ctl`)

Slide 11

Slide 11 text

Configuration Management leverage your existing CM to automatically attach checks to components in your infrastructure community CM modules available: sensu-chef (LWRPs) chef-monitor (recipes built with sensu-chef LWRPs) sensu-puppet ... or roll your own to fit existing CM model. https://github.com/sensu/sensu-chef https://github.com/portertech/chef-monitor https://github.com/sensu/sensu-puppet

Slide 12

Slide 12 text

GUIs

Slide 13

Slide 13 text

sensu-dashboard bundled with the omnibus package simple, stateless no user authentication, roles, etc

Slide 14

Slide 14 text

No content

Slide 15

Slide 15 text

sensu-admin distributed as a separate project stateful, more features users, roles, scheduling downtimes, etc https://github.com/sensu/sensu-admin https://github.com/sensu/sensu-admin-chef

Slide 16

Slide 16 text

No content

Slide 17

Slide 17 text

How does it work?

Slide 18

Slide 18 text

basics: checks, events, handlers

Slide 19

Slide 19 text

check handler event

Slide 20

Slide 20 text

Checks output data to STDOUT or STDERR exit status indicates severity: 0 = OK 1 = WARNING 2 = CRITICAL 3 = UNKNOWN or CUSTOM Look familiar? Same protocol as nagios checks (re-use nagios checks whenever possible!)

Slide 21

Slide 21 text

Events { client: { name: "host01", address: "10.2.1.11", subscriptions: ["application"], timestamp: 1364274222 }, check: { name: "frontend_http", command: "check_http -u http://example.com", subscribers: ["application"], handlers: ["pagerduty"], interval: 60, output: "HTTP/1.1 503 Service Temporarily Unavailable", status: 2, history: [0, 2], flapping: false, issued: 1364274239, executed: 1364274240 }, occurrences: 1, action: "create" } http://docs.sensuapp.org/0.9/events.html

Slide 22

Slide 22 text

Handlers take action on event data events passed to handlers on STDIN only if exit status is not OK except metric checks, which are always passed to handlers several types: [pipe, tcp, udp, amqp, set]

Slide 23

Slide 23 text

No content

Slide 24

Slide 24 text

message flow 1:N

Slide 25

Slide 25 text

sensu-server rabbitmq sensu-client subscriptions: [ webservers ] check request: check: "nginx_service" subscribers: webservers interval: 60 sensu-client subscriptions: [ webservers ] sensu-client subscriptions: [ webservers ] Check request check: check_http

Slide 26

Slide 26 text

sensu-server rabbitmq sensu-client subscriptions: [ webservers ] check response: status: "2" output: "CRITICAL: port 80 timed out" sensu-client subscriptions: [ webservers ] sensu-client subscriptions: [ webservers ] handler: pagerduty.rb Check response handler: mail.rb

Slide 27

Slide 27 text

Check subscriptions checks assigned to subscribers clients have subscriptions ['nginx', 'frontend, 'base'] sensu-server sends check requests to subscribers works well when mapped to server roles in CM

Slide 28

Slide 28 text

Standalone checks scheduled and defined only on client execute on their own schedule, sensu-server does not request execution easier to deploy in some CM systems

Slide 29

Slide 29 text

Client socket sensu-client localhost:3030 (tcp and udp) push events from apps, scripts, etc echo '{ "handlers": ["default"], "name": "my_app_healthcheck", "output": "CRITICAL: MyApp is broke!", "status": 2 }' | nc -w1 127.0.0.1 3030

Slide 30

Slide 30 text

configuration

Slide 31

Slide 31 text

Config files one or more config files designed with CM in mind /etc/sensu/config.json /etc/sensu/conf.d/*.json deep merging

Slide 32

Slide 32 text

Client config { "client": { "name": "frontend01.dom.com", "address": "188.12.11.2", "subscriptions": [ "production", "webserver" ] } } http://docs.sensuapp.org/0.9/clients.html

Slide 33

Slide 33 text

Check config { "checks": { "chef_client": { "command": "check-chef-client.rb", "subscribers": [ "production" ], "interval": 60 } } http://docs.sensuapp.org/0.9/checks.html

Slide 34

Slide 34 text

Check config (standalone) { "checks": { "chef_client": { "command": "check-chef-client.rb", "interval": 60, "standalone": true } http://docs.sensuapp.org/0.9/checks.html

Slide 35

Slide 35 text

Check config (metric) { "checks": { "foo": { "handler": "graphite", "type": "metric", "command": "echo metric 42 `date +%s`", "standalone": true, "interval": 10 } } http://docs.sensuapp.org/0.9/checks.html

Slide 36

Slide 36 text

Handler config (pipe) { "handlers": { "pagerduty": { "type": "pipe", "command": "pagerduty.rb" } } http://docs.sensuapp.org/0.9/handlers.html

Slide 37

Slide 37 text

Handler config (tcp) { "handlers": { "graphite": { "type": "tcp", "socket": { "host": "127.0.0.1", "port": 2003 }, "mutator": "only_check_output" ** } }

Slide 38

Slide 38 text

example use cases

Slide 39

Slide 39 text

auto-decomission: if my EC2 instances are terminated, clean them up and remove from sensu

Slide 40

Slide 40 text

check config no additional check is needed, all sensu-client's heartbeat to sensu-server. if they disappear for 180 seconds, sensu-server generates a "keepalive" critical event keepalive events sent to 'default' handler* (*configurable in 0.9.13+)

Slide 41

Slide 41 text

handler config - default { "handlers": { "default": { "type": "set", "handlers": [ "awsdecommission", "pagerduty" ] }

Slide 42

Slide 42 text

handler config - awsdecommission { "handlers": { "awsdecommission": { "type": "pipe", "command": "/etc/sensu/handlers/awsdecomm.rb", "severities": [ "ok", "warning", "critical" ] }

Slide 43

Slide 43 text

awsdecommission config { "awsdecomm":{ "chef_server_host": "127.0.0.1", "chef_server_port": "4000", "chef_server_version": "0.10.16.2", "chef_client_user": "sensu", "chef_client_key_dir": "/etc/sensu/conf.d/handlers/sensu.pem", "access_key_id": "ACCESS_KEY_ID", "secret_access_key": "SECRET_ACCESS_KEY", "mail_from": "[email protected]", "mail_to": "[email protected]", "smtp_address": "localhost", "smtp_port": "25", "smtp_domain": "localhost" } }

Slide 44

Slide 44 text

use handler set to simplify changes

Slide 45

Slide 45 text

in the beginning, maybe you start simple "handlers": { "severity1": { "type": "set", "handlers": [ "pagerduty" ] ... "checks": { "mysql_healthcheck": { ... "handler": ["severity1"], ...

Slide 46

Slide 46 text

then you decide to copy all events to your log system "handlers": { "severity1": { "type": "set", "handlers": [ "pagerduty", "logstash" ] <- add here ... "checks": { "mysql_healthcheck": { ... "handler": ["severity1"], <- no change to check configs

Slide 47

Slide 47 text

route check results to different teams

Slide 48

Slide 48 text

"checks": { "mysql_healthcheck": { ... "handler": ["pagerduty"], "pagerduty_team": "dbas" ... keys in check definition are available to handlers "handlers": { "pagerduty": { "command": "pagerduty.rb" ... # /etc/sensu/handlers/pagerduty.rb # implement your logic to look at event['check']['pagerduty_team'] and route to the appropriate pagerduty escalation group

Slide 49

Slide 49 text

route sev 1 tickets to pagerduty route sev 2 to ticket system

Slide 50

Slide 50 text

"checks": { "mysql_healthcheck": { ... "handler": ["pagerduty"], ... "disk_check": { ... "handler": ["jira"], ... some alerts need immediate attention - call out ! some alerts can be deferred until business hours - open a ticket instead

Slide 51

Slide 51 text

only alert on demo environment during M-F, 9-5

Slide 52

Slide 52 text

demo environment is deployed at a cloud provider that charges significantly less when instances are not running demo environment used by sales staff, which only works M-F 9-5 shut the environment down during non-business hours to save money but don't alert on it two possible ways to approach this:

Slide 53

Slide 53 text

method 1 - delete demo nodes via the api # demo_shutdown_cron.sh ... for node in demoweb1, demodb1; do cloud-cli shutdown $node curl -X DELETE http://$SENSU_API_URL/client/$node done ... when the demo environment spins up, the nodes will automatically re-register with sensu

Slide 54

Slide 54 text

method 2 - set custom client attribute { "client": { "name": "demoweb1", "handler": "pagerduty", "environment": "demo", ... add custom logic to your handlers to ignore clients in the demo environment outside of business hours

Slide 55

Slide 55 text

link checks to wiki docs (runbook-y stuff)

Slide 56

Slide 56 text

each alert should have a document ("playbook") describing the alert and possible actions to take again, we're leveraging the ability to set arbitrary key/values on checks "checks": { "disk_check": { ... "playbook": "http://wiki.getpantheon. com/playbook/disk_check", ...

Slide 57

Slide 57 text

the 'playbook' attribute follows the event through the system. available to handlers, GUIs, sensu API, etc

Slide 58

Slide 58 text

No content

Slide 59

Slide 59 text

experimental features

Slide 60

Slide 60 text

handler extensions run inside sensu-server process, no forking tcp, udp handlers implemented as extensions risky !! do not block the reactor! only use when very high throughput needed! alternatively, implement your handler as a separate process and send events to it via the tcp handler

Slide 61

Slide 61 text

room for improvement

Slide 62

Slide 62 text

more experiments with multi-datacenter setups to find best practices document HA best practices around redis, rabbitmq "out of box experience" - make it dead simple to sell to your boss & peers things are always improving

Slide 63

Slide 63 text

MOAR DOCUMENTATION !! http://docs.sensuapp.org fork it and help, please! https://github.com/sensu/sensu-docs

Slide 64

Slide 64 text

other great resources

Slide 65

Slide 65 text

● IRC: #sensu on freenode ● mailing lists (google groups): ○ sensu-user ○ sensu-dev ● sensu docs ○ http://docs.sensuapp.org ● "Why Switch? (Nagios to Sensu)" ○ http://www.slideshare.net/jeremy_carroll/sensu-14485155 ● HA Sensu (from @failshell on #sensu) ○ https://blog.theroux.ca/sensu/high-availability-sensu/

Slide 66

Slide 66 text

sandbox vagrant .box, built with chef-monitor recipes runs all sensu components repos.sensuapp.org/box/sandbox.box