Upgrade to Pro — share decks privately, control downloads, hide ads and more …

WTF is Sensu - A DevOps guide to Monitoring

WTF is Sensu - A DevOps guide to Monitoring

WTF is Sensu - A DevOps guide to Monitoring
by Toby Jackson - Systems Engineer at Future PLC

Abstract
Find out how to monitor the health of your applications and services with a modern, scalable monitoring solution that keeps both Developers and Operators happy. This presentation will introduce some core monitoring concepts, and the architecture and configuration of Sensu in a number of different scenarios.

Bio
Toby Jackson is a Software Developer turned Systems Engineer at Future PLC with experience in everything from racking servers, to developing simple cloud-based web applications. If he's not experimenting with new software, he'll be holding a soldiering iron starting another IoT project.

Hello Future

October 21, 2015
Tweet

More Decks by Hello Future

Other Decks in Programming

Transcript

  1. .WTF/whois self: author: ‘Toby Jackson <[email protected]>’ role: ‘Operations Engineer’ twitter:

    ‘@warmfusion’ github: ‘github.com/warmfusion’ employer: ‘www.futureplc.com/yourfuturejob/’
  2. .WTF/is/monitoring?experience • Developer turned Engineer • Implemented Sensu at Future

    PLC ◦ 340+ hosts, vms, switches etc • Helped shape our approach to monitoring
  3. .WTF/is/monitoring?_index Why do we monitor our systems? What should we

    look for? How can Sensu help us? Questions…?
  4. .WTF/is/monitoring?why • Client - Are they down, or is it

    just me? • CEO - Are we making money? • Manager - Are we meeting SLA agreements? • Engineer - Am I woken up for right reasons? • Developer - Did my deploy work? • Everyone... ◦ What’s happening in our environment?
  5. .WTF/is/monitoring?why_tomorrow • Client - Is maintenance going to happen soon?

    • CEO - Are we going to keep making money? • Manager - Can we meet new SLA agreements? • Engineer - Why might I get woken up tonight? • Developer - When do I need to optimise? • Everyone... ◦ Whats going to happen in our environment?
  6. .WTF/is/monitoring?principles Focus on your customers Use a couple of monitoring

    systems De-couple your checks from your code Remember workflow events Many simple checks > Fewer clever checks Don’t wake me up if it can wait
  7. .WTF/is/monitoring?first_steps • Look for the big impact entry points •

    Review past incidents for danger zones • Don’t be afraid to admit that risky code exists
  8. .WTF/is/monitoring?common • Disk, Ram, Load, Network • Patches available •

    Uptime • Logged in users • Config Management status
  9. .WTF/is/monitoring?services • Create http status endpoints • JSON is great

    • 200 OK / 503 Service Unavailable • Lightweight • Downstream dependencies? • Service metrics?
  10. .WTF/is/monitoring?company • Programmatic goals can be monitored • See if

    revenue, purchases or direct customer interactions can be watched • Watch for social media mentions
  11. .WTF/is/monitoring?practise_simple • nginx & php running • Balancer: 200 OK

    • nginx: 200 OK • Cron: ignore for now Web Load Balancer Web01 nginx php cron Web02 nginx php
  12. .WTF/is/monitoring?practise_adv • Balancer >50% backends up • Nginx < 200ms

    response • Cron err log empty && <1hr old Web Load Balancer Web01 nginx php cron Web02 nginx php
  13. .WTF/is/monitoring?practise_clever • Spike in traffic • Failure counts above thresholds

    • Response sizes are curiously large • Lots of (valid) API Auth requests Web Load Balancer Web01 nginx php cron Web02 nginx php
  14. Your users matter Know when they’re in pain Develop a

    standardised app status page Conventional checks are used more frequently Check lots of small things Scales better and helps to isolate incidents quickly .WTF/is/monitoring?what
  15. .WTF/is/sensu?introduction “New generation” of monitoring solutions Open source with paid

    for Enterprise edition Site: sensuapp.org GitHub: github.com/sensu IRC: freenode - #sensu
  16. .WTF/is/sensu?what Consistent way to describe a service check Executes those

    checks as required Reliably handles events (and metrics)
  17. .WTF/is/sensu?why • Tries to do one thing well; handle events

    • Compatible with existing check scripts • Large active open-source community • Scales effectively
  18. .WTF/is/sensu?experience • Replaced nagios, crons etc • Raised visibility of

    monitoring • Devolved control to development • 340 (ish) hosts, vms, switches, firewalls etc • Managed exclusively through Puppet • Developed custom plugins and extensions
  19. .WTF/is/sensu?how The Sensu Standalone Check Process: a. Sensu-Client runs a

    script with 1 line output and an exit code b. Sensu-Client converts event into JSON and puts on RabbitMQ c. Sensu-Server reads event and sends to handlers d. Handlers process event, performing some action
  20. .WTF/is/sensu?standalone_check • Describes ◦ what check to run ◦ how

    to handle events • Runs at a given interval (default 60s) • sensu-client handles output and emits events over message brokers • Can include custom configuration which is included in event sent to handlers sensu::checks: 'sensu-server': command: 'check-procs.rb -p bin/sensu- server -c 1' handlers: ['high', 'pagerduty'] custom: runbook: 'https://wiki.ftr.com/x/4oqq' tip: 'Check /var/log/sensu-server.log' slack: channels: - '#craggyisland'
  21. .WTF/is/sensu?runbook URI to page summary of Impacted services Troubleshooting Common

    problems How to fix Who to talk to References to other information
  22. .WTF/is/sensu?handler • Process events • Perform some (or no) action

    • Typically used to send alerts or emails sensu::handler: slack: type: 'pipe' command: 'slack.rb' config: webhook_token: 'SECRET/KEY' bot_name: 'sensu' channel: '#alerts' pagerduty: type: 'pipe' command: 'pagerduty.rb' severities: ['ok', 'critical'] config: api_key: SECRET_TOKEN_HERE
  23. .WTF/is/sensu?standalone_metrics • The same as checks but... • handlers: [‘metrics’]

    ◦ A special handler for this kind of result • type: metric ◦ Tells sensu to always send the output to the handler sensu::checks: cpu-pcnt-usage-metrics: command: 'cpu-pcnt-usage-metrics.rb' handlers: ['metrics'] type: metric
  24. .WTF/is/sensu?metric_example ix-sensu01.cpu.user 70.92 1440425049 ix-sensu01.cpu.nice 0.00 1440425049 ix-sensu01.cpu.system 8.16 1440425049

    ix-sensu01.cpu.idle 19.90 1440425049 ix-sensu01.cpu.iowait 0.00 1440425049 ix-sensu01.cpu.irq 0.00 1440425049 ix-sensu01.cpu.softirq 1.02 1440425049 ix-sensu01.cpu.steal 0.00 1440425049 ix-sensu01.cpu.guest 0.00 1440425049 Key Value Timestamp
  25. .WTF/is/sensu?issues • Uchiwa isn’t perfect • Sensu-API can crash sometimes

    • No maintained history (over 20 events) • Check dependencies are handled on clients • Redis for datastore ◦ Redundancy is a little harder (for me at least)
  26. .WTF/is/sensu?wins • Alerts into Slack channels • Handles network partitions

    really well • Easy to create new checks and handlers
  27. .WTF/is/monitoring?further_reading Programmatic Alert Correlation - Elik Eizenberg youtu.be/EXk19d09n54 Effective Incident

    Communication - Scott Klein youtu.be/ySSdqfZlC7Y Search for Operability 2015 in YouTube
  28. .WTF/whois?q= self: author: ‘Toby Jackson <[email protected]>’ role: ‘Operations Engineer’ twitter:

    ‘@warmfusion’ github: ‘github.com/warmfusion’ employer: ‘www.futureplc.com/yourfuturejob/’ Any Questions…?