Observability: the new incompetence

Observability: the new incompetence

Why you need monitoring, and how to get started with Sensu and Graphite.

E50d396533a9455ba01a4827868598e9?s=128

Elliot Murphy

August 15, 2013
Tweet

Transcript

  1. 2.

    Basic competency as a product dev team • Version control

    • Appropriate databases • Test suite; green, deployable trunk • Secure handling of passwords, credit cards • Timely security patching
  2. 5.

    Hey, the site is down! Uhh...our website? (45 minutes later)

    OK I made more room on the disk and restarted.
  3. 6.

    “BTW, looks like our storage costs are growing at $1500

    a month based on a 3 month average, and we’ll need a redesign of the storage system in 18 months” Wow, you are a dextrous & proactive, a consummate professional! Have some more money and servers.
  4. 7.

    Uhh...it’s working in dev. Are you sure they are using

    it correctly? Yup, we have seen 37 crashes this morning. Looks like users with Cyrillic names are affected. ETA on a fix is less than one hour. Hey, users are reporting a crash when updating their profile...
  5. 8.

    The site seems slow Uhh...well we’ve been writing a lot

    of code and not paying attention Hmm. Our 99th percentile page load time has actually improved by 10% over the last 3 months. Show me the page that seems slow and we will analyze whats going on.
  6. 9.

    You need monitoring MONITORS, LOTS OF MONITORS. MONITORS ARE FOR

    EXPLORATION/ANALYSIS HUMANS STINK AT WATCHING COMPUTERS SHOULD BE IMPOSSIBLE FOR BOSS, INVESTORS, USERS TO BE THE FIRST TO NOTIFY YOU.
  7. 11.

    What is monitoring • Critical/Warn/OK monitoring • Trending over time,

    event correlation, capacity planning • Alerting - putting an event in the audit log, waking someone up for an emergency, opening a ticket to be addressed next week.
  8. 12.

    What can we watch? • Business or application level metrics

    - revenue, signups, cancellations, engagement • Raw server health (disk space, memory, IO) • Application health (open DB connections, page render time, rate of each HTTP status code, did backups happen last night?) • User experience (javascript errors, app server exceptions, load times) • Vacuum metrics from other places into your system (YouTube likes, AWS Load Balancers)
  9. 14.

    Do you want to spend time or money? • “lean”,

    maybe only contract devs - just use a bunch of SaaS products. Valid approach. • Bootstrapping? • HIPAA or PCI data protection? • Any fulltime devs? • Consider running some of your own monitoring tools. Business folks love ‘em too
  10. 15.

    Running your own • Sensu • Sensu-community-plugins • Graphite, Descartes,

    Tasseo • Pagerduty or OpsGenie for alerting • Logstash, Kibana (http://kibana.org/ infrastructure.html, needs sensu) • Vagrant+Chef for config management Start HERE
  11. 16.

    Part 2: Sensu+Graphite • Sensu is a monitoring framework. Successor

    to Nagios, emerged from needs of cloudy app architecture. #monitoringsucks • Graphite is a time series database. Successor to RRD, originally written at orbitz.com. Amazingly flexible, surging in popularity (Github, hosted services)
  12. 17.

    a monitoring framework Server Client Runs checks Pagerduty Handler Ticket

    Handler API Dashboard disable checks Client Runs checks Client Runs checks Client Runs checks Client Runs checks Client Runs checks OK/Warn/Crit Current metric count Graphite Handler
  13. 19.

    Recording events • deploy happened • marketing email sends •

    press release • server went offline What do we do with events? Event pushed from Client Runs checks Client Runs checks Client Runs checks Client Runs checks Client Runs checks OK/Warn/Crit • Recording: audit log, ticket • Sense-making: overlay on graph • Escalate: pagerduty
  14. 20.

    Recording metrics • free disk space • page load times

    • new signups rate • Facebook likes Current metric count • Record: time series db • Sense-making: Draw graphs • Remix: derivative, sum, time shift • Inception: alert on thresholds (disk full, error rate changing too rapidly) What do we do with metrics?
  15. 21.

    Publishing metrics from your app Run statsd in front of

    Graphite. This is a statistics aggregator, makes it easier to measure correctly. counters (gives you rate+count), sampling, timers with histograms, guages, uniques. https://github.com/reinh/statsd statsd-ruby gem
  16. 22.
  17. 23.
  18. 24.
  19. 25.
  20. 26.
  21. 27.

    Cassandra Apache DNS elasticsearch files graphite haproxy hbase java logging

    lxc memcached mongodb pingdom opsgenie percona postfix snmp solr twilio youtube aws varnish postgres redis riak rabbitmq processes
  22. 28.

    • This is a lot of moving parts • You

    will never set up monitoring infra • You will never keep it updated • Unless it is is *easy* • 50 lines of chef-solo and 10 minutes later the entire system springs to life Use configuration management