Observability: the new incompetence

Observability The new incompetence Part 1: Open your eyes

Basic competency as a product dev team • Version control
• Appropriate databases • Test suite; green, deployable trunk • Secure handling of passwords, credit cards • Timely security patching

Pointless if you can’t keep the product running.

Examples (yes, I’ve made every one of these mistakes)

Hey, the site is down! Uhh...our website? (45 minutes later)
OK I made more room on the disk and restarted.

“BTW, looks like our storage costs are growing at $1500
a month based on a 3 month average, and we’ll need a redesign of the storage system in 18 months” Wow, you are a dextrous & proactive, a consummate professional! Have some more money and servers.

Uhh...it’s working in dev. Are you sure they are using
it correctly? Yup, we have seen 37 crashes this morning. Looks like users with Cyrillic names are affected. ETA on a fix is less than one hour. Hey, users are reporting a crash when updating their profile...

The site seems slow Uhh...well we’ve been writing a lot
of code and not paying attention Hmm. Our 99th percentile page load time has actually improved by 10% over the last 3 months. Show me the page that seems slow and we will analyze whats going on.

You need monitoring MONITORS, LOTS OF MONITORS. MONITORS ARE FOR
EXPLORATION/ANALYSIS HUMANS STINK AT WATCHING COMPUTERS SHOULD BE IMPOSSIBLE FOR BOSS, INVESTORS, USERS TO BE THE FIRST TO NOTIFY YOU.

YOUR JOB IS TO RESPOND, NOT TO REACT.

What is monitoring • Critical/Warn/OK monitoring • Trending over time,
event correlation, capacity planning • Alerting - putting an event in the audit log, waking someone up for an emergency, opening a ticket to be addressed next week.

What can we watch? • Business or application level metrics
- revenue, signups, cancellations, engagement • Raw server health (disk space, memory, IO) • Application health (open DB connections, page render time, rate of each HTTP status code, did backups happen last night?) • User experience (javascript errors, app server exceptions, load times) • Vacuum metrics from other places into your system (YouTube likes, AWS Load Balancers)

Blank sheet of paper is scary paralyzing

Do you want to spend time or money? • “lean”,
maybe only contract devs - just use a bunch of SaaS products. Valid approach. • Bootstrapping? • HIPAA or PCI data protection? • Any fulltime devs? • Consider running some of your own monitoring tools. Business folks love ‘em too

Running your own • Sensu • Sensu-community-plugins • Graphite, Descartes,
Tasseo • Pagerduty or OpsGenie for alerting • Logstash, Kibana (http://kibana.org/ infrastructure.html, needs sensu) • Vagrant+Chef for config management Start HERE

Part 2: Sensu+Graphite • Sensu is a monitoring framework. Successor
to Nagios, emerged from needs of cloudy app architecture. #monitoringsucks • Graphite is a time series database. Successor to RRD, originally written at orbitz.com. Amazingly flexible, surging in popularity (Github, hosted services)

a monitoring framework Server Client Runs checks Pagerduty Handler Ticket
Handler API Dashboard disable checks Client Runs checks Client Runs checks Client Runs checks Client Runs checks Client Runs checks OK/Warn/Crit Current metric count Graphite Handler

time series DB

Recording events • deploy happened • marketing email sends •
press release • server went offline What do we do with events? Event pushed from Client Runs checks Client Runs checks Client Runs checks Client Runs checks Client Runs checks OK/Warn/Crit • Recording: audit log, ticket • Sense-making: overlay on graph • Escalate: pagerduty

Recording metrics • free disk space • page load times
• new signups rate • Facebook likes Current metric count • Record: time series db • Sense-making: Draw graphs • Remix: derivative, sum, time shift • Inception: alert on thresholds (disk full, error rate changing too rapidly) What do we do with metrics?

Publishing metrics from your app Run statsd in front of
Graphite. This is a statistics aggregator, makes it easier to measure correctly. counters (gives you rate+count), sampling, timers with histograms, guages, uniques. https://github.com/reinh/statsd statsd-ruby gem

Cassandra Apache DNS elasticsearch files graphite haproxy hbase java logging
lxc memcached mongodb pingdom opsgenie percona postfix snmp solr twilio youtube aws varnish postgres redis riak rabbitmq processes

• This is a lot of moving parts • You
will never set up monitoring infra • You will never keep it updated • Unless it is is *easy* • 50 lines of chef-solo and 10 minutes later the entire system springs to life Use configuration management

links https://speakerdeck.com/statik/observability- the-new-incompetence http://sensuapp.com/ https://github.com/sensu/sensu-community- plugins http://graphite.wikidot.com/ http://animals-riding-animals.tumblr.com/

Observability: the new incompetence

Observability: the new incompetence

Elliot Murphy

More Decks by Elliot Murphy

Other Decks in Programming

Featured

Transcript