Observability: the new incompetence

Slide 1

Slide 1 text

Observability The new incompetence Part 1: Open your eyes

Slide 2

Slide 2 text

Basic competency as a product dev team • Version control • Appropriate databases • Test suite; green, deployable trunk • Secure handling of passwords, credit cards • Timely security patching

Slide 3

Slide 3 text

Pointless if you can’t keep the product running.

Slide 4

Slide 4 text

Examples (yes, I’ve made every one of these mistakes)

Slide 5

Slide 5 text

Hey, the site is down! Uhh...our website? (45 minutes later) OK I made more room on the disk and restarted.

Slide 6

Slide 6 text

“BTW, looks like our storage costs are growing at $1500 a month based on a 3 month average, and we’ll need a redesign of the storage system in 18 months” Wow, you are a dextrous & proactive, a consummate professional! Have some more money and servers.

Slide 7

Slide 7 text

Uhh...it’s working in dev. Are you sure they are using it correctly? Yup, we have seen 37 crashes this morning. Looks like users with Cyrillic names are affected. ETA on a fix is less than one hour. Hey, users are reporting a crash when updating their profile...

Slide 8

Slide 8 text

The site seems slow Uhh...well we’ve been writing a lot of code and not paying attention Hmm. Our 99th percentile page load time has actually improved by 10% over the last 3 months. Show me the page that seems slow and we will analyze whats going on.

Slide 9

Slide 9 text

You need monitoring MONITORS, LOTS OF MONITORS. MONITORS ARE FOR EXPLORATION/ANALYSIS HUMANS STINK AT WATCHING COMPUTERS SHOULD BE IMPOSSIBLE FOR BOSS, INVESTORS, USERS TO BE THE FIRST TO NOTIFY YOU.

Slide 10

Slide 10 text

YOUR JOB IS TO RESPOND, NOT TO REACT.

Slide 11

Slide 11 text

What is monitoring • Critical/Warn/OK monitoring • Trending over time, event correlation, capacity planning • Alerting - putting an event in the audit log, waking someone up for an emergency, opening a ticket to be addressed next week.

Slide 12

Slide 12 text

What can we watch? • Business or application level metrics - revenue, signups, cancellations, engagement • Raw server health (disk space, memory, IO) • Application health (open DB connections, page render time, rate of each HTTP status code, did backups happen last night?) • User experience (javascript errors, app server exceptions, load times) • Vacuum metrics from other places into your system (YouTube likes, AWS Load Balancers)

Slide 13

Slide 13 text

Blank sheet of paper is scary paralyzing

Slide 14

Slide 14 text

Do you want to spend time or money? • “lean”, maybe only contract devs - just use a bunch of SaaS products. Valid approach. • Bootstrapping? • HIPAA or PCI data protection? • Any fulltime devs? • Consider running some of your own monitoring tools. Business folks love ‘em too

Slide 15

Slide 15 text

Running your own • Sensu • Sensu-community-plugins • Graphite, Descartes, Tasseo • Pagerduty or OpsGenie for alerting • Logstash, Kibana (http://kibana.org/ infrastructure.html, needs sensu) • Vagrant+Chef for config management Start HERE

Slide 16

Slide 16 text

Part 2: Sensu+Graphite • Sensu is a monitoring framework. Successor to Nagios, emerged from needs of cloudy app architecture. #monitoringsucks • Graphite is a time series database. Successor to RRD, originally written at orbitz.com. Amazingly flexible, surging in popularity (Github, hosted services)

Slide 17

Slide 17 text

a monitoring framework Server Client Runs checks Pagerduty Handler Ticket Handler API Dashboard disable checks Client Runs checks Client Runs checks Client Runs checks Client Runs checks Client Runs checks OK/Warn/Crit Current metric count Graphite Handler

Slide 18

Slide 18 text

time series DB

Slide 19

Slide 19 text

Recording events • deploy happened • marketing email sends • press release • server went offline What do we do with events? Event pushed from Client Runs checks Client Runs checks Client Runs checks Client Runs checks Client Runs checks OK/Warn/Crit • Recording: audit log, ticket • Sense-making: overlay on graph • Escalate: pagerduty

Slide 20

Slide 20 text

Recording metrics • free disk space • page load times • new signups rate • Facebook likes Current metric count • Record: time series db • Sense-making: Draw graphs • Remix: derivative, sum, time shift • Inception: alert on thresholds (disk full, error rate changing too rapidly) What do we do with metrics?

Slide 21

Slide 21 text

Publishing metrics from your app Run statsd in front of Graphite. This is a statistics aggregator, makes it easier to measure correctly. counters (gives you rate+count), sampling, timers with histograms, guages, uniques. https://github.com/reinh/statsd statsd-ruby gem

Slide 22

Slide 22 text

No content

Slide 23

Slide 23 text

No content

Slide 24

Slide 24 text

No content

Slide 25

Slide 25 text

No content

Slide 26

Slide 26 text

No content

Slide 27

Slide 27 text

Cassandra Apache DNS elasticsearch files graphite haproxy hbase java logging lxc memcached mongodb pingdom opsgenie percona postfix snmp solr twilio youtube aws varnish postgres redis riak rabbitmq processes

Slide 28

Slide 28 text

• This is a lot of moving parts • You will never set up monitoring infra • You will never keep it updated • Unless it is is *easy* • 50 lines of chef-solo and 10 minutes later the entire system springs to life Use configuration management

Slide 29

Slide 29 text

links https://speakerdeck.com/statik/observability- the-new-incompetence http://sensuapp.com/ https://github.com/sensu/sensu-community- plugins http://graphite.wikidot.com/ http://animals-riding-animals.tumblr.com/