Basic competency as a product dev team • Version control • Appropriate databases • Test suite; green, deployable trunk • Secure handling of passwords, credit cards • Timely security patching
“BTW, looks like our storage costs are growing at $1500 a month based on a 3 month average, and we’ll need a redesign of the storage system in 18 months” Wow, you are a dextrous & proactive, a consummate professional! Have some more money and servers.
Uhh...it’s working in dev. Are you sure they are using it correctly? Yup, we have seen 37 crashes this morning. Looks like users with Cyrillic names are affected. ETA on a fix is less than one hour. Hey, users are reporting a crash when updating their profile...
The site seems slow Uhh...well we’ve been writing a lot of code and not paying attention Hmm. Our 99th percentile page load time has actually improved by 10% over the last 3 months. Show me the page that seems slow and we will analyze whats going on.
You need monitoring MONITORS, LOTS OF MONITORS. MONITORS ARE FOR EXPLORATION/ANALYSIS HUMANS STINK AT WATCHING COMPUTERS SHOULD BE IMPOSSIBLE FOR BOSS, INVESTORS, USERS TO BE THE FIRST TO NOTIFY YOU.
What is monitoring • Critical/Warn/OK monitoring • Trending over time, event correlation, capacity planning • Alerting - putting an event in the audit log, waking someone up for an emergency, opening a ticket to be addressed next week.
What can we watch? • Business or application level metrics - revenue, signups, cancellations, engagement • Raw server health (disk space, memory, IO) • Application health (open DB connections, page render time, rate of each HTTP status code, did backups happen last night?) • User experience (javascript errors, app server exceptions, load times) • Vacuum metrics from other places into your system (YouTube likes, AWS Load Balancers)
Do you want to spend time or money? • “lean”, maybe only contract devs - just use a bunch of SaaS products. Valid approach. • Bootstrapping? • HIPAA or PCI data protection? • Any fulltime devs? • Consider running some of your own monitoring tools. Business folks love ‘em too
Running your own • Sensu • Sensu-community-plugins • Graphite, Descartes, Tasseo • Pagerduty or OpsGenie for alerting • Logstash, Kibana (http://kibana.org/ infrastructure.html, needs sensu) • Vagrant+Chef for config management Start HERE
Part 2: Sensu+Graphite • Sensu is a monitoring framework. Successor to Nagios, emerged from needs of cloudy app architecture. #monitoringsucks • Graphite is a time series database. Successor to RRD, originally written at orbitz.com. Amazingly flexible, surging in popularity (Github, hosted services)
Recording metrics • free disk space • page load times • new signups rate • Facebook likes Current metric count • Record: time series db • Sense-making: Draw graphs • Remix: derivative, sum, time shift • Inception: alert on thresholds (disk full, error rate changing too rapidly) What do we do with metrics?
Publishing metrics from your app Run statsd in front of Graphite. This is a statistics aggregator, makes it easier to measure correctly. counters (gives you rate+count), sampling, timers with histograms, guages, uniques. https://github.com/reinh/statsd statsd-ruby gem
• This is a lot of moving parts • You will never set up monitoring infra • You will never keep it updated • Unless it is is *easy* • 50 lines of chef-solo and 10 minutes later the entire system springs to life Use configuration management