Upgrade to Pro — share decks privately, control downloads, hide ads and more …

It's not in production unless it's monitored.

It's not in production unless it's monitored.

Talk given at RailsConf 2012

In the 21st century successful teams are data-driven. We’ll present a complete introduction to everything you need to start monitoring your service at every level from business drivers to per-request metrics in Rails/Rack, down to server memory/cpu. Provides a high-level overview of the fundamental components that comprise a holistic monitoring system and then drills into real-world examples with tools like ActiveSupport::Notifications, statsd/rack-statsd, and CollectD. Also covers best practices for active alerting on custom monitoring data.

D283895e908601709fd493caf8fe2699?s=128

Joseph Ruscio

April 25, 2012
Tweet

Transcript

  1. It’s not in production unless it’s monitored. Wednesday, April 25,

    2012
  2. ‣ @josephruscio ‣ Co-Founder/CTO Librato ‣ I <3 graphs <me>

    </me> Wednesday, April 25, 2012
  3. Wednesday, April 25, 2012

  4. SaaS 2002 ‣ Seed Round: $1.5M USD ‣ Infrastructure: CAPEX

    ‣ Dedicated Ops Team ‣ Custom Software Stack Wednesday, April 25, 2012
  5. SaaS 2012 ‣ Seed Round: $20K USD ‣ Infrastructure: OPEX

    ‣ <=1 Ops Person ‣ OSS, External Services Wednesday, April 25, 2012
  6. ‣ agile infrastructure ‣ ephemeral infrastructure ‣ more change, worse

    tools! Wednesday, April 25, 2012
  7. ‣ continuous integration ‣ one-click deploy ‣ feature-flagging ‣ monitoring

    ‣ alerting Cont. Deployment Wednesday, April 25, 2012
  8. Wednesday, April 25, 2012

  9. Chunky Bacon!! Wednesday, April 25, 2012

  10. Graphite StatsD OpenTSDB Cube d3.js Wednesday, April 25, 2012

  11. Anti-Pattern • Custom Stats • MySQL threads • VMstat •

    .... Storage • CPU • Interface • Memory • Ping • Battery charge • .... Storage • Ping • CPU • Memory • Disks • SNMP Service • .... Storage ... Nagios Ganglia RRD/Cacti Wednesday, April 25, 2012
  12. #monitoringsucks Wednesday, April 25, 2012

  13. We need a better model Wednesday, April 25, 2012

  14. Metrics ‣ business drivers ‣ application performance ‣ system resources

    ‣ network Wednesday, April 25, 2012
  15. Collection Storage Aggregation Analysis Wednesday, April 25, 2012

  16. Separation of Concerns Wednesday, April 25, 2012

  17. Collection Wednesday, April 25, 2012

  18. Logging ‣ etsy/logster ‣ logstash/logstash ‣ Papertrail et al. Wednesday,

    April 25, 2012
  19. AS::Notifications ‣ pub/sub instrumentation ‣ mattmatt/lograge ‣ twinturbo/harness Wednesday, April

    25, 2012
  20. eric/metriks ‣ Ruby instrumentation ‣ counters,meters,timers ‣ multiple reporters Wednesday,

    April 25, 2012
  21. Aggregation Wednesday, April 25, 2012

  22. etsy/statsd ‣ ~319 SLOC Node.js ‣ counters, timers, gauges ‣

    UDP Wednesday, April 25, 2012
  23. Nginx Unicorn StatsD Front-End Wednesday, April 25, 2012

  24. StatsD Clients ‣ zebrafishlabs/nginx-statsd ‣ github/rack-statsd ‣ shopify/statsd-instrument Wednesday, April

    25, 2012
  25. StatsD Servers Wednesday, April 25, 2012

  26. Storage Wednesday, April 25, 2012

  27. RRDTool ‣ Round-Robin Database Tool ‣ constant storage size ‣

    rollups Wednesday, April 25, 2012
  28. Graphite ‣ Whisper RRD ‣ flat.hierarchical.namespace ‣ HTTP queries Wednesday,

    April 25, 2012
  29. OpenTSDB ‣ HBase ‣ multiple dimensions ‣ HTTP queries Wednesday,

    April 25, 2012
  30. SaaS ‣ Librato Metrics et al. ‣ JSON over HTTP

    ‣ rollups ‣ interactive front-ends Wednesday, April 25, 2012
  31. Visualization Wednesday, April 25, 2012

  32. Correlation ‣ metrics ‣ annotations ‣ arbitrary combinations Wednesday, April

    25, 2012
  33. Wednesday, April 25, 2012

  34. Wednesday, April 25, 2012

  35. Dashboards ‣ shared understanding ‣ aberration detection ‣ fire-fighting manual

    Wednesday, April 25, 2012
  36. Wednesday, April 25, 2012

  37. Wednesday, April 25, 2012

  38. Alerting Wednesday, April 25, 2012

  39. Tuning Alerts ‣ trigger threshold ‣ cancel threshold ‣ re-arm

    window ‣ function ‣ window Wednesday, April 25, 2012
  40. Wednesday, April 25, 2012

  41. Aberrant Behavior Wednesday, April 25, 2012

  42. ‣ separation of concerns ‣ monitoring == tests ‣ arbitrary

    correlations ‣ dashboards ‣ living alerts Wednesday, April 25, 2012
  43. fin Wednesday, April 25, 2012