Upgrade to Pro — share decks privately, control downloads, hide ads and more …

It's not in production unless it's monitored.

It's not in production unless it's monitored.

Talk given at RailsConf 2012

In the 21st century successful teams are data-driven. We’ll present a complete introduction to everything you need to start monitoring your service at every level from business drivers to per-request metrics in Rails/Rack, down to server memory/cpu. Provides a high-level overview of the fundamental components that comprise a holistic monitoring system and then drills into real-world examples with tools like ActiveSupport::Notifications, statsd/rack-statsd, and CollectD. Also covers best practices for active alerting on custom monitoring data.

Joseph Ruscio

April 25, 2012
Tweet

More Decks by Joseph Ruscio

Other Decks in Programming

Transcript

  1. It’s not in production
    unless it’s monitored.
    Wednesday, April 25, 2012

    View Slide

  2. ‣ @josephruscio
    ‣ Co-Founder/CTO Librato
    ‣ I <3 graphs


    Wednesday, April 25, 2012

    View Slide

  3. Wednesday, April 25, 2012

    View Slide

  4. SaaS 2002
    ‣ Seed Round: $1.5M
    USD
    ‣ Infrastructure:
    CAPEX
    ‣ Dedicated Ops
    Team
    ‣ Custom Software
    Stack
    Wednesday, April 25, 2012

    View Slide

  5. SaaS 2012
    ‣ Seed Round: $20K
    USD
    ‣ Infrastructure:
    OPEX
    ‣ <=1 Ops Person
    ‣ OSS, External
    Services
    Wednesday, April 25, 2012

    View Slide

  6. ‣ agile infrastructure
    ‣ ephemeral infrastructure
    ‣ more change, worse tools!
    Wednesday, April 25, 2012

    View Slide

  7. ‣ continuous integration
    ‣ one-click deploy
    ‣ feature-flagging
    ‣ monitoring
    ‣ alerting
    Cont. Deployment
    Wednesday, April 25, 2012

    View Slide

  8. Wednesday, April 25, 2012

    View Slide

  9. Chunky
    Bacon!!
    Wednesday, April 25, 2012

    View Slide

  10. Graphite
    StatsD
    OpenTSDB
    Cube
    d3.js
    Wednesday, April 25, 2012

    View Slide

  11. Anti-Pattern
    • Custom Stats
    • MySQL threads
    • VMstat
    • ....
    Storage
    • CPU
    • Interface
    • Memory
    • Ping
    • Battery charge
    • ....
    Storage
    • Ping
    • CPU
    • Memory
    • Disks
    • SNMP Service
    • ....
    Storage
    ...
    Nagios Ganglia RRD/Cacti
    Wednesday, April 25, 2012

    View Slide

  12. #monitoringsucks
    Wednesday, April 25, 2012

    View Slide

  13. We need a better
    model
    Wednesday, April 25, 2012

    View Slide

  14. Metrics
    ‣ business drivers
    ‣ application performance
    ‣ system resources
    ‣ network
    Wednesday, April 25, 2012

    View Slide

  15. Collection
    Storage
    Aggregation
    Analysis
    Wednesday, April 25, 2012

    View Slide

  16. Separation of
    Concerns
    Wednesday, April 25, 2012

    View Slide

  17. Collection
    Wednesday, April 25, 2012

    View Slide

  18. Logging
    ‣ etsy/logster
    ‣ logstash/logstash
    ‣ Papertrail et al.
    Wednesday, April 25, 2012

    View Slide

  19. AS::Notifications
    ‣ pub/sub instrumentation
    ‣ mattmatt/lograge
    ‣ twinturbo/harness
    Wednesday, April 25, 2012

    View Slide

  20. eric/metriks
    ‣ Ruby instrumentation
    ‣ counters,meters,timers
    ‣ multiple reporters
    Wednesday, April 25, 2012

    View Slide

  21. Aggregation
    Wednesday, April 25, 2012

    View Slide

  22. etsy/statsd
    ‣ ~319 SLOC Node.js
    ‣ counters, timers, gauges
    ‣ UDP
    Wednesday, April 25, 2012

    View Slide

  23. Nginx Unicorn
    StatsD
    Front-End
    Wednesday, April 25, 2012

    View Slide

  24. StatsD Clients
    ‣ zebrafishlabs/nginx-statsd
    ‣ github/rack-statsd
    ‣ shopify/statsd-instrument
    Wednesday, April 25, 2012

    View Slide

  25. StatsD Servers
    Wednesday, April 25, 2012

    View Slide

  26. Storage
    Wednesday, April 25, 2012

    View Slide

  27. RRDTool
    ‣ Round-Robin Database Tool
    ‣ constant storage size
    ‣ rollups
    Wednesday, April 25, 2012

    View Slide

  28. Graphite
    ‣ Whisper RRD
    ‣ flat.hierarchical.namespace
    ‣ HTTP queries
    Wednesday, April 25, 2012

    View Slide

  29. OpenTSDB
    ‣ HBase
    ‣ multiple dimensions
    ‣ HTTP queries
    Wednesday, April 25, 2012

    View Slide

  30. SaaS
    ‣ Librato Metrics et al.
    ‣ JSON over HTTP
    ‣ rollups
    ‣ interactive front-ends
    Wednesday, April 25, 2012

    View Slide

  31. Visualization
    Wednesday, April 25, 2012

    View Slide

  32. Correlation
    ‣ metrics
    ‣ annotations
    ‣ arbitrary combinations
    Wednesday, April 25, 2012

    View Slide

  33. Wednesday, April 25, 2012

    View Slide

  34. Wednesday, April 25, 2012

    View Slide

  35. Dashboards
    ‣ shared understanding
    ‣ aberration detection
    ‣ fire-fighting manual
    Wednesday, April 25, 2012

    View Slide

  36. Wednesday, April 25, 2012

    View Slide

  37. Wednesday, April 25, 2012

    View Slide

  38. Alerting
    Wednesday, April 25, 2012

    View Slide

  39. Tuning Alerts
    ‣ trigger threshold
    ‣ cancel threshold
    ‣ re-arm window
    ‣ function
    ‣ window
    Wednesday, April 25, 2012

    View Slide

  40. Wednesday, April 25, 2012

    View Slide

  41. Aberrant Behavior
    Wednesday, April 25, 2012

    View Slide

  42. ‣ separation of concerns
    ‣ monitoring == tests
    ‣ arbitrary correlations
    ‣ dashboards
    ‣ living alerts
    Wednesday, April 25, 2012

    View Slide

  43. fin
    Wednesday, April 25, 2012

    View Slide