Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data Driven Monitoring

Data Driven Monitoring

At Etsy we are big fans of graphing and monitoring all the things. We deploy our main site several times a day and our monitoring provides us with the tight feedback loop we need to make this possible. The same goes for changes in the infrastructure which are deployed in a similar fashion of small and frequent changes. For this to work we have build up monitoring that tracks changes and possible problems in every nook and cranny of the Etsy stack, be it a network change, systems or application level performance or how bad the last week of on-call rotation was.

The flipside of monitoring all the things however is that we have a myriad of graphs and alerts that can potentially be important and page the on-call engineer at any given time. The risk of running into alert fatigue and a normalization of deviance through not properly scoped checks is rising with this ever increasing size of the monitoring system. This is why we also continuously monitor our monitoring system and ask questions about whether we have all the information at hand when we get paged, if certain alerts actually need to wake someone up or if they are needed at all.

I will give a quick overview of how our monitoring stack is built and then give insights into how we gather data about it and use it to make things better. For site operations and more importantly for the human getting paged when something does go wrong.

Daniel Schauenberg

October 09, 2014
Tweet

More Decks by Daniel Schauenberg

Other Decks in Technology

Transcript

  1. Data Driven Monitoring
    Daniel Schauenberg
    [email protected]
    @mrtazz

    View Slide

  2. View Slide

  3. @mrtazz

    View Slide

  4. Item by TheBackPackShoppe

    View Slide

  5. http://www.flickr.com/photos/brianglanz/1095706242

    View Slide

  6. @mrtazz

    View Slide

  7. How comfortable are
    you deploying a
    change right now?

    View Slide

  8. “If this is your first
    day at Etsy, you
    deploy the site”

    View Slide

  9. View Slide

  10. View Slide

  11. @mrtazz

    View Slide

  12. @mrtazz
    Ganglia
    • System level metrics
    • Instance per DC/environment
    • > 220k RRD files
    • Fully configured through Chef role attributes

    View Slide

  13. @mrtazz
    Rainbow Graphs!

    View Slide

  14. @mrtazz
    StatsD

    View Slide

  15. @mrtazz
    Graphite
    • Application level metrics
    • 96G RAM, 20 Cores, 7.3T SSD RAID 10
    • 525k metrics/minute
    • Mirrored Primary/Primary Setup
    • Functionally sharded relays

    View Slide

  16. @mrtazz

    View Slide

  17. @mrtazz

    View Slide

  18. @mrtazz
    nagios

    View Slide

  19. @mrtazz
    <3 nagios

    View Slide

  20. @mrtazz

    View Slide

  21. @mrtazz
    Nagios
    • 2 instances in each DC/environment
    • Fully Chef generated configuration
    • Service checks and contacts in git
    • Notifications via email->SMS gateway
    • ~75% ops on-call

    View Slide

  22. @mrtazz
    github.com/lozzd/nagdash

    View Slide

  23. @mrtazz

    View Slide

  24. @mrtazz
    Much more…
    • Syslog-ng
    • Logstash
    • Logster
    • Supergrep
    • Eventinator

    View Slide

  25. Information
    Overload
    Image by http://jasoncasteel.deviantart.com/

    View Slide

  26. @mrtazz
    Alert
    Fatigue

    View Slide

  27. We have
    the data
    We can
    make
    it better
    Item by PicksFromThePast

    View Slide

  28. View Slide

  29. @mrtazz
    nagios-herald

    View Slide

  30. @mrtazz
    nagios-herald

    View Slide

  31. @mrtazz
    nagios-herald

    View Slide

  32. @mrtazz
    Failed Check nagios-herald
    Formatter
    Helpers
    Graphite Ganglia Logstash
    Message

    View Slide

  33. View Slide

  34. github.com/etsy/nagios-herald

    View Slide

  35. @mrtazz
    opsweekly

    View Slide

  36. @mrtazz

    View Slide

  37. @mrtazz
    Opsweekly

    View Slide

  38. @mrtazz
    Alert categorization

    View Slide

  39. @mrtazz
    Wearables!
    Item by JennysTrinketShoppe

    View Slide

  40. @mrtazz
    Sleep tracking

    View Slide

  41. github.com/etsy/opsweekly

    View Slide

  42. @mrtazz
    Summary
    • Set of trusted tools for monitoring
    • Always experiment
    • Always learn
    • Always improve
    • Use the data, Luke

    View Slide

  43. @mrtazz
    Shout out to
    @lozzd
    and
    @Ryan_Frantz

    View Slide

  44. @mrtazz
    codeascraft.com
    etsy.com/codeascraft/talks
    etsy.github.com
    etsy.com/careers

    View Slide

  45. Questions?

    View Slide

  46. Data Driven Monitoring
    Daniel Schauenberg
    [email protected]
    @mrtazz

    View Slide