Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data Driven Monitoring

Data Driven Monitoring

At Etsy we are big fans of graphing and monitoring all the things. We deploy our main site several times a day and our monitoring provides us with the tight feedback loop we need to make this possible. The same goes for changes in the infrastructure which are deployed in a similar fashion of small and frequent changes. For this to work we have build up monitoring that tracks changes and possible problems in every nook and cranny of the Etsy stack, be it a network change, systems or application level performance or how bad the last week of on-call rotation was.

The flipside of monitoring all the things however is that we have a myriad of graphs and alerts that can potentially be important and page the on-call engineer at any given time. The risk of running into alert fatigue and a normalization of deviance through not properly scoped checks is rising with this ever increasing size of the monitoring system. This is why we also continuously monitor our monitoring system and ask questions about whether we have all the information at hand when we get paged, if certain alerts actually need to wake someone up or if they are needed at all.

I will give a quick overview of how our monitoring stack is built and then give insights into how we gather data about it and use it to make things better. For site operations and more importantly for the human getting paged when something does go wrong.

Daniel Schauenberg

October 09, 2014

More Decks by Daniel Schauenberg

Other Decks in Technology


  1. @mrtazz Ganglia • System level metrics • Instance per DC/environment

    • > 220k RRD files • Fully configured through Chef role attributes
  2. @mrtazz Graphite • Application level metrics • 96G RAM, 20

    Cores, 7.3T SSD RAID 10 • 525k metrics/minute • Mirrored Primary/Primary Setup • Functionally sharded relays
  3. @mrtazz Nagios • 2 instances in each DC/environment • Fully

    Chef generated configuration • Service checks and contacts in git • Notifications via email->SMS gateway • ~75% ops on-call
  4. @mrtazz Summary • Set of trusted tools for monitoring •

    Always experiment • Always learn • Always improve • Use the data, Luke