Data Driven Monitoring

Data Driven Monitoring

At Etsy we are big fans of graphing and monitoring all the things. We deploy our main site several times a day and our monitoring provides us with the tight feedback loop we need to make this possible. The same goes for changes in the infrastructure which are deployed in a similar fashion of small and frequent changes. For this to work we have build up monitoring that tracks changes and possible problems in every nook and cranny of the Etsy stack, be it a network change, systems or application level performance or how bad the last week of on-call rotation was.

The flipside of monitoring all the things however is that we have a myriad of graphs and alerts that can potentially be important and page the on-call engineer at any given time. The risk of running into alert fatigue and a normalization of deviance through not properly scoped checks is rising with this ever increasing size of the monitoring system. This is why we also continuously monitor our monitoring system and ask questions about whether we have all the information at hand when we get paged, if certain alerts actually need to wake someone up or if they are needed at all.

I will give a quick overview of how our monitoring stack is built and then give insights into how we gather data about it and use it to make things better. For site operations and more importantly for the human getting paged when something does go wrong.


Daniel Schauenberg

October 09, 2014


  1. Data Driven Monitoring Daniel Schauenberg @mrtazz

  2. None
  3. @mrtazz

  4. Item by TheBackPackShoppe


  6. @mrtazz

  7. How comfortable are you deploying a change right now?

  8. “If this is your first day at Etsy, you deploy

    the site”
  9. None
  10. None
  11. @mrtazz

  12. @mrtazz Ganglia • System level metrics • Instance per DC/environment

    • > 220k RRD files • Fully configured through Chef role attributes
  13. @mrtazz Rainbow Graphs!

  14. @mrtazz StatsD

  15. @mrtazz Graphite • Application level metrics • 96G RAM, 20

    Cores, 7.3T SSD RAID 10 • 525k metrics/minute • Mirrored Primary/Primary Setup • Functionally sharded relays
  16. @mrtazz

  17. @mrtazz

  18. @mrtazz nagios

  19. @mrtazz <3 nagios

  20. @mrtazz

  21. @mrtazz Nagios • 2 instances in each DC/environment • Fully

    Chef generated configuration • Service checks and contacts in git • Notifications via email->SMS gateway • ~75% ops on-call
  22. @mrtazz

  23. @mrtazz

  24. @mrtazz Much more… • Syslog-ng • Logstash • Logster •

    Supergrep • Eventinator
  25. Information Overload Image by

  26. @mrtazz Alert Fatigue

  27. We have the data We can make it better Item

    by PicksFromThePast
  28. None
  29. @mrtazz nagios-herald

  30. @mrtazz nagios-herald

  31. @mrtazz nagios-herald

  32. @mrtazz Failed Check nagios-herald Formatter Helpers Graphite Ganglia Logstash Message

  33. None

  35. @mrtazz opsweekly

  36. @mrtazz

  37. @mrtazz Opsweekly

  38. @mrtazz Alert categorization

  39. @mrtazz Wearables! Item by JennysTrinketShoppe

  40. @mrtazz Sleep tracking


  42. @mrtazz Summary • Set of trusted tools for monitoring •

    Always experiment • Always learn • Always improve • Use the data, Luke
  43. @mrtazz Shout out to @lozzd and @Ryan_Frantz

  44. @mrtazz

  45. Questions?

  46. Data Driven Monitoring Daniel Schauenberg @mrtazz