At Etsy we are big fans of graphing and monitoring all the things. We deploy our main site several times a day and our monitoring provides us with the tight feedback loop we need to make this possible. The same goes for changes in the infrastructure which are deployed in a similar fashion of small and frequent changes. For this to work we have build up monitoring that tracks changes and possible problems in every nook and cranny of the Etsy stack, be it a network change, systems or application level performance or how bad the last week of on-call rotation was.
The flipside of monitoring all the things however is that we have a myriad of graphs and alerts that can potentially be important and page the on-call engineer at any given time. The risk of running into alert fatigue and a normalization of deviance through not properly scoped checks is rising with this ever increasing size of the monitoring system. This is why we also continuously monitor our monitoring system and ask questions about whether we have all the information at hand when we get paged, if certain alerts actually need to wake someone up or if they are needed at all.
I will give a quick overview of how our monitoring stack is built and then give insights into how we gather data about it and use it to make things better. For site operations and more importantly for the human getting paged when something does go wrong.