A Whirlwind Tour of Etsy's Monitoring Stack

A Whirlwind Tour of Etsy's Monitoring Stack

It's no secret that at Etsy we are big fans of small, incremental and frequent changes and tight feedback loops. This is how we make it possible to deploy changes to our main codebase more than 50 times a day and also safely apply changes to our infrastructure in a continuous fashion. It enables us to rapidly fix bugs and roll out features in our application stack and infrastructure. This however would not be possible without a tight feedback loop and a myriad of monitoring tools that keep us informed about changes and
possible problems in every nook and cranny of the Etsy stack, no matter if it's a network change event, systems or application level performance or how bad the last week of on-call rotation was.

89e0ad1229121f46047977ac547bd7b4?s=128

Daniel Schauenberg

May 06, 2014
Tweet

Transcript

  1. A Whirlwind Tour of Etsy's Monitoring Stack Daniel Schauenberg dschauenberg@etsy.com

    @mrtazz
  2. None
  3. @mrtazz

  4. @mrtazz

  5. @mrtazz Item by TheBackPackShoppe

  6. How comfortable are you deploying a change right now?

  7. “If this is your first day at Etsy, you deploy

    the site”
  8. None
  9. None
  10. @mrtazz Ganglia • System level metrics • Instance per DC/environment

    • > 220k RRD files • Fully configured through Chef role attributes
  11. @mrtazz Rainbow Graphs!

  12. @mrtazz StatsD • Single instance on one server • Traffic

    mostly from 70 Web & 24 API servers • Node.js • Heavy Sampling • Graphite as backend
  13. @mrtazz

  14. @mrtazz Graphite • Application level metrics • 96G RAM, 20

    Cores, 7.3T SSD RAID 10 • 525k metrics/minute • Mirrored Master/Master Setup • Functionally sharded relays
  15. @mrtazz CNAME relays relays caches caches statsdtimers statsdcounts statsd chef

    logster fqld search generic
  16. @mrtazz

  17. @mrtazz

  18. @mrtazz Syslog-Ng • Web, Search, Gearman, Photos, Nagios, Network, VPN

    • 1.2GB written/minute • Chef role attribute based config • Rule ordering!
  19. None
  20. @mrtazz github.com/etsy/logster • Extract metrics from log files • Written

    in Python • Runs every minute via cron
  21. @mrtazz Splunk • Indexes all of our log files •

    Easy search for patterns • Saved searches for interesting ones • Basically using it as a glorified grep
  22. @mrtazz Logstash • Experiment status • Makes it easier integrate

    different sources • Easy to set up in dev environment • Trying to figure out where/how it fits into our infrastructure
  23. @mrtazz Eventinator • Tracks all events in our infrastructure •

    Chef runs and changes • DNS changes • Network • Deploys • Server provisioning and decommissioning • ~ 12 million events in the last 2 years
  24. @mrtazz

  25. @mrtazz Chef • rules everything around me • Same cookbooks

    on prod and dev • every node runs Chef every 10 minutes • ton of knife plugins and handlers
  26. @mrtazz

  27. @mrtazz > 120 recipes

  28. @mrtazz

  29. @mrtazz Nagios

  30. @mrtazz Nagios • 2 instances in each DC/environment • Fully

    Chef generated configuration • Service checks and contacts in git • Notifications via email->SMS gateway • ~75% ops on-call
  31. @mrtazz github.com/lozzd/nagdash

  32. @mrtazz

  33. @mrtazz

  34. @mrtazz

  35. @mrtazz Nagios Herald • Add context to nagios alerts •

    What are the first 5 things you do when you get paged? • You already have the phone in your hand • nagios notification handler
  36. @mrtazz

  37. @mrtazz The Toys are real

  38. @mrtazz There’s another side of heaven

  39. @mrtazz Ops Weekly

  40. @mrtazz Ops Weekly

  41. @mrtazz Summary • Set of trusted tools • Enhance where

    they come short • Try out new things • Write tools where applicable • Continuous monitoring and adaptation
  42. @mrtazz codeascraft.com etsy.com/codeascraft/talks etsy.github.com etsy.com/careers

  43. @mrtazz Questions?

  44. A Whirlwind Tour of Etsy's Monitoring Stack Daniel Schauenberg dschauenberg@etsy.com

    @mrtazz