A Whirlwind Tour of Etsy's Monitoring Stack

A Whirlwind Tour of Etsy's Monitoring Stack Daniel Schauenberg [email protected]
@mrtazz

@mrtazz

@mrtazz Item by TheBackPackShoppe

How comfortable are you deploying a change right now?

“If this is your first day at Etsy, you deploy
the site”

@mrtazz Ganglia • System level metrics • Instance per DC/environment
• > 220k RRD files • Fully configured through Chef role attributes

@mrtazz Rainbow Graphs!

@mrtazz StatsD • Single instance on one server • Traffic
mostly from 70 Web & 24 API servers • Node.js • Heavy Sampling • Graphite as backend

@mrtazz

@mrtazz Graphite • Application level metrics • 96G RAM, 20
Cores, 7.3T SSD RAID 10 • 525k metrics/minute • Mirrored Master/Master Setup • Functionally sharded relays

@mrtazz CNAME relays relays caches caches statsdtimers statsdcounts statsd chef
logster fqld search generic

@mrtazz

@mrtazz Syslog-Ng • Web, Search, Gearman, Photos, Nagios, Network, VPN
• 1.2GB written/minute • Chef role attribute based config • Rule ordering!

@mrtazz github.com/etsy/logster • Extract metrics from log files • Written
in Python • Runs every minute via cron

@mrtazz Splunk • Indexes all of our log files •
Easy search for patterns • Saved searches for interesting ones • Basically using it as a glorified grep

@mrtazz Logstash • Experiment status • Makes it easier integrate
different sources • Easy to set up in dev environment • Trying to figure out where/how it fits into our infrastructure

@mrtazz Eventinator • Tracks all events in our infrastructure •
Chef runs and changes • DNS changes • Network • Deploys • Server provisioning and decommissioning • ~ 12 million events in the last 2 years

@mrtazz

@mrtazz Chef • rules everything around me • Same cookbooks
on prod and dev • every node runs Chef every 10 minutes • ton of knife plugins and handlers

@mrtazz

@mrtazz > 120 recipes

@mrtazz

@mrtazz Nagios

@mrtazz Nagios • 2 instances in each DC/environment • Fully
Chef generated configuration • Service checks and contacts in git • Notifications via email->SMS gateway • ~75% ops on-call

@mrtazz github.com/lozzd/nagdash

@mrtazz

@mrtazz Nagios Herald • Add context to nagios alerts •
What are the first 5 things you do when you get paged? • You already have the phone in your hand • nagios notification handler

@mrtazz

@mrtazz The Toys are real

@mrtazz There’s another side of heaven

@mrtazz Ops Weekly

@mrtazz Summary • Set of trusted tools • Enhance where
they come short • Try out new things • Write tools where applicable • Continuous monitoring and adaptation

@mrtazz codeascraft.com etsy.com/codeascraft/talks etsy.github.com etsy.com/careers

@mrtazz Questions?

A Whirlwind Tour of Etsy's Monitoring Stack Daniel Schauenberg [email protected]
@mrtazz

A Whirlwind Tour of Etsy's Monitoring Stack

A Whirlwind Tour of Etsy's Monitoring Stack

More Decks by Daniel Schauenberg

Other Decks in Technology

Featured

Transcript