$30 off During Our Annual Pro Sale. View Details »

A Whirlwind Tour of Etsy's Monitoring Stack

A Whirlwind Tour of Etsy's Monitoring Stack

It's no secret that at Etsy we are big fans of small, incremental and frequent changes and tight feedback loops. This is how we make it possible to deploy changes to our main codebase more than 50 times a day and also safely apply changes to our infrastructure in a continuous fashion. It enables us to rapidly fix bugs and roll out features in our application stack and infrastructure. This however would not be possible without a tight feedback loop and a myriad of monitoring tools that keep us informed about changes and
possible problems in every nook and cranny of the Etsy stack, no matter if it's a network change event, systems or application level performance or how bad the last week of on-call rotation was.

Daniel Schauenberg

May 06, 2014
Tweet

More Decks by Daniel Schauenberg

Other Decks in Technology

Transcript

  1. A Whirlwind Tour of
    Etsy's Monitoring Stack
    Daniel Schauenberg
    [email protected]
    @mrtazz

    View Slide

  2. View Slide

  3. @mrtazz

    View Slide

  4. @mrtazz

    View Slide

  5. @mrtazz
    Item by TheBackPackShoppe

    View Slide

  6. How comfortable
    are you deploying
    a change right
    now?

    View Slide

  7. “If this is your first
    day at Etsy, you
    deploy the site”

    View Slide

  8. View Slide

  9. View Slide

  10. @mrtazz
    Ganglia
    • System level metrics
    • Instance per DC/environment
    • > 220k RRD files
    • Fully configured through Chef role
    attributes

    View Slide

  11. @mrtazz
    Rainbow Graphs!

    View Slide

  12. @mrtazz
    StatsD
    • Single instance on one server
    • Traffic mostly from 70 Web & 24 API
    servers
    • Node.js
    • Heavy Sampling
    • Graphite as backend

    View Slide

  13. @mrtazz

    View Slide

  14. @mrtazz
    Graphite
    • Application level metrics
    • 96G RAM, 20 Cores, 7.3T SSD RAID 10
    • 525k metrics/minute
    • Mirrored Master/Master Setup
    • Functionally sharded relays

    View Slide

  15. @mrtazz
    CNAME
    relays
    relays
    caches
    caches
    statsdtimers

    statsdcounts

    statsd

    chef

    logster

    fqld

    search

    generic

    View Slide

  16. @mrtazz

    View Slide

  17. @mrtazz

    View Slide

  18. @mrtazz
    Syslog-Ng
    • Web, Search, Gearman, Photos, Nagios,
    Network, VPN
    • 1.2GB written/minute
    • Chef role attribute based config
    • Rule ordering!

    View Slide

  19. View Slide

  20. @mrtazz
    github.com/etsy/logster
    • Extract metrics from log files
    • Written in Python
    • Runs every minute via cron

    View Slide

  21. @mrtazz
    Splunk
    • Indexes all of our log files
    • Easy search for patterns
    • Saved searches for interesting ones
    • Basically using it as a glorified grep

    View Slide

  22. @mrtazz
    Logstash
    • Experiment status
    • Makes it easier integrate different sources
    • Easy to set up in dev environment
    • Trying to figure out where/how it fits into
    our infrastructure

    View Slide

  23. @mrtazz
    Eventinator
    • Tracks all events in our infrastructure
    • Chef runs and changes
    • DNS changes
    • Network
    • Deploys
    • Server provisioning and decommissioning
    • ~ 12 million events in the last 2 years

    View Slide

  24. @mrtazz

    View Slide

  25. @mrtazz
    Chef
    • rules everything around me
    • Same cookbooks on prod and dev
    • every node runs Chef every 10 minutes
    • ton of knife plugins and handlers

    View Slide

  26. @mrtazz

    View Slide

  27. @mrtazz
    > 120 recipes

    View Slide

  28. @mrtazz

    View Slide

  29. @mrtazz
    Nagios

    View Slide

  30. @mrtazz
    Nagios
    • 2 instances in each DC/environment
    • Fully Chef generated configuration
    • Service checks and contacts in git
    • Notifications via email->SMS gateway
    • ~75% ops on-call

    View Slide

  31. @mrtazz
    github.com/lozzd/nagdash

    View Slide

  32. @mrtazz

    View Slide

  33. @mrtazz

    View Slide

  34. @mrtazz

    View Slide

  35. @mrtazz
    Nagios Herald
    • Add context to nagios alerts
    • What are the first 5 things you do when
    you get paged?
    • You already have the phone in your hand
    • nagios notification handler

    View Slide

  36. @mrtazz

    View Slide

  37. @mrtazz
    The Toys are real

    View Slide

  38. @mrtazz
    There’s another
    side of heaven

    View Slide

  39. @mrtazz
    Ops Weekly

    View Slide

  40. @mrtazz
    Ops Weekly

    View Slide

  41. @mrtazz
    Summary
    • Set of trusted tools
    • Enhance where they come short
    • Try out new things
    • Write tools where applicable
    • Continuous monitoring and adaptation

    View Slide

  42. @mrtazz
    codeascraft.com

    etsy.com/codeascraft/talks

    etsy.github.com

    etsy.com/careers

    View Slide

  43. @mrtazz
    Questions?

    View Slide

  44. A Whirlwind Tour of
    Etsy's Monitoring Stack
    Daniel Schauenberg
    [email protected]
    @mrtazz

    View Slide