Development, Deployment & Collaboration at Etsy

Development, Deployment & Collaboration at Etsy

At Etsy about 150 engineers deploy a single monolithic application more than 60 times a day. This process of deploying small changesets continuously enables us to build up and release robust features and detect and fix bugs extremely fast. All while serving over a billion page views per month. Developing and deploying at such a high velocity however only works because product developers and designers, infrastructure and operations engineers and the security team work closely together. We have an extremely open culture of sharing (inside and outside the company) and make sure we run into as few surprises as possible by bringing everybody on the same page about changes.

In order to explain how we make this work at Etsy I will give details about how the general development process is laid out. A huge part of this is the setup of our development environment. Each engineer has their own VM which runs a slimmed down version of the Etsy stack. We use Chef to keep our infrastructure in sync and the developer VMs are no exception, they run the same cookbooks as the production infrastructure. This is paramount in making sure features are being developed in an environment as close to production as possible.

Our whole development process is wrapped into a tight feedback loop of which our CI cluster and our monitoring stack are the centerpiece. The CI system has two central tasks. One is to run the full suite of tests before deployment and smoker tests against staging and production. And the second one - which is much more resource intensive - is to provide a system for engineers to test their work in progress changes against the whole test suite with a single command line script. I will go into detail how our setup, which currently consists of about 250 Jenkins build slaves, enables quick feedback and how we continuously work on keeping it fast.

Once changes are in production, we have a big set of dashboards, log parsing and alerting tools to make sure we can detect regressions and bugs as fast as possible and fix them with the next deploy. In addition to providing a quick method to detect problems our myriad of dashboards also provide a way to quickly share the current state of etsy.com and enables us to have efficient and productive discussions within and across teams by sharing a simple URL in IRC. I will talk about how we use those tools every day and how everybody sits down and investigates what's going on in case of a faulty deploy and how we all learn from those incidents by sharing successes and failures openly.

At Etsy it is in every engineer's responsibility to deploy their changes themselves using Deployinator, a one button deployment system we have written and open sourced. This system is integrated into the company wide IRC network and serves as the canonical way to deploy changes and provides a set of features to gain confidence in the changeset that is about to go live. I will give insights into how the system works and has changed over time to accomodate use cases we saw for better communicating change and enabling people to have an efficient discussion and proper view of the current state when something doesn't go according to plan.

Continuous Deployment and the ongoing collaboration across teams in engineering and operations are the foundation of moving fast and iterating on products and features. We have a strong culture of taking responsibility and sharing knowledge, successes and failures to build a succesful and resilient engineering team. This talk will give deep insights into how we develop software at Etsy and what tools and processes we utilize to help us achieve our goals.

This is a revised version of my talk from QCon London March 2014

89e0ad1229121f46047977ac547bd7b4?s=128

Daniel Schauenberg

June 19, 2014
Tweet

Transcript

  1. Development, Deployment and Collaboration at Etsy Daniel Schauenberg dschauenberg@etsy.com @mrtazz

  2. None
  3. @mrtazz Etsy Stats

  4. @mrtazz Etsy Stats

  5. @mrtazz Item by TheBackPackShoppe

  6. http://www.flickr.com/photos/brianglanz/1095706242

  7. avg 50 deploys/ day

  8. avg n > m deploys/ day

  9. How comfortable are you deploying a change right now?

  10. @mrtazz http://www.flickr.com/photos/renaissancechambara/2349811492 small change

  11. Config Flags Item by RocajoStudio

  12. None
  13. “If this is your first day at Etsy, you deploy

    the site”
  14. Developer VMs

  15. @mrtazz Developer VMs • KVM • Every engineer has one

    • Fully Chef’d with the Etsy Stack • Different sizes and Chef roles
  16. None
  17. Continuous Integration

  18. None
  19. @mrtazz Continuous Integration • Run set of tests before each

    deploy • Full QA suite • Princess/Production smoker tests • Try (yup, there is one)
  20. http://www.flickr.com/photos/egfocus/6962179321

  21. @mrtazz The Bobs • LXC virtualized hosts • 14/physical hosts

    • Spread over 3 SSDs • Most of them attached to try
  22. None
  23. Item by decomodwalls

  24. Deployinator

  25. @mrtazz Deployinator • 2 Buttons, no ambiguity • Overview of

    current state of deploy • Links to Logwatcher and Dashboards • Easy to add stacks for new tools to deploy
  26. http://www.flickr.com/photos/jbgeronimi/6363087361

  27. None
  28. Monitoring

  29. @mrtazz shouldigraphit.com

  30. @mrtazz Monitoring • Devs do their feature monitoring • Everybody

    can access all the graphs • Dashboard All The Things! • Stream All The Logs!
  31. None
  32. None
  33. None
  34. On Call

  35. If you are writing code, you are on-call

  36. @mrtazz On-Call Schedules • ops on-call • dev on-call •

    payments on-call • support on-call
  37. None
  38. @mrtazz Dev On-Call • On-call for 3 days • All

    developers who are not in another rotation • L1 and L2 escalations • L1 if it’s your first time
  39. Incident Response

  40. @mrtazz Incident Response • “This graph looks funny” • “Hey

    I just got paged for elevated error rate after deploys” • “Supergrep is going crazy!!”
  41. Is the site down?

  42. None
  43. #warroom

  44. @mrtazz #warroom • only outage related conversations • coordinate investigation,

    communication, countermeasures and monitoring • good place to lurk for new engineers
  45. Post Mortems

  46. blameless

  47. Everybody’s invited

  48. Learning Opportunity

  49. Summary

  50. @mrtazz Summary • These are things that work for *us*

    • Culture is an on-going effort • Share everything • Encourage learning/teaching
  51. @mrtazz Summary • Lunch ’n learns • DC visits •

    On-call for a day • Bootcamps/Senior rotations
  52. codeascraft.com etsy.com/codeascraft/talks etsy.github.com etsy.com/careers

  53. Questions?

  54. Development, Deployment and Collaboration at Etsy Daniel Schauenberg dschauenberg@etsy.com