At Etsy about 150 engineers deploy a single monolithic application more than 60 times a day. This process of deploying small changesets continuously enables us to build up and release robust features and detect and fix bugs extremely fast. All while serving over a billion page views per month. Developing and deploying at such a high velocity however only works because product developers and designers, infrastructure and operations engineers and the security team work closely together. We have an extremely open culture of sharing (inside and outside the company) and make sure we run into as few surprises as possible by bringing everybody on the same page about changes.
In order to explain how we make this work at Etsy I will give details about how the general development process is laid out. A huge part of this is the setup of our development environment. Each engineer has their own VM which runs a slimmed down version of the Etsy stack. We use Chef to keep our infrastructure in sync and the developer VMs are no exception, they run the same cookbooks as the production infrastructure. This is paramount in making sure features are being developed in an environment as close to production as possible.
Our whole development process is wrapped into a tight feedback loop of which our CI cluster and our monitoring stack are the centerpiece. The CI system has two central tasks. One is to run the full suite of tests before deployment and smoker tests against staging and production. And the second one - which is much more resource intensive - is to provide a system for engineers to test their work in progress changes against the whole test suite with a single command line script. I will go into detail how our setup, which currently consists of about 250 Jenkins build slaves, enables quick feedback and how we continuously work on keeping it fast.
Once changes are in production, we have a big set of dashboards, log parsing and alerting tools to make sure we can detect regressions and bugs as fast as possible and fix them with the next deploy. In addition to providing a quick method to detect problems our myriad of dashboards also provide a way to quickly share the current state of etsy.com and enables us to have efficient and productive discussions within and across teams by sharing a simple URL in IRC. I will talk about how we use those tools every day and how everybody sits down and investigates what's going on in case of a faulty deploy and how we all learn from those incidents by sharing successes and failures openly.
At Etsy it is in every engineer's responsibility to deploy their changes themselves using Deployinator, a one button deployment system we have written and open sourced. This system is integrated into the company wide IRC network and serves as the canonical way to deploy changes and provides a set of features to gain confidence in the changeset that is about to go live. I will give insights into how the system works and has changed over time to accomodate use cases we saw for better communicating change and enabling people to have an efficient discussion and proper view of the current state when something doesn't go according to plan.
Continuous Deployment and the ongoing collaboration across teams in engineering and operations are the foundation of moving fast and iterating on products and features. We have a strong culture of taking responsibility and sharing knowledge, successes and failures to build a succesful and resilient engineering team. This talk will give deep insights into how we develop software at Etsy and what tools and processes we utilize to help us achieve our goals.
This is a revised version of my talk from QCon London March 2014
and Collaboration at Etsy
Item by TheBackPackShoppe
avg 50 deploys/
avg n > m deploys/
are you deploying
a change right
Item by RocajoStudio
“If this is your first
day at Etsy, you
deploy the site”
• Every engineer has one
• Fully Chef’d with the Etsy Stack
• Different sizes and Chef roles
• Run set of tests before each deploy
• Full QA suite
• Princess/Production smoker tests
• Try (yup, there is one)
• LXC virtualized hosts
• 14/physical hosts
• Spread over 3 SSDs
• Most of them attached to try
Item by decomodwalls
• 2 Buttons, no ambiguity
• Overview of current state of deploy
• Links to Logwatcher and Dashboards
• Easy to add stacks for new tools to deploy
• Devs do their feature monitoring
• Everybody can access all the graphs
• Dashboard All The Things!
• Stream All The Logs!
If you are writing
code, you are
• ops on-call
• dev on-call
• payments on-call
• support on-call
• On-call for 3 days
• All developers who are not in another
• L1 and L2 escalations
• L1 if it’s your first time
• “This graph looks funny”
• “Hey I just got paged for elevated error rate
• “Supergrep is going crazy!!”
Is the site down?
• only outage related conversations
• coordinate investigation, communication,
countermeasures and monitoring
• good place to lurk for new engineers
• These are things that work for *us*
• Culture is an on-going effort
• Share everything
• Encourage learning/teaching
• Lunch ’n learns
• DC visits
• On-call for a day
• Bootcamps/Senior rotations
Collaboration at Etsy