The Push Train - Speaker Deck

Slide 1

Slide 1 text

No content

Slide 2

Slide 2 text

So hey, I’m Dan McKinley, I’m visiting you from Los Angeles

Slide 3

Slide 3 text

I was an early employee at Etsy and worked there for seven years. I’ve worked for Stripe since, and I cofounded a continuous delivery platform company.

Slide 4

Slide 4 text

Along the way I’ve talked to a lot of companies that have been interested in continuous delivery.

Slide 5

Slide 5 text

After I left Etsy, I had this notion that you could take the tools that we used there and just drop them into another company to get it onto the golden path.

Slide 6

Slide 6 text

But my experience has been that that’s not the case at all. I think one of our mistakes when talking about continuous delivery, at Etsy anyway, was to talk with an emphasis on the tools.

Slide 7

Slide 7 text

Continuous delivery is not a set of tools you can buy or fork, or at least it’s not just that. I think all of this work was really important to give the movement credibility. But, I think we failed to communicate clearly what what it was like to live within the system, and what it was like to keep it working. I wanted to address that with this talk.

Slide 8

Slide 8 text

One of the key things that made the system work was that we really were trying to get the most out of every engineer we had.

Slide 9

Slide 9 text

And we were working backwards from the goals of changing a lot of stuff, and doing it safely.

Slide 10

Slide 10 text

And as we scaled up, we tried really hard not to give up on the idea that everyone had a stake in fixing production when it was broken.

Slide 11

Slide 11 text

I haven’t seen a lot of material about how you actually go about organizing high numbers of safe deploys every day.

Slide 12

Slide 12 text

It turns out that this is really more of a human orchestration problem than it is a technical problem. So it may be less straightforward than a rant about rsync, but I’m going to try to do it.

Slide 13

Slide 13 text

How do you even get to the point of deploying 40 or 60 times a day?

Slide 14

Slide 14 text

We got there on purpose, at the same time we were growing as an organization. We believed that deploying as often as possible would lead to safety and development velocity, and I think we were vindicated in that. But ramping that up wasn’t immediate. It went like this.

Slide 15

Slide 15 text

Our company wasn’t conceived as a continuous delivery engineering organization. Pretty much the opposite actually. So when we decided we were going to aim for deploying many times a day, there was a lot of pre-existing process that had to be destroyed. I’ll talk about some of that.

Slide 16

Slide 16 text

After that there was a middle epoch where improved our process rapidly, and got a lot more deploys per day very quickly.

Slide 17

Slide 17 text

But as we were doing this our engineering team grew from a few dozen folks up to well over a hundred. And it’s nontrivial to go from a handful of people deploying 40 times a day up to a hundred people deploying 40 times per day.

Slide 18

Slide 18 text

A great deal of effort went into sustaining that velocity once we had it.

Slide 19

Slide 19 text

We had a lot of places along the way where there were challenges or the process broke down entirely. With this talk I wanted to put together a narrative of how we fought through some of the key problems we had.

Slide 20

Slide 20 text

I was with the company up until we had about 150 people deploying regularly.

Slide 21

Slide 21 text

Beyond that I imagine you may need different things.

Slide 22

Slide 22 text

Ok so like I said, we had to destroy some process to get started.

Slide 23

Slide 23 text

Namely, we had a lot of process that was prophylactic. It was built with the intent of finding production problems before production.

Slide 24

Slide 24 text

Let me tell you a quick story. In 2008, Etsy’s engineering founders parted ways with the company. And those folks meant well, but they had been gatekeeping production pretty hard. We had not really been able to touch the site, and now we were about to expected to. There was also no production monitoring to speak of. We suddenly found ourselves out to sea.

Slide 25

Slide 25 text

My reaction to this was to start writing some integration tests. Using selenium, that web driver toolkit zenefits used to break the law. That seemed sensible as a way have at least a modicum of safety as started changing things. It was also nice that this was a proactive thing I could do with literally nobody helping me.

Slide 26

Slide 26 text

The thing you should know about writing web-driving integration tests is that doing this well is at least as hard as writing a multithreaded program. And that’s a domain where human competence isn’t estimated to be high. The other problem with testing across process boundaries is that failure is a thing that can’t be entirely avoided. So what you tend to wind up with there is a lot of tests that work most of the time. Even if you’re really good and prevent most race conditions, you’re still going to have some defects. Given enough tests, that means that one is always failed.

Slide 27

Slide 27 text

We solved that by deleting all 1500 tests, representing several human years of effort.

Slide 28

Slide 28 text

We went back to the drawing board here. We realized that the whole point of tests is to gain confidence before changing things. And to the extent that there are false negatives in the tests, or the tests gum up the works, they’re doing the opposite of that. Tests are one way to gain some confidence that a change is safe. But that’s all they are. Just one way.

Slide 29

Slide 29 text

Ramping up code very gradually in production is another way

Slide 30

Slide 30 text

Deploying code in smaller and smaller pieces is another way. In abstract, every single line of code you deploy has some probability of breaking the site. So if you deploy a lot of lines of code at once, you’re just going break the site. And you stand a better chance of inspecting code for correctness the less of it there is.

Slide 31

Slide 31 text

Deploying code so that users can’t see it and then using feature flags to test it in production is another good way to gain confidence.

Slide 32

Slide 32 text

Over time we did write a lot more tests. Unit tests that finish quickly and don’t have false negatives are a really great confidence-builder.

Slide 33

Slide 33 text

You may have tests, but you must have monitoring. You can write an infinite number of tests and only asymptotically approach zero problems in production. So at least when building a website, your priority should be knowing things are broken and having the capability to fix them as quickly as possible. Preventing problems is a distant second.

Slide 34

Slide 34 text

Baroque source control rituals were another pre-existing thing we had that we had to give up on in order to ship code frequently. I’ll just summarize this by saying that git is bad.

Slide 35

Slide 35 text

Git of course is not really bad. But it was created to support software that’s nothing at all like web software. The Linux kernel has many concurrent supported versions. A website, on the other hand, doesn’t really have a version at all. It has a current state.

Slide 36

Slide 36 text

Github is likewise oriented around open source libraries, and it may be great for that, but there’s no reason to suspect that things you do there should work as well for building an online application. But the tendency among engineers is to start with Github, and all of the workflow and cultural baggage that surrounds it. Then they bang on stuff from there until something’s in production.

Slide 37

Slide 37 text

We eventually came to a realization that there was no point to the rituals we were performing with revision control. It’s better to conceive of a website as a living organism where all of the versions of it are all jumbled together.

Slide 38

Slide 38 text

The production codebase is your development branch. You write your development code in the production codebase, inside of an if block that turns it off. And you ship that development code to production as a matter of routine.

Slide 39

Slide 39 text

So after making those leaps we were able to deploy more often. But new problems arose as we tried to dramatically increase speed.

Slide 40

Slide 40 text

The story so far has been about actively destroying process and ceremony around deploys. But those were both cases of deciding to destroy process as a team. It’s another thing entirely for individuals to yolo their own destruction of process. One major reason that happens is because the deploy process can be too slow.

Slide 41

Slide 41 text

If the deploy tooling isn’t made fast, there’s probably a faster and more dangerous way to do things and people will do that. They’ll replace running docker containers by hand. They’ll hand-edit files on the hosts. I want to stipulate that this doesn’t happen because people are evil, it happens because they’re people and they follow the path of least resistance.

Slide 42

Slide 42 text

You also will occasionally need to deploy in a big hurry. Maybe you’re experiencing a SQL injection attack, or what have you. You don’t want to be trying to use a different set of fast deployment methods in a crisis.

Slide 43

Slide 43 text

You want to be exercising the fast methods all of the time. That’s how you can know that it works.

Slide 44

Slide 44 text

Do you really need to run that whitespace linter in critical path? Maybe you don’t. The compelling way to deploy should be the right way to deploy.

Slide 45

Slide 45 text

Eventually we got fast deploys working reliably enough that people started to treat them as routine. But that also introduces some new problems.

Slide 46

Slide 46 text

If we ship code very rarely, shipping code is a notable event. People will pay a lot of attention to it.

Slide 47

Slide 47 text

On the other hand, if we ship code a bunch of times every single weekday, then shipping code is no longer special. This is all sounds very tautological. But it has implications.

Slide 48

Slide 48 text

People will stand up from their desks and wander off while they’re deploying. They’ll go get coffee. They’ll go on a walk. The mountains will call them.

Slide 49

Slide 49 text

It’s entirely possible to automate the whole deployment pipeline. In fact a lot of people think that that is what continuous deployment and/or delivery is. We didn’t do this, and in general I think it’s a bad idea.

Slide 50

Slide 50 text

Let me give you the following analogy. A while ago Uber yolo’d a self-driving car trial in downtown San Francisco. It ended abruptly right after a video surfaced of one rolling right through a pedestrian intersection during a red light.

Slide 51

Slide 51 text

It’s important to note that there was a human sitting in the driver’s seat, but that person didn’t intervene. Uber blamed that person for the incident. But that’s the wrong way to look at it. The automation was capable enough that the human’s attention very understandably lapsed, but not capable enough to replace the human. The human and the car are, together, the system. Things you do to automate the car affect the human.

Slide 52

Slide 52 text

Automating deploys is often just like this. You can deploy the code safely, but you can’t accurately predict what it’ll do to the database. And so on. use taste when it comes to automation

Slide 53

Slide 53 text

Make people actively press a button to deploy, even if 99% of your deploys are routine. Because in the rare case that something goes wrong, the human part of the system will be ready to react. Automation that only mostly works is often worse than no automation at all.

Slide 54

Slide 54 text

Another problem we ran into as we ramped up to lots of deploys was that people skipped steps while deploying.

Slide 55

Slide 55 text

If you deploy from a command line tool, this tends to be all of the context you have while you’re doing it. This is a popular choice these days, but I think this is a bad idea.

Slide 56

Slide 56 text

That’s because you need to know what’s going on at the moment you’re deploying. You need to know what you’re pushing, if anyone else is trying to push at the same time, whether or not the site is currently on fire, and so on.

Slide 57

Slide 57 text

I get that we’re all nerds here and we love command line tools, but my unpopular opinion here is that you want a web cockpit.

Slide 58

Slide 58 text

Web cockpits can surface all of that status to you in a digestible way, and they can lay out a workflow that comprises multiple steps. They’re also more accessible to people outside of engineering that might also want to deploy things. Which is great, because then you can build a culture where supportfolk can ship knowledge base changes, or designers can ship css changes.

Slide 59

Slide 59 text

Another thing to keep in mind is that when we live in the world where shipping code isn’t special, that’s not just a problem for the people doing the deployments. It’s a problem for anyone that has to react to deploys.

Slide 60

Slide 60 text

Releasing code at my first job was a nightmare in every respect. You’d show up on Saturday morning, and spend several days on it. It sucked. But one criticism you couldn’t make of this is that nobody knew it was happening. Brutality isn’t a great ethos, but it is at least an ethos.

Slide 61

Slide 61 text

Deploying code once a year is one way to achieve awareness. But if we want to deploy every day, we have to get awareness on purpose. We have to intentionally build it.

Slide 62

Slide 62 text

One thing you need is an easily accessible display of the deployment history, and the current status. You need to know what’s happening now, and what was deployed at any given time.

Slide 63

Slide 63 text

Automatically drawing deploy lines on all your graphs is also handy, so that you can correlate things happening in production to deploy times.

Slide 64

Slide 64 text

Another thing that goes wrong here, which I’ve seen many other teams doing, is prematurely siloing all of their slack communications about pushes.

Slide 65

Slide 65 text

What you get when you do this are incidents where a deploy by one team triggers alerts in the channel of some other team, who isn’t aware that a deploy just happened. You lose the obvious connection between the deploy notification and the alert if you do this.

Slide 66

Slide 66 text

I think you should coordinate in one push channel, and push that as far as you possibly can. That gives you a shared sense of status.

Slide 67

Slide 67 text

So those were some of the things that got us deploying safely and often. But eventually, overall developer velocity hits an upper bound. You can’t deploy faster, but your organization continues to grow.

Slide 68

Slide 68 text

The returns on making deploys faster eventually diminish, and you can only fit so many end-to-end deploys in a day. Something has to give.

Slide 69

Slide 69 text

One approach, which I definitely pursued for a while, is to make the day longer. Tech employees rise late in my experience, so you can get up at the crack of dawn and ship a bunch of stuff. That’s a bad solution though.

Slide 70

Slide 70 text

We could split up deployables, but that’s taking on a great deal of operational debt. And comes with a lot of other baggage, which I’ll avoid ranting about. Suffice it to say I think we should try to exhaust other options before we do that.

Slide 71

Slide 71 text

What we can do is try to find some common deployment patterns that are pretty safe, and figure out ways to extract them into faster pipelines.

Slide 72

Slide 72 text

First one: if we’re doing flag-driven development right, many of the changes being made look like this. Just changing one config setting, turning things on or off or doing rampups. These are safe, or at least quickly reversible. We can take these and make a faster deploy lane for them that runs in parallel and skips the tests.

Slide 73

Slide 73 text

Digging deeper, a lot of what you push if you are branching in code is code that isn’t executed in production. Conceptually, or even literally, you’re pushing code like this. These are really safe deploys, because the code is just dead weight that isn’t executed at all.

Slide 74

Slide 74 text

We can just invent a convention around that, where engineers can mark individual commits as being this pattern. What we did was just preface them with the word DARK in all caps.

Slide 75

Slide 75 text

So in cases like this where you’d have four people trying to make changes,

Slide 76

Slide 76 text

Maybe a few of them are DARK changes so you don’t need four deploys to ship them. The DARK deploys can just get shipped by the operator paying attention to his or her live changes.

Slide 77

Slide 77 text

Those things help. But eventually those workarounds hit the same limits. So to keep sustaining your deployment velocity, you need to get folks cooperating.

Slide 78

Slide 78 text

Ultimately you need to figure out how to do this safely. One deploy, with two people making live changes at the same time. This is the sort of thing that will often go fine with no coordination. But when something does break, these people need to be in communication.

Slide 79

Slide 79 text

So we need a system for keeping those folks in touch. And that presents a real challenge.

Slide 80

Slide 80 text

This is an example of a place where we did write a decent amount of code to solve a problem. We wrote a chatbot to help us coordinate. Someone was nice enough to port it to hubot. You can find it on npm now.

Slide 81

Slide 81 text

We conceptually have a bunch of people at any given time that want to get some code out. As we noted before, they all gather in the push channel.

Slide 82

Slide 82 text

We decided to divide these up into train cars of an arbitrarily chosen size. I think the size changed over time but for the purposes of demonstration let’s say we pick three people at a time to deploy together.

Slide 83

Slide 83 text

The train cars move through the queue one at a time. So the car at the head is deploying, the rest are waiting.

Slide 84

Slide 84 text

New people can hop into the channel and join at the back of the train. Or add new cars to it if the last one’s full.

Slide 85

Slide 85 text

Within the first train car, the person at the head of the queue is designated the leader. By convention. This is the person that will be in charge of the deploy and will push all the buttons.

Slide 86

Slide 86 text

The first thing the people do when the deploy starts is push their code.

Slide 87

Slide 87 text

Then they all tell the bot that their code is in and they’re ready to go. Like this.

Slide 88

Slide 88 text

Then the deploy leader pushes the button to deploy to staging. Or QA, if you’ve got QA, which we didn’t. But nevermind.

Slide 89

Slide 89 text

Then the next thing the deploy team does is manually verify their changes. That might mean different things for each of them depending on what their changes were.

Slide 90

Slide 90 text

When they’re done doing that, they signal to the bot that they’re done. When everyone is done the bot pings the deploy leader and tells them that the deploy is ready to move to the next step.

Slide 91

Slide 91 text

Which is to deploy to prod.

Slide 92

Slide 92 text

That’s how it goes in the best case. But things can go wrong. Someone can also tell the bot that something’s wrong, and the bot will stop the train.

Slide 93

Slide 93 text

If that happens, since we coordinate in one channel, a lot of folks are in some state of paying attention and are available to help.

Slide 94

Slide 94 text

Ok, so these are some of the main social hacks that got us somewhere north of 100 engineers on a single repo.

Slide 95

Slide 95 text

I wanted to give a sense for what it feels like to evolve alongside a continuous delivery effort. I don’t think you should take this as a set of instructions per se. It’s a toolbox you can use, but the details of your situation will differ.

Slide 96

Slide 96 text

If you’re building a Mars rover, for example, the failure tolerances you have and the tradeoffs you will want make will be different than ours were.

Slide 97

Slide 97 text

It’s notable that almost all of the hard things we dealt with were social problems. Some of these solutions involved writing code, but the hard part was the human organization. The hard parts in were maintaining a sense of community ownership over the state of the whole system.

Slide 98

Slide 98 text

Since in order to deploy software we have to write it first, we tend to start with things that make software easiest to write. Or the specific load of development baggage that we’re most comfortable with. Then we hope that we can bang on some tooling and get working software in the end. There’s no reason to expect that this approach should work, or result in anything good.

Slide 99

Slide 99 text

This is the approach that does work. Consider what your goals are, and what operates in production. Then work backwards from those things to the methods that you use.

Slide 100

Slide 100 text

If you’re trying to increase developer velocity and it looks like the problem is interesting technically, you might want to stop and ask yourself if you’re in the weeds. The tendency once you’ve programmed yourself into a serious hole is to keep programming. Maybe you should stop trying to program your way out of difficult situations.

Slide 101

Slide 101 text

fin