Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The Push Train

The Push Train

A talk about the human side of continuous delivery.

Dan McKinley

May 07, 2017
Tweet

More Decks by Dan McKinley

Other Decks in Technology

Transcript

  1. View Slide

  2. So hey, I’m Dan McKinley, I’m visiting you from Los Angeles

    View Slide

  3. I was an early employee at Etsy and worked there for seven years. I’ve worked for Stripe since, and I cofounded a
    continuous delivery platform company.

    View Slide

  4. Along the way I’ve talked to a lot of companies that have been interested in continuous delivery.

    View Slide

  5. After I left Etsy, I had this notion that you could take the tools that we used there and just drop them into another company
    to get it onto the golden path.

    View Slide

  6. But my experience has been that that’s not the case at all.
    I think one of our mistakes when talking about continuous delivery, at Etsy anyway, was to talk with an emphasis on the
    tools.

    View Slide

  7. Continuous delivery is not a set of tools you can buy or fork, or at least it’s not just that. I think all of this work was really
    important to give the movement credibility.
    But, I think we failed to communicate clearly what what it was like to live within the system, and what it was like to keep it
    working. I wanted to address that with this talk.

    View Slide

  8. One of the key things that made the system work was that we really were trying to get the most out of every engineer we
    had.

    View Slide

  9. And we were working backwards from the goals of changing a lot of stuff, and doing it safely.

    View Slide

  10. And as we scaled up, we tried really hard not to give up on the idea that everyone had a stake in fixing production when it
    was broken.

    View Slide

  11. I haven’t seen a lot of material about how you actually go about organizing high numbers of safe deploys every day.

    View Slide

  12. It turns out that this is really more of a human orchestration problem than it is a technical problem. So it may be less
    straightforward than a rant about rsync, but I’m going to try to do it.

    View Slide

  13. How do you even get to the point of deploying 40 or 60 times a day?

    View Slide

  14. We got there on purpose, at the same time we were growing as an organization. We believed that deploying as often as
    possible would lead to safety and development velocity, and I think we were vindicated in that.
    But ramping that up wasn’t immediate. It went like this.

    View Slide

  15. Our company wasn’t conceived as a continuous delivery engineering organization. Pretty much the opposite actually. So
    when we decided we were going to aim for deploying many times a day, there was a lot of pre-existing process that had to
    be destroyed. I’ll talk about some of that.

    View Slide

  16. After that there was a middle epoch where improved our process rapidly, and got a lot more deploys per day very quickly.

    View Slide

  17. But as we were doing this our engineering team grew from a few dozen folks up to well over a hundred. And it’s nontrivial
    to go from a handful of people deploying 40 times a day up to a hundred people deploying 40 times per day.

    View Slide

  18. A great deal of effort went into sustaining that velocity once we had it.

    View Slide

  19. We had a lot of places along the way where there were challenges or the process broke down entirely. With this talk I
    wanted to put together a narrative of how we fought through some of the key problems we had.

    View Slide

  20. I was with the company up until we had about 150 people deploying regularly.

    View Slide

  21. Beyond that I imagine you may need different things.

    View Slide

  22. Ok so like I said, we had to destroy some process to get started.

    View Slide

  23. Namely, we had a lot of process that was prophylactic. It was built with the intent of finding production problems before
    production.

    View Slide

  24. Let me tell you a quick story. In 2008, Etsy’s engineering founders parted ways with the company. And those folks meant
    well, but they had been gatekeeping production pretty hard. We had not really been able to touch the site, and now we
    were about to expected to.
    There was also no production monitoring to speak of. We suddenly found ourselves out to sea.

    View Slide

  25. My reaction to this was to start writing some integration tests. Using selenium, that web driver toolkit zenefits used to
    break the law. That seemed sensible as a way have at least a modicum of safety as started changing things.
    It was also nice that this was a proactive thing I could do with literally nobody helping me.

    View Slide

  26. The thing you should know about writing web-driving integration tests is that doing this well is at least as hard as writing a
    multithreaded program. And that’s a domain where human competence isn’t estimated to be high. The other problem with
    testing across process boundaries is that failure is a thing that can’t be entirely avoided.
    So what you tend to wind up with there is a lot of tests that work most of the time. Even if you’re really good and prevent
    most race conditions, you’re still going to have some defects. Given enough tests, that means that one is always failed.

    View Slide

  27. We solved that by deleting all 1500 tests, representing several human years of effort.

    View Slide

  28. We went back to the drawing board here. We realized that the whole point of tests is to gain confidence before changing
    things. And to the extent that there are false negatives in the tests, or the tests gum up the works, they’re doing the
    opposite of that.
    Tests are one way to gain some confidence that a change is safe. But that’s all they are. Just one way.

    View Slide

  29. Ramping up code very gradually in production is another way

    View Slide

  30. Deploying code in smaller and smaller pieces is another way. In abstract, every single line of code you deploy has some
    probability of breaking the site. So if you deploy a lot of lines of code at once, you’re just going break the site.
    And you stand a better chance of inspecting code for correctness the less of it there is.

    View Slide

  31. Deploying code so that users can’t see it and then using feature flags to test it in production is another good way to gain
    confidence.

    View Slide

  32. Over time we did write a lot more tests. Unit tests that finish quickly and don’t have false negatives are a really great
    confidence-builder.

    View Slide

  33. You may have tests, but you must have monitoring. You can write an infinite number of tests and only asymptotically
    approach zero problems in production.
    So at least when building a website, your priority should be knowing things are broken and having the capability to fix
    them as quickly as possible. Preventing problems is a distant second.

    View Slide

  34. Baroque source control rituals were another pre-existing thing we had that we had to give up on in order to ship code
    frequently. I’ll just summarize this by saying that git is bad.

    View Slide

  35. Git of course is not really bad. But it was created to support software that’s nothing at all like web software. The Linux
    kernel has many concurrent supported versions.
    A website, on the other hand, doesn’t really have a version at all. It has a current state.

    View Slide

  36. Github is likewise oriented around open source libraries, and it may be great for that, but there’s no reason to suspect that
    things you do there should work as well for building an online application.
    But the tendency among engineers is to start with Github, and all of the workflow and cultural baggage that surrounds it.
    Then they bang on stuff from there until something’s in production.

    View Slide

  37. We eventually came to a realization that there was no point to the rituals we were performing with revision control. It’s
    better to conceive of a website as a living organism where all of the versions of it are all jumbled together.

    View Slide

  38. The production codebase is your development branch. You write your development code in the production codebase,
    inside of an if block that turns it off. And you ship that development code to production as a matter of routine.

    View Slide

  39. So after making those leaps we were able to deploy more often. But new problems arose as we tried to dramatically
    increase speed.

    View Slide

  40. The story so far has been about actively destroying process and ceremony around deploys. But those were both cases of
    deciding to destroy process as a team.
    It’s another thing entirely for individuals to yolo their own destruction of process.
    One major reason that happens is because the deploy process can be too slow.

    View Slide

  41. If the deploy tooling isn’t made fast, there’s probably a faster and more dangerous way to do things and people will do
    that. They’ll replace running docker containers by hand. They’ll hand-edit files on the hosts.
    I want to stipulate that this doesn’t happen because people are evil, it happens because they’re people and they follow the
    path of least resistance.

    View Slide

  42. You also will occasionally need to deploy in a big hurry. Maybe you’re experiencing a SQL injection attack, or what have
    you.
    You don’t want to be trying to use a different set of fast deployment methods in a crisis.

    View Slide

  43. You want to be exercising the fast methods all of the time. That’s how you can know that it works.

    View Slide

  44. Do you really need to run that whitespace linter in critical path? Maybe you don’t. The compelling way to deploy should be
    the right way to deploy.

    View Slide

  45. Eventually we got fast deploys working reliably enough that people started to treat them as routine. But that also
    introduces some new problems.

    View Slide

  46. If we ship code very rarely, shipping code is a notable event. People will pay a lot of attention to it.

    View Slide

  47. On the other hand, if we ship code a bunch of times every single weekday, then shipping code is no longer special.
    This is all sounds very tautological. But it has implications.

    View Slide

  48. People will stand up from their desks and wander off while they’re deploying. They’ll go get coffee. They’ll go on a walk.
    The mountains will call them.

    View Slide

  49. It’s entirely possible to automate the whole deployment pipeline. In fact a lot of people think that that is what continuous
    deployment and/or delivery is.
    We didn’t do this, and in general I think it’s a bad idea.

    View Slide

  50. Let me give you the following analogy. A while ago Uber yolo’d a self-driving car trial in downtown San Francisco. It ended
    abruptly right after a video surfaced of one rolling right through a pedestrian intersection during a red light.

    View Slide

  51. It’s important to note that there was a human sitting in the driver’s seat, but that person didn’t intervene.
    Uber blamed that person for the incident. But that’s the wrong way to look at it. The automation was capable enough that
    the human’s attention very understandably lapsed, but not capable enough to replace the human.
    The human and the car are, together, the system. Things you do to automate the car affect the human.

    View Slide

  52. Automating deploys is often just like this. You can deploy the code safely, but you can’t accurately predict what it’ll do to
    the database. And so on.
    use taste when it comes to automation

    View Slide

  53. Make people actively press a button to deploy, even if 99% of your deploys are routine. Because in the rare case that
    something goes wrong, the human part of the system will be ready to react.
    Automation that only mostly works is often worse than no automation at all.

    View Slide

  54. Another problem we ran into as we ramped up to lots of deploys was that people skipped steps while deploying.

    View Slide

  55. If you deploy from a command line tool, this tends to be all of the context you have while you’re doing it. This is a popular
    choice these days, but I think this is a bad idea.

    View Slide

  56. That’s because you need to know what’s going on at the moment you’re deploying.
    You need to know what you’re pushing, if anyone else is trying to push at the same time, whether or not the site is
    currently on fire, and so on.

    View Slide

  57. I get that we’re all nerds here and we love command line tools, but my unpopular opinion here is that you want a web
    cockpit.

    View Slide

  58. Web cockpits can surface all of that status to you in a digestible way, and they can lay out a workflow that comprises
    multiple steps.
    They’re also more accessible to people outside of engineering that might also want to deploy things. Which is great,
    because then you can build a culture where supportfolk can ship knowledge base changes, or designers can ship css
    changes.

    View Slide

  59. Another thing to keep in mind is that when we live in the world where shipping code isn’t special, that’s not just a problem
    for the people doing the deployments. It’s a problem for anyone that has to react to deploys.

    View Slide

  60. Releasing code at my first job was a nightmare in every respect. You’d show up on Saturday morning, and spend several
    days on it. It sucked.
    But one criticism you couldn’t make of this is that nobody knew it was happening. Brutality isn’t a great ethos, but it is at
    least an ethos.

    View Slide

  61. Deploying code once a year is one way to achieve awareness. But if we want to deploy every day, we have to get
    awareness on purpose. We have to intentionally build it.

    View Slide

  62. One thing you need is an easily accessible display of the deployment history, and the current status. You need to know
    what’s happening now, and what was deployed at any given time.

    View Slide

  63. Automatically drawing deploy lines on all your graphs is also handy, so that you can correlate things happening in
    production to deploy times.

    View Slide

  64. Another thing that goes wrong here, which I’ve seen many other teams doing, is prematurely siloing all of their slack
    communications about pushes.

    View Slide

  65. What you get when you do this are incidents where a deploy by one team triggers alerts in the channel of some other
    team, who isn’t aware that a deploy just happened. You lose the obvious connection between the deploy notification and
    the alert if you do this.

    View Slide

  66. I think you should coordinate in one push channel, and push that as far as you possibly can. That gives you a shared
    sense of status.

    View Slide

  67. So those were some of the things that got us deploying safely and often.
    But eventually, overall developer velocity hits an upper bound. You can’t deploy faster, but your organization continues to
    grow.

    View Slide

  68. The returns on making deploys faster eventually diminish, and you can only fit so many end-to-end deploys in a day.
    Something has to give.

    View Slide

  69. One approach, which I definitely pursued for a while, is to make the day longer. Tech employees rise late in my
    experience, so you can get up at the crack of dawn and ship a bunch of stuff. That’s a bad solution though.

    View Slide

  70. We could split up deployables, but that’s taking on a great deal of operational debt. And comes with a lot of other
    baggage, which I’ll avoid ranting about.
    Suffice it to say I think we should try to exhaust other options before we do that.

    View Slide

  71. What we can do is try to find some common deployment patterns that are pretty safe, and figure out ways to extract them
    into faster pipelines.

    View Slide

  72. First one: if we’re doing flag-driven development right, many of the changes being made look like this. Just changing one
    config setting, turning things on or off or doing rampups. These are safe, or at least quickly reversible.
    We can take these and make a faster deploy lane for them that runs in parallel and skips the tests.

    View Slide

  73. Digging deeper, a lot of what you push if you are branching in code is code that isn’t executed in production.
    Conceptually, or even literally, you’re pushing code like this.
    These are really safe deploys, because the code is just dead weight that isn’t executed at all.

    View Slide

  74. We can just invent a convention around that, where engineers can mark individual commits as being this pattern. What we
    did was just preface them with the word DARK in all caps.

    View Slide

  75. So in cases like this where you’d have four people trying to make changes,

    View Slide

  76. Maybe a few of them are DARK changes so you don’t need four deploys to ship them. The DARK deploys can just get
    shipped by the operator paying attention to his or her live changes.

    View Slide

  77. Those things help. But eventually those workarounds hit the same limits. So to keep sustaining your deployment velocity,
    you need to get folks cooperating.

    View Slide

  78. Ultimately you need to figure out how to do this safely. One deploy, with two people making live changes at the same
    time.
    This is the sort of thing that will often go fine with no coordination. But when something does break, these people need to
    be in communication.

    View Slide

  79. So we need a system for keeping those folks in touch. And that presents a real challenge.

    View Slide

  80. This is an example of a place where we did write a decent amount of code to solve a problem. We wrote a chatbot to help
    us coordinate.
    Someone was nice enough to port it to hubot. You can find it on npm now.

    View Slide

  81. We conceptually have a bunch of people at any given time that want to get some code out. As we noted before, they all
    gather in the push channel.

    View Slide

  82. We decided to divide these up into train cars of an arbitrarily chosen size. I think the size changed over time but for the
    purposes of demonstration let’s say we pick three people at a time to deploy together.

    View Slide

  83. The train cars move through the queue one at a time. So the car at the head is deploying, the rest are waiting.

    View Slide

  84. New people can hop into the channel and join at the back of the train. Or add new cars to it if the last one’s full.

    View Slide

  85. Within the first train car, the person at the head of the queue is designated the leader. By convention. This is the person
    that will be in charge of the deploy and will push all the buttons.

    View Slide

  86. The first thing the people do when the deploy starts is push their code.

    View Slide

  87. Then they all tell the bot that their code is in and they’re ready to go. Like this.

    View Slide

  88. Then the deploy leader pushes the button to deploy to staging. Or QA, if you’ve got QA, which we didn’t. But nevermind.

    View Slide

  89. Then the next thing the deploy team does is manually verify their changes. That might mean different things for each of
    them depending on what their changes were.

    View Slide

  90. When they’re done doing that, they signal to the bot that they’re done. When everyone is done the bot pings the deploy
    leader and tells them that the deploy is ready to move to the next step.

    View Slide

  91. Which is to deploy to prod.

    View Slide

  92. That’s how it goes in the best case. But things can go wrong. Someone can also tell the bot that something’s wrong, and
    the bot will stop the train.

    View Slide

  93. If that happens, since we coordinate in one channel, a lot of folks are in some state of paying attention and are available to
    help.

    View Slide

  94. Ok, so these are some of the main social hacks that got us somewhere north of 100 engineers on a single repo.

    View Slide

  95. I wanted to give a sense for what it feels like to evolve alongside a continuous delivery effort.
    I don’t think you should take this as a set of instructions per se. It’s a toolbox you can use, but the details of your situation
    will differ.

    View Slide

  96. If you’re building a Mars rover, for example, the failure tolerances you have and the tradeoffs you will want make will be
    different than ours were.

    View Slide

  97. It’s notable that almost all of the hard things we dealt with were social problems. Some of these solutions involved writing
    code, but the hard part was the human organization.
    The hard parts in were maintaining a sense of community ownership over the state of the whole system.

    View Slide

  98. Since in order to deploy software we have to write it first, we tend to start with things that make software easiest to write.
    Or the specific load of development baggage that we’re most comfortable with.
    Then we hope that we can bang on some tooling and get working software in the end. There’s no reason to expect that
    this approach should work, or result in anything good.

    View Slide

  99. This is the approach that does work. Consider what your goals are, and what operates in production. Then work
    backwards from those things to the methods that you use.

    View Slide

  100. If you’re trying to increase developer velocity and it looks like the problem is interesting technically, you might want to stop
    and ask yourself if you’re in the weeds.
    The tendency once you’ve programmed yourself into a serious hole is to keep programming. Maybe you should stop
    trying to program your way out of difficult situations.

    View Slide

  101. fin

    View Slide