Slide 1

Slide 1 text

Migrating a monolith to Kubernetes DevOps Enterprise Summit 2017 Jesse Newland

Slide 2

Slide 2 text

Hi!

Slide 3

Slide 3 text

I’m Jesse Newland

Slide 4

Slide 4 text

@jnewland

Slide 5

Slide 5 text

16 years in web operations

Slide 6

Slide 6 text

6 years at GitHub

Slide 7

Slide 7 text

Engineering / Management

Slide 8

Slide 8 text

No content

Slide 9

Slide 9 text

No content

Slide 10

Slide 10 text

Technical leadership from Austin, TX

Slide 11

Slide 11 text

Why am I here? Kubernetes? Monoliths? DevOps? ENTERPRISE?

Slide 12

Slide 12 text

My job is to affect change in a technical organization

Slide 13

Slide 13 text

Github is growing, maturing, & evolving

Slide 14

Slide 14 text

Our solutions often don’t scale to fit the needs of our growing the organization

Slide 15

Slide 15 text

On a journey of continuous improvement

Slide 16

Slide 16 text

We are more alike, my friends, than we are unalike. Maya Angelou

Slide 17

Slide 17 text

No content

Slide 18

Slide 18 text

https://githubengineering.com/kubernetes-at-github/

Slide 19

Slide 19 text

Slide 20

Slide 20 text

Kubernetes is an open- source system for automating deployment, scaling, and management of containerized applications

Slide 21

Slide 21 text

Kubernetes builds upon 15 years of experience of running production workloads at Google, combined with best-of-breed ideas and practices from the community

Slide 22

Slide 22 text

I’m not here to tell you that you should adopt Kubernetes

Slide 23

Slide 23 text

Or even to go to deep into the technical details of our migration

Slide 24

Slide 24 text

https://githubengineering.com/kubernetes-at-github/ @jnewland

Slide 25

Slide 25 text

Kubernetes is a technology

Slide 26

Slide 26 text

Kubernetes is a super dope technology

Slide 27

Slide 27 text

Not a panacea

Slide 28

Slide 28 text

Use what’s right for you

Slide 29

Slide 29 text

I’d like to share an anecdote from our ongoing journey

Slide 30

Slide 30 text

The only slide with bullets, I promise! • Why we migrated our monolith to Kubernetes • How did we approached a large cross-team project • Where we are today • What we learned in the process • Where we’re headed

Slide 31

Slide 31 text

Why?

Slide 32

Slide 32 text

Context

Slide 33

Slide 33 text

The monolith

Slide 34

Slide 34 text

Ruby on Rails

Slide 35

Slide 35 text

github.com/ github/ github

Slide 36

Slide 36 text

GitHub dot com the website

Slide 37

Slide 37 text

10 years old

Slide 38

Slide 38 text

Extremely important to early velocity

Slide 39

Slide 39 text

Increasing complexity

Slide 40

Slide 40 text

Diffusion of responsibility

Slide 41

Slide 41 text

No content

Slide 42

Slide 42 text

Incredibly high performance hardware

Slide 43

Slide 43 text

Incredibly reliable hardware

Slide 44

Slide 44 text

Incredibly low latency networking

Slide 45

Slide 45 text

Incredibly high throughput networking

Slide 46

Slide 46 text

No content

Slide 47

Slide 47 text

Unit of compute == instance

Slide 48

Slide 48 text

Instance setup tightly coupled with configuration management

Slide 49

Slide 49 text

API-driven, testable, but brutal feedback loop

Slide 50

Slide 50 text

Human-managed provisioning and load balancing config

Slide 51

Slide 51 text

High level of effort required to get a service into production

Slide 52

Slide 52 text

No content

Slide 53

Slide 53 text

Our customer base is growing

Slide 54

Slide 54 text

Our customers are growing

Slide 55

Slide 55 text

Our ecosystem is growing

Slide 56

Slide 56 text

Our organization is growing

Slide 57

Slide 57 text

We’re shipping new products

Slide 58

Slide 58 text

We’re improving existing products

Slide 59

Slide 59 text

Our customers expect increasing speed and reliability

Slide 60

Slide 60 text

We saw indications that our approach was struggling to deal with these forces

Slide 61

Slide 61 text

The engineering culture at GitHub was attempting to evolve to encourage individual teams to act as maintainers of their own services

Slide 62

Slide 62 text

SRE's tools and practices for running services had not yet evolved to match

Slide 63

Slide 63 text

Easier to add functionality to an existing service

Slide 64

Slide 64 text

Unsurprisingly, the monolith kept growing

Slide 65

Slide 65 text

Increasing CI duration

Slide 66

Slide 66 text

Increasing deploy duration

Slide 67

Slide 67 text

Inflexible infrastructure

Slide 68

Slide 68 text

Inefficient infrastructure

Slide 69

Slide 69 text

Private cloud lock-in

Slide 70

Slide 70 text

Developer and user experience trending downward

Slide 71

Slide 71 text

The planets aligned in way that made all of these problems visible all at one

Slide 72

Slide 72 text

Hack week

Slide 73

Slide 73 text

Given a week to ship something new and innovative, what might we expect engineers to do?

Slide 74

Slide 74 text

1) spend ~1 day on Puppet, provisioning, and load balancing config

Slide 75

Slide 75 text

2) reach out to SRE on Thursday and ask for our help?

Slide 76

Slide 76 text

3) build hack week features as a PR against the monolith

Slide 77

Slide 77 text

Microcosm of the larger problems with our approach

Slide 78

Slide 78 text

Incentives not aligned with the outcomes we desired

Slide 79

Slide 79 text

No content

Slide 80

Slide 80 text

Our on-ramp went in the wrong direction

Slide 81

Slide 81 text

High effort required

Slide 82

Slide 82 text

No content

Slide 83

Slide 83 text

We decided to make an investment in our tools

Slide 84

Slide 84 text

We decided to make an investment in our processes

Slide 85

Slide 85 text

We decided to make an investment in our technology

Slide 86

Slide 86 text

To support the other ongoing changes in our organization, we decided that we would work to level the playing field

Slide 87

Slide 87 text

To support the decomposition of the monolith, we decided that we would work to provide a better experience for new services

Slide 88

Slide 88 text

To enable SRE to spend more time on interesting services, we decided to work to reduce the amount of time we needed to spend on boring services

Slide 89

Slide 89 text

To reduce the time we spent on boring services, we decided to work to make the service provisioning process entirely self-service

Slide 90

Slide 90 text

To bring the infrastructure- building feedback loop down, we decided to base this new future on a container orchestration platform

Slide 91

Slide 91 text

To leverage the experience of Google and the strength of the community, we decided to build this new approach with Kubernetes

Slide 92

Slide 92 text

How?

Slide 93

Slide 93 text

okay sorry, a few more bullets • Passion team • Prototype • Pick an impactful and visible target • Product vision and project plan • Pwork • Pause and regroup

Slide 94

Slide 94 text

Passion team

Slide 95

Slide 95 text

https://github.com/blog/2316-organize-your-experts-with-ad-hoc-teams

Slide 96

Slide 96 text

Intentionally curate a diverse set of skills

Slide 97

Slide 97 text

Intentionally curate a diverse set of experience

Slide 98

Slide 98 text

Intentionally curate a diverse set of knowledge

Slide 99

Slide 99 text

Intentionally curate a diverse set of perspectives

Slide 100

Slide 100 text

SRE + Developer Experience + Platform Engineering

Slide 101

Slide 101 text

Project scoped team

Slide 102

Slide 102 text

@github/kubernetes github/kube #kube

Slide 103

Slide 103 text

Prototype

Slide 104

Slide 104 text

A strategy for not crying under the bed during hack week

Slide 105

Slide 105 text

Prototype Goals

Slide 106

Slide 106 text

Kubernetes cluster, load balancing, deployment strategy, docs

Slide 107

Slide 107 text

Leverage hack week standard of quality

Slide 108

Slide 108 text

Validate our hypothesis that we could provide a new and better experience with minimal effort

Slide 109

Slide 109 text

Validate our hypothesis that if provided with another option, engineers would flock to it

Slide 110

Slide 110 text

Learn more about Kubernetes

Slide 111

Slide 111 text

Seek feedback from engineers that used the new approach

Slide 112

Slide 112 text

Internal marketing

Slide 113

Slide 113 text

Wild success

Slide 114

Slide 114 text

No content

Slide 115

Slide 115 text

Handful of projects launched with very little SRE involvement

Slide 116

Slide 116 text

Positive feedback

Slide 117

Slide 117 text

Learned a ton about an engineer’s perspective

Slide 118

Slide 118 text

Several of these projects still exist, and are maintained by their creating teams

Slide 119

Slide 119 text

Pick a big target

Slide 120

Slide 120 text

We decided to migrate the monolith

Slide 121

Slide 121 text

Why?

Slide 122

Slide 122 text

Pros

Slide 123

Slide 123 text

We wanted to validate something larger following our positive experience with smaller scale apps during Hack Week

Slide 124

Slide 124 text

A well worn path

Slide 125

Slide 125 text

We were confident in the testing strategies available to us

Slide 126

Slide 126 text

We had an overlapping need for dynamic lab environments

Slide 127

Slide 127 text

And an overlapping need for more flexibility to handle peaks and valleys of demand

Slide 128

Slide 128 text

Cons

Slide 129

Slide 129 text

It might not work

Slide 130

Slide 130 text

We might make things worse

Slide 131

Slide 131 text

We decided to put together a project plan and see if it felt viable

Slide 132

Slide 132 text

Vision and plan

Slide 133

Slide 133 text

Tons of high-impact, visible work ahead

Slide 134

Slide 134 text

Communication was crucial

Slide 135

Slide 135 text

Key elements of communicating change at GitHub

Slide 136

Slide 136 text

Know your goal

Slide 137

Slide 137 text

…and lead with it

Slide 138

Slide 138 text

Don’t mince words

Slide 139

Slide 139 text

Write conversationally

Slide 140

Slide 140 text

Include the alternatives you’ve considered

Slide 141

Slide 141 text

Doing nothing is always an alternative

Slide 142

Slide 142 text

Consider the production impact

Slide 143

Slide 143 text

Give it a URL

Slide 144

Slide 144 text

Pull request

Slide 145

Slide 145 text

Repeat the message using different mediums

Slide 146

Slide 146 text

Communication had the desired impact

Slide 147

Slide 147 text

Executive support

Slide 148

Slide 148 text

Additional engineering resources

Slide 149

Slide 149 text

Project management resources

Slide 150

Slide 150 text

Now all we had to do was not be wrong

Slide 151

Slide 151 text

How’d it go?

Slide 152

Slide 152 text

No content

Slide 153

Slide 153 text

No content

Slide 154

Slide 154 text

One big container

Slide 155

Slide 155 text

1.1gb image

Slide 156

Slide 156 text

100s image build

Slide 157

Slide 157 text

it's fine

Slide 158

Slide 158 text

Review lab

Slide 159

Slide 159 text

50 times per day

Slide 160

Slide 160 text

Staff opt-in

Slide 161

Slide 161 text

Controlled experiments

Slide 162

Slide 162 text

No content

Slide 163

Slide 163 text

No content

Slide 164

Slide 164 text

~100% of github.com web requests served by application processes running Kubernetes

Slide 165

Slide 165 text

Most of the functionality we built to support the monolith is available to other services

Slide 166

Slide 166 text

~20% of all services are running on Kubernetes clusters

Slide 167

Slide 167 text

What'd we learn?

Slide 168

Slide 168 text

Positive outcomes

Slide 169

Slide 169 text

Reduced level of effort for new service setup

Slide 170

Slide 170 text

New services regularly deployed with little-to- no SRE involvement

Slide 171

Slide 171 text

APIs to query the running state of our system

Slide 172

Slide 172 text

APIs to mutate the running state of our system

Slide 173

Slide 173 text

Cloud-native platform to build against

Slide 174

Slide 174 text

Open

Slide 175

Slide 175 text

Emerging as a standard

Slide 176

Slide 176 text

Reduce lock-in

Slide 177

Slide 177 text

Commoditize compute providers

Slide 178

Slide 178 text

More OSS friendly than configuration management and glue

Slide 179

Slide 179 text

provider automation, config management, packages, & operating system

Slide 180

Slide 180 text

container images, resources, & apis

Slide 181

Slide 181 text

No content

Slide 182

Slide 182 text

No content

Slide 183

Slide 183 text

Challenges

Slide 184

Slide 184 text

SRE

Slide 185

Slide 185 text

Operationalizing a new platform

Slide 186

Slide 186 text

Docker instability

Slide 187

Slide 187 text

Changing the expectations of application engineers

Slide 188

Slide 188 text

Application Engineering

Slide 189

Slide 189 text

Change

Slide 190

Slide 190 text

Learning curve

Slide 191

Slide 191 text

Shorter process lifetimes

Slide 192

Slide 192 text

What happens on process shutdown?

Slide 193

Slide 193 text

What happens during ungraceful shutdown?

Slide 194

Slide 194 text

Things I’d do again

Slide 195

Slide 195 text

Passion team

Slide 196

Slide 196 text

Prioritize communication

Slide 197

Slide 197 text

Network effect via highly visible work

Slide 198

Slide 198 text

Gradual rollout

Slide 199

Slide 199 text

Things I’d do differently

Slide 200

Slide 200 text

More consciously consider the handoff phase

Slide 201

Slide 201 text

Document this approach to help it feel more regular

Slide 202

Slide 202 text

More open source

Slide 203

Slide 203 text

What’s next?

Slide 204

Slide 204 text

Seek feedback from engineers

Slide 205

Slide 205 text

Seek feedback from SREs

Slide 206

Slide 206 text

Seek feedback from leadership

Slide 207

Slide 207 text

Relentlessly focus on automating work that scales with traffic or organizational size

Slide 208

Slide 208 text

Build services that leverage the platform

Slide 209

Slide 209 text

Focus SRE efforts on improvements that benefit all services

Slide 210

Slide 210 text

Focus SRE efforts on improvements that benefit everyone

Slide 211

Slide 211 text

Keep improving

Slide 212

Slide 212 text

Thanks!

Slide 213

Slide 213 text

@jnewland