$30 off During Our Annual Pro Sale. View details »

Migrating a monolith to Kubernetes

Jesse Newland
November 14, 2017

Migrating a monolith to Kubernetes

Last year, a small team at GitHub set out to migrate a large portion of the application that serves GitHub.com to Kubernetes. This application fits the classic definition of a monolith: a large codebase, developed over many years, containing contributions from hunderds of engineers (many of which have moved on to other things). In this presentation, we'll cover our motivations for this migration, the factors that led us to choose Kubernetes, the strategies we used to empower a small team to make a change that affected a large engineering organization, and reflect on what we learned in the process.

Jesse Newland

November 14, 2017
Tweet

More Decks by Jesse Newland

Other Decks in Technology

Transcript

  1. Migrating a monolith to Kubernetes DevOps Enterprise Summit 2017 Jesse

    Newland
  2. Hi!

  3. I’m Jesse Newland

  4. @jnewland

  5. 16 years in web operations

  6. 6 years at GitHub

  7. Engineering / Management

  8. None
  9. None
  10. Technical leadership from Austin, TX

  11. Why am I here? Kubernetes? Monoliths? DevOps? ENTERPRISE?

  12. My job is to affect change in a technical organization

  13. Github is growing, maturing, & evolving

  14. Our solutions often don’t scale to fit the needs of

    our growing the organization
  15. On a journey of continuous improvement

  16. We are more alike, my friends, than we are unalike.

    Maya Angelou
  17. None
  18. https://githubengineering.com/kubernetes-at-github/

  19. Kubernetes is an open- source system for automating deployment, scaling,

    and management of containerized applications
  20. Kubernetes builds upon 15 years of experience of running production

    workloads at Google, combined with best-of-breed ideas and practices from the community
  21. I’m not here to tell you that you should adopt

    Kubernetes
  22. Or even to go to deep into the technical details

    of our migration
  23. https://githubengineering.com/kubernetes-at-github/ @jnewland

  24. Kubernetes is a technology

  25. Kubernetes is a super dope technology

  26. Not a panacea

  27. Use what’s right for you

  28. I’d like to share an anecdote from our ongoing journey

  29. The only slide with bullets, I promise! • Why we

    migrated our monolith to Kubernetes • How did we approached a large cross-team project • Where we are today • What we learned in the process • Where we’re headed
  30. Why?

  31. Context

  32. The monolith

  33. Ruby on Rails

  34. github.com/ github/ github

  35. GitHub dot com the website

  36. 10 years old

  37. Extremely important to early velocity

  38. Increasing complexity

  39. Diffusion of responsibility

  40. None
  41. Incredibly high performance hardware

  42. Incredibly reliable hardware

  43. Incredibly low latency networking

  44. Incredibly high throughput networking

  45. None
  46. Unit of compute == instance

  47. Instance setup tightly coupled with configuration management

  48. API-driven, testable, but brutal feedback loop

  49. Human-managed provisioning and load balancing config

  50. High level of effort required to get a service into

    production
  51. None
  52. Our customer base is growing

  53. Our customers are growing

  54. Our ecosystem is growing

  55. Our organization is growing

  56. We’re shipping new products

  57. We’re improving existing products

  58. Our customers expect increasing speed and reliability

  59. We saw indications that our approach was struggling to deal

    with these forces
  60. The engineering culture at GitHub was attempting to evolve to

    encourage individual teams to act as maintainers of their own services
  61. SRE's tools and practices for running services had not yet

    evolved to match
  62. Easier to add functionality to an existing service

  63. Unsurprisingly, the monolith kept growing

  64. Increasing CI duration

  65. Increasing deploy duration

  66. Inflexible infrastructure

  67. Inefficient infrastructure

  68. Private cloud lock-in

  69. Developer and user experience trending downward

  70. The planets aligned in way that made all of these

    problems visible all at one
  71. Hack week

  72. Given a week to ship something new and innovative, what

    might we expect engineers to do?
  73. 1) spend ~1 day on Puppet, provisioning, and load balancing

    config
  74. 2) reach out to SRE on Thursday and ask for

    our help?
  75. 3) build hack week features as a PR against the

    monolith
  76. Microcosm of the larger problems with our approach

  77. Incentives not aligned with the outcomes we desired

  78. None
  79. Our on-ramp went in the wrong direction

  80. High effort required

  81. None
  82. We decided to make an investment in our tools

  83. We decided to make an investment in our processes

  84. We decided to make an investment in our technology

  85. To support the other ongoing changes in our organization, we

    decided that we would work to level the playing field
  86. To support the decomposition of the monolith, we decided that

    we would work to provide a better experience for new services
  87. To enable SRE to spend more time on interesting services,

    we decided to work to reduce the amount of time we needed to spend on boring services
  88. To reduce the time we spent on boring services, we

    decided to work to make the service provisioning process entirely self-service
  89. To bring the infrastructure- building feedback loop down, we decided

    to base this new future on a container orchestration platform
  90. To leverage the experience of Google and the strength of

    the community, we decided to build this new approach with Kubernetes
  91. How?

  92. okay sorry, a few more bullets • Passion team •

    Prototype • Pick an impactful and visible target • Product vision and project plan • Pwork • Pause and regroup
  93. Passion team

  94. https://github.com/blog/2316-organize-your-experts-with-ad-hoc-teams

  95. Intentionally curate a diverse set of skills

  96. Intentionally curate a diverse set of experience

  97. Intentionally curate a diverse set of knowledge

  98. Intentionally curate a diverse set of perspectives

  99. SRE + Developer Experience + Platform Engineering

  100. Project scoped team

  101. @github/kubernetes github/kube #kube

  102. Prototype

  103. A strategy for not crying under the bed during hack

    week
  104. Prototype Goals

  105. Kubernetes cluster, load balancing, deployment strategy, docs

  106. Leverage hack week standard of quality

  107. Validate our hypothesis that we could provide a new and

    better experience with minimal effort
  108. Validate our hypothesis that if provided with another option, engineers

    would flock to it
  109. Learn more about Kubernetes

  110. Seek feedback from engineers that used the new approach

  111. Internal marketing

  112. Wild success

  113. None
  114. Handful of projects launched with very little SRE involvement

  115. Positive feedback

  116. Learned a ton about an engineer’s perspective

  117. Several of these projects still exist, and are maintained by

    their creating teams
  118. Pick a big target

  119. We decided to migrate the monolith

  120. Why?

  121. Pros

  122. We wanted to validate something larger following our positive experience

    with smaller scale apps during Hack Week
  123. A well worn path

  124. We were confident in the testing strategies available to us

  125. We had an overlapping need for dynamic lab environments

  126. And an overlapping need for more flexibility to handle peaks

    and valleys of demand
  127. Cons

  128. It might not work

  129. We might make things worse

  130. We decided to put together a project plan and see

    if it felt viable
  131. Vision and plan

  132. Tons of high-impact, visible work ahead

  133. Communication was crucial

  134. Key elements of communicating change at GitHub

  135. Know your goal

  136. …and lead with it

  137. Don’t mince words

  138. Write conversationally

  139. Include the alternatives you’ve considered

  140. Doing nothing is always an alternative

  141. Consider the production impact

  142. Give it a URL

  143. Pull request

  144. Repeat the message using different mediums

  145. Communication had the desired impact

  146. Executive support

  147. Additional engineering resources

  148. Project management resources

  149. Now all we had to do was not be wrong

  150. How’d it go?

  151. None
  152. None
  153. One big container

  154. 1.1gb image

  155. 100s image build

  156. it's fine

  157. Review lab

  158. 50 times per day

  159. Staff opt-in

  160. Controlled experiments

  161. None
  162. None
  163. ~100% of github.com web requests served by application processes running

    Kubernetes
  164. Most of the functionality we built to support the monolith

    is available to other services
  165. ~20% of all services are running on Kubernetes clusters

  166. What'd we learn?

  167. Positive outcomes

  168. Reduced level of effort for new service setup

  169. New services regularly deployed with little-to- no SRE involvement

  170. APIs to query the running state of our system

  171. APIs to mutate the running state of our system

  172. Cloud-native platform to build against

  173. Open

  174. Emerging as a standard

  175. Reduce lock-in

  176. Commoditize compute providers

  177. More OSS friendly than configuration management and glue

  178. provider automation, config management, packages, & operating system

  179. container images, resources, & apis

  180. None
  181. None
  182. Challenges

  183. SRE

  184. Operationalizing a new platform

  185. Docker instability

  186. Changing the expectations of application engineers

  187. Application Engineering

  188. Change

  189. Learning curve

  190. Shorter process lifetimes

  191. What happens on process shutdown?

  192. What happens during ungraceful shutdown?

  193. Things I’d do again

  194. Passion team

  195. Prioritize communication

  196. Network effect via highly visible work

  197. Gradual rollout

  198. Things I’d do differently

  199. More consciously consider the handoff phase

  200. Document this approach to help it feel more regular

  201. More open source

  202. What’s next?

  203. Seek feedback from engineers

  204. Seek feedback from SREs

  205. Seek feedback from leadership

  206. Relentlessly focus on automating work that scales with traffic or

    organizational size
  207. Build services that leverage the platform

  208. Focus SRE efforts on improvements that benefit all services

  209. Focus SRE efforts on improvements that benefit everyone

  210. Keep improving

  211. Thanks!

  212. @jnewland