Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Migrating a monolith to Kubernetes

Jesse Newland
November 14, 2017

Migrating a monolith to Kubernetes

Last year, a small team at GitHub set out to migrate a large portion of the application that serves GitHub.com to Kubernetes. This application fits the classic definition of a monolith: a large codebase, developed over many years, containing contributions from hunderds of engineers (many of which have moved on to other things). In this presentation, we'll cover our motivations for this migration, the factors that led us to choose Kubernetes, the strategies we used to empower a small team to make a change that affected a large engineering organization, and reflect on what we learned in the process.

Jesse Newland

November 14, 2017
Tweet

More Decks by Jesse Newland

Other Decks in Technology

Transcript

  1. Migrating a monolith
    to Kubernetes
    DevOps Enterprise Summit 2017
    Jesse Newland

    View Slide

  2. Hi!

    View Slide

  3. I’m
    Jesse Newland

    View Slide

  4. @jnewland

    View Slide

  5. 16 years in web
    operations

    View Slide

  6. 6 years at GitHub

    View Slide

  7. Engineering /
    Management

    View Slide

  8. View Slide

  9. View Slide

  10. Technical leadership
    from Austin, TX

    View Slide

  11. Why am I here?
    Kubernetes? Monoliths? DevOps? ENTERPRISE?

    View Slide

  12. My job is to affect
    change in a technical
    organization

    View Slide

  13. Github is
    growing,
    maturing,
    & evolving

    View Slide

  14. Our solutions often
    don’t scale to fit the
    needs of our growing
    the organization

    View Slide

  15. On a journey of
    continuous
    improvement

    View Slide

  16. We are more alike,
    my friends,
    than we are unalike.
    Maya Angelou

    View Slide

  17. View Slide

  18. https://githubengineering.com/kubernetes-at-github/

    View Slide


  19. View Slide

  20. Kubernetes is an open-
    source system for
    automating deployment,
    scaling, and management of
    containerized applications

    View Slide

  21. Kubernetes builds upon 15 years
    of experience of running
    production workloads at Google,
    combined with best-of-breed
    ideas and practices from the
    community

    View Slide

  22. I’m not here to tell
    you that you should
    adopt Kubernetes

    View Slide

  23. Or even to go to deep
    into the technical
    details of our
    migration

    View Slide

  24. https://githubengineering.com/kubernetes-at-github/
    @jnewland

    View Slide

  25. Kubernetes is a
    technology

    View Slide

  26. Kubernetes is a
    super dope
    technology

    View Slide

  27. Not a panacea

    View Slide

  28. Use what’s right
    for you

    View Slide

  29. I’d like to share an anecdote
    from our ongoing journey

    View Slide

  30. The only slide with bullets, I promise!
    • Why we migrated our monolith to Kubernetes
    • How did we approached a large cross-team project
    • Where we are today
    • What we learned in the process
    • Where we’re headed

    View Slide

  31. Why?

    View Slide

  32. Context

    View Slide

  33. The monolith

    View Slide

  34. Ruby on Rails

    View Slide

  35. github.com/
    github/
    github

    View Slide

  36. GitHub dot com
    the website

    View Slide

  37. 10 years old

    View Slide

  38. Extremely
    important to
    early velocity

    View Slide

  39. Increasing
    complexity

    View Slide

  40. Diffusion of
    responsibility

    View Slide

  41. View Slide

  42. Incredibly high
    performance
    hardware

    View Slide

  43. Incredibly reliable
    hardware

    View Slide

  44. Incredibly low
    latency
    networking

    View Slide

  45. Incredibly high
    throughput
    networking

    View Slide

  46. View Slide

  47. Unit of compute
    ==
    instance

    View Slide

  48. Instance setup tightly
    coupled with
    configuration
    management

    View Slide

  49. API-driven,
    testable, but brutal
    feedback loop

    View Slide

  50. Human-managed provisioning and
    load balancing config

    View Slide

  51. High level of effort
    required to get a
    service into
    production

    View Slide

  52. View Slide

  53. Our customer
    base is growing

    View Slide

  54. Our customers are
    growing

    View Slide

  55. Our ecosystem is
    growing

    View Slide

  56. Our organization is
    growing

    View Slide

  57. We’re shipping
    new products

    View Slide

  58. We’re improving
    existing products

    View Slide

  59. Our customers
    expect increasing
    speed and reliability

    View Slide

  60. We saw indications that our
    approach was struggling to
    deal with these forces

    View Slide

  61. The engineering culture at
    GitHub was attempting to
    evolve to encourage individual
    teams to act as maintainers of
    their own services

    View Slide

  62. SRE's tools and practices for running services
    had not yet evolved to match

    View Slide

  63. Easier to add functionality to an existing service

    View Slide

  64. Unsurprisingly, the
    monolith kept
    growing

    View Slide

  65. Increasing CI duration

    View Slide

  66. Increasing deploy duration

    View Slide

  67. Inflexible
    infrastructure

    View Slide

  68. Inefficient infrastructure

    View Slide

  69. Private
    cloud
    lock-in

    View Slide

  70. Developer and user experience trending downward

    View Slide

  71. The planets aligned in way
    that made all of these
    problems visible all at one

    View Slide

  72. Hack week

    View Slide

  73. Given a week to ship
    something new and
    innovative, what might we
    expect engineers to do?

    View Slide

  74. 1) spend ~1 day on
    Puppet, provisioning,
    and load balancing
    config

    View Slide

  75. 2) reach out to SRE
    on Thursday and
    ask for our help?

    View Slide

  76. 3) build hack week
    features as a PR
    against the monolith

    View Slide

  77. Microcosm of the larger problems
    with our approach

    View Slide

  78. Incentives
    not aligned
    with the outcomes
    we desired

    View Slide

  79. View Slide

  80. Our on-ramp went
    in the
    wrong direction

    View Slide

  81. High
    effort
    required

    View Slide

  82. View Slide

  83. We decided to make
    an investment in
    our tools

    View Slide

  84. We decided to make
    an investment in
    our processes

    View Slide

  85. We decided to make
    an investment in
    our technology

    View Slide

  86. To support the other ongoing
    changes in our organization,
    we decided that we would work to
    level the playing field

    View Slide

  87. To support the decomposition
    of the monolith, we decided
    that we would work to
    provide a better experience
    for new services

    View Slide

  88. To enable SRE to spend more
    time on interesting services,
    we decided to work to reduce
    the amount of time we needed
    to spend on boring services

    View Slide

  89. To reduce the time we spent
    on boring services, we
    decided to work to make the
    service provisioning process
    entirely self-service

    View Slide

  90. To bring the infrastructure-
    building feedback loop down,
    we decided to base this new
    future on a container
    orchestration platform

    View Slide

  91. To leverage the experience
    of Google and the strength
    of the community, we
    decided to build this new
    approach with Kubernetes

    View Slide

  92. How?

    View Slide

  93. okay sorry, a few more bullets
    • Passion team
    • Prototype
    • Pick an impactful and visible target
    • Product vision and project plan
    • Pwork
    • Pause and regroup

    View Slide

  94. Passion team

    View Slide

  95. https://github.com/blog/2316-organize-your-experts-with-ad-hoc-teams

    View Slide

  96. Intentionally
    curate a diverse
    set of skills

    View Slide

  97. Intentionally
    curate a diverse
    set of experience

    View Slide

  98. Intentionally
    curate a diverse
    set of knowledge

    View Slide

  99. Intentionally
    curate a diverse
    set of perspectives

    View Slide

  100. SRE
    +
    Developer Experience
    +
    Platform Engineering

    View Slide

  101. Project scoped
    team

    View Slide

  102. @github/kubernetes
    github/kube
    #kube

    View Slide

  103. Prototype

    View Slide

  104. A strategy for not
    crying under the bed
    during hack week

    View Slide

  105. Prototype Goals

    View Slide

  106. Kubernetes cluster,
    load balancing,
    deployment strategy,
    docs

    View Slide

  107. Leverage
    hack week
    standard of
    quality

    View Slide

  108. Validate our hypothesis
    that we could provide a
    new and better experience
    with minimal effort

    View Slide

  109. Validate our hypothesis
    that if provided with
    another option, engineers
    would flock to it

    View Slide

  110. Learn more about
    Kubernetes

    View Slide

  111. Seek feedback from
    engineers that used
    the new approach

    View Slide

  112. Internal marketing

    View Slide

  113. Wild success

    View Slide

  114. View Slide

  115. Handful of projects
    launched with very
    little SRE involvement

    View Slide

  116. Positive feedback

    View Slide

  117. Learned a ton
    about an engineer’s
    perspective

    View Slide

  118. Several of these
    projects still exist, and
    are maintained by
    their creating teams

    View Slide

  119. Pick a big target

    View Slide

  120. We decided to migrate the monolith

    View Slide

  121. Why?

    View Slide

  122. Pros

    View Slide

  123. We wanted to validate
    something larger following
    our positive experience with
    smaller scale apps during
    Hack Week

    View Slide

  124. A well worn path

    View Slide

  125. We were confident in
    the testing strategies
    available to us

    View Slide

  126. We had an overlapping
    need for dynamic lab
    environments

    View Slide

  127. And an overlapping
    need for more flexibility
    to handle peaks and
    valleys of demand

    View Slide

  128. Cons

    View Slide

  129. It might not work

    View Slide

  130. We might make
    things worse

    View Slide

  131. We decided to put
    together a project plan
    and see if it felt viable

    View Slide

  132. Vision and plan

    View Slide

  133. Tons of high-impact,
    visible work ahead

    View Slide

  134. Communication
    was crucial

    View Slide

  135. Key elements of
    communicating change at GitHub

    View Slide

  136. Know your goal

    View Slide

  137. …and lead with it

    View Slide

  138. Don’t mince words

    View Slide

  139. Write
    conversationally

    View Slide

  140. Include the
    alternatives you’ve
    considered

    View Slide

  141. Doing nothing is
    always an
    alternative

    View Slide

  142. Consider the
    production impact

    View Slide

  143. Give it a URL

    View Slide

  144. Pull request

    View Slide

  145. Repeat the
    message using
    different mediums

    View Slide

  146. Communication had
    the desired impact

    View Slide

  147. Executive support

    View Slide

  148. Additional
    engineering
    resources

    View Slide

  149. Project
    management
    resources

    View Slide

  150. Now all we had to do was
    not be wrong

    View Slide

  151. How’d it go?

    View Slide

  152. View Slide

  153. View Slide

  154. One
    big
    container

    View Slide

  155. 1.1gb image

    View Slide

  156. 100s image build

    View Slide

  157. it's fine

    View Slide

  158. Review lab

    View Slide

  159. 50 times per day

    View Slide

  160. Staff opt-in

    View Slide

  161. Controlled experiments

    View Slide

  162. View Slide

  163. View Slide

  164. ~100% of github.com
    web requests served by
    application processes
    running Kubernetes

    View Slide

  165. Most of the functionality
    we built to support the
    monolith is available to
    other services

    View Slide

  166. ~20% of all services
    are running on
    Kubernetes clusters

    View Slide

  167. What'd we learn?

    View Slide

  168. Positive outcomes

    View Slide

  169. Reduced level of
    effort for new
    service setup

    View Slide

  170. New services regularly
    deployed with little-to-
    no SRE involvement

    View Slide

  171. APIs to query the
    running state of
    our system

    View Slide

  172. APIs to mutate the
    running state of
    our system

    View Slide

  173. Cloud-native
    platform to build
    against

    View Slide

  174. Open

    View Slide

  175. Emerging
    as a
    standard

    View Slide

  176. Reduce
    lock-in

    View Slide

  177. Commoditize
    compute
    providers

    View Slide

  178. More OSS friendly
    than configuration
    management and glue

    View Slide

  179. provider automation,
    config management,
    packages, &
    operating system

    View Slide

  180. container images,
    resources, &
    apis

    View Slide

  181. View Slide

  182. View Slide

  183. Challenges

    View Slide

  184. SRE

    View Slide

  185. Operationalizing a
    new platform

    View Slide

  186. Docker instability

    View Slide

  187. Changing the
    expectations of
    application engineers

    View Slide

  188. Application
    Engineering

    View Slide

  189. Change

    View Slide

  190. Learning curve

    View Slide

  191. Shorter
    process
    lifetimes

    View Slide

  192. What happens on
    process
    shutdown?

    View Slide

  193. What happens
    during ungraceful
    shutdown?

    View Slide

  194. Things I’d do again

    View Slide

  195. Passion team

    View Slide

  196. Prioritize
    communication

    View Slide

  197. Network effect via
    highly visible work

    View Slide

  198. Gradual rollout

    View Slide

  199. Things I’d do
    differently

    View Slide

  200. More consciously
    consider the
    handoff phase

    View Slide

  201. Document this
    approach to help it
    feel more regular

    View Slide

  202. More open source

    View Slide

  203. What’s next?

    View Slide

  204. Seek feedback
    from engineers

    View Slide

  205. Seek feedback
    from SREs

    View Slide

  206. Seek feedback
    from leadership

    View Slide

  207. Relentlessly focus on
    automating work that
    scales with traffic or
    organizational size

    View Slide

  208. Build services that
    leverage the
    platform

    View Slide

  209. Focus SRE efforts on
    improvements that
    benefit all services

    View Slide

  210. Focus SRE efforts on
    improvements that
    benefit everyone

    View Slide

  211. Keep improving

    View Slide

  212. Thanks!

    View Slide

  213. @jnewland

    View Slide