Upgrade to Pro — share decks privately, control downloads, hide ads and more …

KubeCon Europe 2018: Switching Horses Midstream: The Challenges of Migrating 150+ Microservices to Kubernetes

KubeCon Europe 2018: Switching Horses Midstream: The Challenges of Migrating 150+ Microservices to Kubernetes

The FT’s content platform team put our first containers live in mid-2015 and migrated the rest of our services over by April 2016. At that point, we weren't using Kubernetes - and much of what we were using, we built ourselves.

At the end of 2016, we decided we wanted to benefit from the work other people were doing and switch over to Kubernetes. But it's not easy to do that kind of move when you have 150+ microservices and you need to keep the existing platform running in parallel while you do the migration.

This covers this migration and the challenges we faced.

A288fb976fc633cde90a2bc19bf2b5a6?s=128

Sarah Wells

May 03, 2018
Tweet

Transcript

  1. Switching horses midstream: the challenge of migrating 150+ services to

    kubernetes Sarah Wells Technical Director for Operations and Reliability, Financial Times @sarahjwells
  2. None
  3. None
  4. None
  5. None
  6. The FT’s Content platform

  7. This is what it really looks like…

  8. @sarahjwells Why *did* we migrate to k8s?

  9. @sarahjwells Mid 2015: a hand-rolled container stack

  10. @sarahjwells https://medium.com/wardleymaps

  11. @sarahjwells Spend your innovation tokens wisely

  12. @sarahjwells ~80% reduction in EC2 costs

  13. @sarahjwells Many fewer steps to start running a new service

    in production
  14. @sarahjwells But: supportability of an in-house platform is a challenge

  15. @sarahjwells http://mcfunley.com/choose-boring-technology Choose boring technology

  16. @sarahjwells By late 2016, tools were maturing

  17. @sarahjwells https://medium.com/wardleymaps

  18. @sarahjwells The FT is not a cluster orchestration company

  19. @sarahjwells Late 2016: Consider the alternatives

  20. @sarahjwells Metrics for success: - amount of time spent keeping

    cluster healthy - number of sarcastic comments on slack
  21. @sarahjwells Opted for kubernetes

  22. None
  23. @sarahjwells Using leading edge technologies requires you to be comfortable

    with change
  24. @sarahjwells Shouldn’t be (too) scared about making the wrong decision

    http://uk.businessinsider.com/jeff-bezos-on-type-1-and-type-2- decisions-2016-4
  25. @sarahjwells Switching horses midstream

  26. @sarahjwells At the start of this migration we had 150

    services
  27. @sarahjwells Lots of other work going on at the same

    time
  28. @sarahjwells Complications of running in parallel

  29. @sarahjwells We had well over 2000 code releases while running

    at least part of the stack in parallel
  30. @sarahjwells Decisions, decisions, decisions…

  31. @sarahjwells Separate branches vs if/else in code

  32. @sarahjwells Separate deployment mechanisms vs a single deployment mechanism

  33. @sarahjwells Risk-based approach to testing

  34. @sarahjwells Doing anything 150 times takes time

  35. @sarahjwells Changes per service weren’t *that* big

  36. @sarahjwells Migrating from systemd service files to helm charts

  37. @sarahjwells Integrating the service into a templated jenkins pipeline

  38. @sarahjwells Good to get everyone involved - “Helm days”

  39. @sarahjwells Discovered a lot of ‘broken’ things

  40. @sarahjwells Services that hadn’t been built for a long time

  41. @sarahjwells A standard that isn’t enforced may will not be

    complied with: - healthcheck timeouts
  42. @sarahjwells - /__gtg endpoints

  43. @sarahjwells Making sure a service will recover if k8s moves

    it elsewhere
  44. @sarahjwells Easy to get sucked into making things better

  45. @sarahjwells Would have been better if…

  46. @sarahjwells We’d swarmed on the work

  47. @sarahjwells The longer you run in parallel, the more overhead

    for releasing code changes
  48. @sarahjwells and the higher the costs

  49. @sarahjwells Not just AWS costs either

  50. @sarahjwells Going live

  51. @sarahjwells Doing the migration

  52. @sarahjwells The results

  53. @sarahjwells A more stable platform

  54. @sarahjwells Something where we can learn from others

  55. Reduction in hosting and support costs

  56. None
  57. @sarahjwells Thank you! We’re hiring: https://aboutus.ft.com/careers/