Upgrade to Pro — share decks privately, control downloads, hide ads and more …

KubeCon Europe 2018: Switching Horses Midstream: The Challenges of Migrating 150+ Microservices to Kubernetes

KubeCon Europe 2018: Switching Horses Midstream: The Challenges of Migrating 150+ Microservices to Kubernetes

The FT’s content platform team put our first containers live in mid-2015 and migrated the rest of our services over by April 2016. At that point, we weren't using Kubernetes - and much of what we were using, we built ourselves.

At the end of 2016, we decided we wanted to benefit from the work other people were doing and switch over to Kubernetes. But it's not easy to do that kind of move when you have 150+ microservices and you need to keep the existing platform running in parallel while you do the migration.

This covers this migration and the challenges we faced.

Sarah Wells

May 03, 2018
Tweet

More Decks by Sarah Wells

Other Decks in Technology

Transcript

  1. Switching horses midstream:
    the challenge of migrating
    150+ services to kubernetes
    Sarah Wells
    Technical Director for Operations and Reliability, Financial Times
    @sarahjwells

    View full-size slide

  2. The FT’s Content platform

    View full-size slide

  3. This is what it really looks like…

    View full-size slide

  4. @sarahjwells
    Why *did* we migrate to k8s?

    View full-size slide

  5. @sarahjwells
    Mid 2015: a hand-rolled
    container stack

    View full-size slide

  6. @sarahjwells
    https://medium.com/wardleymaps

    View full-size slide

  7. @sarahjwells
    Spend your innovation tokens wisely

    View full-size slide

  8. @sarahjwells
    ~80% reduction in EC2 costs

    View full-size slide

  9. @sarahjwells
    Many fewer steps to start running a new service in
    production

    View full-size slide

  10. @sarahjwells
    But: supportability of an in-house platform is a
    challenge

    View full-size slide

  11. @sarahjwells
    http://mcfunley.com/choose-boring-technology
    Choose boring technology

    View full-size slide

  12. @sarahjwells
    By late 2016, tools were
    maturing

    View full-size slide

  13. @sarahjwells
    https://medium.com/wardleymaps

    View full-size slide

  14. @sarahjwells
    The FT is not a cluster orchestration company

    View full-size slide

  15. @sarahjwells
    Late 2016: Consider the alternatives

    View full-size slide

  16. @sarahjwells
    Metrics for success:
    - amount of time spent keeping cluster healthy
    - number of sarcastic comments on slack

    View full-size slide

  17. @sarahjwells
    Opted for kubernetes

    View full-size slide

  18. @sarahjwells
    Using leading edge technologies
    requires you to be comfortable
    with change

    View full-size slide

  19. @sarahjwells
    Shouldn’t be (too) scared about
    making the wrong decision
    http://uk.businessinsider.com/jeff-bezos-on-type-1-and-type-2-
    decisions-2016-4

    View full-size slide

  20. @sarahjwells
    Switching horses midstream

    View full-size slide

  21. @sarahjwells
    At the start of this migration we had 150 services

    View full-size slide

  22. @sarahjwells
    Lots of other work going
    on at the same time

    View full-size slide

  23. @sarahjwells
    Complications of running in
    parallel

    View full-size slide

  24. @sarahjwells
    We had well over 2000 code releases while running
    at least part of the stack in parallel

    View full-size slide

  25. @sarahjwells
    Decisions, decisions, decisions…

    View full-size slide

  26. @sarahjwells
    Separate branches vs if/else in code

    View full-size slide

  27. @sarahjwells
    Separate deployment mechanisms vs a single
    deployment mechanism

    View full-size slide

  28. @sarahjwells
    Risk-based approach to testing

    View full-size slide

  29. @sarahjwells
    Doing anything 150 times takes
    time

    View full-size slide

  30. @sarahjwells
    Changes per service weren’t *that* big

    View full-size slide

  31. @sarahjwells
    Migrating from systemd
    service files to helm
    charts

    View full-size slide

  32. @sarahjwells
    Integrating the service
    into a templated jenkins
    pipeline

    View full-size slide

  33. @sarahjwells
    Good to get everyone involved - “Helm days”

    View full-size slide

  34. @sarahjwells
    Discovered a lot of ‘broken’
    things

    View full-size slide

  35. @sarahjwells
    Services that hadn’t been built for a long time

    View full-size slide

  36. @sarahjwells
    A standard that isn’t
    enforced may will not be
    complied with:
    - healthcheck timeouts

    View full-size slide

  37. @sarahjwells
    - /__gtg endpoints

    View full-size slide

  38. @sarahjwells
    Making sure a service will recover if k8s moves it
    elsewhere

    View full-size slide

  39. @sarahjwells
    Easy to get sucked into making things better

    View full-size slide

  40. @sarahjwells
    Would have been better if…

    View full-size slide

  41. @sarahjwells
    We’d swarmed on the work

    View full-size slide

  42. @sarahjwells
    The longer you run in parallel, the more overhead for
    releasing code changes

    View full-size slide

  43. @sarahjwells
    and the higher the costs

    View full-size slide

  44. @sarahjwells
    Not just AWS costs either

    View full-size slide

  45. @sarahjwells
    Going live

    View full-size slide

  46. @sarahjwells
    Doing the migration

    View full-size slide

  47. @sarahjwells
    The results

    View full-size slide

  48. @sarahjwells
    A more stable platform

    View full-size slide

  49. @sarahjwells
    Something where we can learn from others

    View full-size slide

  50. Reduction in hosting and support costs

    View full-size slide

  51. @sarahjwells
    Thank you!
    We’re hiring: https://aboutus.ft.com/careers/

    View full-size slide