Slide 1

Slide 1 text

Switching horses midstream: the challenge of migrating 150+ services to kubernetes Sarah Wells Technical Director for Operations and Reliability, Financial Times @sarahjwells

Slide 2

Slide 2 text

No content

Slide 3

Slide 3 text

No content

Slide 4

Slide 4 text

No content

Slide 5

Slide 5 text

No content

Slide 6

Slide 6 text

The FT’s Content platform

Slide 7

Slide 7 text

This is what it really looks like…

Slide 8

Slide 8 text

@sarahjwells Why *did* we migrate to k8s?

Slide 9

Slide 9 text

@sarahjwells Mid 2015: a hand-rolled container stack

Slide 10

Slide 10 text

@sarahjwells https://medium.com/wardleymaps

Slide 11

Slide 11 text

@sarahjwells Spend your innovation tokens wisely

Slide 12

Slide 12 text

@sarahjwells ~80% reduction in EC2 costs

Slide 13

Slide 13 text

@sarahjwells Many fewer steps to start running a new service in production

Slide 14

Slide 14 text

@sarahjwells But: supportability of an in-house platform is a challenge

Slide 15

Slide 15 text

@sarahjwells http://mcfunley.com/choose-boring-technology Choose boring technology

Slide 16

Slide 16 text

@sarahjwells By late 2016, tools were maturing

Slide 17

Slide 17 text

@sarahjwells https://medium.com/wardleymaps

Slide 18

Slide 18 text

@sarahjwells The FT is not a cluster orchestration company

Slide 19

Slide 19 text

@sarahjwells Late 2016: Consider the alternatives

Slide 20

Slide 20 text

@sarahjwells Metrics for success: - amount of time spent keeping cluster healthy - number of sarcastic comments on slack

Slide 21

Slide 21 text

@sarahjwells Opted for kubernetes

Slide 22

Slide 22 text

No content

Slide 23

Slide 23 text

@sarahjwells Using leading edge technologies requires you to be comfortable with change

Slide 24

Slide 24 text

@sarahjwells Shouldn’t be (too) scared about making the wrong decision http://uk.businessinsider.com/jeff-bezos-on-type-1-and-type-2- decisions-2016-4

Slide 25

Slide 25 text

@sarahjwells Switching horses midstream

Slide 26

Slide 26 text

@sarahjwells At the start of this migration we had 150 services

Slide 27

Slide 27 text

@sarahjwells Lots of other work going on at the same time

Slide 28

Slide 28 text

@sarahjwells Complications of running in parallel

Slide 29

Slide 29 text

@sarahjwells We had well over 2000 code releases while running at least part of the stack in parallel

Slide 30

Slide 30 text

@sarahjwells Decisions, decisions, decisions…

Slide 31

Slide 31 text

@sarahjwells Separate branches vs if/else in code

Slide 32

Slide 32 text

@sarahjwells Separate deployment mechanisms vs a single deployment mechanism

Slide 33

Slide 33 text

@sarahjwells Risk-based approach to testing

Slide 34

Slide 34 text

@sarahjwells Doing anything 150 times takes time

Slide 35

Slide 35 text

@sarahjwells Changes per service weren’t *that* big

Slide 36

Slide 36 text

@sarahjwells Migrating from systemd service files to helm charts

Slide 37

Slide 37 text

@sarahjwells Integrating the service into a templated jenkins pipeline

Slide 38

Slide 38 text

@sarahjwells Good to get everyone involved - “Helm days”

Slide 39

Slide 39 text

@sarahjwells Discovered a lot of ‘broken’ things

Slide 40

Slide 40 text

@sarahjwells Services that hadn’t been built for a long time

Slide 41

Slide 41 text

@sarahjwells A standard that isn’t enforced may will not be complied with: - healthcheck timeouts

Slide 42

Slide 42 text

@sarahjwells - /__gtg endpoints

Slide 43

Slide 43 text

@sarahjwells Making sure a service will recover if k8s moves it elsewhere

Slide 44

Slide 44 text

@sarahjwells Easy to get sucked into making things better

Slide 45

Slide 45 text

@sarahjwells Would have been better if…

Slide 46

Slide 46 text

@sarahjwells We’d swarmed on the work

Slide 47

Slide 47 text

@sarahjwells The longer you run in parallel, the more overhead for releasing code changes

Slide 48

Slide 48 text

@sarahjwells and the higher the costs

Slide 49

Slide 49 text

@sarahjwells Not just AWS costs either

Slide 50

Slide 50 text

@sarahjwells Going live

Slide 51

Slide 51 text

@sarahjwells Doing the migration

Slide 52

Slide 52 text

@sarahjwells The results

Slide 53

Slide 53 text

@sarahjwells A more stable platform

Slide 54

Slide 54 text

@sarahjwells Something where we can learn from others

Slide 55

Slide 55 text

Reduction in hosting and support costs

Slide 56

Slide 56 text

No content

Slide 57

Slide 57 text

@sarahjwells Thank you! We’re hiring: https://aboutus.ft.com/careers/