QCon London 2019: Mature microservices and how to operate them

Slide 1

Slide 1 text

Mature microservices and how to operate them Sarah Wells Technical Director for Operations & Reliability, The Financial Times @sarahjwells

Slide 2

Slide 2 text

No content

Slide 3

Slide 3 text

@sarahjwells https://www.ft.com/stream/ c47f4dfc-6879-4e95-accf-ca8cbe6a1f69

Slide 4

Slide 4 text

@sarahjwells https://www.ft.com/companies

Slide 5

Slide 5 text

@sarahjwells Problem: we’d set up a redirect to a page which didn’t exist

Slide 6

Slide 6 text

@sarahjwells We weren’t sure how to ﬁx the data via the url management tool

Slide 7

Slide 7 text

No content

Slide 8

Slide 8 text

No content

Slide 9

Slide 9 text

@sarahjwells We got it ﬁxed

Slide 10

Slide 10 text

@sarahjwells Polyglot architectures are great - until you need to work out how *this* database is backed up

Slide 11

Slide 11 text

No content

Slide 12

Slide 12 text

No content

Slide 13

Slide 13 text

No content

Slide 14

Slide 14 text

@sarahjwells Microservices are more complicated to operate and maintain

Slide 15

Slide 15 text

@sarahjwells Why bother?

Slide 16

Slide 16 text

No content

Slide 17

Slide 17 text

No content

Slide 18

Slide 18 text

@sarahjwells “Experiment” for most organizations really means “try” Linda Rising Experiments: the Good, the Bad and the Beautiful

Slide 19

Slide 19 text

Overlap tests by componentising the barrier

Slide 20

Slide 20 text

@sarahjwells Releasing changes frequently doesn’t just ‘happen’

Slide 21

Slide 21 text

@sarahjwells Done right, microservices enable this

Slide 22

Slide 22 text

@sarahjwells The team that builds the system *has* to operate it too

Slide 23

Slide 23 text

@sarahjwells What happens when teams move on to new projects?

Slide 24

Slide 24 text

@sarahjwells Your next legacy system will be microservices not a monolith

Slide 25

Slide 25 text

@sarahjwells Optimising for speed Operating microservices When people move on

Slide 26

Slide 26 text

@sarahjwells Optimising for speed

Slide 27

Slide 27 text

No content

Slide 28

Slide 28 text

Measure High performers Delivery lead time

Slide 29

Slide 29 text

Measure High performers Delivery lead time Less than one hour “How long would it take you to release a single line of code to production?”

Slide 30

Slide 30 text

Measure High performers Delivery lead time Less than one hour Deployment frequency

Slide 31

Slide 31 text

Measure High performers Delivery lead time Less than one hour Deployment frequency On demand

Slide 32

Slide 32 text

Measure High performers Delivery lead time Less than one hour Deployment frequency On demand Time to restore service

Slide 33

Slide 33 text

Measure High performers Delivery lead time Less than one hour Deployment frequency On demand Time to restore service Less than one hour

Slide 34

Slide 34 text

Measure High performers Delivery lead time Less than one hour Deployment frequency On demand Time to restore service Less than one hour Change fail rate

Slide 35

Slide 35 text

Measure High performers Delivery lead time Less than one hour Deployment frequency On demand Time to restore service Less than one hour Change fail rate 0 - 15%

Slide 36

Slide 36 text

@sarahjwells High performing organisations release changes frequently

Slide 37

Slide 37 text

@sarahjwells Continuous delivery is the foundation

Slide 38

Slide 38 text

“If it hurts, do it more frequently, and bring the pain forward.”

Slide 39

Slide 39 text

@sarahjwells Our old build and deployment process was very manual…

Slide 40

Slide 40 text

No content

Slide 41

Slide 41 text

@sarahjwells You can’t experiment when you do 12 releases a year

Slide 42

Slide 42 text

@sarahjwells 1. An automated build and release pipeline

Slide 43

Slide 43 text

@sarahjwells 2. Automated testing, integrated into the pipeline

Slide 44

Slide 44 text

@sarahjwells 3. Continuous integration

Slide 45

Slide 45 text

@sarahjwells If you aren’t releasing multiple times a day, consider what is stopping you

Slide 46

Slide 46 text

@sarahjwells You’ll probably have to change the way you architect things

Slide 47

Slide 47 text

@sarahjwells Zero downtime deployments: - sequential deployments - schemaless databases

Slide 48

Slide 48 text

@sarahjwells In hours releases mean the people who can help are there

Slide 49

Slide 49 text

@sarahjwells You need to be able to test and deploy your changes independently

Slide 50

Slide 50 text

@sarahjwells You need systems - and teams - to be loosely coupled

Slide 51

Slide 51 text

@sarahjwells Done right, microservices are loosely coupled

Slide 52

Slide 52 text

@sarahjwells Processes also have to change

Slide 53

Slide 53 text

@sarahjwells Often there is ‘process theatre’ around things and this can safely be removed

Slide 54

Slide 54 text

@sarahjwells Change approval boards don’t reduce the chance of failure

Slide 55

Slide 55 text

@sarahjwells Filling out a form for each change takes too long

Slide 56

Slide 56 text

@sarahjwells How fast are we moving?

Slide 57

Slide 57 text

No content

Slide 58

Slide 58 text

No content

Slide 59

Slide 59 text

@sarahjwells Releasing 250 times as often

Slide 60

Slide 60 text

@sarahjwells Changes are small, easy to understand, independent and reversible

Slide 61

Slide 61 text

<1% failure rate ~16% failure rate

Slide 62

Slide 62 text

@sarahjwells Optimising for speed Operating microservices

Slide 63

Slide 63 text

No content

Slide 64

Slide 64 text

@sarahjwells There are patterns and approaches that help

Slide 65

Slide 65 text

@sarahjwells Devops is essential for success

Slide 66

Slide 66 text

@sarahjwells You can’t hand things oﬀ to another team when they change multiple times a day

Slide 67

Slide 67 text

@sarahjwells High performing teams get to make their own decisions about tools and technology

Slide 68

Slide 68 text

@sarahjwells Delegating tool choice to teams makes it hard for central teams to support everything

Slide 69

Slide 69 text

@sarahjwells Make it someone else’s problem

Slide 70

Slide 70 text

https://medium.com/wardleymaps

Slide 71

Slide 71 text

@sarahjwells Buy rather than build, unless it’s critical to your business

Slide 72

Slide 72 text

@sarahjwells Work out what level of risk you’re comfortable with

Slide 73

Slide 73 text

@sarahjwells “We’re not a hospital or a power station”

Slide 74

Slide 74 text

@sarahjwells We value releasing often so we can experiment frequently

Slide 75

Slide 75 text

@sarahjwells Accept that you will generally be in a state of ‘grey failure’

Slide 76

Slide 76 text

No content

Slide 77

Slide 77 text

@sarahjwells Retry on failure: - backoﬀ before retrying - give up if it’s taking too long

Slide 78

Slide 78 text

@sarahjwells Mitigate now, ﬁx tomorrow

Slide 79

Slide 79 text

@sarahjwells How do you know something’s wrong?

Slide 80

Slide 80 text

@sarahjwells Concentrate on the business capabilities

Slide 81

Slide 81 text

@sarahjwells Synthetic monitoring

Slide 82

Slide 82 text

No content

Slide 83

Slide 83 text

No content

Slide 84

Slide 84 text

No content

Slide 85

Slide 85 text

No content

Slide 86

Slide 86 text

@sarahjwells No data ﬁxtures required

Slide 87

Slide 87 text

@sarahjwells Also helps us know things are broken even if no user is currently doing anything

Slide 88

Slide 88 text

@sarahjwells Make sure you know whether *real* things are working in production

Slide 89

Slide 89 text

@sarahjwells Our editorial team is inventive

Slide 90

Slide 90 text

@sarahjwells What does it mean for a publish to be ‘successful’?

Slide 91

Slide 91 text

No content

Slide 92

Slide 92 text

No content

Slide 93

Slide 93 text

No content

Slide 94

Slide 94 text

No content

Slide 95

Slide 95 text

@sarahjwells Build observability into your system

Slide 96

Slide 96 text

@sarahjwells Observability: can you infer what’s going on in the system by looking at its external outputs?

Slide 97

Slide 97 text

@sarahjwells Log aggregation

Slide 98

Slide 98 text

No content

Slide 99

Slide 99 text

@sarahjwells Metrics

Slide 100

Slide 100 text

@sarahjwells Keep it simple: - request rate - latency - error rate

Slide 101

Slide 101 text

@sarahjwells You’ll always be migrating *something*

Slide 102

Slide 102 text

@sarahjwells Doing anything 150 times is painful

Slide 103

Slide 103 text

@sarahjwells Deployment pipelines need to be templated

Slide 104

Slide 104 text

@sarahjwells Use a service mesh

Slide 105

Slide 105 text

@sarahjwells You’ll have services that haven’t been released for years

Slide 106

Slide 106 text

@sarahjwells But you don’t want to ﬁnd out your service can’t be released when you most need to do it

Slide 107

Slide 107 text

@sarahjwells Build everything overnight?

Slide 108

Slide 108 text

@sarahjwells Optimising for speed Operating microservices When people move on

Slide 109

Slide 109 text

@sarahjwells Every system must be owned

Slide 110

Slide 110 text

@sarahjwells If you won’t invest enough to keep it running properly, shut it down

Slide 111

Slide 111 text

@sarahjwells Keeping documentation up to date is a challenge

Slide 112

Slide 112 text

@sarahjwells We started with a searchable runbook library

Slide 113

Slide 113 text

No content

Slide 114

Slide 114 text

@sarahjwells System codes are very helpful

Slide 115

Slide 115 text

@sarahjwells We needed to represent this stuﬀ as a graph

Slide 116

Slide 116 text

No content

Slide 117

Slide 117 text

No content

Slide 118

Slide 118 text

@sarahjwells Helps if you can give people something in return

Slide 119

Slide 119 text

No content

Slide 120

Slide 120 text

No content

Slide 121

Slide 121 text

@sarahjwells Practice

Slide 122

Slide 122 text

“If it hurts, do it more frequently, and bring the pain forward.”

Slide 123

Slide 123 text

@sarahjwells Failovers, database restores

Slide 124

Slide 124 text

@sarahjwells Chaos engineering https://principlesofchaos.org/

Slide 125

Slide 125 text

@sarahjwells Understand your steady state Look at what you can change - minimise the blast radius Work out what you expect to see happen Run the experiment and see if you were right