RebelCon 2019: Mature microservices and how to operate them

Slide 1

Slide 1 text

Mature microservices and how to operate them Sarah Wells Technical Director for Operations & Reliability, The Financial Times @sarahjwells

Slide 2

Slide 2 text

No content

Slide 3

Slide 3 text

@sarahjwells https://www.ft.com/stream/ c47f4dfc-6879-4e95-accf-ca8cbe6a1f69

Slide 4

Slide 4 text

@sarahjwells https://www.ft.com/companies

Slide 5

Slide 5 text

@sarahjwells Problem: we’d set up a redirect to a page which didn’t exist

Slide 6

Slide 6 text

No content

Slide 7

Slide 7 text

No content

Slide 8

Slide 8 text

@sarahjwells Using the right tool for the job is great - until you need to work out how *this* database is backed up

Slide 9

Slide 9 text

No content

Slide 10

Slide 10 text

No content

Slide 11

Slide 11 text

No content

Slide 12

Slide 12 text

@sarahjwells Microservices are more complicated to operate and maintain

Slide 13

Slide 13 text

@sarahjwells Why bother?

Slide 14

Slide 14 text

No content

Slide 15

Slide 15 text

No content

Slide 16

Slide 16 text

@sarahjwells “Experiment” for most organizations really means “try” Linda Rising Experiments: the Good, the Bad and the Beautiful

Slide 17

Slide 17 text

Overlap tests by componentising the barrier

Slide 18

Slide 18 text

@sarahjwells Releasing changes frequently doesn’t just ‘happen’

Slide 19

Slide 19 text

@sarahjwells Done right, microservices enable this

Slide 20

Slide 20 text

@sarahjwells What happens when teams move on to new projects?

Slide 21

Slide 21 text

@sarahjwells Your next legacy system will be microservices not a monolith

Slide 22

Slide 22 text

@sarahjwells Optimising for speed Operating microservices When people move on

Slide 23

Slide 23 text

@sarahjwells Optimising for speed

Slide 24

Slide 24 text

No content

Slide 25

Slide 25 text

Measure High performers Delivery lead time data from Accelerate: Forsgren, Humble, Kim

Slide 26

Slide 26 text

Measure High performers Delivery lead time Less than one hour data from Accelerate: Forsgren, Humble, Kim

Slide 27

Slide 27 text

Measure High performers Delivery lead time Less than one hour Deployment frequency data from Accelerate: Forsgren, Humble, Kim

Slide 28

Slide 28 text

Measure High performers Delivery lead time Less than one hour Deployment frequency On demand data from Accelerate: Forsgren, Humble, Kim

Slide 29

Slide 29 text

Measure High performers Delivery lead time Less than one hour Deployment frequency On demand Time to restore service data from Accelerate: Forsgren, Humble, Kim

Slide 30

Slide 30 text

Measure High performers Delivery lead time Less than one hour Deployment frequency On demand Time to restore service Less than one hour data from Accelerate: Forsgren, Humble, Kim

Slide 31

Slide 31 text

Measure High performers Delivery lead time Less than one hour Deployment frequency On demand Time to restore service Less than one hour Change fail rate data from Accelerate: Forsgren, Humble, Kim

Slide 32

Slide 32 text

Measure High performers Delivery lead time Less than one hour Deployment frequency On demand Time to restore service Less than one hour Change fail rate 0 - 15% data from Accelerate: Forsgren, Humble, Kim

Slide 33

Slide 33 text

@sarahjwells High performing organisations release changes frequently

Slide 34

Slide 34 text

@sarahjwells Continuous delivery is the foundation

Slide 35

Slide 35 text

“If it hurts, do it more frequently, and bring the pain forward.”

Slide 36

Slide 36 text

@sarahjwells Our old build and deployment process was very manual…

Slide 37

Slide 37 text

No content

Slide 38

Slide 38 text

@sarahjwells You can’t experiment when you do 12 releases a year

Slide 39

Slide 39 text

@sarahjwells What does continuous delivery involve?

Slide 40

Slide 40 text

@sarahjwells 1. An automated build and release pipeline

Slide 41

Slide 41 text

@sarahjwells 2. Automated testing, integrated into the pipeline

Slide 42

Slide 42 text

@sarahjwells 3. Continuous integration

Slide 43

Slide 43 text

@sarahjwells If you aren’t releasing multiple times a day, consider what is stopping you

Slide 44

Slide 44 text

@sarahjwells You’ll probably have to change the way you architect things

Slide 45

Slide 45 text

@sarahjwells Zero downtime deployments: - sequential deployments - schemaless databases

Slide 46

Slide 46 text

@sarahjwells You need to be able to test and deploy your changes independently

Slide 47

Slide 47 text

@sarahjwells You need systems - and teams - to be loosely coupled

Slide 48

Slide 48 text

@sarahjwells Done right, microservices are loosely coupled

Slide 49

Slide 49 text

@sarahjwells Processes also have to change

Slide 50

Slide 50 text

@sarahjwells Often there is ‘process theatre’ around things and this can safely be removed

Slide 51

Slide 51 text

@sarahjwells Change approval boards don’t reduce the chance of failure

Slide 52

Slide 52 text

@sarahjwells Filling out a form for each change takes too long

Slide 53

Slide 53 text

@sarahjwells How often do we release code at the FT?

Slide 54

Slide 54 text

Content platform releases, 2017

Slide 55

Slide 55 text

Content platform releases, 2014

Slide 56

Slide 56 text

@sarahjwells Releasing 250 times as often

Slide 57

Slide 57 text

@sarahjwells Changes are small, easy to understand, independent and reversible

Slide 58

Slide 58 text

<1% failure rate ~16% failure rate

Slide 59

Slide 59 text

@sarahjwells Optimising for speed Operating microservices

Slide 60

Slide 60 text

No content

Slide 61

Slide 61 text

@sarahjwells There are patterns and approaches that help

Slide 62

Slide 62 text

@sarahjwells Devops is essential for success

Slide 63

Slide 63 text

@sarahjwells The team that builds the system *has* to operate it too

Slide 64

Slide 64 text

@sarahjwells You can’t hand things off to another team when they change multiple times a day

Slide 65

Slide 65 text

@sarahjwells High performing teams get to make their own decisions about tools and technology

Slide 66

Slide 66 text

@sarahjwells Delegating tool choice to teams makes it hard for central teams to support everything

Slide 67

Slide 67 text

@sarahjwells Make it someone else’s problem

Slide 68

Slide 68 text

https://medium.com/wardleymaps

Slide 69

Slide 69 text

@sarahjwells Buy rather than build, unless it’s critical to your business

Slide 70

Slide 70 text

@sarahjwells Work out what level of risk you’re comfortable with

Slide 71

Slide 71 text

@sarahjwells “We’re not a hospital or a power station”

Slide 72

Slide 72 text

@sarahjwells We value releasing often so we can experiment frequently

Slide 73

Slide 73 text

@sarahjwells Accept that you will generally be in a state of ‘grey failure’

Slide 74

Slide 74 text

No content

Slide 75

Slide 75 text

@sarahjwells Retry on failure: - backoff before retrying - give up if it’s taking too long

Slide 76

Slide 76 text

@sarahjwells Mitigate now, ﬁx tomorrow

Slide 77

Slide 77 text

@sarahjwells How do you know something’s wrong?

Slide 78

Slide 78 text

@sarahjwells Concentrate on the business capabilities

Slide 79

Slide 79 text

@sarahjwells Synthetic monitoring

Slide 80

Slide 80 text

No content

Slide 81

Slide 81 text

No content

Slide 82

Slide 82 text

No content

Slide 83

Slide 83 text

No content

Slide 84

Slide 84 text

@sarahjwells Also helps us know things are broken even if no user is currently doing anything

Slide 85

Slide 85 text

@sarahjwells Make sure you know whether *real* things are working in production

Slide 86

Slide 86 text

@sarahjwells Our editorial team is inventive

Slide 87

Slide 87 text

@sarahjwells What does it mean for a publish to be ‘successful’?

Slide 88

Slide 88 text

No content

Slide 89

Slide 89 text

No content

Slide 90

Slide 90 text

No content

Slide 91

Slide 91 text

No content

Slide 92

Slide 92 text

@sarahjwells Build observability into your system

Slide 93

Slide 93 text

@sarahjwells Observability: can you infer what’s going on in the system by looking at its external outputs?

Slide 94

Slide 94 text

@sarahjwells Log aggregation

Slide 95

Slide 95 text

No content

Slide 96

Slide 96 text

@sarahjwells Metrics

Slide 97

Slide 97 text

@sarahjwells Keep it simple: - request rate - error rate - duration

Slide 98

Slide 98 text

@sarahjwells You’ll always be migrating *something*

Slide 99

Slide 99 text

@sarahjwells Doing anything 150 times is painful

Slide 100

Slide 100 text

@sarahjwells Deployment pipelines need to be templated

Slide 101

Slide 101 text

@sarahjwells Use a service mesh

Slide 102

Slide 102 text

@sarahjwells You’ll have services that haven’t been released for years

Slide 103

Slide 103 text

@sarahjwells Build everything overnight?

Slide 104

Slide 104 text

@sarahjwells Optimising for speed Operating microservices When people move on

Slide 105

Slide 105 text

@sarahjwells Every system must be owned

Slide 106

Slide 106 text

@sarahjwells If you won’t invest enough to keep it running properly, shut it down

Slide 107

Slide 107 text

@sarahjwells Keeping documentation up to date is a challenge

Slide 108

Slide 108 text

@sarahjwells We started with a searchable runbook library

Slide 109

Slide 109 text

@sarahjwells System codes are very helpful

Slide 110

Slide 110 text

@sarahjwells We needed to represent this stuff as a graph

Slide 111

Slide 111 text

No content

Slide 112

Slide 112 text

No content

Slide 113

Slide 113 text

No content

Slide 114

Slide 114 text

@sarahjwells Helps if you can give people something in return

Slide 115

Slide 115 text

No content

Slide 116

Slide 116 text

No content

Slide 117

Slide 117 text

@sarahjwells runbooks.md

Slide 118

Slide 118 text

No content

Slide 119

Slide 119 text

No content

Slide 120

Slide 120 text

@sarahjwells Practice

Slide 121

Slide 121 text

“If it hurts, do it more frequently, and bring the pain forward.”

Slide 122

Slide 122 text

@sarahjwells Failovers, database restores

Slide 123

Slide 123 text

@sarahjwells Chaos engineering https://principlesofchaos.org/

Slide 124

Slide 124 text

@sarahjwells Understand your steady state Look at what you can change - minimise the blast radius Work out what you expect to see happen Run the experiment and see if you were right