Slide 1

Slide 1 text

Microservices and scale Sarah Wells Principal Engineer, Financial Times @sarahjwells

Slide 2

Slide 2 text

@sarahjwells BUT - I’m not here to talk about that

Slide 3

Slide 3 text

@sarahjwells I’m not dealing with this kind of scaling challenge

Slide 4

Slide 4 text

The FT’s Universal Publishing Platform

Slide 5

Slide 5 text

No content

Slide 6

Slide 6 text

1

Slide 7

Slide 7 text

1 2

Slide 8

Slide 8 text

Not publishing at huge scale: around 7000 publishes a day

Slide 9

Slide 9 text

1 2 3

Slide 10

Slide 10 text

@sarahjwells 14 million concepts in our graph database or ~20GB

Slide 11

Slide 11 text

1 2 3 4

Slide 12

Slide 12 text

Around 180,000 API requests an hour

Slide 13

Slide 13 text

@sarahjwells We’re not doing microservices to help with scale

Slide 14

Slide 14 text

@sarahjwells They let us move faster

Slide 15

Slide 15 text

Deploys to production last year

Slide 16

Slide 16 text

Deploys to production of the monolith

Slide 17

Slide 17 text

@sarahjwells Releasing nearly 190 times as often

Slide 18

Slide 18 text

@sarahjwells So - why am I talking about scale at all?

Slide 19

Slide 19 text

@sarahjwells *Operating* microservices is a scale challenge

Slide 20

Slide 20 text

No content

Slide 21

Slide 21 text

@sarahjwells 150+ microservices: we need to automate things

Slide 22

Slide 22 text

@sarahjwells The challenges: 1. Provisioning and deployment 2. Monitoring and alerting 3. Logging 4. Service documentation

Slide 23

Slide 23 text

@sarahjwells Provisioning and deployment

Slide 24

Slide 24 text

Provisioning time scale

Slide 25

Slide 25 text

@sarahjwells Provisioning needs to take minutes

Slide 26

Slide 26 text

@sarahjwells Deployment must be (almost entirely) automated

Slide 27

Slide 27 text

@sarahjwells Our old process was very manual…

Slide 28

Slide 28 text

No content

Slide 29

Slide 29 text

No content

Slide 30

Slide 30 text

@sarahjwells Setting up new deployment pipelines has to be quick

Slide 31

Slide 31 text

@sarahjwells Need to be able to make global changes to them

Slide 32

Slide 32 text

@sarahjwells Monitoring and alerting can be very noisy

Slide 33

Slide 33 text

@sarahjwells With resilience, we have 568 instances

Slide 34

Slide 34 text

@sarahjwells If we checked each service every minute…

Slide 35

Slide 35 text

@sarahjwells 817,920 checks per day

Slide 36

Slide 36 text

@sarahjwells One service per VM, 20 system checks, running every minute…

Slide 37

Slide 37 text

@sarahjwells 16,358,400 checks per day

Slide 38

Slide 38 text

@sarahjwells “One-in-a-million” issues would hit us 16 times every day

Slide 39

Slide 39 text

@sarahjwells Which is why we don’t have one service per VM…

Slide 40

Slide 40 text

@sarahjwells Running containers on shared VMs reduces this to 92,160 system checks per day

Slide 41

Slide 41 text

@sarahjwells Still a total of 910,080 checks per day

Slide 42

Slide 42 text

@sarahjwells Logging

Slide 43

Slide 43 text

@sarahjwells ~50,000 log lines per minute

Slide 44

Slide 44 text

@sarahjwells Service documentation

Slide 45

Slide 45 text

@sarahjwells The service registry… who owns what

Slide 46

Slide 46 text

@sarahjwells Lots of information per service

Slide 47

Slide 47 text

No content

Slide 48

Slide 48 text

No content

Slide 49

Slide 49 text

No content

Slide 50

Slide 50 text

@sarahjwells Our GDPR process meant receiving 150 google forms…

Slide 51

Slide 51 text

@sarahjwells How can you solve the operational scale issues?

Slide 52

Slide 52 text

@sarahjwells 1. Provisioning and deployment 2. Monitoring and alerting 3. Logging 4. Service documentation

Slide 53

Slide 53 text

@sarahjwells Provisioning and deployment: automation and tooling

Slide 54

Slide 54 text

@sarahjwells Invest in automation of provisioning

Slide 55

Slide 55 text

No content

Slide 56

Slide 56 text

@sarahjwells Deployment: move away from Jenkins

Slide 57

Slide 57 text

No content

Slide 58

Slide 58 text

@sarahjwells To set up deployment for a new service…

Slide 59

Slide 59 text

@sarahjwells 1. Configure CircleCI 2. Configure Docker hub 3. Add service files to a services repo

Slide 60

Slide 60 text

@sarahjwells Not perfect…

Slide 61

Slide 61 text

@sarahjwells Looking at templated pipelines…

Slide 62

Slide 62 text

@sarahjwells Monitoring and alerting: focus on what matters

Slide 63

Slide 63 text

@sarahjwells It’s the business functionality you should care about

Slide 64

Slide 64 text

No content

Slide 65

Slide 65 text

@sarahjwells Logging: log aggregation and transaction ids

Slide 66

Slide 66 text

No content

Slide 67

Slide 67 text

@sarahjwells Effective log aggregation needs a way to find all related logs

Slide 68

Slide 68 text

Transaction ids tie all microservices together

Slide 69

Slide 69 text

@sarahjwells

Slide 70

Slide 70 text

@sarahjwells Documentation: standards, templates, automation, tooling

Slide 71

Slide 71 text

@sarahjwells Executable documentation

Slide 72

Slide 72 text

@sarahjwells Healthchecks

Slide 73

Slide 73 text

The FT healthcheck standard GET http://{service}/__health

Slide 74

Slide 74 text

The FT healthcheck standard GET http://{service}/__health returns 200 if the service can run the healthcheck

Slide 75

Slide 75 text

The FT healthcheck standard GET http://{service}/__health returns 200 if the service can run the healthcheck each check will return "ok": true or "ok": false

Slide 76

Slide 76 text

No content

Slide 77

Slide 77 text

No content

Slide 78

Slide 78 text

@sarahjwells Healthchecks are unit tested

Slide 79

Slide 79 text

@sarahjwells Keeping information near to the code

Slide 80

Slide 80 text

@sarahjwells Update automatically on deploy

Slide 81

Slide 81 text

@sarahjwells Other teams need to adapt too

Slide 82

Slide 82 text

@sarahjwells Change and release management

Slide 83

Slide 83 text

@sarahjwells 2256 releases = 53 working days doing CRs

Slide 84

Slide 84 text

@sarahjwells Automation, again

Slide 85

Slide 85 text

No content

Slide 86

Slide 86 text

No content

Slide 87

Slide 87 text

@sarahjwells Github web hook for our CRs

Slide 88

Slide 88 text

No content

Slide 89

Slide 89 text

@sarahjwells First line support

Slide 90

Slide 90 text

@sarahjwells There are many different technologies for them to understand now

Slide 91

Slide 91 text

@sarahjwells Our development teams don’t know the whole system either…

Slide 92

Slide 92 text

No content

Slide 93

Slide 93 text

@sarahjwells Operating microservices *is* a challenge

Slide 94

Slide 94 text

@sarahjwells The benefits can be worth it…

Slide 95

Slide 95 text

Deploys to production last year

Slide 96

Slide 96 text

Deploys to production of the monolith

Slide 97

Slide 97 text

@sarahjwells But you have to be prepared to pay the cost

Slide 98

Slide 98 text

@sarahjwells Thank you!