Slide 1

Slide 1 text

@nickywrightson An Engineer’s Guide to a Good Night’s Sleep By Nicky Wrightson @nickywrightson

Slide 2

Slide 2 text

@nickywrightson

Slide 3

Slide 3 text

@nickywrightson

Slide 4

Slide 4 text

@nickywrightson

Slide 5

Slide 5 text

@nickywrightson We are building REALLY complicated distributed systems

Slide 6

Slide 6 text

@nickywrightson Martin Fowler https://martinfowler.com/articles/microservice-trade-offs.html “You need a mature operations team to manage lots of services, which are being redeployed regularly” https://martinfowler.com/articles/microservice-trade-offs.html

Slide 7

Slide 7 text

@nickywrightson

Slide 8

Slide 8 text

@nickywrightson Empowered teams means the team also control the support

Slide 9

Slide 9 text

@nickywrightson 2014 Consumers add a caching layer to protect against our outages 2019 Out of hours calls to 3rd line have all but disappeared 2018 Migration to Kubernetes completed 2017 Our services were given an SLA of 15mins recovery time

Slide 10

Slide 10 text

@nickywrightson Approaches to reduce the risk of being called 5

Slide 11

Slide 11 text

@nickywrightson Engineer’s mindset 1

Slide 12

Slide 12 text

@nickywrightson 1

Slide 13

Slide 13 text

@nickywrightson Enable teams to own their own support models 1

Slide 14

Slide 14 text

@nickywrightson Operations Support Team A Support Team B 1

Slide 15

Slide 15 text

@nickywrightson The team triages issues during the day 1

Slide 16

Slide 16 text

@nickywrightson Engineers need to think about that out of hours call with every error condition 1

Slide 17

Slide 17 text

@nickywrightson Design the severity levels within your service 1

Slide 18

Slide 18 text

@nickywrightson “The quality of a system will appear to be declining unless it is rigorously maintained” Lehmans Laws of Software Evolution “Declining Quality” (1996) 1

Slide 19

Slide 19 text

@nickywrightson As system evolves, its complexity increases unless work is done to maintain or reduce it Lehmans Laws of Software Evolution cont. "Increasing Complexity" (1974) 1

Slide 20

Slide 20 text

@nickywrightson Engineer’s mindset 1

Slide 21

Slide 21 text

@nickywrightson Don’t get called for issues that could have been caught in office hours 2

Slide 22

Slide 22 text

@nickywrightson Releases during the day should never wake you up at night 2

Slide 23

Slide 23 text

@nickywrightson Can our deployment times help this? 2

Slide 24

Slide 24 text

@nickywrightson Quick deployment 2

Slide 25

Slide 25 text

@nickywrightson 2 VERIFY VERIFY VERIFY

Slide 26

Slide 26 text

@nickywrightson By Cindy Sridharan (@copyconstruct) 2

Slide 27

Slide 27 text

@nickywrightson 3am batch jobs are a guarantee to get an overnight call at some point 2

Slide 28

Slide 28 text

@nickywrightson 2

Slide 29

Slide 29 text

@nickywrightson 2

Slide 30

Slide 30 text

@nickywrightson Don’t get called for issues that could have been caught in office hours 2

Slide 31

Slide 31 text

@nickywrightson Automate failure recovery where possible 3

Slide 32

Slide 32 text

@nickywrightson Let your platform recover for you 3

Slide 33

Slide 33 text

@nickywrightson Applications need to cope with change Graceful Termination Transactional Clean restarts Stateless Queue Backed Idempotent 3

Slide 34

Slide 34 text

@nickywrightson Make your system idempotent so you can automatically replay failed events 3

Slide 35

Slide 35 text

@nickywrightson Multi region automatic system failovers 3

Slide 36

Slide 36 text

@nickywrightson Multi region automatic system failovers 3

Slide 37

Slide 37 text

@nickywrightson 3

Slide 38

Slide 38 text

@nickywrightson Our EU stack went 3

Slide 39

Slide 39 text

@nickywrightson 3

Slide 40

Slide 40 text

@nickywrightson 3

Slide 41

Slide 41 text

@nickywrightson 3

Slide 42

Slide 42 text

@nickywrightson Automate failure recovery where possible 3

Slide 43

Slide 43 text

@nickywrightson Understand what your customers really care about 4

Slide 44

Slide 44 text

@nickywrightson You want to be the first to know about a critical failure 4

Slide 45

Slide 45 text

@nickywrightson “Only have alerts that you need to action” Sarah Wells - Director of Operations and Reliability at FT 4

Slide 46

Slide 46 text

@nickywrightson Service that cleans old images from the repo Service that takes payments Not all services are equal != 4

Slide 47

Slide 47 text

@nickywrightson Synthetic Requests 4

Slide 48

Slide 48 text

@nickywrightson Use tracing to monitor your critical flows 4 Ben Sigelman @ this morning’s keynote

Slide 49

Slide 49 text

@nickywrightson 4

Slide 50

Slide 50 text

@nickywrightson 4

Slide 51

Slide 51 text

@nickywrightson 4

Slide 52

Slide 52 text

@nickywrightson 4

Slide 53

Slide 53 text

@nickywrightson We are now flagging important events close to the code 4

Slide 54

Slide 54 text

@nickywrightson Understand what your customers really care about 4

Slide 55

Slide 55 text

@nickywrightson Break things and practice everything 5

Slide 56

Slide 56 text

@nickywrightson “a method of experimenting on infrastructure that lets you expose weaknesses before they become a real problem.” 5

Slide 57

Slide 57 text

@nickywrightson Monolith to microservice timeline 5

Slide 58

Slide 58 text

@nickywrightson When can we release the chaos monkeys? 5

Slide 59

Slide 59 text

@nickywrightson Manual simulation of outages work too 5

Slide 60

Slide 60 text

@nickywrightson Spot the SPOF 5

Slide 61

Slide 61 text

@nickywrightson Multi region automatic system failovers 5

Slide 62

Slide 62 text

@nickywrightson Multi region automatic system failovers 5

Slide 63

Slide 63 text

@nickywrightson Fixing things in hours helps team confidence to support out of hours 5

Slide 64

Slide 64 text

@nickywrightson Manual intervention should be simple FIX IT! 5

Slide 65

Slide 65 text

@nickywrightson 5

Slide 66

Slide 66 text

@nickywrightson Make sure your alerts have all the relevant information to action the event 5

Slide 67

Slide 67 text

@nickywrightson Failed requests 5

Slide 68

Slide 68 text

@nickywrightson At 3am just get the system to limp into hours 5

Slide 69

Slide 69 text

@nickywrightson Break things and practice everything 5

Slide 70

Slide 70 text

@nickywrightson Engineer’s mindset 1

Slide 71

Slide 71 text

@nickywrightson Don’t get called for issues that could have been caught in office hours 2

Slide 72

Slide 72 text

@nickywrightson Automate failure recovery where possible 3

Slide 73

Slide 73 text

@nickywrightson Understand what your customers care about? 4

Slide 74

Slide 74 text

@nickywrightson Break things and practice everything 5

Slide 75

Slide 75 text

@nickywrightson The engineers are the ones called at 3am We now own this!

Slide 76

Slide 76 text

@nickywrightson Thanks!

Slide 77

Slide 77 text

@nickywrightson Resources Testing Microservices, the sane way by Cindy Sridharan https://medium.com/@copyconstruct/testing-microservices-the-sane- way-9bb31d158c16 Microservices trade offs by Martin Fowler https://martinfowler.com/articles/microservice-trade-offs.html https://medium.com/netflix-techblog/vizceral-open-source-acc0c32113fe