Slide 1

Slide 1 text

@nickywrightson An Engineer’s Guide to a Good Night’s Sleep By Nicky Wrightson @nickywrightson

Slide 2

Slide 2 text

@nickywrightson

Slide 3

Slide 3 text

@nickywrightson

Slide 4

Slide 4 text

@nickywrightson

Slide 5

Slide 5 text

@nickywrightson We are building REALLY complicated distributed systems

Slide 6

Slide 6 text

@nickywrightson Martin Fowler https://martinfowler.com/articles/microservice-trade-offs.html “You need a mature operations team to manage lots of services, which are being redeployed regularly”

Slide 7

Slide 7 text

@nickywrightson Empowered teams means the team also control the support

Slide 8

Slide 8 text

@nickywrightson 2014 Consumers add a caching layer to protect against our outages 2019 Out of hours calls to 3rd line have all but disappeared 2018 Migration to Kubernetes completed 2017 Our services were given an SLA of 15mins recovery time

Slide 9

Slide 9 text

@nickywrightson Approaches to reduce the risk of being called 5

Slide 10

Slide 10 text

@nickywrightson Engineer’s mindset 1

Slide 11

Slide 11 text

@nickywrightson 1

Slide 12

Slide 12 text

@nickywrightson Enable teams to own their own support models 1

Slide 13

Slide 13 text

@nickywrightson Operations Support Team A Support Team B 1

Slide 14

Slide 14 text

@nickywrightson The team triages issues during the day 1

Slide 15

Slide 15 text

@nickywrightson Engineers need to think about that out of hours call with every error condition 1

Slide 16

Slide 16 text

@nickywrightson Design the severity levels within your service 1

Slide 17

Slide 17 text

@nickywrightson Engineer’s mindset 1

Slide 18

Slide 18 text

@nickywrightson Don’t get called for issues that could have been caught in office hours 2

Slide 19

Slide 19 text

@nickywrightson Releases during the day should never wake you up at night 2

Slide 20

Slide 20 text

@nickywrightson Can our deployment times help this? 2

Slide 21

Slide 21 text

@nickywrightson 2 Get your deployment system do automatic rollbacks

Slide 22

Slide 22 text

@nickywrightson 2 VERIFY VERIFY VERIFY

Slide 23

Slide 23 text

@nickywrightson By Cindy Sridharan (@copyconstruct) 2

Slide 24

Slide 24 text

@nickywrightson 3am batch jobs are a guarantee to get an overnight call at some point 2

Slide 25

Slide 25 text

@nickywrightson Don’t get called for issues that could have been caught in office hours 2

Slide 26

Slide 26 text

@nickywrightson Automate failure recovery where possible 3

Slide 27

Slide 27 text

@nickywrightson Let your platform recover for you 3

Slide 28

Slide 28 text

@nickywrightson Applications need to cope with change Graceful Termination Transactional Clean restarts Stateless Queue Backed Idempotent 3

Slide 29

Slide 29 text

@nickywrightson Multi region automatic system failovers 3

Slide 30

Slide 30 text

@nickywrightson 3

Slide 31

Slide 31 text

@nickywrightson Multi region automatic system failovers 3

Slide 32

Slide 32 text

@nickywrightson Healthchecks and liveness probes may not tell the whole story

Slide 33

Slide 33 text

@nickywrightson

Slide 34

Slide 34 text

@nickywrightson Automate failure recovery where possible 3

Slide 35

Slide 35 text

@nickywrightson Understand what your customers really care about 4

Slide 36

Slide 36 text

@nickywrightson You want to be the first to know about a critical failure 4

Slide 37

Slide 37 text

@nickywrightson “Only have alerts that you need to action” Sarah Wells - Director of Operations and Reliability at FT 4

Slide 38

Slide 38 text

@nickywrightson Synthetic Requests 4

Slide 39

Slide 39 text

@nickywrightson Use tracing to monitor your critical flows 4 Ben Sigelman Restoring Confidence in Microservices: Tracing That's More Than Traces

Slide 40

Slide 40 text

@nickywrightson 4

Slide 41

Slide 41 text

@nickywrightson 4

Slide 42

Slide 42 text

@nickywrightson 4

Slide 43

Slide 43 text

@nickywrightson 4

Slide 44

Slide 44 text

@nickywrightson Understand what your customers really care about 4

Slide 45

Slide 45 text

@nickywrightson Break things and practice everything 5

Slide 46

Slide 46 text

@nickywrightson “a method of experimenting on infrastructure that lets you expose weaknesses before they become a real problem.” 5

Slide 47

Slide 47 text

@nickywrightson Monolith to microservice timeline 5

Slide 48

Slide 48 text

@nickywrightson When can we release the chaos monkeys? 5

Slide 49

Slide 49 text

@nickywrightson Manual simulation of outages work too 5

Slide 50

Slide 50 text

@nickywrightson Spot the SPOF 5

Slide 51

Slide 51 text

@nickywrightson Multi region automatic system failovers 5

Slide 52

Slide 52 text

@nickywrightson Multi region automatic system failovers 5

Slide 53

Slide 53 text

@nickywrightson Manual intervention should be simple FIX IT! 5

Slide 54

Slide 54 text

@nickywrightson 5

Slide 55

Slide 55 text

@nickywrightson Provide all the information to action an alert 5

Slide 56

Slide 56 text

@nickywrightson At 3am just get the system to limp into office hours 5

Slide 57

Slide 57 text

@nickywrightson Break things and practice everything 5

Slide 58

Slide 58 text

@nickywrightson Engineer’s mindset 1

Slide 59

Slide 59 text

@nickywrightson Don’t get called for issues that could have been caught in office hours 2

Slide 60

Slide 60 text

@nickywrightson Automate failure recovery where possible 3

Slide 61

Slide 61 text

@nickywrightson Understand what your customers care about? 4

Slide 62

Slide 62 text

@nickywrightson Break things and practice everything 5

Slide 63

Slide 63 text

@nickywrightson The engineers are the ones called at 3am We now own this!

Slide 64

Slide 64 text

@nickywrightson Thanks! https://speakerdeck.com/nickywrightson We are hiring!

Slide 65

Slide 65 text

@nickywrightson Resources Testing Microservices, the sane way by Cindy Sridharan https://medium.com/@copyconstruct/testing-microservices-the-sane-way-9bb31d158c16 Microservices trade offs by Martin Fowler https://martinfowler.com/articles/microservice-trade-offs.html Ben Sigelman @ QCon 2019 https://www.infoq.com/presentations/microservices-distributed-tracing? itm_source=infoq&itm_medium=QCon_EarlyAccessVideos&itm_campaign=QConLondon2019 James Governor on progressive delivery: https://redmonk.com/jgovernor/2018/08/06/towards- progressive-delivery/ Charity Majors on Friday freezes: https://charity.wtf/2019/05/01/friday-deploy-freezes-are-exactly-like- murdering-puppies/ Skyscanner hiring link: https://grnh.se/579803f21