Slide 1

Slide 1 text

@nickywrightson An Engineer’s Guide to a Good Night’s Sleep By Nicky Wrightson @nickywrightson

Slide 2

Slide 2 text

@nickywrightson

Slide 3

Slide 3 text

@nickywrightson

Slide 4

Slide 4 text

@nickywrightson

Slide 5

Slide 5 text

@nickywrightson We are building REALLY complicated distributed systems

Slide 6

Slide 6 text

@nickywrightson Martin Fowler https://martinfowler.com/articles/microservice-trade-offs.html “You need a mature operations team to manage lots of services, which are being redeployed regularly”

Slide 7

Slide 7 text

@nickywrightson

Slide 8

Slide 8 text

@nickywrightson Empowered teams means the team also control the support

Slide 9

Slide 9 text

@nickywrightson 2014 Consumers add a caching layer to protect against our outages 2019 Out of hours calls to 3rd line have all but disappeared 2018 Migration to Kubernetes completed 2017 Our services were given an SLA of 15mins recovery time

Slide 10

Slide 10 text

@nickywrightson Approaches to reduce the risk of being called 5

Slide 11

Slide 11 text

@nickywrightson Engineer’s mindset 1

Slide 12

Slide 12 text

@nickywrightson 1

Slide 13

Slide 13 text

@nickywrightson Enable teams to own their own support models 1

Slide 14

Slide 14 text

@nickywrightson Operations Support Team A Support Team B 1

Slide 15

Slide 15 text

@nickywrightson The team triages issues during the day 1

Slide 16

Slide 16 text

@nickywrightson Engineers need to think about that out of hours call with every error condition 1

Slide 17

Slide 17 text

@nickywrightson Design the severity levels within your service 1

Slide 18

Slide 18 text

@nickywrightson Engineer’s mindset 1

Slide 19

Slide 19 text

@nickywrightson Don’t get called for issues that could have been caught in office hours 2

Slide 20

Slide 20 text

@nickywrightson Releases during the day should never wake you up at night 2

Slide 21

Slide 21 text

@nickywrightson Can our deployment times help this? 2

Slide 22

Slide 22 text

@nickywrightson Quick deployment 2

Slide 23

Slide 23 text

@nickywrightson 2 Get your deployment system do automatic rollbacks

Slide 24

Slide 24 text

@nickywrightson 2 VERIFY VERIFY VERIFY

Slide 25

Slide 25 text

@nickywrightson By Cindy Sridharan (@copyconstruct) 2

Slide 26

Slide 26 text

@nickywrightson 3am batch jobs are a guarantee to get an overnight call at some point 2

Slide 27

Slide 27 text

@nickywrightson 2

Slide 28

Slide 28 text

@nickywrightson 2

Slide 29

Slide 29 text

@nickywrightson Don’t get called for issues that could have been caught in office hours 2

Slide 30

Slide 30 text

@nickywrightson Automate failure recovery where possible 3

Slide 31

Slide 31 text

@nickywrightson Let your platform recover for you 3

Slide 32

Slide 32 text

@nickywrightson Applications need to cope with change Graceful Termination Transactional Clean restarts Stateless Queue Backed Idempotent 3

Slide 33

Slide 33 text

@nickywrightson Multi region automatic system failovers 3

Slide 34

Slide 34 text

@nickywrightson 3

Slide 35

Slide 35 text

@nickywrightson Multi region automatic system failovers 3

Slide 36

Slide 36 text

@nickywrightson Healthchecks and liveness probes may not tell the whole story

Slide 37

Slide 37 text

@nickywrightson

Slide 38

Slide 38 text

@nickywrightson Automate failure recovery where possible 3

Slide 39

Slide 39 text

@nickywrightson Understand what your customers really care about 4

Slide 40

Slide 40 text

@nickywrightson You want to be the first to know about a critical failure 4

Slide 41

Slide 41 text

@nickywrightson “Only have alerts that you need to action” Sarah Wells - Director of Operations and Reliability at FT 4

Slide 42

Slide 42 text

@nickywrightson Synthetic Requests 4

Slide 43

Slide 43 text

@nickywrightson Use tracing to monitor your critical flows 4 Ben Sigelman Restoring Confidence in Microservices: Tracing That's More Than Traces

Slide 44

Slide 44 text

@nickywrightson 4

Slide 45

Slide 45 text

@nickywrightson 4

Slide 46

Slide 46 text

@nickywrightson 4

Slide 47

Slide 47 text

@nickywrightson 4

Slide 48

Slide 48 text

@nickywrightson We are now flagging important events close to the code 4

Slide 49

Slide 49 text

@nickywrightson Understand what your customers really care about 4

Slide 50

Slide 50 text

@nickywrightson Break things and practice everything 5

Slide 51

Slide 51 text

@nickywrightson “a method of experimenting on infrastructure that lets you expose weaknesses before they become a real problem.” 5

Slide 52

Slide 52 text

@nickywrightson Monolith to microservice timeline 5

Slide 53

Slide 53 text

@nickywrightson When can we release the chaos monkeys? 5

Slide 54

Slide 54 text

@nickywrightson Manual simulation of outages work too 5

Slide 55

Slide 55 text

@nickywrightson Spot the SPOF 5

Slide 56

Slide 56 text

@nickywrightson Multi region automatic system failovers 5

Slide 57

Slide 57 text

@nickywrightson Multi region automatic system failovers 5

Slide 58

Slide 58 text

@nickywrightson Fixing things in hours helps team confidence to support out of hours 5

Slide 59

Slide 59 text

@nickywrightson Manual intervention should be simple FIX IT! 5

Slide 60

Slide 60 text

@nickywrightson 5

Slide 61

Slide 61 text

@nickywrightson Make sure your alerts have all the relevant information to action the event 5

Slide 62

Slide 62 text

@nickywrightson Failed requests 5

Slide 63

Slide 63 text

@nickywrightson At 3am just get the system to limp into hours 5

Slide 64

Slide 64 text

@nickywrightson Break things and practice everything 5

Slide 65

Slide 65 text

@nickywrightson Engineer’s mindset 1

Slide 66

Slide 66 text

@nickywrightson Don’t get called for issues that could have been caught in office hours 2

Slide 67

Slide 67 text

@nickywrightson Automate failure recovery where possible 3

Slide 68

Slide 68 text

@nickywrightson Understand what your customers care about? 4

Slide 69

Slide 69 text

@nickywrightson Break things and practice everything 5

Slide 70

Slide 70 text

@nickywrightson The engineers are the ones called at 3am We now own this!

Slide 71

Slide 71 text

@nickywrightson Thanks! https://speakerdeck.com/nickywrightson https://grnh.se/579803f21

Slide 72

Slide 72 text

@nickywrightson Resources Testing Microservices, the sane way by Cindy Sridharan https://medium.com/@copyconstruct/testing-microservices-the-sane-way-9bb31d158c16 Microservices trade offs by Martin Fowler https://martinfowler.com/articles/microservice-trade-offs.html Ben Sigelman @ QCon 2019 https://www.infoq.com/presentations/microservices-distributed-tracing? itm_source=infoq&itm_medium=QCon_EarlyAccessVideos&itm_campaign=QConLondon2019 James Governor on progressive delivery: https://redmonk.com/jgovernor/2018/08/06/towards- progressive-delivery/ Chaity Majors on Friday freezes: https://charity.wtf/2019/05/01/friday-deploy-freezes-are-exactly-like- murdering-puppies/