Slide 1

Slide 1 text

You build it, you run it Chris O’Dell | @ChrisAnnODell Why Developers should also be on-call 1

Slide 2

Slide 2 text

Who has been on-call? 2

Slide 3

Slide 3 text

Chris O’Dell @ChrisAnnODell Backend Engineer at Monzo 13+ years professional development inc: ● 3 years unofficial second line (startups, yo) ● 3 years dev on call supporting my own apps ● Soon to be on call again… 3

Slide 4

Slide 4 text

Books 4 http://cdwithwindows.net http://releasabilitybook.com

Slide 5

Slide 5 text

5

Slide 6

Slide 6 text

6 The Continuous Delivery Pipeline

Slide 7

Slide 7 text

7

Slide 8

Slide 8 text

8

Slide 9

Slide 9 text

9

Slide 10

Slide 10 text

10

Slide 11

Slide 11 text

11

Slide 12

Slide 12 text

12 Dev Ops

Slide 13

Slide 13 text

13 Dev Ops

Slide 14

Slide 14 text

14 Dev Ops

Slide 15

Slide 15 text

15

Slide 16

Slide 16 text

16 Dev Ops

Slide 17

Slide 17 text

17 Coda Hale – Metrics, Metrics, Everywhere https://www.youtube.com/watch?v=czes-oa Telemetry Jordan Sissel - logging: logstash and other things https://www.youtube.com/watch?v=RuUFnog29M4

Slide 18

Slide 18 text

18 https://twitter.com/samnewman/status/862268242030718976

Slide 19

Slide 19 text

19

Slide 20

Slide 20 text

Metrics Driven Development - The use of real-time metrics to drive rapid, precise, and granular software iterations 20 https://sookocheff.com/post/mdd/mdd/

Slide 21

Slide 21 text

21 Coda Hale – Metrics, Metrics, Everywhere https://www.youtube.com/watch?v=czes-oa0yi

Slide 22

Slide 22 text

Developers On-Call is an evolution of Metrics Driven Development 22

Slide 23

Slide 23 text

23 After a point, Software & Perf bugs become more common than low level infra ones

Slide 24

Slide 24 text

“When things are broken, we want people with the best context trying to fix things.” – Blake Scrivener, Netflix SRE Manager “Who Owns On Call” http://increment.com/on-call/who-owns-on-call/ 24

Slide 25

Slide 25 text

25 Dev Ops

Slide 26

Slide 26 text

[having a separate operations team] “creates a divide and simply doesn’t scale, it puts the onus of responsibility for fixing an issue on the wrong team.” – Joey Parsons, Airbnb SRE manager “Who Owns On Call” http://increment.com/on-call/who-owns-on-call/ 26

Slide 27

Slide 27 text

27 Own your product for its entire lifetime.

Slide 28

Slide 28 text

Ownership •Ownership is a prerequisite of Autonomy -> Mastery -> Purpose 28 Dan Pink - Drive https://www.youtube.com/watch?v=u6XAPnuFjJc

Slide 29

Slide 29 text

“With great power, comes great responsibility.” – Uncle Ben Spider-Man 2002 29

Slide 30

Slide 30 text

Ownership Become experts in supporting your existing tooling to feed your future choices 30

Slide 31

Slide 31 text

Operating in Production is an oft neglected use-case of a Product 31

Slide 32

Slide 32 text

Ownership Being on call for your own product is more about risk management than control 32

Slide 33

Slide 33 text

Being on-call 33

Slide 34

Slide 34 text

On-call engineers are the “first aiders” of software engineering 34

Slide 35

Slide 35 text

ABC Airways Breathing Circulation 35

Slide 36

Slide 36 text

ABC Assess - Triage the incoming alerts Blast radius - What applications are failing? Compensate - apply mitigating actions 36

Slide 37

Slide 37 text

Compensating actions •Turn off a feature flag •Apply graceful degradation •Redeploy a known good version •Turn on load shedding •Many more… 37 Ines Sombra - Architectural Patterns of Resilient Distributed Systems http://www.youtube.com/watch?v=ohvPnJYUW1E

Slide 38

Slide 38 text

Do not investigate the root cause during an incident 38

Slide 39

Slide 39 text

Follow up •Hold a blameless post-mortem soon after the event •Mitigating fixes go to the top of the workstream •Run Show & Tells of incidents 39

Slide 40

Slide 40 text

“MTTR is more important than MTBF (for most types of F)” – John Allspaw Author of Web Operations http://www.kitchensoap.com/2010/11/07/mttr-mtbf-for-most-types-of-f/ 40

Slide 41

Slide 41 text

Ownership of runtime success belongs to the team 41

Slide 42

Slide 42 text

Collaborate with Ops •Improve your understanding of how applications operate in production •Improve your knowledge of highly available systems to feedback into the development of your product 42

Slide 43

Slide 43 text

43 Learn from others’ mistakes

Slide 44

Slide 44 text

44 https://about.gitlab.com/2017/02/10/postmortem-of-database-outage-of-january-31/

Slide 45

Slide 45 text

45 https://aws.amazon.com/message/41926/

Slide 46

Slide 46 text

46 https://www.youtube.com/watch?v=OUYTNywPk-s

Slide 47

Slide 47 text

Never fail the same way twice 47

Slide 48

Slide 48 text

48 Promote a sense of safety

Slide 49

Slide 49 text

Push the alerts •Do not expect engineers to sit and watch logs •Use an alerting tool with built in rotations & escalations such as PagerDuty or OpsGenie 49

Slide 50

Slide 50 text

50 http://lightsecond.com/ascii_stereogram.html

Slide 51

Slide 51 text

Signal to Noise •Alerts should only be used when we would be happy waking someone up! •Informative & Actionable 51

Slide 52

Slide 52 text

Actively combat alert fatigue 52

Slide 53

Slide 53 text

53 Running an on call rota

Slide 54

Slide 54 text

Running an on call rota •No more than 1 week at a time •Changeover on Tuesdays •Have an onboarding process for new engineers 54 https://blog.hinterlands.org/2010/07/running-an-oncall-rota/

Slide 55

Slide 55 text

Running an on call rota •Agree a reasonable SLA for alert acknowledgement •Have escalation policies to provide support when needed 55 https://blog.hinterlands.org/2010/07/running-an-oncall-rota/

Slide 56

Slide 56 text

Agree responsibilities Example •Devs responsible for App health & performance •Ops responsible for underlying infra and monitoring stack 56

Slide 57

Slide 57 text

57 Prevent burnout

Slide 58

Slide 58 text

“Burnout is killing us.” – John Willis Co-Author of The DevOps Handbook http://itrevolution.com/karojisatsu/ 58

Slide 59

Slide 59 text

59 Be mindful of the cognitive weight of on-call

Slide 60

Slide 60 text

60 Ensure team size allows for rest, sickness and holidays

Slide 61

Slide 61 text

Prevent Burnout •Allow for engineers to cover and swap between themselves 61

Slide 62

Slide 62 text

Prevent Burnout •Ensure the engineers are empowered and supported to improve the applications, and thus improve on-call experience 62

Slide 63

Slide 63 text

63 Compensation over reward

Slide 64

Slide 64 text

Devs on call is fast becoming an industry standard 64

Slide 65

Slide 65 text

Take pride in your services… 65

Slide 66

Slide 66 text

…without destroying your team 66

Slide 67

Slide 67 text

You build it, you run it. 67

Slide 68

Slide 68 text

68 Thank you! Chris O’Dell @ChrisAnnODell

Slide 69

Slide 69 text

Attributions Iceberg - https://www.flickr.com/photos/14730981@N08/28803627705/ Pipeline - https://www.flickr.com/photos/cantoni/4426784542/ CD Pipeline Photos – www.wocintechchat.com EKG - https://www.flickr.com/photos/vandalog/9445960751/ Bottleneck - https://www.flickr.com/photos/aidan_jones/1691801119 Mind the Gap – https://www.flickr.com/photos/christopherbrown/10135180454/ Door key - https://www.flickr.com/photos/alancleaver/5577108264/ Tyre stack – https://www.flickr.com/photos/markusspiske/14605397426/ Punch Clock - https://www.flickr.com/photos/tjblackwell/5659432136/ Carrot - https://www.flickr.com/photos/80375783@N00/3392828213/ Sick pilot - https://twitter.com/AviatorInsp/status/975542614714757121 Shift happens - https://www.flickr.com/photos/pilottheatre/9254122019 Butting Heads - https://www.flickr.com/photos/jamiedfw/5423425957/ 69

Slide 70

Slide 70 text

70 http://lightsecond.com/ascii_stereogram.html

Slide 71

Slide 71 text

71 http://lightsecond.com/ascii_stereogram.html