Slide 1

Slide 1 text

No content

Slide 2

Slide 2 text

About me Chris O’Dell @ChrisAnnODell Lead for Build Engineering at Skelton Thatcher Consulting skeltonthatcher.com 2

Slide 3

Slide 3 text

Books 3

Slide 4

Slide 4 text

Team-first digital transformation 30+ organisations UK, EU, India, China skeltonthatcher.com 4

Slide 5

Slide 5 text

Why Developers should also be on-call 5

Slide 6

Slide 6 text

Agenda Prerequisites: •CD Pipeline •Telemetry •Metrics Driven Development •Product ownership 6

Slide 7

Slide 7 text

Agenda Techniques: •On-call as “First Aid” •Effective incident follow up •Running and on-call Rota •Preventing burnout 7

Slide 8

Slide 8 text

8

Slide 9

Slide 9 text

9

Slide 10

Slide 10 text

No content

Slide 11

Slide 11 text

No content

Slide 12

Slide 12 text

No content

Slide 13

Slide 13 text

No content

Slide 14

Slide 14 text

No content

Slide 15

Slide 15 text

No content

Slide 16

Slide 16 text

No content

Slide 17

Slide 17 text

No content

Slide 18

Slide 18 text

[having a separate operations team] “creates a divide and simply doesn’t scale, it puts the onus of responsibility for fixing an issue on the wrong team.” – Joey Parsons, Airbnb SRE manager “Who Owns On Call” http://increment.com/on-call/who-owns-on-call/ 18

Slide 19

Slide 19 text

No content

Slide 20

Slide 20 text

20 Coda Hale – Metrics, Metrics, Everywhere https://www.youtube.com/watch?v=czes-oa0yik Jordan Sissel - logging: logstash and other things https://www.youtube.com/watch?v=RuUFnog29M4

Slide 21

Slide 21 text

21 https://twitter.com/samnewman/status/862268242030718976

Slide 22

Slide 22 text

22

Slide 23

Slide 23 text

Metrics Driven Development The use of real-time metrics to drive rapid, precise, and granular software iterations. 23 https://sookocheff.com/post/mdd/mdd/

Slide 24

Slide 24 text

24 Coda Hale – Metrics, Metrics, Everywhere https://www.youtube.com/watch?v=czes-oa0yik

Slide 25

Slide 25 text

Developers On-Call is an evolution of Metrics Driven Development 25

Slide 26

Slide 26 text

26

Slide 27

Slide 27 text

Bottleneck •After a point, Software & Perf bugs become more common than low level infra ones •Business find it difficult to hire Ops at the rate of Dev 27

Slide 28

Slide 28 text

“When things are broken, we want people with the best context trying to fix things.” – Blake Scrivener, Netflix SRE Manager “Who Owns On Call” http://increment.com/on-call/who-owns-on-call/ 28

Slide 29

Slide 29 text

Collaborate with Ops •Improve your understanding of how applications operate in production •Improve your knowledge of highly available systems to feedback into the development of your product 29

Slide 30

Slide 30 text

Ownership •Own your product for its entire lifetime. From creation, through execution and to deletion. 30

Slide 31

Slide 31 text

“With great power, comes great responsibility.” – Uncle Ben Spider-Man 2002 31

Slide 32

Slide 32 text

Ownership •Become experts in supporting your existing tooling to feed your future choices 32

Slide 33

Slide 33 text

Operating in Production is an oft neglected use-case of a Product 33

Slide 34

Slide 34 text

Ownership •Being on call for your own product is more about risk management than control 34

Slide 35

Slide 35 text

35

Slide 36

Slide 36 text

ABC Airways Breathing Circulation 36

Slide 37

Slide 37 text

ABC Assess - Triage the incoming alerts Blast radius - What applications are failing? Compensate - apply mitigating actions 37

Slide 38

Slide 38 text

Compensating actions •Turn off a feature flag •Apply graceful degradation •Redeploy a known good version •Turn on load shedding •Many more… 38 Ines Sombra - Architectural Patterns of Resilient Distributed Systems http://www.youtube.com/watch?v=ohvPnJYUW1E

Slide 39

Slide 39 text

Do not investigate the root cause during an incident 39

Slide 40

Slide 40 text

Follow up •Hold a blameless post-mortem soon after the event •Mitigating fixes go to the top of the workstream •Run Show & Tells of incidents 40

Slide 41

Slide 41 text

“MTTR is more important than MTBF (for most types of F)” – John Allspaw Author of Web Operations http://www.kitchensoap.com/2010/11/07/mttr-mtbf-for-most-types-of-f/ 41

Slide 42

Slide 42 text

Ownership of runtime success belongs to the team 42

Slide 43

Slide 43 text

43 https://about.gitlab.com/2017/02/10/postmortem-of-database-outage-of-january-31/

Slide 44

Slide 44 text

44 https://aws.amazon.com/message/41926/

Slide 45

Slide 45 text

Never fail the same way twice 45

Slide 46

Slide 46 text

46

Slide 47

Slide 47 text

Push the alerts •Do not expect engineers to sit and watch logs •Use an alerting tool with built in rotations & escalations such as PagerDuty or OpsGenie 47

Slide 48

Slide 48 text

48 http://lightsecond.com/ascii_stereogram.html

Slide 49

Slide 49 text

Signal to Noise •Alerts should only be used when we would be happy waking someone up! •Informative & Actionable 49

Slide 50

Slide 50 text

Actively combat alert fatigue 50

Slide 51

Slide 51 text

51

Slide 52

Slide 52 text

Running an on call rota •No more than 1 week at a time •Changeover on Tuesdays •Have an onboarding process for new engineers 52 https://blog.hinterlands.org/2010/07/running-an-oncall-rota/

Slide 53

Slide 53 text

Running an on call rota •Agree a reasonable SLA for alert acknowledgement •Have escalation policies to provide support when needed 53 https://blog.hinterlands.org/2010/07/running-an-oncall-rota/

Slide 54

Slide 54 text

Agree responsibilities •Agree areas of responsibility for Dev & Ops •E.g. Devs responsible for App health & performance •Ops responsible for underlying infra and monitoring stack 54

Slide 55

Slide 55 text

55

Slide 56

Slide 56 text

“Burnout is killing us.” – John Willis Co-Author of The DevOps Handbook http://itrevolution.com/karojisatsu/ 56

Slide 57

Slide 57 text

57

Slide 58

Slide 58 text

Prevent Burnout •Ensure your team size is large enough to allow for rest, sickness and holidays •Allow for engineers to cover and swap between themselves 58

Slide 59

Slide 59 text

Prevent Burnout •Ensure the engineers have the authority and space to improve the applications and thus improve on-call 59

Slide 60

Slide 60 text

60

Slide 61

Slide 61 text

Developers On Call 61

Slide 62

Slide 62 text

Take pride in your services… 62

Slide 63

Slide 63 text

…without destroying your team 63

Slide 64

Slide 64 text

You build it, you run it. 64

Slide 65

Slide 65 text

releasabilitybook.com Upcoming book: Team Guide to Build & Release Engineering by Chris O’Dell & Manuel Pais 65

Slide 66

Slide 66 text

Thank you Chris O’Dell @ChrisAnnODell skeltonthatcher.com 66

Slide 67

Slide 67 text

Attributions Iceberg - https://www.flickr.com/photos/14730981@N08/28803627705/ Pipeline - https://www.flickr.com/photos/cantoni/4426784542/ CD Pipeline Photos – www.wocintechchat.com EKG - https://www.flickr.com/photos/vandalog/9445960751/ Bottleneck - https://www.flickr.com/photos/aidan_jones/1691801119 Mind the Gap – https://www.flickr.com/photos/christopherbrown/10135180454/ Tyre stack – https://www.flickr.com/photos/markusspiske/14605397426/ Punch Clock - https://www.flickr.com/photos/tjblackwell/5659432136/ Carrot - https://www.flickr.com/photos/80375783@N00/3392828213/ 67