You build it, you run it [RebelCon 2017]

About me Chris O’Dell @ChrisAnnODell Lead for Build Engineering at
Skelton Thatcher Consulting skeltonthatcher.com 2

Books 3

Team-first digital transformation 30+ organisations UK, EU, India, China skeltonthatcher.com
4

Why Developers should also be on-call 5

Agenda Prerequisites: •CD Pipeline •Telemetry •Metrics Driven Development •Product ownership
6

Agenda Techniques: •On-call as “First Aid” •Effective incident follow up
•Running and on-call Rota •Preventing burnout 7

[having a separate operations team] “creates a divide and simply
doesn’t scale, it puts the onus of responsibility for fixing an issue on the wrong team.” – Joey Parsons, Airbnb SRE manager “Who Owns On Call” http://increment.com/on-call/who-owns-on-call/ 18

20 Coda Hale – Metrics, Metrics, Everywhere https://www.youtube.com/watch?v=czes-oa0yik Jordan Sissel
- logging: logstash and other things https://www.youtube.com/watch?v=RuUFnog29M4

21 https://twitter.com/samnewman/status/862268242030718976

Metrics Driven Development The use of real-time metrics to drive
rapid, precise, and granular software iterations. 23 https://sookocheff.com/post/mdd/mdd/

24 Coda Hale – Metrics, Metrics, Everywhere https://www.youtube.com/watch?v=czes-oa0yik

Developers On-Call is an evolution of Metrics Driven Development 25

Bottleneck •After a point, Software & Perf bugs become more
common than low level infra ones •Business find it difficult to hire Ops at the rate of Dev 27

“When things are broken, we want people with the best
context trying to fix things.” – Blake Scrivener, Netflix SRE Manager “Who Owns On Call” http://increment.com/on-call/who-owns-on-call/ 28

Collaborate with Ops •Improve your understanding of how applications operate
in production •Improve your knowledge of highly available systems to feedback into the development of your product 29

Ownership •Own your product for its entire lifetime. From creation,
through execution and to deletion. 30

“With great power, comes great responsibility.” – Uncle Ben Spider-Man
2002 31

Ownership •Become experts in supporting your existing tooling to feed
your future choices 32

Operating in Production is an oft neglected use-case of a
Product 33

Ownership •Being on call for your own product is more
about risk management than control 34

ABC Airways Breathing Circulation 36

ABC Assess - Triage the incoming alerts Blast radius -
What applications are failing? Compensate - apply mitigating actions 37

Compensating actions •Turn off a feature flag •Apply graceful degradation
•Redeploy a known good version •Turn on load shedding •Many more… 38 Ines Sombra - Architectural Patterns of Resilient Distributed Systems http://www.youtube.com/watch?v=ohvPnJYUW1E

Do not investigate the root cause during an incident 39

Follow up •Hold a blameless post-mortem soon after the event
•Mitigating fixes go to the top of the workstream •Run Show & Tells of incidents 40

“MTTR is more important than MTBF (for most types of
F)” – John Allspaw Author of Web Operations http://www.kitchensoap.com/2010/11/07/mttr-mtbf-for-most-types-of-f/ 41

Ownership of runtime success belongs to the team 42

43 https://about.gitlab.com/2017/02/10/postmortem-of-database-outage-of-january-31/

44 https://aws.amazon.com/message/41926/

Never fail the same way twice 45

Push the alerts •Do not expect engineers to sit and
watch logs •Use an alerting tool with built in rotations & escalations such as PagerDuty or OpsGenie 47

48 http://lightsecond.com/ascii_stereogram.html

Signal to Noise •Alerts should only be used when we
would be happy waking someone up! •Informative & Actionable 49

Actively combat alert fatigue 50

Running an on call rota •No more than 1 week
at a time •Changeover on Tuesdays •Have an onboarding process for new engineers 52 https://blog.hinterlands.org/2010/07/running-an-oncall-rota/

Running an on call rota •Agree a reasonable SLA for
alert acknowledgement •Have escalation policies to provide support when needed 53 https://blog.hinterlands.org/2010/07/running-an-oncall-rota/

Agree responsibilities •Agree areas of responsibility for Dev & Ops
•E.g. Devs responsible for App health & performance •Ops responsible for underlying infra and monitoring stack 54

“Burnout is killing us.” – John Willis Co-Author of The
DevOps Handbook http://itrevolution.com/karojisatsu/ 56

Prevent Burnout •Ensure your team size is large enough to
allow for rest, sickness and holidays •Allow for engineers to cover and swap between themselves 58

Prevent Burnout •Ensure the engineers have the authority and space
to improve the applications and thus improve on-call 59

Developers On Call 61

Take pride in your services… 62

…without destroying your team 63

You build it, you run it. 64

releasabilitybook.com Upcoming book: Team Guide to Build & Release Engineering
by Chris O’Dell & Manuel Pais 65

Thank you Chris O’Dell @ChrisAnnODell skeltonthatcher.com 66

Attributions Iceberg - https://www.flickr.com/photos/14730981@N08/28803627705/ Pipeline - https://www.flickr.com/photos/cantoni/4426784542/ CD Pipeline Photos
– www.wocintechchat.com EKG - https://www.flickr.com/photos/vandalog/9445960751/ Bottleneck - https://www.flickr.com/photos/aidan_jones/1691801119 Mind the Gap – https://www.flickr.com/photos/christopherbrown/10135180454/ Tyre stack – https://www.flickr.com/photos/markusspiske/14605397426/ Punch Clock - https://www.flickr.com/photos/tjblackwell/5659432136/ Carrot - https://www.flickr.com/photos/80375783@N00/3392828213/ 67

You build it, you run it [RebelCon 2017]

You build it, you run it [RebelCon 2017]

More Decks by Chris

Other Decks in Technology

Featured

Transcript