Upgrade to Pro — share decks privately, control downloads, hide ads and more …

You build it, you run it

July 04, 2018

You build it, you run it


July 04, 2018

More Decks by Chris

Other Decks in Technology


  1. You build it, you run it Chris O’Dell | @ChrisAnnODell

    Why Developers should also be on-call 1
  2. Chris O’Dell @ChrisAnnODell Backend Engineer at Monzo 13+ years professional

    development inc: • 3 years unofficial second line (startups, yo) • 3 years dev on call supporting my own apps • Soon to be on call again… 3
  3. 5

  4. 7

  5. 8

  6. 9

  7. 10

  8. 11

  9. 15

  10. 17 Coda Hale – Metrics, Metrics, Everywhere https://www.youtube.com/watch?v=czes-oa Telemetry Jordan

    Sissel - logging: logstash and other things https://www.youtube.com/watch?v=RuUFnog29M4
  11. 19

  12. Metrics Driven Development - The use of real-time metrics to

    drive rapid, precise, and granular software iterations 20 https://sookocheff.com/post/mdd/mdd/
  13. 23 After a point, Software & Perf bugs become more

    common than low level infra ones
  14. “When things are broken, we want people with the best

    context trying to fix things.” – Blake Scrivener, Netflix SRE Manager “Who Owns On Call” http://increment.com/on-call/who-owns-on-call/ 24
  15. [having a separate operations team] “creates a divide and simply

    doesn’t scale, it puts the onus of responsibility for fixing an issue on the wrong team.” – Joey Parsons, Airbnb SRE manager “Who Owns On Call” http://increment.com/on-call/who-owns-on-call/ 26
  16. Ownership •Ownership is a prerequisite of Autonomy -> Mastery ->

    Purpose 28 Dan Pink - Drive https://www.youtube.com/watch?v=u6XAPnuFjJc
  17. Ownership Being on call for your own product is more

    about risk management than control 32
  18. ABC Assess - Triage the incoming alerts Blast radius -

    What applications are failing? Compensate - apply mitigating actions 36
  19. Compensating actions •Turn off a feature flag •Apply graceful degradation

    •Redeploy a known good version •Turn on load shedding •Many more… 37 Ines Sombra - Architectural Patterns of Resilient Distributed Systems http://www.youtube.com/watch?v=ohvPnJYUW1E
  20. Follow up •Hold a blameless post-mortem soon after the event

    •Mitigating fixes go to the top of the workstream •Run Show & Tells of incidents 39
  21. “MTTR is more important than MTBF (for most types of

    F)” – John Allspaw Author of Web Operations http://www.kitchensoap.com/2010/11/07/mttr-mtbf-for-most-types-of-f/ 40
  22. Collaborate with Ops •Improve your understanding of how applications operate

    in production •Improve your knowledge of highly available systems to feedback into the development of your product 42
  23. Push the alerts •Do not expect engineers to sit and

    watch logs •Use an alerting tool with built in rotations & escalations such as PagerDuty or OpsGenie 49
  24. Signal to Noise •Alerts should only be used when we

    would be happy waking someone up! •Informative & Actionable 51
  25. Running an on call rota •No more than 1 week

    at a time •Changeover on Tuesdays •Have an onboarding process for new engineers 54 https://blog.hinterlands.org/2010/07/running-an-oncall-rota/
  26. Running an on call rota •Agree a reasonable SLA for

    alert acknowledgement •Have escalation policies to provide support when needed 55 https://blog.hinterlands.org/2010/07/running-an-oncall-rota/
  27. Agree responsibilities Example •Devs responsible for App health & performance

    •Ops responsible for underlying infra and monitoring stack 56
  28. “Burnout is killing us.” – John Willis Co-Author of The

    DevOps Handbook http://itrevolution.com/karojisatsu/ 58
  29. Prevent Burnout •Ensure the engineers are empowered and supported to

    improve the applications, and thus improve on-call experience 62
  30. Attributions Iceberg - https://www.flickr.com/photos/14730981@N08/28803627705/ Pipeline - https://www.flickr.com/photos/cantoni/4426784542/ CD Pipeline Photos

    – www.wocintechchat.com EKG - https://www.flickr.com/photos/vandalog/9445960751/ Bottleneck - https://www.flickr.com/photos/aidan_jones/1691801119 Mind the Gap – https://www.flickr.com/photos/christopherbrown/10135180454/ Door key - https://www.flickr.com/photos/alancleaver/5577108264/ Tyre stack – https://www.flickr.com/photos/markusspiske/14605397426/ Punch Clock - https://www.flickr.com/photos/tjblackwell/5659432136/ Carrot - https://www.flickr.com/photos/80375783@N00/3392828213/ Sick pilot - https://twitter.com/AviatorInsp/status/975542614714757121 Shift happens - https://www.flickr.com/photos/pilottheatre/9254122019 Butting Heads - https://www.flickr.com/photos/jamiedfw/5423425957/ 69