You build it, you run it

86de5d185c65b3581247a2e1d7e32c3c?s=47 Chris
July 04, 2018

You build it, you run it

86de5d185c65b3581247a2e1d7e32c3c?s=128

Chris

July 04, 2018
Tweet

Transcript

  1. You build it, you run it Chris O’Dell | @ChrisAnnODell

    Why Developers should also be on-call 1
  2. Who has been on-call? 2

  3. Chris O’Dell @ChrisAnnODell Backend Engineer at Monzo 13+ years professional

    development inc: • 3 years unofficial second line (startups, yo) • 3 years dev on call supporting my own apps • Soon to be on call again… 3
  4. Books 4 http://cdwithwindows.net http://releasabilitybook.com

  5. 5

  6. 6 The Continuous Delivery Pipeline

  7. 7

  8. 8

  9. 9

  10. 10

  11. 11

  12. 12 Dev Ops

  13. 13 Dev Ops

  14. 14 Dev Ops

  15. 15

  16. 16 Dev Ops

  17. 17 Coda Hale – Metrics, Metrics, Everywhere https://www.youtube.com/watch?v=czes-oa Telemetry Jordan

    Sissel - logging: logstash and other things https://www.youtube.com/watch?v=RuUFnog29M4
  18. 18 https://twitter.com/samnewman/status/862268242030718976

  19. 19

  20. Metrics Driven Development - The use of real-time metrics to

    drive rapid, precise, and granular software iterations 20 https://sookocheff.com/post/mdd/mdd/
  21. 21 Coda Hale – Metrics, Metrics, Everywhere https://www.youtube.com/watch?v=czes-oa0yi

  22. Developers On-Call is an evolution of Metrics Driven Development 22

  23. 23 After a point, Software & Perf bugs become more

    common than low level infra ones
  24. “When things are broken, we want people with the best

    context trying to fix things.” – Blake Scrivener, Netflix SRE Manager “Who Owns On Call” http://increment.com/on-call/who-owns-on-call/ 24
  25. 25 Dev Ops

  26. [having a separate operations team] “creates a divide and simply

    doesn’t scale, it puts the onus of responsibility for fixing an issue on the wrong team.” – Joey Parsons, Airbnb SRE manager “Who Owns On Call” http://increment.com/on-call/who-owns-on-call/ 26
  27. 27 Own your product for its entire lifetime.

  28. Ownership •Ownership is a prerequisite of Autonomy -> Mastery ->

    Purpose 28 Dan Pink - Drive https://www.youtube.com/watch?v=u6XAPnuFjJc
  29. “With great power, comes great responsibility.” – Uncle Ben Spider-Man

    2002 29
  30. Ownership Become experts in supporting your existing tooling to feed

    your future choices 30
  31. Operating in Production is an oft neglected use-case of a

    Product 31
  32. Ownership Being on call for your own product is more

    about risk management than control 32
  33. Being on-call 33

  34. On-call engineers are the “first aiders” of software engineering 34

  35. ABC Airways Breathing Circulation 35

  36. ABC Assess - Triage the incoming alerts Blast radius -

    What applications are failing? Compensate - apply mitigating actions 36
  37. Compensating actions •Turn off a feature flag •Apply graceful degradation

    •Redeploy a known good version •Turn on load shedding •Many more… 37 Ines Sombra - Architectural Patterns of Resilient Distributed Systems http://www.youtube.com/watch?v=ohvPnJYUW1E
  38. Do not investigate the root cause during an incident 38

  39. Follow up •Hold a blameless post-mortem soon after the event

    •Mitigating fixes go to the top of the workstream •Run Show & Tells of incidents 39
  40. “MTTR is more important than MTBF (for most types of

    F)” – John Allspaw Author of Web Operations http://www.kitchensoap.com/2010/11/07/mttr-mtbf-for-most-types-of-f/ 40
  41. Ownership of runtime success belongs to the team 41

  42. Collaborate with Ops •Improve your understanding of how applications operate

    in production •Improve your knowledge of highly available systems to feedback into the development of your product 42
  43. 43 Learn from others’ mistakes

  44. 44 https://about.gitlab.com/2017/02/10/postmortem-of-database-outage-of-january-31/

  45. 45 https://aws.amazon.com/message/41926/

  46. 46 https://www.youtube.com/watch?v=OUYTNywPk-s

  47. Never fail the same way twice 47

  48. 48 Promote a sense of safety

  49. Push the alerts •Do not expect engineers to sit and

    watch logs •Use an alerting tool with built in rotations & escalations such as PagerDuty or OpsGenie 49
  50. 50 http://lightsecond.com/ascii_stereogram.html

  51. Signal to Noise •Alerts should only be used when we

    would be happy waking someone up! •Informative & Actionable 51
  52. Actively combat alert fatigue 52

  53. 53 Running an on call rota

  54. Running an on call rota •No more than 1 week

    at a time •Changeover on Tuesdays •Have an onboarding process for new engineers 54 https://blog.hinterlands.org/2010/07/running-an-oncall-rota/
  55. Running an on call rota •Agree a reasonable SLA for

    alert acknowledgement •Have escalation policies to provide support when needed 55 https://blog.hinterlands.org/2010/07/running-an-oncall-rota/
  56. Agree responsibilities Example •Devs responsible for App health & performance

    •Ops responsible for underlying infra and monitoring stack 56
  57. 57 Prevent burnout

  58. “Burnout is killing us.” – John Willis Co-Author of The

    DevOps Handbook http://itrevolution.com/karojisatsu/ 58
  59. 59 Be mindful of the cognitive weight of on-call

  60. 60 Ensure team size allows for rest, sickness and holidays

  61. Prevent Burnout •Allow for engineers to cover and swap between

    themselves 61
  62. Prevent Burnout •Ensure the engineers are empowered and supported to

    improve the applications, and thus improve on-call experience 62
  63. 63 Compensation over reward

  64. Devs on call is fast becoming an industry standard 64

  65. Take pride in your services… 65

  66. …without destroying your team 66

  67. You build it, you run it. 67

  68. 68 Thank you! Chris O’Dell @ChrisAnnODell

  69. Attributions Iceberg - https://www.flickr.com/photos/14730981@N08/28803627705/ Pipeline - https://www.flickr.com/photos/cantoni/4426784542/ CD Pipeline Photos

    – www.wocintechchat.com EKG - https://www.flickr.com/photos/vandalog/9445960751/ Bottleneck - https://www.flickr.com/photos/aidan_jones/1691801119 Mind the Gap – https://www.flickr.com/photos/christopherbrown/10135180454/ Door key - https://www.flickr.com/photos/alancleaver/5577108264/ Tyre stack – https://www.flickr.com/photos/markusspiske/14605397426/ Punch Clock - https://www.flickr.com/photos/tjblackwell/5659432136/ Carrot - https://www.flickr.com/photos/80375783@N00/3392828213/ Sick pilot - https://twitter.com/AviatorInsp/status/975542614714757121 Shift happens - https://www.flickr.com/photos/pilottheatre/9254122019 Butting Heads - https://www.flickr.com/photos/jamiedfw/5423425957/ 69
  70. 70 http://lightsecond.com/ascii_stereogram.html

  71. 71 http://lightsecond.com/ascii_stereogram.html