You build it, you run it [RebelCon 2017]

86de5d185c65b3581247a2e1d7e32c3c?s=47 Chris
June 16, 2017

You build it, you run it [RebelCon 2017]

86de5d185c65b3581247a2e1d7e32c3c?s=128

Chris

June 16, 2017
Tweet

Transcript

  1. None
  2. About me Chris O’Dell @ChrisAnnODell Lead for Build Engineering at

    Skelton Thatcher Consulting skeltonthatcher.com 2
  3. Books 3

  4. Team-first digital transformation 30+ organisations UK, EU, India, China skeltonthatcher.com

    4
  5. Why Developers should also be on-call 5

  6. Agenda Prerequisites: •CD Pipeline •Telemetry •Metrics Driven Development •Product ownership

    6
  7. Agenda Techniques: •On-call as “First Aid” •Effective incident follow up

    •Running and on-call Rota •Preventing burnout 7
  8. 8

  9. 9

  10. None
  11. None
  12. None
  13. None
  14. None
  15. None
  16. None
  17. None
  18. [having a separate operations team] “creates a divide and simply

    doesn’t scale, it puts the onus of responsibility for fixing an issue on the wrong team.” – Joey Parsons, Airbnb SRE manager “Who Owns On Call” http://increment.com/on-call/who-owns-on-call/ 18
  19. None
  20. 20 Coda Hale – Metrics, Metrics, Everywhere https://www.youtube.com/watch?v=czes-oa0yik Jordan Sissel

    - logging: logstash and other things https://www.youtube.com/watch?v=RuUFnog29M4
  21. 21 https://twitter.com/samnewman/status/862268242030718976

  22. 22

  23. Metrics Driven Development The use of real-time metrics to drive

    rapid, precise, and granular software iterations. 23 https://sookocheff.com/post/mdd/mdd/
  24. 24 Coda Hale – Metrics, Metrics, Everywhere https://www.youtube.com/watch?v=czes-oa0yik

  25. Developers On-Call is an evolution of Metrics Driven Development 25

  26. 26

  27. Bottleneck •After a point, Software & Perf bugs become more

    common than low level infra ones •Business find it difficult to hire Ops at the rate of Dev 27
  28. “When things are broken, we want people with the best

    context trying to fix things.” – Blake Scrivener, Netflix SRE Manager “Who Owns On Call” http://increment.com/on-call/who-owns-on-call/ 28
  29. Collaborate with Ops •Improve your understanding of how applications operate

    in production •Improve your knowledge of highly available systems to feedback into the development of your product 29
  30. Ownership •Own your product for its entire lifetime. From creation,

    through execution and to deletion. 30
  31. “With great power, comes great responsibility.” – Uncle Ben Spider-Man

    2002 31
  32. Ownership •Become experts in supporting your existing tooling to feed

    your future choices 32
  33. Operating in Production is an oft neglected use-case of a

    Product 33
  34. Ownership •Being on call for your own product is more

    about risk management than control 34
  35. 35

  36. ABC Airways Breathing Circulation 36

  37. ABC Assess - Triage the incoming alerts Blast radius -

    What applications are failing? Compensate - apply mitigating actions 37
  38. Compensating actions •Turn off a feature flag •Apply graceful degradation

    •Redeploy a known good version •Turn on load shedding •Many more… 38 Ines Sombra - Architectural Patterns of Resilient Distributed Systems http://www.youtube.com/watch?v=ohvPnJYUW1E
  39. Do not investigate the root cause during an incident 39

  40. Follow up •Hold a blameless post-mortem soon after the event

    •Mitigating fixes go to the top of the workstream •Run Show & Tells of incidents 40
  41. “MTTR is more important than MTBF (for most types of

    F)” – John Allspaw Author of Web Operations http://www.kitchensoap.com/2010/11/07/mttr-mtbf-for-most-types-of-f/ 41
  42. Ownership of runtime success belongs to the team 42

  43. 43 https://about.gitlab.com/2017/02/10/postmortem-of-database-outage-of-january-31/

  44. 44 https://aws.amazon.com/message/41926/

  45. Never fail the same way twice 45

  46. 46

  47. Push the alerts •Do not expect engineers to sit and

    watch logs •Use an alerting tool with built in rotations & escalations such as PagerDuty or OpsGenie 47
  48. 48 http://lightsecond.com/ascii_stereogram.html

  49. Signal to Noise •Alerts should only be used when we

    would be happy waking someone up! •Informative & Actionable 49
  50. Actively combat alert fatigue 50

  51. 51

  52. Running an on call rota •No more than 1 week

    at a time •Changeover on Tuesdays •Have an onboarding process for new engineers 52 https://blog.hinterlands.org/2010/07/running-an-oncall-rota/
  53. Running an on call rota •Agree a reasonable SLA for

    alert acknowledgement •Have escalation policies to provide support when needed 53 https://blog.hinterlands.org/2010/07/running-an-oncall-rota/
  54. Agree responsibilities •Agree areas of responsibility for Dev & Ops

    •E.g. Devs responsible for App health & performance •Ops responsible for underlying infra and monitoring stack 54
  55. 55

  56. “Burnout is killing us.” – John Willis Co-Author of The

    DevOps Handbook http://itrevolution.com/karojisatsu/ 56
  57. 57

  58. Prevent Burnout •Ensure your team size is large enough to

    allow for rest, sickness and holidays •Allow for engineers to cover and swap between themselves 58
  59. Prevent Burnout •Ensure the engineers have the authority and space

    to improve the applications and thus improve on-call 59
  60. 60

  61. Developers On Call 61

  62. Take pride in your services… 62

  63. …without destroying your team 63

  64. You build it, you run it. 64

  65. releasabilitybook.com Upcoming book: Team Guide to Build & Release Engineering

    by Chris O’Dell & Manuel Pais 65
  66. Thank you Chris O’Dell @ChrisAnnODell skeltonthatcher.com 66

  67. Attributions Iceberg - https://www.flickr.com/photos/14730981@N08/28803627705/ Pipeline - https://www.flickr.com/photos/cantoni/4426784542/ CD Pipeline Photos

    – www.wocintechchat.com EKG - https://www.flickr.com/photos/vandalog/9445960751/ Bottleneck - https://www.flickr.com/photos/aidan_jones/1691801119 Mind the Gap – https://www.flickr.com/photos/christopherbrown/10135180454/ Tyre stack – https://www.flickr.com/photos/markusspiske/14605397426/ Punch Clock - https://www.flickr.com/photos/tjblackwell/5659432136/ Carrot - https://www.flickr.com/photos/80375783@N00/3392828213/ 67