You build it, you run it [RebelCon 2017]

86de5d185c65b3581247a2e1d7e32c3c?s=47 Chris
June 16, 2017

You build it, you run it [RebelCon 2017]

86de5d185c65b3581247a2e1d7e32c3c?s=128

Chris

June 16, 2017
Tweet

Transcript

  1. 1.
  2. 2.

    About me Chris O’Dell @ChrisAnnODell Lead for Build Engineering at

    Skelton Thatcher Consulting skeltonthatcher.com 2
  3. 3.
  4. 7.

    Agenda Techniques: •On-call as “First Aid” •Effective incident follow up

    •Running and on-call Rota •Preventing burnout 7
  5. 8.

    8

  6. 9.

    9

  7. 10.
  8. 11.
  9. 12.
  10. 13.
  11. 14.
  12. 15.
  13. 16.
  14. 17.
  15. 18.

    [having a separate operations team] “creates a divide and simply

    doesn’t scale, it puts the onus of responsibility for fixing an issue on the wrong team.” – Joey Parsons, Airbnb SRE manager “Who Owns On Call” http://increment.com/on-call/who-owns-on-call/ 18
  16. 19.
  17. 20.

    20 Coda Hale – Metrics, Metrics, Everywhere https://www.youtube.com/watch?v=czes-oa0yik Jordan Sissel

    - logging: logstash and other things https://www.youtube.com/watch?v=RuUFnog29M4
  18. 22.

    22

  19. 23.

    Metrics Driven Development The use of real-time metrics to drive

    rapid, precise, and granular software iterations. 23 https://sookocheff.com/post/mdd/mdd/
  20. 26.

    26

  21. 27.

    Bottleneck •After a point, Software & Perf bugs become more

    common than low level infra ones •Business find it difficult to hire Ops at the rate of Dev 27
  22. 28.

    “When things are broken, we want people with the best

    context trying to fix things.” – Blake Scrivener, Netflix SRE Manager “Who Owns On Call” http://increment.com/on-call/who-owns-on-call/ 28
  23. 29.

    Collaborate with Ops •Improve your understanding of how applications operate

    in production •Improve your knowledge of highly available systems to feedback into the development of your product 29
  24. 34.

    Ownership •Being on call for your own product is more

    about risk management than control 34
  25. 35.

    35

  26. 37.

    ABC Assess - Triage the incoming alerts Blast radius -

    What applications are failing? Compensate - apply mitigating actions 37
  27. 38.

    Compensating actions •Turn off a feature flag •Apply graceful degradation

    •Redeploy a known good version •Turn on load shedding •Many more… 38 Ines Sombra - Architectural Patterns of Resilient Distributed Systems http://www.youtube.com/watch?v=ohvPnJYUW1E
  28. 40.

    Follow up •Hold a blameless post-mortem soon after the event

    •Mitigating fixes go to the top of the workstream •Run Show & Tells of incidents 40
  29. 41.

    “MTTR is more important than MTBF (for most types of

    F)” – John Allspaw Author of Web Operations http://www.kitchensoap.com/2010/11/07/mttr-mtbf-for-most-types-of-f/ 41
  30. 46.

    46

  31. 47.

    Push the alerts •Do not expect engineers to sit and

    watch logs •Use an alerting tool with built in rotations & escalations such as PagerDuty or OpsGenie 47
  32. 49.

    Signal to Noise •Alerts should only be used when we

    would be happy waking someone up! •Informative & Actionable 49
  33. 51.

    51

  34. 52.

    Running an on call rota •No more than 1 week

    at a time •Changeover on Tuesdays •Have an onboarding process for new engineers 52 https://blog.hinterlands.org/2010/07/running-an-oncall-rota/
  35. 53.

    Running an on call rota •Agree a reasonable SLA for

    alert acknowledgement •Have escalation policies to provide support when needed 53 https://blog.hinterlands.org/2010/07/running-an-oncall-rota/
  36. 54.

    Agree responsibilities •Agree areas of responsibility for Dev & Ops

    •E.g. Devs responsible for App health & performance •Ops responsible for underlying infra and monitoring stack 54
  37. 55.

    55

  38. 56.

    “Burnout is killing us.” – John Willis Co-Author of The

    DevOps Handbook http://itrevolution.com/karojisatsu/ 56
  39. 57.

    57

  40. 58.

    Prevent Burnout •Ensure your team size is large enough to

    allow for rest, sickness and holidays •Allow for engineers to cover and swap between themselves 58
  41. 59.

    Prevent Burnout •Ensure the engineers have the authority and space

    to improve the applications and thus improve on-call 59
  42. 60.

    60

  43. 67.

    Attributions Iceberg - https://www.flickr.com/photos/14730981@N08/28803627705/ Pipeline - https://www.flickr.com/photos/cantoni/4426784542/ CD Pipeline Photos

    – www.wocintechchat.com EKG - https://www.flickr.com/photos/vandalog/9445960751/ Bottleneck - https://www.flickr.com/photos/aidan_jones/1691801119 Mind the Gap – https://www.flickr.com/photos/christopherbrown/10135180454/ Tyre stack – https://www.flickr.com/photos/markusspiske/14605397426/ Punch Clock - https://www.flickr.com/photos/tjblackwell/5659432136/ Carrot - https://www.flickr.com/photos/80375783@N00/3392828213/ 67