Upgrade to Pro — share decks privately, control downloads, hide ads and more …

You build it, you run it [RebelCon 2017]

Chris
June 16, 2017

You build it, you run it [RebelCon 2017]

Chris

June 16, 2017
Tweet

More Decks by Chris

Other Decks in Technology

Transcript

  1. View Slide

  2. About me
    Chris O’Dell
    @ChrisAnnODell
    Lead for Build Engineering at
    Skelton Thatcher Consulting
    skeltonthatcher.com
    2

    View Slide

  3. Books
    3

    View Slide

  4. Team-first digital transformation
    30+ organisations
    UK, EU, India, China
    skeltonthatcher.com
    4

    View Slide

  5. Why Developers should also
    be on-call
    5

    View Slide

  6. Agenda
    Prerequisites:
    •CD Pipeline
    •Telemetry
    •Metrics Driven Development
    •Product ownership
    6

    View Slide

  7. Agenda
    Techniques:
    •On-call as “First Aid”
    •Effective incident follow up
    •Running and on-call Rota
    •Preventing burnout
    7

    View Slide

  8. 8

    View Slide

  9. 9

    View Slide

  10. View Slide

  11. View Slide

  12. View Slide

  13. View Slide

  14. View Slide

  15. View Slide

  16. View Slide

  17. View Slide

  18. [having a separate operations team]
    “creates a divide and simply
    doesn’t scale, it puts the onus of
    responsibility for fixing an issue on
    the wrong team.”
    – Joey Parsons,
    Airbnb SRE manager
    “Who Owns On Call” http://increment.com/on-call/who-owns-on-call/
    18

    View Slide

  19. View Slide

  20. 20
    Coda Hale – Metrics, Metrics, Everywhere
    https://www.youtube.com/watch?v=czes-oa0yik
    Jordan Sissel - logging: logstash and other things
    https://www.youtube.com/watch?v=RuUFnog29M4

    View Slide

  21. 21
    https://twitter.com/samnewman/status/862268242030718976

    View Slide

  22. 22

    View Slide

  23. Metrics Driven Development
    The use of real-time metrics to
    drive rapid, precise, and granular
    software iterations.
    23
    https://sookocheff.com/post/mdd/mdd/

    View Slide

  24. 24
    Coda Hale – Metrics, Metrics, Everywhere
    https://www.youtube.com/watch?v=czes-oa0yik

    View Slide

  25. Developers On-Call is an
    evolution of Metrics Driven
    Development
    25

    View Slide

  26. 26

    View Slide

  27. Bottleneck
    •After a point, Software & Perf
    bugs become more common
    than low level infra ones
    •Business find it difficult to hire
    Ops at the rate of Dev
    27

    View Slide

  28. “When things are broken, we want
    people with the best context trying
    to fix things.”
    – Blake Scrivener,
    Netflix SRE Manager
    “Who Owns On Call” http://increment.com/on-call/who-owns-on-call/
    28

    View Slide

  29. Collaborate with Ops
    •Improve your understanding of how
    applications operate in production
    •Improve your knowledge of highly
    available systems to feedback into
    the development of your product
    29

    View Slide

  30. Ownership
    •Own your product for its entire
    lifetime. From creation, through
    execution and to deletion.
    30

    View Slide

  31. “With great power, comes great
    responsibility.”
    – Uncle Ben
    Spider-Man 2002
    31

    View Slide

  32. Ownership
    •Become experts in supporting
    your existing tooling to feed your
    future choices
    32

    View Slide

  33. Operating in Production is
    an oft neglected use-case of
    a Product
    33

    View Slide

  34. Ownership
    •Being on call for your own product
    is more about risk management
    than control
    34

    View Slide

  35. 35

    View Slide

  36. ABC
    Airways
    Breathing
    Circulation
    36

    View Slide

  37. ABC
    Assess
    - Triage the incoming alerts
    Blast radius
    - What applications are failing?
    Compensate
    - apply mitigating actions
    37

    View Slide

  38. Compensating actions
    •Turn off a feature flag
    •Apply graceful degradation
    •Redeploy a known good version
    •Turn on load shedding
    •Many more…
    38
    Ines Sombra - Architectural Patterns of Resilient Distributed Systems
    http://www.youtube.com/watch?v=ohvPnJYUW1E

    View Slide

  39. Do not investigate the root
    cause during an incident
    39

    View Slide

  40. Follow up
    •Hold a blameless post-mortem
    soon after the event
    •Mitigating fixes go to the top of
    the workstream
    •Run Show & Tells of incidents
    40

    View Slide

  41. “MTTR is more important than MTBF
    (for most types of F)”
    – John Allspaw
    Author of Web Operations
    http://www.kitchensoap.com/2010/11/07/mttr-mtbf-for-most-types-of-f/
    41

    View Slide

  42. Ownership of runtime
    success belongs to the team
    42

    View Slide

  43. 43
    https://about.gitlab.com/2017/02/10/postmortem-of-database-outage-of-january-31/

    View Slide

  44. 44
    https://aws.amazon.com/message/41926/

    View Slide

  45. Never fail the same way
    twice
    45

    View Slide

  46. 46

    View Slide

  47. Push the alerts
    •Do not expect engineers to sit and
    watch logs
    •Use an alerting tool with built in
    rotations & escalations such as
    PagerDuty or OpsGenie
    47

    View Slide

  48. 48
    http://lightsecond.com/ascii_stereogram.html

    View Slide

  49. Signal to Noise
    •Alerts should only be used when
    we would be happy waking
    someone up!
    •Informative & Actionable
    49

    View Slide

  50. Actively combat alert
    fatigue
    50

    View Slide

  51. 51

    View Slide

  52. Running an on call rota
    •No more than 1 week at a time
    •Changeover on Tuesdays
    •Have an onboarding process for
    new engineers
    52
    https://blog.hinterlands.org/2010/07/running-an-oncall-rota/

    View Slide

  53. Running an on call rota
    •Agree a reasonable SLA for alert
    acknowledgement
    •Have escalation policies to
    provide support when needed
    53
    https://blog.hinterlands.org/2010/07/running-an-oncall-rota/

    View Slide

  54. Agree responsibilities
    •Agree areas of responsibility for
    Dev & Ops
    •E.g. Devs responsible for App
    health & performance
    •Ops responsible for underlying
    infra and monitoring stack
    54

    View Slide

  55. 55

    View Slide

  56. “Burnout is killing us.”
    – John Willis
    Co-Author of The DevOps Handbook
    http://itrevolution.com/karojisatsu/
    56

    View Slide

  57. 57

    View Slide

  58. Prevent Burnout
    •Ensure your team size is large
    enough to allow for rest, sickness
    and holidays
    •Allow for engineers to cover and
    swap between themselves
    58

    View Slide

  59. Prevent Burnout
    •Ensure the engineers have the
    authority and space to improve
    the applications and thus improve
    on-call
    59

    View Slide

  60. 60

    View Slide

  61. Developers On Call
    61

    View Slide

  62. Take pride in your services…
    62

    View Slide

  63. …without destroying
    your team
    63

    View Slide

  64. You build it, you run it.
    64

    View Slide

  65. releasabilitybook.com
    Upcoming book:
    Team Guide to Build &
    Release Engineering
    by Chris O’Dell & Manuel Pais
    65

    View Slide

  66. Thank you
    Chris O’Dell
    @ChrisAnnODell
    skeltonthatcher.com
    66

    View Slide

  67. Attributions
    Iceberg - https://www.flickr.com/photos/[email protected]/28803627705/
    Pipeline - https://www.flickr.com/photos/cantoni/4426784542/
    CD Pipeline Photos – www.wocintechchat.com
    EKG - https://www.flickr.com/photos/vandalog/9445960751/
    Bottleneck - https://www.flickr.com/photos/aidan_jones/1691801119
    Mind the Gap –
    https://www.flickr.com/photos/christopherbrown/10135180454/
    Tyre stack –
    https://www.flickr.com/photos/markusspiske/14605397426/
    Punch Clock - https://www.flickr.com/photos/tjblackwell/5659432136/
    Carrot - https://www.flickr.com/photos/[email protected]/3392828213/
    67

    View Slide