Upgrade to Pro — share decks privately, control downloads, hide ads and more …

You build it, you run it [RebelCon 2017]

Chris
June 16, 2017

You build it, you run it [RebelCon 2017]

Chris

June 16, 2017
Tweet

More Decks by Chris

Other Decks in Technology

Transcript

  1. About me
    Chris O’Dell
    @ChrisAnnODell
    Lead for Build Engineering at
    Skelton Thatcher Consulting
    skeltonthatcher.com
    2

    View full-size slide

  2. Team-first digital transformation
    30+ organisations
    UK, EU, India, China
    skeltonthatcher.com
    4

    View full-size slide

  3. Why Developers should also
    be on-call
    5

    View full-size slide

  4. Agenda
    Prerequisites:
    •CD Pipeline
    •Telemetry
    •Metrics Driven Development
    •Product ownership
    6

    View full-size slide

  5. Agenda
    Techniques:
    •On-call as “First Aid”
    •Effective incident follow up
    •Running and on-call Rota
    •Preventing burnout
    7

    View full-size slide

  6. [having a separate operations team]
    “creates a divide and simply
    doesn’t scale, it puts the onus of
    responsibility for fixing an issue on
    the wrong team.”
    – Joey Parsons,
    Airbnb SRE manager
    “Who Owns On Call” http://increment.com/on-call/who-owns-on-call/
    18

    View full-size slide

  7. 20
    Coda Hale – Metrics, Metrics, Everywhere
    https://www.youtube.com/watch?v=czes-oa0yik
    Jordan Sissel - logging: logstash and other things
    https://www.youtube.com/watch?v=RuUFnog29M4

    View full-size slide

  8. 21
    https://twitter.com/samnewman/status/862268242030718976

    View full-size slide

  9. Metrics Driven Development
    The use of real-time metrics to
    drive rapid, precise, and granular
    software iterations.
    23
    https://sookocheff.com/post/mdd/mdd/

    View full-size slide

  10. 24
    Coda Hale – Metrics, Metrics, Everywhere
    https://www.youtube.com/watch?v=czes-oa0yik

    View full-size slide

  11. Developers On-Call is an
    evolution of Metrics Driven
    Development
    25

    View full-size slide

  12. Bottleneck
    •After a point, Software & Perf
    bugs become more common
    than low level infra ones
    •Business find it difficult to hire
    Ops at the rate of Dev
    27

    View full-size slide

  13. “When things are broken, we want
    people with the best context trying
    to fix things.”
    – Blake Scrivener,
    Netflix SRE Manager
    “Who Owns On Call” http://increment.com/on-call/who-owns-on-call/
    28

    View full-size slide

  14. Collaborate with Ops
    •Improve your understanding of how
    applications operate in production
    •Improve your knowledge of highly
    available systems to feedback into
    the development of your product
    29

    View full-size slide

  15. Ownership
    •Own your product for its entire
    lifetime. From creation, through
    execution and to deletion.
    30

    View full-size slide

  16. “With great power, comes great
    responsibility.”
    – Uncle Ben
    Spider-Man 2002
    31

    View full-size slide

  17. Ownership
    •Become experts in supporting
    your existing tooling to feed your
    future choices
    32

    View full-size slide

  18. Operating in Production is
    an oft neglected use-case of
    a Product
    33

    View full-size slide

  19. Ownership
    •Being on call for your own product
    is more about risk management
    than control
    34

    View full-size slide

  20. ABC
    Airways
    Breathing
    Circulation
    36

    View full-size slide

  21. ABC
    Assess
    - Triage the incoming alerts
    Blast radius
    - What applications are failing?
    Compensate
    - apply mitigating actions
    37

    View full-size slide

  22. Compensating actions
    •Turn off a feature flag
    •Apply graceful degradation
    •Redeploy a known good version
    •Turn on load shedding
    •Many more…
    38
    Ines Sombra - Architectural Patterns of Resilient Distributed Systems
    http://www.youtube.com/watch?v=ohvPnJYUW1E

    View full-size slide

  23. Do not investigate the root
    cause during an incident
    39

    View full-size slide

  24. Follow up
    •Hold a blameless post-mortem
    soon after the event
    •Mitigating fixes go to the top of
    the workstream
    •Run Show & Tells of incidents
    40

    View full-size slide

  25. “MTTR is more important than MTBF
    (for most types of F)”
    – John Allspaw
    Author of Web Operations
    http://www.kitchensoap.com/2010/11/07/mttr-mtbf-for-most-types-of-f/
    41

    View full-size slide

  26. Ownership of runtime
    success belongs to the team
    42

    View full-size slide

  27. 43
    https://about.gitlab.com/2017/02/10/postmortem-of-database-outage-of-january-31/

    View full-size slide

  28. 44
    https://aws.amazon.com/message/41926/

    View full-size slide

  29. Never fail the same way
    twice
    45

    View full-size slide

  30. Push the alerts
    •Do not expect engineers to sit and
    watch logs
    •Use an alerting tool with built in
    rotations & escalations such as
    PagerDuty or OpsGenie
    47

    View full-size slide

  31. 48
    http://lightsecond.com/ascii_stereogram.html

    View full-size slide

  32. Signal to Noise
    •Alerts should only be used when
    we would be happy waking
    someone up!
    •Informative & Actionable
    49

    View full-size slide

  33. Actively combat alert
    fatigue
    50

    View full-size slide

  34. Running an on call rota
    •No more than 1 week at a time
    •Changeover on Tuesdays
    •Have an onboarding process for
    new engineers
    52
    https://blog.hinterlands.org/2010/07/running-an-oncall-rota/

    View full-size slide

  35. Running an on call rota
    •Agree a reasonable SLA for alert
    acknowledgement
    •Have escalation policies to
    provide support when needed
    53
    https://blog.hinterlands.org/2010/07/running-an-oncall-rota/

    View full-size slide

  36. Agree responsibilities
    •Agree areas of responsibility for
    Dev & Ops
    •E.g. Devs responsible for App
    health & performance
    •Ops responsible for underlying
    infra and monitoring stack
    54

    View full-size slide

  37. “Burnout is killing us.”
    – John Willis
    Co-Author of The DevOps Handbook
    http://itrevolution.com/karojisatsu/
    56

    View full-size slide

  38. Prevent Burnout
    •Ensure your team size is large
    enough to allow for rest, sickness
    and holidays
    •Allow for engineers to cover and
    swap between themselves
    58

    View full-size slide

  39. Prevent Burnout
    •Ensure the engineers have the
    authority and space to improve
    the applications and thus improve
    on-call
    59

    View full-size slide

  40. Developers On Call
    61

    View full-size slide

  41. Take pride in your services…
    62

    View full-size slide

  42. …without destroying
    your team
    63

    View full-size slide

  43. You build it, you run it.
    64

    View full-size slide

  44. releasabilitybook.com
    Upcoming book:
    Team Guide to Build &
    Release Engineering
    by Chris O’Dell & Manuel Pais
    65

    View full-size slide

  45. Thank you
    Chris O’Dell
    @ChrisAnnODell
    skeltonthatcher.com
    66

    View full-size slide

  46. Attributions
    Iceberg - https://www.flickr.com/photos/14730981@N08/28803627705/
    Pipeline - https://www.flickr.com/photos/cantoni/4426784542/
    CD Pipeline Photos – www.wocintechchat.com
    EKG - https://www.flickr.com/photos/vandalog/9445960751/
    Bottleneck - https://www.flickr.com/photos/aidan_jones/1691801119
    Mind the Gap –
    https://www.flickr.com/photos/christopherbrown/10135180454/
    Tyre stack –
    https://www.flickr.com/photos/markusspiske/14605397426/
    Punch Clock - https://www.flickr.com/photos/tjblackwell/5659432136/
    Carrot - https://www.flickr.com/photos/80375783@N00/3392828213/
    67

    View full-size slide