$30 off During Our Annual Pro Sale. View Details »

You build it, you run it

Chris
July 04, 2018

You build it, you run it

Chris

July 04, 2018
Tweet

More Decks by Chris

Other Decks in Technology

Transcript

  1. You build it, you run it
    Chris O’Dell | @ChrisAnnODell
    Why Developers should also be on-call
    1

    View Slide

  2. Who has been on-call?
    2

    View Slide

  3. Chris O’Dell
    @ChrisAnnODell
    Backend Engineer at Monzo
    13+ years professional development inc:
    ● 3 years unofficial second line (startups, yo)
    ● 3 years dev on call supporting my own apps
    ● Soon to be on call again…
    3

    View Slide

  4. Books
    4
    http://cdwithwindows.net http://releasabilitybook.com

    View Slide

  5. 5

    View Slide

  6. 6
    The Continuous Delivery
    Pipeline

    View Slide

  7. 7

    View Slide

  8. 8

    View Slide

  9. 9

    View Slide

  10. 10

    View Slide

  11. 11

    View Slide

  12. 12
    Dev Ops

    View Slide

  13. 13
    Dev Ops

    View Slide

  14. 14
    Dev Ops

    View Slide

  15. 15

    View Slide

  16. 16
    Dev Ops

    View Slide

  17. 17
    Coda Hale – Metrics, Metrics, Everywhere
    https://www.youtube.com/watch?v=czes-oa
    Telemetry
    Jordan Sissel - logging: logstash and other things
    https://www.youtube.com/watch?v=RuUFnog29M4

    View Slide

  18. 18
    https://twitter.com/samnewman/status/862268242030718976

    View Slide

  19. 19

    View Slide

  20. Metrics Driven Development
    - The use of real-time metrics to drive
    rapid, precise, and granular software
    iterations
    20
    https://sookocheff.com/post/mdd/mdd/

    View Slide

  21. 21
    Coda Hale – Metrics, Metrics, Everywhere
    https://www.youtube.com/watch?v=czes-oa0yi

    View Slide

  22. Developers On-Call is an
    evolution of Metrics Driven
    Development
    22

    View Slide

  23. 23
    After a point, Software & Perf
    bugs become more common
    than low level infra ones

    View Slide

  24. “When things are broken, we want
    people with the best context trying to
    fix things.”
    – Blake Scrivener,
    Netflix SRE Manager
    “Who Owns On Call” http://increment.com/on-call/who-owns-on-call/
    24

    View Slide

  25. 25
    Dev Ops

    View Slide

  26. [having a separate operations team]
    “creates a divide and simply doesn’t
    scale, it puts the onus of
    responsibility for fixing an issue on
    the wrong team.”
    – Joey Parsons,
    Airbnb SRE manager
    “Who Owns On Call” http://increment.com/on-call/who-owns-on-call/
    26

    View Slide

  27. 27
    Own your product for its entire
    lifetime.

    View Slide

  28. Ownership
    •Ownership is a prerequisite of
    Autonomy -> Mastery -> Purpose
    28
    Dan Pink - Drive
    https://www.youtube.com/watch?v=u6XAPnuFjJc

    View Slide

  29. “With great power, comes great
    responsibility.”
    – Uncle Ben
    Spider-Man 2002
    29

    View Slide

  30. Ownership
    Become experts in supporting your
    existing tooling to feed your future
    choices
    30

    View Slide

  31. Operating in Production is an
    oft neglected use-case of a
    Product
    31

    View Slide

  32. Ownership
    Being on call for your own product
    is more about risk management
    than control
    32

    View Slide

  33. Being on-call
    33

    View Slide

  34. On-call engineers are the
    “first aiders” of software
    engineering
    34

    View Slide

  35. ABC
    Airways
    Breathing
    Circulation
    35

    View Slide

  36. ABC
    Assess
    - Triage the incoming alerts
    Blast radius
    - What applications are failing?
    Compensate
    - apply mitigating actions
    36

    View Slide

  37. Compensating actions
    •Turn off a feature flag
    •Apply graceful degradation
    •Redeploy a known good version
    •Turn on load shedding
    •Many more…
    37
    Ines Sombra - Architectural Patterns of Resilient Distributed Systems
    http://www.youtube.com/watch?v=ohvPnJYUW1E

    View Slide

  38. Do not investigate the root
    cause during an incident
    38

    View Slide

  39. Follow up
    •Hold a blameless post-mortem soon
    after the event
    •Mitigating fixes go to the top of the
    workstream
    •Run Show & Tells of incidents
    39

    View Slide

  40. “MTTR is more important than MTBF
    (for most types of F)”
    – John Allspaw
    Author of Web Operations
    http://www.kitchensoap.com/2010/11/07/mttr-mtbf-for-most-types-of-f/
    40

    View Slide

  41. Ownership of runtime success
    belongs to the team
    41

    View Slide

  42. Collaborate with Ops
    •Improve your understanding of how
    applications operate in production
    •Improve your knowledge of highly
    available systems to feedback into
    the development of your product
    42

    View Slide

  43. 43
    Learn from others’ mistakes

    View Slide

  44. 44
    https://about.gitlab.com/2017/02/10/postmortem-of-database-outage-of-january-31/

    View Slide

  45. 45
    https://aws.amazon.com/message/41926/

    View Slide

  46. 46
    https://www.youtube.com/watch?v=OUYTNywPk-s

    View Slide

  47. Never fail the same way twice
    47

    View Slide

  48. 48
    Promote a sense of safety

    View Slide

  49. Push the alerts
    •Do not expect engineers to sit and
    watch logs
    •Use an alerting tool with built in
    rotations & escalations such as
    PagerDuty or OpsGenie
    49

    View Slide

  50. 50
    http://lightsecond.com/ascii_stereogram.html

    View Slide

  51. Signal to Noise
    •Alerts should only be used when
    we would be happy waking
    someone up!
    •Informative & Actionable
    51

    View Slide

  52. Actively combat alert fatigue
    52

    View Slide

  53. 53
    Running an on call rota

    View Slide

  54. Running an on call rota
    •No more than 1 week at a time
    •Changeover on Tuesdays
    •Have an onboarding process for
    new engineers
    54
    https://blog.hinterlands.org/2010/07/running-an-oncall-rota/

    View Slide

  55. Running an on call rota
    •Agree a reasonable SLA for alert
    acknowledgement
    •Have escalation policies to
    provide support when needed
    55
    https://blog.hinterlands.org/2010/07/running-an-oncall-rota/

    View Slide

  56. Agree responsibilities
    Example
    •Devs responsible for App health &
    performance
    •Ops responsible for underlying infra
    and monitoring stack
    56

    View Slide

  57. 57
    Prevent burnout

    View Slide

  58. “Burnout is killing us.”
    – John Willis
    Co-Author of The DevOps Handbook
    http://itrevolution.com/karojisatsu/
    58

    View Slide

  59. 59
    Be mindful of the cognitive
    weight of on-call

    View Slide

  60. 60
    Ensure team size allows for
    rest, sickness and holidays

    View Slide

  61. Prevent Burnout
    •Allow for engineers to cover and
    swap between themselves
    61

    View Slide

  62. Prevent Burnout
    •Ensure the engineers are
    empowered and supported to
    improve the applications, and thus
    improve on-call experience
    62

    View Slide

  63. 63
    Compensation over reward

    View Slide

  64. Devs on call is fast becoming
    an industry standard
    64

    View Slide

  65. Take pride in your services…
    65

    View Slide

  66. …without destroying your team
    66

    View Slide

  67. You build it, you run it.
    67

    View Slide

  68. 68
    Thank you!
    Chris O’Dell
    @ChrisAnnODell

    View Slide

  69. Attributions
    Iceberg - https://www.flickr.com/photos/14730981@N08/28803627705/
    Pipeline - https://www.flickr.com/photos/cantoni/4426784542/
    CD Pipeline Photos – www.wocintechchat.com
    EKG - https://www.flickr.com/photos/vandalog/9445960751/
    Bottleneck - https://www.flickr.com/photos/aidan_jones/1691801119
    Mind the Gap – https://www.flickr.com/photos/christopherbrown/10135180454/
    Door key - https://www.flickr.com/photos/alancleaver/5577108264/
    Tyre stack – https://www.flickr.com/photos/markusspiske/14605397426/
    Punch Clock - https://www.flickr.com/photos/tjblackwell/5659432136/
    Carrot - https://www.flickr.com/photos/80375783@N00/3392828213/
    Sick pilot - https://twitter.com/AviatorInsp/status/975542614714757121
    Shift happens - https://www.flickr.com/photos/pilottheatre/9254122019
    Butting Heads - https://www.flickr.com/photos/jamiedfw/5423425957/
    69

    View Slide

  70. 70
    http://lightsecond.com/ascii_stereogram.html

    View Slide

  71. 71
    http://lightsecond.com/ascii_stereogram.html

    View Slide