Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Building a Culture of Reliability - SRECon EMEA 2017

Building a Culture of Reliability - SRECon EMEA 2017

Getting customers to care about Reliability is hard. Getting stakeholders to care about Reliability is harder. Getting the entire company to care about Reliability is even harder.

In this talk, I will cover what steps that every leader in any organization can take to get more people to care about Reliability. Because Reliability is one of those things that people only notice when it goes in the wrong direction, it can be hard to show the value of it and why it is so important.

We will walk through cultural and management changes, metrics to watch and obsess over, and some tooling that can help along the way.

Video Available Here: https://www.youtube.com/watch?v=xH4FqAHeq08

Arup Chakrabarti

August 31, 2017
Tweet

More Decks by Arup Chakrabarti

Other Decks in Technology

Transcript

  1. Arup Chakrabarti
    Director of Engineering, PagerDuty
    Building a Culture of Reliability
    SRECON EMEA 2017
    @arupchak

    View Slide

  2. @arupchak
    Disclaimers

    View Slide

  3. @arupchak
    I work with smrt smart people

    View Slide

  4. @arupchak
    You are not PagerDuty

    View Slide

  5. @arupchak
    We get this wrong too

    View Slide

  6. @arupchak
    Definitions

    View Slide

  7. @arupchak
    Reliability

    View Slide

  8. @arupchak
    Probability that your
    software works*

    View Slide

  9. @arupchak
    What every CTO claims they
    want because numbers

    View Slide

  10. @arupchak
    Culture

    View Slide

  11. @arupchak
    Social behavior and norms
    for a group of people

    View Slide

  12. @arupchak
    A way to get your colleagues to
    behave the way you want them to
    without staring at them all the time

    View Slide

  13. @arupchak
    Metrics

    View Slide

  14. @arupchak
    “Show me the business impact”
    -Your Pointy Haired Manager

    View Slide

  15. @arupchak
    “Here is a graph of open File Descriptors
    going through the roof”
    -Frustrated Engineer

    View Slide

  16. @arupchak
    “What the $%#! is a File Descriptor?”
    -Your Pointy Haired Manager

    View Slide

  17. @arupchak
    Business Metrics
    Managers Care About

    View Slide

  18. @arupchak
    Metrics Your Customers
    Care About

    View Slide

  19. @arupchak
    Two Types of Online Businesses
    • Individual Transaction Businesses
    • Subscription Businesses

    View Slide

  20. @arupchak
    Individual Transaction Business
    $$$ per Minute
    $0
    $23
    $45
    $68
    $90
    Monday Tuesday Wednesday Thursday Friday

    View Slide

  21. @arupchak
    Individual Transaction Business
    $$$ per Minute
    $0
    $23
    $45
    $68
    $90
    Monday Tuesday Wednesday Thursday Friday

    View Slide

  22. @arupchak
    Individual Transaction Business
    $$$ per Minute
    $0
    $23
    $45
    $68
    $90
    Monday Tuesday Wednesday Thursday Friday

    View Slide

  23. @arupchak
    Individual Transaction Business
    $$$ per Minute
    $0
    $23
    $45
    $68
    $90
    Monday Tuesday Wednesday Thursday Friday

    View Slide

  24. @arupchak
    Individual Transaction Business
    $$$ per Minute
    $0
    $23
    $45
    $68
    $90
    Monday Tuesday Wednesday Thursday Friday
    $

    View Slide

  25. @arupchak
    Individual Transaction Business
    $$$ per Minute
    $0
    $23
    $45
    $68
    $90
    Monday Tuesday Wednesday Thursday Friday
    $

    View Slide

  26. @arupchak
    Subscription Businesses
    • Cannot solely measure when you make money
    • Poor Reliability erodes trust and will cause you lose revenue
    • Need to find something between how money is made and what customers
    care about

    View Slide

  27. @arupchak
    Subscription Businesses
    Incidents Resolved per Hour - July 2017

    View Slide

  28. @arupchak
    Finding the right metrics is hard

    View Slide

  29. @arupchak
    But still worth looking for

    View Slide

  30. @arupchak
    More People On-Call

    View Slide

  31. @arupchak
    Customers do not care
    who gets paged

    View Slide

  32. @arupchak
    Customers just want to
    use your service

    View Slide

  33. @arupchak
    Centralized Operations Engineering Org

    View Slide

  34. @arupchak
    Centralized Operations Engineering Org

    View Slide

  35. @arupchak
    Centralized Operations Engineering Org

    View Slide

  36. @arupchak
    Distributed Operations Engineering Org

    View Slide

  37. @arupchak
    Distributed Operations Engineering Org

    View Slide

  38. @arupchak
    Distributed Operations Engineering Org
    Eng Product HR
    UX Marketing Execs

    View Slide

  39. @arupchak
    Distributed Operations Org
    • Sets expectations around availability of people
    • More small incidents over single major incident
    • Builds empathy and why Reliability is hard

    View Slide

  40. @arupchak
    Tooling and Processes

    View Slide

  41. @arupchak
    “If we just install Nagios, everything will be
    fine and all of our problems will be solved”
    -Arup in 2002

    View Slide

  42. @arupchak
    “We humans co-evolve with our tools. We
    change the tools, and the tools change us,
    and that cycle repeats.”
    -Jeff Bezos

    View Slide

  43. @arupchak
    Failure Friday (Process)

    View Slide

  44. @arupchak
    Started Small

    View Slide

  45. @arupchak

    View Slide

  46. @arupchak
    Got Bigger and Smarter

    View Slide

  47. @arupchak

    View Slide

  48. @arupchak
    ?

    View Slide

  49. @arupchak
    ?

    View Slide

  50. @arupchak
    ?

    View Slide

  51. @arupchak
    Reboot Roulette (Tool)

    View Slide

  52. @arupchak

    View Slide

  53. @arupchak
    Major Incident Response

    (Process and Tooling)

    View Slide

  54. @arupchak
    Started Really Poorly

    View Slide

  55. @arupchak
    Got A Little Better Each Time

    View Slide

  56. @arupchak

    View Slide

  57. @arupchak
    Still Not Perfect

    View Slide

  58. @arupchak

    View Slide

  59. @arupchak
    Internal Liaison Role (Process)

    View Slide

  60. @arupchak
    Over-communicate during
    Major Incidents

    View Slide

  61. @arupchak

    View Slide

  62. @arupchak
    Improving Reliability means
    constantly failing, constantly
    recovering, and constantly learning

    View Slide

  63. @arupchak
    Yes, it can be exhausting,
    but it is worth it

    View Slide

  64. @arupchak
    Improving Culture means
    constantly failing, constantly
    recovering, and constantly learning

    View Slide

  65. @arupchak
    Yes, it can be even more
    exhausting, but it is really
    really really worth it

    View Slide

  66. Arup Chakrabarti
    Director of Engineering, PagerDuty
    Thank You
    WE ARE HIRING PAGERDUTY.COM/CAREERS
    [email protected]
    @arupchak

    View Slide

  67. @arupchak
    Related Reading
    • https://response.pagerduty.com/
    • https://www.pagerduty.com/blog/intern-insights-on-call-experience/
    • https://www.pagerduty.com/blog/failure-fridays-four-years/
    • https://speakerdeck.com/arupchak

    View Slide