Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Building a Culture of Reliability - SRECon EMEA 2017

Building a Culture of Reliability - SRECon EMEA 2017

Getting customers to care about Reliability is hard. Getting stakeholders to care about Reliability is harder. Getting the entire company to care about Reliability is even harder.

In this talk, I will cover what steps that every leader in any organization can take to get more people to care about Reliability. Because Reliability is one of those things that people only notice when it goes in the wrong direction, it can be hard to show the value of it and why it is so important.

We will walk through cultural and management changes, metrics to watch and obsess over, and some tooling that can help along the way.

Video Available Here: https://www.youtube.com/watch?v=xH4FqAHeq08

Ebe1d126c7c859171156efb4c08db14f?s=128

Arup Chakrabarti

August 31, 2017
Tweet

Transcript

  1. Arup Chakrabarti Director of Engineering, PagerDuty Building a Culture of

    Reliability SRECON EMEA 2017 @arupchak
  2. @arupchak Disclaimers

  3. @arupchak I work with smrt smart people

  4. @arupchak You are not PagerDuty

  5. @arupchak We get this wrong too

  6. @arupchak Definitions

  7. @arupchak Reliability

  8. @arupchak Probability that your software works*

  9. @arupchak What every CTO claims they want because numbers

  10. @arupchak Culture

  11. @arupchak Social behavior and norms for a group of people

  12. @arupchak A way to get your colleagues to behave the

    way you want them to without staring at them all the time
  13. @arupchak Metrics

  14. @arupchak “Show me the business impact” -Your Pointy Haired Manager

  15. @arupchak “Here is a graph of open File Descriptors going

    through the roof” -Frustrated Engineer
  16. @arupchak “What the $%#! is a File Descriptor?” -Your Pointy

    Haired Manager
  17. @arupchak Business Metrics Managers Care About

  18. @arupchak Metrics Your Customers Care About

  19. @arupchak Two Types of Online Businesses • Individual Transaction Businesses

    • Subscription Businesses
  20. @arupchak Individual Transaction Business $$$ per Minute $0 $23 $45

    $68 $90 Monday Tuesday Wednesday Thursday Friday
  21. @arupchak Individual Transaction Business $$$ per Minute $0 $23 $45

    $68 $90 Monday Tuesday Wednesday Thursday Friday
  22. @arupchak Individual Transaction Business $$$ per Minute $0 $23 $45

    $68 $90 Monday Tuesday Wednesday Thursday Friday
  23. @arupchak Individual Transaction Business $$$ per Minute $0 $23 $45

    $68 $90 Monday Tuesday Wednesday Thursday Friday
  24. @arupchak Individual Transaction Business $$$ per Minute $0 $23 $45

    $68 $90 Monday Tuesday Wednesday Thursday Friday $
  25. @arupchak Individual Transaction Business $$$ per Minute $0 $23 $45

    $68 $90 Monday Tuesday Wednesday Thursday Friday $ €
  26. @arupchak Subscription Businesses • Cannot solely measure when you make

    money • Poor Reliability erodes trust and will cause you lose revenue • Need to find something between how money is made and what customers care about
  27. @arupchak Subscription Businesses Incidents Resolved per Hour - July 2017

  28. @arupchak Finding the right metrics is hard

  29. @arupchak But still worth looking for

  30. @arupchak More People On-Call

  31. @arupchak Customers do not care who gets paged

  32. @arupchak Customers just want to use your service

  33. @arupchak Centralized Operations Engineering Org

  34. @arupchak Centralized Operations Engineering Org

  35. @arupchak Centralized Operations Engineering Org

  36. @arupchak Distributed Operations Engineering Org

  37. @arupchak Distributed Operations Engineering Org

  38. @arupchak Distributed Operations Engineering Org Eng Product HR UX Marketing

    Execs
  39. @arupchak Distributed Operations Org • Sets expectations around availability of

    people • More small incidents over single major incident • Builds empathy and why Reliability is hard
  40. @arupchak Tooling and Processes

  41. @arupchak “If we just install Nagios, everything will be fine

    and all of our problems will be solved” -Arup in 2002
  42. @arupchak “We humans co-evolve with our tools. We change the

    tools, and the tools change us, and that cycle repeats.” -Jeff Bezos
  43. @arupchak Failure Friday (Process)

  44. @arupchak Started Small

  45. @arupchak

  46. @arupchak Got Bigger and Smarter

  47. @arupchak

  48. @arupchak ?

  49. @arupchak ?

  50. @arupchak ?

  51. @arupchak Reboot Roulette (Tool)

  52. @arupchak

  53. @arupchak Major Incident Response
 (Process and Tooling)

  54. @arupchak Started Really Poorly

  55. @arupchak Got A Little Better Each Time

  56. @arupchak

  57. @arupchak Still Not Perfect

  58. @arupchak

  59. @arupchak Internal Liaison Role (Process)

  60. @arupchak Over-communicate during Major Incidents

  61. @arupchak

  62. @arupchak Improving Reliability means constantly failing, constantly recovering, and constantly

    learning
  63. @arupchak Yes, it can be exhausting, but it is worth

    it
  64. @arupchak Improving Culture means constantly failing, constantly recovering, and constantly

    learning
  65. @arupchak Yes, it can be even more exhausting, but it

    is really really really worth it
  66. Arup Chakrabarti Director of Engineering, PagerDuty Thank You WE ARE

    HIRING PAGERDUTY.COM/CAREERS ARUP@PAGERDUTY.COM @arupchak
  67. @arupchak Related Reading • https://response.pagerduty.com/ • https://www.pagerduty.com/blog/intern-insights-on-call-experience/ • https://www.pagerduty.com/blog/failure-fridays-four-years/ •

    https://speakerdeck.com/arupchak