Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Suffer Better

j.hand
June 08, 2018

Suffer Better

Shared from the perspective of 4 ultra-endurance athletes, who coincidentally are experts in building resiliency and constant improvements in systems both digital and human, I’ll share some of the most critical aspects of site reliability engineering.

From preparation to pushing known limits to learning and improving, there is much that can be learned about how we approach building resiliency into our systems.

I will share a 3-tiered approach towards site reliability including:

Observability (from the customer’s perspective)
Chaos Engineering (proactively understanding reality and setting expectations)
GameDay & Incident Management (preparation and practice of important roles and procedures)
Audience members will walk away with a better understanding of Observability and where monitoring plays a role in it. They will also be left with actionable ideas to implement very quickly within their own organization to begin their own SRE initiative almost immediately. Fears and confusion surrounding Chaos Engineering and QA testing “in Prod” will be clarified and the value of such efforts will be extremely clear.

By sharing the stories of 4 athletes and the extreme (100+ mile) races they train for, execute, and learn from… I hope to expose a clear approach to increasing the uptime of systems while continuously delivering products and services our customers value the most.

j.hand

June 08, 2018
Tweet

More Decks by j.hand

Other Decks in Technology

Transcript

  1. 1 — @jasonhand | @victorops

    View Slide

  2. Jason
    Hand
    @jasonhand
    VictorOps
    2 — @jasonhand | @victorops

    View Slide

  3. Modern Business IT Challenges?
    3 — @jasonhand | @victorops

    View Slide

  4. Modern IT Challenges?
    ๏ Longer release cycles
    ๏ Broken deployments
    ๏ Slow 5me to resolve
    4 — @jasonhand | @victorops

    View Slide

  5. Modern IT Challenges?
    ๏ Low visibility
    ๏ Unnecessary ga3ng
    ๏ Minimal feedback
    5 — @jasonhand | @victorops

    View Slide

  6. “Reliability is the single most
    important feature we
    provide.”
    — Dan Jones (CTO VictorOps)
    6 — @jasonhand | @victorops

    View Slide

  7. Value
    7 — @jasonhand | @victorops

    View Slide

  8. How To:
    Build Resilient Systems
    8 — @jasonhand | @victorops

    View Slide

  9. "Avoid shortcuts and embrace the pain it takes to
    reach our goals"
    9 — @jasonhand | @victorops

    View Slide

  10. Step One: Avoid Shortcuts
    10 — @jasonhand | @victorops

    View Slide

  11. Step Two: Embrace Pain
    11 — @jasonhand | @victorops

    View Slide

  12. If it hurts .. do it
    more o.en
    — Jez Humble
    12 — @jasonhand | @victorops

    View Slide

  13. Step Three: Reach Goals
    13 — @jasonhand | @victorops

    View Slide

  14. Agenda
    14 — @jasonhand | @victorops

    View Slide

  15. Start?
    15 — @jasonhand | @victorops

    View Slide

  16. Assess
    Examine Health | Accept Reality | Plan
    16 — @jasonhand | @victorops

    View Slide

  17. Exercise
    Push | Fail | Measure
    17 — @jasonhand | @victorops

    View Slide

  18. Rehearse
    Prac%ce | Learn From Failure | Improve
    18 — @jasonhand | @victorops

    View Slide

  19. 19 — @jasonhand | @victorops

    View Slide

  20. Finish?
    20 — @jasonhand | @victorops

    View Slide

  21. There is no
    Finish Line
    21 — @jasonhand | @victorops

    View Slide

  22. Con$nuous
    Improvement
    22 — @jasonhand | @victorops

    View Slide

  23. You can't improve
    what you don't measure
    23 — @jasonhand | @victorops

    View Slide

  24. Marathon
    24 — @jasonhand | @victorops

    View Slide

  25. Erin
    Osgood
    25 — @jasonhand | @victorops

    View Slide

  26. 2014:
    Goal: 3:35
    (Beat the Boston Qualifying Time)
    26 — @jasonhand | @victorops

    View Slide

  27. 2014:
    Goal: 3:35
    (Beat the Boston Qualifying Time)
    Time: 3:37:27
    27 — @jasonhand | @victorops

    View Slide

  28. 28 — @jasonhand | @victorops

    View Slide

  29. 2015:
    Goal: 3:35
    (Beat the Boston Qualifying Time)
    29 — @jasonhand | @victorops

    View Slide

  30. 2015:
    Goal: 3:35
    (Beat the Boston Qualifying Time)
    Time: 3:27:01
    30 — @jasonhand | @victorops

    View Slide

  31. 31 — @jasonhand | @victorops

    View Slide

  32. 2016:
    Goal: Finish
    32 — @jasonhand | @victorops

    View Slide

  33. 2016:
    Goal: Finish
    Time: 3:31:12
    33 — @jasonhand | @victorops

    View Slide

  34. 34 — @jasonhand | @victorops

    View Slide

  35. 2017:
    Goal: 3:27:01
    (Be$er than 2015)
    35 — @jasonhand | @victorops

    View Slide

  36. 2017:
    Goal: 3:27:01
    (Be$er than 2015)
    DNF
    36 — @jasonhand | @victorops

    View Slide

  37. 37 — @jasonhand | @victorops

    View Slide

  38. 2018:
    Goal: 3:00
    (less than)
    38 — @jasonhand | @victorops

    View Slide

  39. 2018:
    Goal: 3:00
    (less than)
    Time: 3:11:15
    39 — @jasonhand | @victorops

    View Slide

  40. 40 — @jasonhand | @victorops

    View Slide

  41. Erin's Advice
    "Improvement Requires Set Backs"
    ๏ Stretch Goals
    ๏ Learn From Failure
    ๏ Measure & Accept "Reality"
    41 — @jasonhand | @victorops

    View Slide

  42. Reality
    42 — @jasonhand | @victorops

    View Slide

  43. 43 — @jasonhand | @victorops

    View Slide

  44. Ultra
    Marathon
    44 — @jasonhand | @victorops

    View Slide

  45. Tom
    Hart
    45 — @jasonhand | @victorops

    View Slide

  46. Ridge Challenge
    112-mile single stage ultra marathon.
    Irelands toughest single stage foot race.
    46 — @jasonhand | @victorops

    View Slide

  47. Rebecca
    Boozan
    47 — @jasonhand | @victorops

    View Slide

  48. Leadwoman
    3rd place - 2016
    26.2-mile trail run +
    50-mile mountain bike +
    100-mile mountain bike +
    10k run +
    and 100-mile trail run
    48 — @jasonhand | @victorops

    View Slide

  49. Cordis
    Hall
    49 — @jasonhand | @victorops

    View Slide

  50. Cruel Jewel
    ๏ 106 miles
    ๏ 33,000, eleva/on change
    ๏ 5th Overall
    ๏ 28:30 final /me
    ๏ Qualifier for the Hardrock 100 and Ultra Trail de Mont Blanc
    50 — @jasonhand | @victorops

    View Slide

  51. Transgrancanaria
    ๏ 80 miles across the island of Gran Canaria (Canary islands)
    ๏ 26,000: eleva๏ 17 hours final ๏ 63rd Overall
    ๏ 52nd Male
    ๏ 2nd American
    51 — @jasonhand | @victorops

    View Slide

  52. Trade
    Offs
    52 — @jasonhand | @victorops

    View Slide

  53. Speed
    vs
    Quality
    53 — @jasonhand | @victorops

    View Slide

  54. Building
    A Culture of
    Reliability
    54 — @jasonhand | @victorops

    View Slide

  55. 55 — @jasonhand | @victorops

    View Slide

  56. SRE
    @ VictorOps
    56 — @jasonhand | @victorops

    View Slide

  57. jhand.co/SRE_Book
    57 — @jasonhand | @victorops

    View Slide

  58. SRE
    Council
    58 — @jasonhand | @victorops

    View Slide

  59. Council
    Members
    Engineering, Support, Product
    59 — @jasonhand | @victorops

    View Slide

  60. Facilita'ng the culture of SRE:
    Empower each engineer’s “reliability feels,” so they
    can take ownership of improvements
    60 — @jasonhand | @victorops

    View Slide

  61. Facilita'ng the culture of SRE:
    Proac&vely expose dependencies across systems
    star0ng with dialogue and data
    61 — @jasonhand | @victorops

    View Slide

  62. Facilita'ng the culture of SRE:
    The council would serve as the point of contact
    for reliability conversa5ons
    62 — @jasonhand | @victorops

    View Slide

  63. What Keeps Us Up At Night?
    (From The Customer's Point Of View)
    63 — @jasonhand | @victorops

    View Slide

  64. Council
    Concerns
    64 — @jasonhand | @victorops

    View Slide

  65. 65 — @jasonhand | @victorops

    View Slide

  66. Themes
    -Broken Deployments
    -Slow Time To Recover
    - Cost of Down5me
    - Low Incident Visibility
    - Unhappy Customers
    - Long Release Cycles
    66 — @jasonhand | @victorops

    View Slide

  67. How Can We
    Answer These
    Ques%ons?
    67 — @jasonhand | @victorops

    View Slide

  68. Observability
    68 — @jasonhand | @victorops

    View Slide

  69. Monitoring vs. Observability
    Monitoring is an ac,on we take on a system.
    Observability is a property of a system.
    69 — @jasonhand | @victorops

    View Slide

  70. 70 — @jasonhand | @victorops

    View Slide

  71. Unknown
    Unknowns
    71 — @jasonhand | @victorops

    View Slide

  72. Assess
    Is it doing what it is supposed to be doing?
    Determine what "normal" is and how to keep tabs
    on it in real 4me. Where are we now?
    72 — @jasonhand | @victorops

    View Slide

  73. 73 — @jasonhand | @victorops

    View Slide

  74. Priori%ze on the
    User Perspec)ve
    (Proac've)
    74 — @jasonhand | @victorops

    View Slide

  75. Consumer Value
    Aler%ng
    75 — @jasonhand | @victorops

    View Slide

  76. Phases of
    Incidents
    76 — @jasonhand | @victorops

    View Slide

  77. Learning
    & Feedback
    jhand.co/PIR_book
    77 — @jasonhand | @victorops

    View Slide

  78. 78 — @jasonhand | @victorops

    View Slide

  79. 79 — @jasonhand | @victorops

    View Slide

  80. Chaos Engineering
    Stretching, exercising, or otherwise pushing the
    system to it's limits to know where those
    limita8ons exist.
    80 — @jasonhand | @victorops

    View Slide

  81. Chaos
    principlesofchaos.org
    81 — @jasonhand | @victorops

    View Slide

  82. Chaos Engineering
    ๏ Reduce Impact of Injury
    ๏ Flush out Unknown Unknowns (check
    condi;ons)
    82 — @jasonhand | @victorops

    View Slide

  83. 83 — @jasonhand | @victorops

    View Slide

  84. 84 — @jasonhand | @victorops

    View Slide

  85. GameDays
    Using knowledge and structured plan, rou4nely
    perform the ac4ons. Expanding and improving
    current PR's
    85 — @jasonhand | @victorops

    View Slide

  86. GameDays
    ๏ Coordinate Strengths/Weaknesses/
    Opportuni5es
    ๏ Establish Training Program (simula5on)
    86 — @jasonhand | @victorops

    View Slide

  87. 87 — @jasonhand | @victorops

    View Slide

  88. 88 — @jasonhand | @victorops

    View Slide

  89. “We need to create a culture that
    reinforces the value of taking risks
    and learning from failure and the need
    for repe//on and prac/ce to create
    mastery.”
    — Gene Kim (Co-author, Phoenix Project)
    89 — @jasonhand | @victorops

    View Slide

  90. Systems Thinking | Feedback | Experiment & Learn
    90 — @jasonhand | @victorops

    View Slide

  91. It's not just a technical solu/on.
    It's not just a procedure problem.
    91 — @jasonhand | @victorops

    View Slide

  92. DevOps
    An approach to our "work" where we con.nuously
    look for methods to evaluate and improve the
    technology, process, and people as they relate to
    building, deploying, opera0ng, securing, and
    suppor0ng the "value" our organiza.on provides.
    92 — @jasonhand | @victorops

    View Slide

  93. Holis&c
    93 — @jasonhand | @victorops

    View Slide

  94. 94 — @jasonhand | @victorops

    View Slide

  95. Now What?
    Understand where you are right now?
    Make more of the system knowable from the
    customer's perspec;ve.
    95 — @jasonhand | @victorops

    View Slide

  96. Now What?
    Push the limits of your systems and use metrics to
    determine normal and thresholds
    96 — @jasonhand | @victorops

    View Slide

  97. Now What?
    Establish regime and workout rou4ne to
    constantly work muscles and build intui4on
    97 — @jasonhand | @victorops

    View Slide

  98. 98 — @jasonhand | @victorops

    View Slide

  99. Building
    A Culture of
    Reliability
    99 — @jasonhand | @victorops

    View Slide

  100. SRE
    @ VictorOps
    100 — @jasonhand | @victorops

    View Slide

  101. 101 — @jasonhand | @victorops

    View Slide

  102. 102 — @jasonhand | @victorops

    View Slide

  103. 103 — @jasonhand | @victorops

    View Slide

  104. Thank
    You
    104 — @jasonhand | @victorops

    View Slide

  105. 105 — @jasonhand | @victorops

    View Slide