Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Suffer Better

516fcd20ab7b946f50090ce1d557638c?s=47 j.hand
June 08, 2018

Suffer Better

Shared from the perspective of 4 ultra-endurance athletes, who coincidentally are experts in building resiliency and constant improvements in systems both digital and human, I’ll share some of the most critical aspects of site reliability engineering.

From preparation to pushing known limits to learning and improving, there is much that can be learned about how we approach building resiliency into our systems.

I will share a 3-tiered approach towards site reliability including:

Observability (from the customer’s perspective)
Chaos Engineering (proactively understanding reality and setting expectations)
GameDay & Incident Management (preparation and practice of important roles and procedures)
Audience members will walk away with a better understanding of Observability and where monitoring plays a role in it. They will also be left with actionable ideas to implement very quickly within their own organization to begin their own SRE initiative almost immediately. Fears and confusion surrounding Chaos Engineering and QA testing “in Prod” will be clarified and the value of such efforts will be extremely clear.

By sharing the stories of 4 athletes and the extreme (100+ mile) races they train for, execute, and learn from… I hope to expose a clear approach to increasing the uptime of systems while continuously delivering products and services our customers value the most.

516fcd20ab7b946f50090ce1d557638c?s=128

j.hand

June 08, 2018
Tweet

Transcript

  1. 1 — @jasonhand | @victorops

  2. Jason Hand @jasonhand VictorOps 2 — @jasonhand | @victorops

  3. Modern Business IT Challenges? 3 — @jasonhand | @victorops

  4. Modern IT Challenges? ๏ Longer release cycles ๏ Broken deployments

    ๏ Slow 5me to resolve 4 — @jasonhand | @victorops
  5. Modern IT Challenges? ๏ Low visibility ๏ Unnecessary ga3ng ๏

    Minimal feedback 5 — @jasonhand | @victorops
  6. “Reliability is the single most important feature we provide.” —

    Dan Jones (CTO VictorOps) 6 — @jasonhand | @victorops
  7. Value 7 — @jasonhand | @victorops

  8. How To: Build Resilient Systems 8 — @jasonhand | @victorops

  9. "Avoid shortcuts and embrace the pain it takes to reach

    our goals" 9 — @jasonhand | @victorops
  10. Step One: Avoid Shortcuts 10 — @jasonhand | @victorops

  11. Step Two: Embrace Pain 11 — @jasonhand | @victorops

  12. If it hurts .. do it more o.en — Jez

    Humble 12 — @jasonhand | @victorops
  13. Step Three: Reach Goals 13 — @jasonhand | @victorops

  14. Agenda 14 — @jasonhand | @victorops

  15. Start? 15 — @jasonhand | @victorops

  16. Assess Examine Health | Accept Reality | Plan 16 —

    @jasonhand | @victorops
  17. Exercise Push | Fail | Measure 17 — @jasonhand |

    @victorops
  18. Rehearse Prac%ce | Learn From Failure | Improve 18 —

    @jasonhand | @victorops
  19. 19 — @jasonhand | @victorops

  20. Finish? 20 — @jasonhand | @victorops

  21. There is no Finish Line 21 — @jasonhand | @victorops

  22. Con$nuous Improvement 22 — @jasonhand | @victorops

  23. You can't improve what you don't measure 23 — @jasonhand

    | @victorops
  24. Marathon 24 — @jasonhand | @victorops

  25. Erin Osgood 25 — @jasonhand | @victorops

  26. 2014: Goal: 3:35 (Beat the Boston Qualifying Time) 26 —

    @jasonhand | @victorops
  27. 2014: Goal: 3:35 (Beat the Boston Qualifying Time) Time: 3:37:27

    27 — @jasonhand | @victorops
  28. 28 — @jasonhand | @victorops

  29. 2015: Goal: 3:35 (Beat the Boston Qualifying Time) 29 —

    @jasonhand | @victorops
  30. 2015: Goal: 3:35 (Beat the Boston Qualifying Time) Time: 3:27:01

    30 — @jasonhand | @victorops
  31. 31 — @jasonhand | @victorops

  32. 2016: Goal: Finish 32 — @jasonhand | @victorops

  33. 2016: Goal: Finish Time: 3:31:12 33 — @jasonhand | @victorops

  34. 34 — @jasonhand | @victorops

  35. 2017: Goal: 3:27:01 (Be$er than 2015) 35 — @jasonhand |

    @victorops
  36. 2017: Goal: 3:27:01 (Be$er than 2015) DNF 36 — @jasonhand

    | @victorops
  37. 37 — @jasonhand | @victorops

  38. 2018: Goal: 3:00 (less than) 38 — @jasonhand | @victorops

  39. 2018: Goal: 3:00 (less than) Time: 3:11:15 39 — @jasonhand

    | @victorops
  40. 40 — @jasonhand | @victorops

  41. Erin's Advice "Improvement Requires Set Backs" ๏ Stretch Goals ๏

    Learn From Failure ๏ Measure & Accept "Reality" 41 — @jasonhand | @victorops
  42. Reality 42 — @jasonhand | @victorops

  43. 43 — @jasonhand | @victorops

  44. Ultra Marathon 44 — @jasonhand | @victorops

  45. Tom Hart 45 — @jasonhand | @victorops

  46. Ridge Challenge 112-mile single stage ultra marathon. Irelands toughest single

    stage foot race. 46 — @jasonhand | @victorops
  47. Rebecca Boozan 47 — @jasonhand | @victorops

  48. Leadwoman 3rd place - 2016 26.2-mile trail run + 50-mile

    mountain bike + 100-mile mountain bike + 10k run + and 100-mile trail run 48 — @jasonhand | @victorops
  49. Cordis Hall 49 — @jasonhand | @victorops

  50. Cruel Jewel ๏ 106 miles ๏ 33,000, eleva/on change ๏

    5th Overall ๏ 28:30 final /me ๏ Qualifier for the Hardrock 100 and Ultra Trail de Mont Blanc 50 — @jasonhand | @victorops
  51. Transgrancanaria ๏ 80 miles across the island of Gran Canaria

    (Canary islands) ๏ 26,000: eleva<on change ๏ 17 hours final <me ๏ 63rd Overall ๏ 52nd Male ๏ 2nd American 51 — @jasonhand | @victorops
  52. Trade Offs 52 — @jasonhand | @victorops

  53. Speed vs Quality 53 — @jasonhand | @victorops

  54. Building A Culture of Reliability 54 — @jasonhand | @victorops

  55. 55 — @jasonhand | @victorops

  56. SRE @ VictorOps 56 — @jasonhand | @victorops

  57. jhand.co/SRE_Book 57 — @jasonhand | @victorops

  58. SRE Council 58 — @jasonhand | @victorops

  59. Council Members Engineering, Support, Product 59 — @jasonhand | @victorops

  60. Facilita'ng the culture of SRE: Empower each engineer’s “reliability feels,”

    so they can take ownership of improvements 60 — @jasonhand | @victorops
  61. Facilita'ng the culture of SRE: Proac&vely expose dependencies across systems

    star0ng with dialogue and data 61 — @jasonhand | @victorops
  62. Facilita'ng the culture of SRE: The council would serve as

    the point of contact for reliability conversa5ons 62 — @jasonhand | @victorops
  63. What Keeps Us Up At Night? (From The Customer's Point

    Of View) 63 — @jasonhand | @victorops
  64. Council Concerns 64 — @jasonhand | @victorops

  65. 65 — @jasonhand | @victorops

  66. Themes -Broken Deployments -Slow Time To Recover - Cost of

    Down5me - Low Incident Visibility - Unhappy Customers - Long Release Cycles 66 — @jasonhand | @victorops
  67. How Can We Answer These Ques%ons? 67 — @jasonhand |

    @victorops
  68. Observability 68 — @jasonhand | @victorops

  69. Monitoring vs. Observability Monitoring is an ac,on we take on

    a system. Observability is a property of a system. 69 — @jasonhand | @victorops
  70. 70 — @jasonhand | @victorops

  71. Unknown Unknowns 71 — @jasonhand | @victorops

  72. Assess Is it doing what it is supposed to be

    doing? Determine what "normal" is and how to keep tabs on it in real 4me. Where are we now? 72 — @jasonhand | @victorops
  73. 73 — @jasonhand | @victorops

  74. Priori%ze on the User Perspec)ve (Proac've) 74 — @jasonhand |

    @victorops
  75. Consumer Value Aler%ng 75 — @jasonhand | @victorops

  76. Phases of Incidents 76 — @jasonhand | @victorops

  77. Learning & Feedback jhand.co/PIR_book 77 — @jasonhand | @victorops

  78. 78 — @jasonhand | @victorops

  79. 79 — @jasonhand | @victorops

  80. Chaos Engineering Stretching, exercising, or otherwise pushing the system to

    it's limits to know where those limita8ons exist. 80 — @jasonhand | @victorops
  81. Chaos principlesofchaos.org 81 — @jasonhand | @victorops

  82. Chaos Engineering ๏ Reduce Impact of Injury ๏ Flush out

    Unknown Unknowns (check condi;ons) 82 — @jasonhand | @victorops
  83. 83 — @jasonhand | @victorops

  84. 84 — @jasonhand | @victorops

  85. GameDays Using knowledge and structured plan, rou4nely perform the ac4ons.

    Expanding and improving current PR's 85 — @jasonhand | @victorops
  86. GameDays ๏ Coordinate Strengths/Weaknesses/ Opportuni5es ๏ Establish Training Program (simula5on)

    86 — @jasonhand | @victorops
  87. 87 — @jasonhand | @victorops

  88. 88 — @jasonhand | @victorops

  89. “We need to create a culture that reinforces the value

    of taking risks and learning from failure and the need for repe//on and prac/ce to create mastery.” — Gene Kim (Co-author, Phoenix Project) 89 — @jasonhand | @victorops
  90. Systems Thinking | Feedback | Experiment & Learn 90 —

    @jasonhand | @victorops
  91. It's not just a technical solu/on. It's not just a

    procedure problem. 91 — @jasonhand | @victorops
  92. DevOps An approach to our "work" where we con.nuously look

    for methods to evaluate and improve the technology, process, and people as they relate to building, deploying, opera0ng, securing, and suppor0ng the "value" our organiza.on provides. 92 — @jasonhand | @victorops
  93. Holis&c 93 — @jasonhand | @victorops

  94. 94 — @jasonhand | @victorops

  95. Now What? Understand where you are right now? Make more

    of the system knowable from the customer's perspec;ve. 95 — @jasonhand | @victorops
  96. Now What? Push the limits of your systems and use

    metrics to determine normal and thresholds 96 — @jasonhand | @victorops
  97. Now What? Establish regime and workout rou4ne to constantly work

    muscles and build intui4on 97 — @jasonhand | @victorops
  98. 98 — @jasonhand | @victorops

  99. Building A Culture of Reliability 99 — @jasonhand | @victorops

  100. SRE @ VictorOps 100 — @jasonhand | @victorops

  101. 101 — @jasonhand | @victorops

  102. 102 — @jasonhand | @victorops

  103. 103 — @jasonhand | @victorops

  104. Thank You 104 — @jasonhand | @victorops

  105. 105 — @jasonhand | @victorops