Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Suffer Better

June 08, 2018

Suffer Better

Shared from the perspective of 4 ultra-endurance athletes, who coincidentally are experts in building resiliency and constant improvements in systems both digital and human, I’ll share some of the most critical aspects of site reliability engineering.

From preparation to pushing known limits to learning and improving, there is much that can be learned about how we approach building resiliency into our systems.

I will share a 3-tiered approach towards site reliability including:

Observability (from the customer’s perspective)
Chaos Engineering (proactively understanding reality and setting expectations)
GameDay & Incident Management (preparation and practice of important roles and procedures)
Audience members will walk away with a better understanding of Observability and where monitoring plays a role in it. They will also be left with actionable ideas to implement very quickly within their own organization to begin their own SRE initiative almost immediately. Fears and confusion surrounding Chaos Engineering and QA testing “in Prod” will be clarified and the value of such efforts will be extremely clear.

By sharing the stories of 4 athletes and the extreme (100+ mile) races they train for, execute, and learn from… I hope to expose a clear approach to increasing the uptime of systems while continuously delivering products and services our customers value the most.


June 08, 2018

More Decks by j.hand

Other Decks in Technology


  1. Modern IT Challenges? ๏ Longer release cycles ๏ Broken deployments

    ๏ Slow 5me to resolve 4 — @jasonhand | @victorops
  2. Modern IT Challenges? ๏ Low visibility ๏ Unnecessary ga3ng ๏

    Minimal feedback 5 — @jasonhand | @victorops
  3. “Reliability is the single most important feature we provide.” —

    Dan Jones (CTO VictorOps) 6 — @jasonhand | @victorops
  4. "Avoid shortcuts and embrace the pain it takes to reach

    our goals" 9 — @jasonhand | @victorops
  5. If it hurts .. do it more o.en — Jez

    Humble 12 — @jasonhand | @victorops
  6. Erin's Advice "Improvement Requires Set Backs" ๏ Stretch Goals ๏

    Learn From Failure ๏ Measure & Accept "Reality" 41 — @jasonhand | @victorops
  7. Leadwoman 3rd place - 2016 26.2-mile trail run + 50-mile

    mountain bike + 100-mile mountain bike + 10k run + and 100-mile trail run 48 — @jasonhand | @victorops
  8. Cruel Jewel ๏ 106 miles ๏ 33,000, eleva/on change ๏

    5th Overall ๏ 28:30 final /me ๏ Qualifier for the Hardrock 100 and Ultra Trail de Mont Blanc 50 — @jasonhand | @victorops
  9. Transgrancanaria ๏ 80 miles across the island of Gran Canaria

    (Canary islands) ๏ 26,000: eleva<on change ๏ 17 hours final <me ๏ 63rd Overall ๏ 52nd Male ๏ 2nd American 51 — @jasonhand | @victorops
  10. Facilita'ng the culture of SRE: Empower each engineer’s “reliability feels,”

    so they can take ownership of improvements 60 — @jasonhand | @victorops
  11. Facilita'ng the culture of SRE: Proac&vely expose dependencies across systems

    star0ng with dialogue and data 61 — @jasonhand | @victorops
  12. Facilita'ng the culture of SRE: The council would serve as

    the point of contact for reliability conversa5ons 62 — @jasonhand | @victorops
  13. What Keeps Us Up At Night? (From The Customer's Point

    Of View) 63 — @jasonhand | @victorops
  14. Themes -Broken Deployments -Slow Time To Recover - Cost of

    Down5me - Low Incident Visibility - Unhappy Customers - Long Release Cycles 66 — @jasonhand | @victorops
  15. Monitoring vs. Observability Monitoring is an ac,on we take on

    a system. Observability is a property of a system. 69 — @jasonhand | @victorops
  16. Assess Is it doing what it is supposed to be

    doing? Determine what "normal" is and how to keep tabs on it in real 4me. Where are we now? 72 — @jasonhand | @victorops
  17. Chaos Engineering Stretching, exercising, or otherwise pushing the system to

    it's limits to know where those limita8ons exist. 80 — @jasonhand | @victorops
  18. Chaos Engineering ๏ Reduce Impact of Injury ๏ Flush out

    Unknown Unknowns (check condi;ons) 82 — @jasonhand | @victorops
  19. GameDays Using knowledge and structured plan, rou4nely perform the ac4ons.

    Expanding and improving current PR's 85 — @jasonhand | @victorops
  20. “We need to create a culture that reinforces the value

    of taking risks and learning from failure and the need for repe//on and prac/ce to create mastery.” — Gene Kim (Co-author, Phoenix Project) 89 — @jasonhand | @victorops
  21. It's not just a technical solu/on. It's not just a

    procedure problem. 91 — @jasonhand | @victorops
  22. DevOps An approach to our "work" where we con.nuously look

    for methods to evaluate and improve the technology, process, and people as they relate to building, deploying, opera0ng, securing, and suppor0ng the "value" our organiza.on provides. 92 — @jasonhand | @victorops
  23. Now What? Understand where you are right now? Make more

    of the system knowable from the customer's perspec;ve. 95 — @jasonhand | @victorops
  24. Now What? Push the limits of your systems and use

    metrics to determine normal and thresholds 96 — @jasonhand | @victorops
  25. Now What? Establish regime and workout rou4ne to constantly work

    muscles and build intui4on 97 — @jasonhand | @victorops