Upgrade to Pro — share decks privately, control downloads, hide ads and more …

What is SRE

What is SRE

Tammy Bryant Butow ( Principal Site Reliability Engineer @ Gremlin ) answers the question, what is SRE ( Site Reliability Engineering)?

Go to https://gremlin.com/talks for free Gremlin stickers as a gift for watching this talk

Tammy Bryant Butow

May 14, 2021
Tweet

More Decks by Tammy Bryant Butow

Other Decks in Technology

Transcript

  1. 2 1. What is SRE? 2. SRE Phases 3. SRE

    Use Cases 4. SRE Success Stories Agenda Product Development Capacity Planning Testing + Release Procedures Postmortem Analysis Incident Response Monitoring @tammybutow
  2. What is SRE? Site Reliability Engineering (SRE) is a software

    engineering strategy and methodology. The term SRE was coined by Ben Treynor (Google) in 2003. Site Reliability Engineering involves both ops work -- tickets, on-call & manual tasks -- and development work -- internal tooling, SRE tools and building automatic systems. The percentage of time spent on ops/development depends on the needs of your organisation. It’s an important metric to track! Over time the ops % for each system should decrease. 4 @tammybutow
  3. - Andrew Widdowson (SRE @ Google) “Our work is like

    being a part of the world’s most intense pit crew. We change the tires of a race car as it’s going 100mph.” 5 @tammybutow
  4. What is SRE? 6 @tammybutow Ops Dev 50% Time 50%

    Time A day in the life of an SRE
  5. What is SRE? 9 @tammybutow Ops Dev 25% 50% ?

    25% This is time I can potentially share with another team! A day in the life of an SRE
  6. 13 @tammybutow Product Development Capacity Planning Testing + Release Procedures

    Postmortem Analysis Incident Response Monitoring 1 2 3
  7. SRE Use Case 1: Incident Response 15 @tammybutow DETECTION DIAGNOSIS

    MITIGATION PREVENTION CLOSURE DETECTION Alert & page for SEV Discover source of SEV Introduce fix and mitigate impact of SEV TTD (Time to Detection) TTI (Total time of Impact) GameDay to replicate SEV and confirm fix is reliable Alert & page for SEV Understand root cause and complete all SEV action items TTR (Time to Recovery) TTD (Time to Detection) TBF (Time between failures) ROLES & RESPONSIBILITIES Incident Manager On-Call (IMOC) Tech Lead On-Call (TLOC) The IMOC leads and coordinate the SEV team through the SEV lifecycle. The TLOC settles in the trenches and stays laser-focused on technical problem solving
  8. SRE Use Case 2: Postmortem Analysis 17 @tammybutow Postmortem: SEV

    0 Slow Walrus Owner: IMOC (), TLOC () Status: Final/Draft Incident Date: Published Date: Executive Summary Impact: Root causes: Problem Summary: Duration of problem: Product(s) affected: % of product affected: User Impact: Revenue Impact: Detection: Resolution: Root Causes & Trigger: Timeline / Recovery efforts: Lessons Learned: What went well? What went poorly? • Outage • Recovery Where did we get lucky? Action Items: Glossary: Appendix:
  9. SRE Use Case 2: Postmortem Analysis 18 @tammybutow Incident Database

    Postmortem Analysis Dashboard Postmortems Postmortem Database
  10. SRE Use Case 3: Incident Reproduction 20 @tammybutow Postmortem Gremlin

    Scenarios Incident Reproduction Results Automate Gremlin Scenarios
  11. SRE Success Stories: Dropbox 22 @tammybutow 10x reduction in incidents

    in 3 months No SEV 0s for 12+ months Reduction in on-call time % Increase in team engagement
  12. SRE Success Stories: Gremlin 23 @tammybutow Regular monthly GameDays Identification

    of 10+ critical issues Reduction in on-call training time Increase in team knowledge