Upgrade to Pro — share decks privately, control downloads, hide ads and more …

What is SRE

What is SRE

Tammy Bryant Butow ( Principal Site Reliability Engineer @ Gremlin ) answers the question, what is SRE ( Site Reliability Engineering)?

Go to https://gremlin.com/talks for free Gremlin stickers as a gift for watching this talk

203e64aeb53ae59b2b4dcf923c163c23?s=128

Tammy Bryant Butow

May 14, 2021
Tweet

Transcript

  1. 1 What is SRE? Tammy Butow Principal SRE @ Gremlin

  2. 2 1. What is SRE? 2. SRE Phases 3. SRE

    Use Cases 4. SRE Success Stories Agenda Product Development Capacity Planning Testing + Release Procedures Postmortem Analysis Incident Response Monitoring @tammybutow
  3. 3 What is SRE? @tammybutow

  4. What is SRE? Site Reliability Engineering (SRE) is a software

    engineering strategy and methodology. The term SRE was coined by Ben Treynor (Google) in 2003. Site Reliability Engineering involves both ops work -- tickets, on-call & manual tasks -- and development work -- internal tooling, SRE tools and building automatic systems. The percentage of time spent on ops/development depends on the needs of your organisation. It’s an important metric to track! Over time the ops % for each system should decrease. 4 @tammybutow
  5. - Andrew Widdowson (SRE @ Google) “Our work is like

    being a part of the world’s most intense pit crew. We change the tires of a race car as it’s going 100mph.” 5 @tammybutow
  6. What is SRE? 6 @tammybutow Ops Dev 50% Time 50%

    Time A day in the life of an SRE
  7. What is SRE? 7 @tammybutow Ops Dev 80% Time 20%Time

    A day in the life of an SRE
  8. What is SRE? 8 @tammybutow Ops Dev 25% 50% A

    day in the life of an SRE
  9. What is SRE? 9 @tammybutow Ops Dev 25% 50% ?

    25% This is time I can potentially share with another team! A day in the life of an SRE
  10. 10 SRE Phases @tammybutow

  11. SRE Phases 11 @tammybutow Plan Code Test Build Deploy Operate

    Productionize Integration Monitor
  12. 12 SRE Use Cases @tammybutow

  13. 13 @tammybutow Product Development Capacity Planning Testing + Release Procedures

    Postmortem Analysis Incident Response Monitoring 1 2 3
  14. 14 SRE Use Case 1: Incident Response @tammybutow

  15. SRE Use Case 1: Incident Response 15 @tammybutow DETECTION DIAGNOSIS

    MITIGATION PREVENTION CLOSURE DETECTION Alert & page for SEV Discover source of SEV Introduce fix and mitigate impact of SEV TTD (Time to Detection) TTI (Total time of Impact) GameDay to replicate SEV and confirm fix is reliable Alert & page for SEV Understand root cause and complete all SEV action items TTR (Time to Recovery) TTD (Time to Detection) TBF (Time between failures) ROLES & RESPONSIBILITIES Incident Manager On-Call (IMOC) Tech Lead On-Call (TLOC) The IMOC leads and coordinate the SEV team through the SEV lifecycle. The TLOC settles in the trenches and stays laser-focused on technical problem solving
  16. 16 SRE Use Case 2: Postmortem Analysis @tammybutow

  17. SRE Use Case 2: Postmortem Analysis 17 @tammybutow Postmortem: SEV

    0 Slow Walrus Owner: IMOC (), TLOC () Status: Final/Draft Incident Date: Published Date: Executive Summary Impact: Root causes: Problem Summary: Duration of problem: Product(s) affected: % of product affected: User Impact: Revenue Impact: Detection: Resolution: Root Causes & Trigger: Timeline / Recovery efforts: Lessons Learned: What went well? What went poorly? • Outage • Recovery Where did we get lucky? Action Items: Glossary: Appendix:
  18. SRE Use Case 2: Postmortem Analysis 18 @tammybutow Incident Database

    Postmortem Analysis Dashboard Postmortems Postmortem Database
  19. 19 SRE Use Case 3: Incident Reproduction @tammybutow

  20. SRE Use Case 3: Incident Reproduction 20 @tammybutow Postmortem Gremlin

    Scenarios Incident Reproduction Results Automate Gremlin Scenarios
  21. 21 SRE Success Stories @tammybutow

  22. SRE Success Stories: Dropbox 22 @tammybutow 10x reduction in incidents

    in 3 months No SEV 0s for 12+ months Reduction in on-call time % Increase in team engagement
  23. SRE Success Stories: Gremlin 23 @tammybutow Regular monthly GameDays Identification

    of 10+ critical issues Reduction in on-call training time Increase in team knowledge
  24. 24 Join the community gremlin.com/slack @tammybutow

  25. Thank You tammy@gremlin.com linkedin.com/in/tammybutow/ @tammybutow