Slide 1

Slide 1 text

1 What is SRE? Tammy Butow Principal SRE @ Gremlin

Slide 2

Slide 2 text

2 1. What is SRE? 2. SRE Phases 3. SRE Use Cases 4. SRE Success Stories Agenda Product Development Capacity Planning Testing + Release Procedures Postmortem Analysis Incident Response Monitoring @tammybutow

Slide 3

Slide 3 text

3 What is SRE? @tammybutow

Slide 4

Slide 4 text

What is SRE? Site Reliability Engineering (SRE) is a software engineering strategy and methodology. The term SRE was coined by Ben Treynor (Google) in 2003. Site Reliability Engineering involves both ops work -- tickets, on-call & manual tasks -- and development work -- internal tooling, SRE tools and building automatic systems. The percentage of time spent on ops/development depends on the needs of your organisation. It’s an important metric to track! Over time the ops % for each system should decrease. 4 @tammybutow

Slide 5

Slide 5 text

- Andrew Widdowson (SRE @ Google) “Our work is like being a part of the world’s most intense pit crew. We change the tires of a race car as it’s going 100mph.” 5 @tammybutow

Slide 6

Slide 6 text

What is SRE? 6 @tammybutow Ops Dev 50% Time 50% Time A day in the life of an SRE

Slide 7

Slide 7 text

What is SRE? 7 @tammybutow Ops Dev 80% Time 20%Time A day in the life of an SRE

Slide 8

Slide 8 text

What is SRE? 8 @tammybutow Ops Dev 25% 50% A day in the life of an SRE

Slide 9

Slide 9 text

What is SRE? 9 @tammybutow Ops Dev 25% 50% ? 25% This is time I can potentially share with another team! A day in the life of an SRE

Slide 10

Slide 10 text

10 SRE Phases @tammybutow

Slide 11

Slide 11 text

SRE Phases 11 @tammybutow Plan Code Test Build Deploy Operate Productionize Integration Monitor

Slide 12

Slide 12 text

12 SRE Use Cases @tammybutow

Slide 13

Slide 13 text

13 @tammybutow Product Development Capacity Planning Testing + Release Procedures Postmortem Analysis Incident Response Monitoring 1 2 3

Slide 14

Slide 14 text

14 SRE Use Case 1: Incident Response @tammybutow

Slide 15

Slide 15 text

SRE Use Case 1: Incident Response 15 @tammybutow DETECTION DIAGNOSIS MITIGATION PREVENTION CLOSURE DETECTION Alert & page for SEV Discover source of SEV Introduce fix and mitigate impact of SEV TTD (Time to Detection) TTI (Total time of Impact) GameDay to replicate SEV and confirm fix is reliable Alert & page for SEV Understand root cause and complete all SEV action items TTR (Time to Recovery) TTD (Time to Detection) TBF (Time between failures) ROLES & RESPONSIBILITIES Incident Manager On-Call (IMOC) Tech Lead On-Call (TLOC) The IMOC leads and coordinate the SEV team through the SEV lifecycle. The TLOC settles in the trenches and stays laser-focused on technical problem solving

Slide 16

Slide 16 text

16 SRE Use Case 2: Postmortem Analysis @tammybutow

Slide 17

Slide 17 text

SRE Use Case 2: Postmortem Analysis 17 @tammybutow Postmortem: SEV 0 Slow Walrus Owner: IMOC (), TLOC () Status: Final/Draft Incident Date: Published Date: Executive Summary Impact: Root causes: Problem Summary: Duration of problem: Product(s) affected: % of product affected: User Impact: Revenue Impact: Detection: Resolution: Root Causes & Trigger: Timeline / Recovery efforts: Lessons Learned: What went well? What went poorly? ● Outage ● Recovery Where did we get lucky? Action Items: Glossary: Appendix:

Slide 18

Slide 18 text

SRE Use Case 2: Postmortem Analysis 18 @tammybutow Incident Database Postmortem Analysis Dashboard Postmortems Postmortem Database

Slide 19

Slide 19 text

19 SRE Use Case 3: Incident Reproduction @tammybutow

Slide 20

Slide 20 text

SRE Use Case 3: Incident Reproduction 20 @tammybutow Postmortem Gremlin Scenarios Incident Reproduction Results Automate Gremlin Scenarios

Slide 21

Slide 21 text

21 SRE Success Stories @tammybutow

Slide 22

Slide 22 text

SRE Success Stories: Dropbox 22 @tammybutow 10x reduction in incidents in 3 months No SEV 0s for 12+ months Reduction in on-call time % Increase in team engagement

Slide 23

Slide 23 text

SRE Success Stories: Gremlin 23 @tammybutow Regular monthly GameDays Identification of 10+ critical issues Reduction in on-call training time Increase in team knowledge

Slide 24

Slide 24 text

24 Join the community gremlin.com/slack @tammybutow

Slide 25

Slide 25 text

Thank You tammy@gremlin.com linkedin.com/in/tammybutow/ @tammybutow