Slide 1

Slide 1 text

SRE NEXT 2022 Sensible Incident Management for So4ware Startups Takayuki Watanabe @Launchable, Inc.

Slide 2

Slide 2 text

Who? Name: Takayuki Watanabe Affiliation: Launchable, Inc. Role: Software Engineer Sns: Blog: GitHub: takanabe Twitter: @takanabe_w Interests: - Developer Productivity - Site Reliability Engineering - Sustainability Engineering SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 2

Slide 3

Slide 3 text

Your takeaways You can understand: • Incident management has a life cycle. • Incident response roles and structures exist to embody 3T mental models. • Choosing strategies and tools makes incident managements at startups sensible. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 3

Slide 4

Slide 4 text

Out of scope • Fundamental SRE terminology (e.g. SLO, SLI, Error budget, Postmortem) SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 4

Slide 5

Slide 5 text

Disclaimer • This session refers a lot of exis0ng incident management and SRE prac0ces. • But contains a lot of opinionated ideas and philosophy as well. • So, the ideas might contradict to some people's. • Let's discuss on TwiAer using #srenext with @takanabe_w SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 5

Slide 6

Slide 6 text

Today's agenda • About Launchable • Does a startup need incident management? • Dissect incident management prac8ces. • 3T mental models and life cycles • How can we improve incident management? • Choosing right strategies and tools SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 6

Slide 7

Slide 7 text

Chapter 1: About Launchable SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 7

Slide 8

Slide 8 text

SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 8

Slide 9

Slide 9 text

What is Launchable? A SaaS accelera)ng so.ware development cycles. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 9

Slide 10

Slide 10 text

What is Launchable? Current focus is machine learning based test selec0ons by: • Predic(ng a meaningful subset of tests. • Iden(fying flaky tests. • Visualizing test trends with metrics. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 10

Slide 11

Slide 11 text

What is Launchable? e.g. Reordering tests based on likelihood of failures. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 11

Slide 12

Slide 12 text

Our team size • Launchable is a startup • 2 CEOs + 15 employees • So3ware engineer (7 people) • Product manager • Marke>ng • Sales • etc... SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 12

Slide 13

Slide 13 text

Phases and the number of so0ware engineers Note: the numbers are es/mated by the presenter based on previous experiences. • Phase 0: Founding ~ 4 so3ware engineers • Phase 1: 5 ~ 10 so3ware engineers • Phase 2: 11 ~ so3ware engineers SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 13

Slide 14

Slide 14 text

My SRE NEXT 2022 is about ... SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 14

Slide 15

Slide 15 text

Incident management at so#ware startups SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 15

Slide 16

Slide 16 text

Does a startup need incident management? SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 16

Slide 17

Slide 17 text

Yes, it's obvious if products have customers. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 17

Slide 18

Slide 18 text

Do you have enough engineering members? SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 18

Slide 19

Slide 19 text

No! but... SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 19

Slide 20

Slide 20 text

Learning from previous careers1 • I've worked at various sizes and stages. • Company A: +300,000 people • Company B: +400 people (Joined when they had +300 people) • Company C: +150 people (Joined when they only had less than 10 people) • Product developments are always the highest priority concerns. • OperaHon improvement != Product development velocity degradaHon. • We will never have enough engineering members to improve opera;ons. Never. 1 SRE NEXT 2020: Designing fault-tolerant microservices with SRE and circuit breaker centric architecture SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 20

Slide 21

Slide 21 text

Are speed and quality trade-off? • I personally don't think so 2 3. • I believe sensible incident management accelerates our development velocity. 3 A Philosophy of So.ware Deisgn, Chapter 3: Working Code Isnt' Enough, pp. 13 - 18. 2 mar& Is High Quality So;ware Worth the Cost? SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 21

Slide 22

Slide 22 text

Can we reframe the original ques3on? • We want to reframe "Does a startup need incident management?" to: • Which incident management processes won't change even for rapid developments? • Which processes should we improve? • Let's dissect incident management prac=ces in the industry. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 22

Slide 23

Slide 23 text

Chapter 2: Dissect incident management prac/ces SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 23

Slide 24

Slide 24 text

What is incident management? Incident management • High level and overall process for handling incidents in an organiza5on. Incident response • Part of incident management for actual technical steps including detec5on, repor5ng, mi5ga5on, and recovery during incidents. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 24

Slide 25

Slide 25 text

Exis%ng prac%ces SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 25

Slide 26

Slide 26 text

e.g. Terminology SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 26

Slide 27

Slide 27 text

Examples of terminology 4 • CAN Reports • Deputy • Execu3ve Swoop • Grenade Thrower • Incident Commander (IC) • Resolver • Severity • Scribe • Subject Ma4er Expert (SME) 4 h$ps:/ / SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 27

Slide 28

Slide 28 text

e.g. Roles SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 28

Slide 29

Slide 29 text

Examples of roles at Google 5 6 6 Anatomy of an Incident Google’s Approach to Incident Management for Produc;on Services, Chapter 4: Mi;ga;on and Recovery, pp. 31-32. 5 Google SRE Workbook, Chapter 9: Incident Response SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 29

Slide 30

Slide 30 text

Examples of roles at PagerDuty 7 8 8 Google SRE Workbook, Chapter 9: Incident Response 7 PagerDuty Incident Response Documenta6on, Different Roles - SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 30

Slide 31

Slide 31 text

Too much! SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 31

Slide 32

Slide 32 text

Can we translate these prac.ces into more higher level concepts? SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 32

Slide 33

Slide 33 text

Chapter 3: 3T mental models and life cycles SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 33

Slide 34

Slide 34 text

Examples of roles at PagerDuty SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 34

Slide 35

Slide 35 text

Command role • Responsibility is managing incident responses to align in organiza5ons. • Understand ongoing opera5ons • Understand who is doing what • Delegate sub-commander responsibility to others if necessary. • Make incident response tangible. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 35

Slide 36

Slide 36 text

Liason role • Responsibility is smooth repor1ng and communica1ons. • For both internally and externally. • Make incident response transparent. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 36

Slide 37

Slide 37 text

Opera&on role • Responsibility is actual technical ac2vi2es to solve issues. • Focus on triage, analysis, mi2ga2on and recovery. • Communica2on with rest of organiza2ons is not a primary concern. • In many cases, operators produce root causes of incidents but don't blame them. • Nobody wants to cause incidents. • All par2cipants focus on assigned roles based on chain of trust. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 37

Slide 38

Slide 38 text

3T mental models for incident response The incident response roles embody 3T mental models. • Transparency • Keep informa-on of incident responses reachable for everybody. • Tangibility • Manage status of incidents. • Manage who handles what. • Trust • Believe everybody makes best efforts during incidents. • Don't blame anybody because nobody wants to cause incidents. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 38

Slide 39

Slide 39 text

High level view of incident management cycles SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 39

Slide 40

Slide 40 text

High level view of incident management cycles SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 40

Slide 41

Slide 41 text

High level view of incident management cycles Examples: • Incident management policy • Documenta3on • Repor3ng mechanism • Observability • policy • Incident response training SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 41

Slide 42

Slide 42 text

High level view of incident management cycles Examples: • Aler&ng • Triage • Root-cause analysis • Escala'ons • Opening war rooms SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 42

Slide 43

Slide 43 text

High level view of incident management cycles Examples: • Rollback deployment (mi3ga3on) • Kill slow queries (mi3ga3on) • Fix bug (recovery) • Add index to tables (recovery) SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 43

Slide 44

Slide 44 text

High level view of incident management cycles Examples: • Addi%onal triage • Prepara%on for postmortems • Postmortems • Handle ac*on items raised at postmortems SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 44

Slide 45

Slide 45 text

Postmortem vs FtS h"ps:/ /twi" SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 45

Slide 46

Slide 46 text

Chapter 4: How can we improve incident management? SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 46

Slide 47

Slide 47 text

Where should we invest our 0me? SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 47

Slide 48

Slide 48 text

Where should we invest our 0me? SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 48

Slide 49

Slide 49 text

Key %mes of incident response SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 49

Slide 50

Slide 50 text

Key %mes of incident response • Time to detect (TTD) • Time to engagement (TTE) • Time to fix (TTF) • Time to repair/recovery (TTR) • Time between failures (TBF) SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 50

Slide 51

Slide 51 text

Time to detect (TTD) SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 51

Slide 52

Slide 52 text

Time to engagement (TTE) SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 52

Slide 53

Slide 53 text

Time to fix (TTF) SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 53

Slide 54

Slide 54 text

Time to recovery (TTR) SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 54

Slide 55

Slide 55 text

Time between failures (TBF) SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 55

Slide 56

Slide 56 text

Which &me do we want to improve? SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 56

Slide 57

Slide 57 text

TTD at Launchable Current status • We've already had several detec0on mechanisms using Datadog and Sentry. Solu%on • Introduc*on of SLO and Error Budget makes our aler*ng criteria more clear. • But don't forget "Law of diminishing returns" to make decisions. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 57

Slide 58

Slide 58 text

TTE at Launchable Current status • Easy enough to no.ce during office hours at Slack channels. • We don't have on-call rota.ons ATM, which makes TTE uncontrollable. Solu%on • Apply follow-the-sun strategy to cover wide-range hours. • Introducing on-call rota:ons and pager. • But we don't feel it's necessary now. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 58

Slide 59

Slide 59 text

TTF at Launchable Current status • We don't have enough observability mechanisms • Depending on each developer's debug skill • During this window, developers cannot spend .me on product developments. Solu%on • Introducing more team-shared observability dashboards. • Introducing more observability mechanism to drill down root causes. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 59

Slide 60

Slide 60 text

Which &me do we want to improve? TTF improvement brings us high returns with small efforts. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 60

Slide 61

Slide 61 text

Which do we op+mize? MTTR vs MTBF • Short MTTR and long MTBF are the best • Short MTTR but short MTBF = Incidents frequently occur but are recovered quickly. • Long MTTR but long MTBF = Incidents don't occur frequently but once occur, they aren't recovered soon. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 61

Slide 62

Slide 62 text

Startups should focus on MTTR improvement • There is no evolu.on without high cadence itera.ons at startups. • TTD and TTE are difficult to improve for us. • Reducing TTF results in reducing MTTR. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 62

Slide 63

Slide 63 text

Do we have other +mes we haven't ar+culated? SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 63

Slide 64

Slide 64 text

Hidden key )mes of incident life cycles SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 64

Slide 65

Slide 65 text

Hidden key )mes of incident life cycles SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 65

Slide 66

Slide 66 text

Hidden key )mes of incident life cycles SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 66

Slide 67

Slide 67 text

Hidden key )mes of incident life cycles SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 67

Slide 68

Slide 68 text

Hidden key )mes of incident life cycles SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 68

Slide 69

Slide 69 text

Hidden key )mes of incident life cycles Don't underes,mate the ,mes we spend as post-incident ac,vi,es. • Time to (addi,onal )triage (TTT) • Time to learn (TTL) • Time to improvement (TTI) • Time to prepara,on (TTP) SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 69

Slide 70

Slide 70 text

Power ques*on: Which process do you hate? SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 70

Slide 71

Slide 71 text

Which process do you hate? I personally don't want to spend 0me on the following processes. • Addi$onal triages to dig root causes. • Prepara$on for learning ( != I don't like joining postmortem sessions). • Maintainance of incident management processes. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 71

Slide 72

Slide 72 text

Why addi(onal triages? • Startups don't have enough observability mechanisms. • We some:mes cannot find root causes (this is acceptable). • We tend to spend a lot of :me here in that situa:on. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 72

Slide 73

Slide 73 text

Why prepara)on for learning? • Prepara'on for team-wise learning sessions take 'me. • Documen'ng for Postmortems. • Copy & paste dances to create 'meline sca=ered various places. • Timeline needs to consider 'me-zones. • There is a gravity which prevent people from announcing incident casually. • For starups, the most important ac'vi'es are learning as a team. • If TTL is long, people cannot announce incidents casually. • As a result, postmortems ruin short MTTR with high cadence learning itera'ons. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 73

Slide 74

Slide 74 text

Why maintenance of incident management processes? • Maintainance of incident management processes contains: • incident management policy. • Improving incident management structures. • documents. • Training people to align with the updates. • Characteris.cally, incidents don't occur frequently, • Too tough to memorize incident response processes for everybody. • In urgent situa.on, people don't read documents. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 74

Slide 75

Slide 75 text

Can we reduce TTI? • It's depending on ac0on items coming from postmortems. • No teams can handle all ac0on items we discussed during postmortems. • Common an0-pa

Slide 76

Slide 76 text

Approach for unbalanced ac0on items • Think of engineering members' capacity • Priori7ze and classify the work9 10 10 Anatomy of an incident management, Chapter 5 9 Postmortem Ac,on Items: Plan the Work and Work the Plan, USENIX SRECon 2017 SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 76

Slide 77

Slide 77 text

Importance / Size / Urgency (ISU) Matrix • Assignee's confidence is also valuable to declare. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 77

Slide 78

Slide 78 text

ISU Matrix on GitHub Projects SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 78

Slide 79

Slide 79 text

My focus is reduc-on of TTT, TTL, and TTP SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 79

Slide 80

Slide 80 text

Chapter 5: Choosing right strategies and tools SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 80

Slide 81

Slide 81 text

Phases and the number of so0ware engineers Note: the numbers are es/mated by the presenter based on previous experiences. • Phase 0: Founding ~ 4 so3ware engineers • Phase 1: 5 ~ 10 so3ware engineers • Phase 2: 11 ~ so3ware engineers SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 81

Slide 82

Slide 82 text

Let's reframe the original ques3on again • Reframe "Does a startup need incident management?" • At startups, how can we: • Build an incident management structure enforcing the 3T mental models? • Improve the ":mes" of the incident management life cycle? SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 82

Slide 83

Slide 83 text

Evolu&on of incident management at Launchable Improvement target Ac/ons from phase 0 to 1 Ac/ons from phase 1 to 2 Transparency - Encourage push communica3on - Encourage pull communica3on - Create war rooms - Share status pages Tangibility - Automate parts of incident response flow - Automate en3re incident response flow - Introduce incident lead role Trust - Introduce blameless culture - Split lead and opera3on roles for complex incidents Time to Engagement (TTE) - Automate incident announcements - Automate en3re incident response flow - Introduce on-call rota3ons - Expand follow-the-sun coverages Time to Fix (TTF) - Introduce observability - Improve observability Time to Triage (TTT) - Introduce observability - Improve observability Time to Learn (TTL) - Introduce postmortem template - Generate postmortem Time to Prepara3on (TTP) - Create incident management policies - Enforce incident management policies - Self-service incident response trainings SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 83

Slide 84

Slide 84 text

Phase 0: Founding ~ 4 so2ware engineers SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 84

Slide 85

Slide 85 text

No strategy • Product does not have customers. • We don't need incident responses. • Build incident management structure based on product growth. • All members do everything if necessary. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 85

Slide 86

Slide 86 text

Incident management system SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 86

Slide 87

Slide 87 text

Phase 1: 5 ~ 10 so-ware engineers SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 87

Slide 88

Slide 88 text

Environmental changes from phase 0 to 1 • When products have customers, we need an incident management. • The more so8ware engineers join, the more incidents happen. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 88

Slide 89

Slide 89 text

Strategy Make everything simple and easy to follow SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 89

Slide 90

Slide 90 text

Incident management changes from phase 0 to 1 Improvement target Ac/ons from phase 0 to 1 Ac/ons from phase 1 to 2 Transparency - Encourage push communica/on - Encourage pull communica3on - Create war rooms - Share status pages Tangibility - Automate parts of incident response flow - Automate en3re incident response flow - Introduce incident lead role Trust - Introduce blameless culture - Split lead and opera3on roles for complex incidents Time to Engagement (TTE) - Automate incident announcements - Automate en3re incident response flow - Introduce on-call rota3ons - Expand follow-the-sun coverages Time to Fix (TTF) - Introduce observability - Improve observability Time to Triage (TTT) - Introduce observability - Improve observability Time to Learn (TTL) - Introduce postmortem template - Generate postmortem Time to Prepara3on (TTP) - Create incident management policies - Enforce incident management policies - Self-service incident response trainings SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 90

Slide 91

Slide 91 text

Incident management system (phase 0 to 1) SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 91

Slide 92

Slide 92 text

Incident management system (phase 0 to 1) SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 92

Slide 93

Slide 93 text

Founda'on of incident management policies • We maintain policies on Confluence. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 93

Slide 94

Slide 94 text

Founda'on of incident management policies • We maintain policies on Confluence. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 94

Slide 95

Slide 95 text

Automa'on of incident escala'ons • We escalate incidents using Slack Workflow. • We handle incidents in Slack channel and Google Meet. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 95

Slide 96

Slide 96 text

Automa'on of incident escala'ons • We escalate incidents using Slack Workflow. • We handle incidents in Slack channel and Google Meet. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 96

Slide 97

Slide 97 text

Automa'on of incident escala'ons • We escalate incidents using Slack Workflow. • We handle incidents in Slack channel and Google Meet. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 97

Slide 98

Slide 98 text

Introduc)on of postmortem • We keep all postmortems on Confluence. • We create a new postmortem page using a Confluence template feature. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 98

Slide 99

Slide 99 text

Introduc)on of postmortem • We keep all postmortems on Confluence. • We create a new postmortem page using a Confluence template feature. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 99

Slide 100

Slide 100 text

Very simple and easy to follow SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 100

Slide 101

Slide 101 text

Postmortems as strong fact-based data • Even we cannot solve root causes, you can use the postmortems as data. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 101

Slide 102

Slide 102 text

Can we improve the incident management? SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 102

Slide 103

Slide 103 text

e.g. Does this what human should take care? SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 103

Slide 104

Slide 104 text

e.g. Does this what human should take care? • We don't have solid policy but policy does not scale. • Employees are living in Japan and US. • Sharing all informa>on on Slack is easy to miss. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 104

Slide 105

Slide 105 text

e.g. Do we need roles? SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 105

Slide 106

Slide 106 text

Phase 2: 11 ~ ?? so-ware engineers SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 106

Slide 107

Slide 107 text

Environmental changes from Phase 1 to 2 • Our products have more customers. • The more so3ware engineers join, the more incidents happen. • Increase of employees and >me-zone gaps make sync and push-style communica>ons tough. • In the first place, Launchable encourages async and wriEen communica>ons. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 107

Slide 108

Slide 108 text

Strategy • Enforce incident management policies by so4ware not by documents. • Involve appropriate people based on pull-style communica8ons. • Use the current tool chains in the company. • Too many new tools degrade teams' performance. • Use Slack as interac?ve communica?on places to keep flow info. • Use Confluence to keep stock info (non-urgent communica?ons). SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 108

Slide 109

Slide 109 text

Incident management system (phase 1 to 2) SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 109

Slide 110

Slide 110 text

Incident management system (phase 1 to 2) SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 110

Slide 111

Slide 111 text

Incident management system (phase 1 to 2) SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 111

Slide 112

Slide 112 text

SaaS: • h#ps:/ / SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 112

Slide 113

Slide 113 text

SaaS: Blameless • h#ps:/ / SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 113

Slide 114

Slide 114 text

SaaS: Datadog Incident • h#ps:/ / SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 114

Slide 115

Slide 115 text

SaaS: Grafana Incident • h#ps:/ / SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 115

Slide 116

Slide 116 text

OSS: monzo/response OSS version of • h#ps:/ / • h#ps:/ / SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 116

Slide 117

Slide 117 text

in-house tool: Slack App + Web App • It's not so difficult to implement Slack App and Web App for this purpose. • But... I want to use my ?me for other stuff. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 117

Slide 118

Slide 118 text

SaaS vs OSS vs in-house tool • We want to maximize developers' disposal 5me for product developments. • We don't want to increase cogni5ve loads. • OSS and in-house tool needs code and document maintenance. • OSS and in-house tool needs evangelical ac5vi5es for this type of tools. • Use SaaS if money allows (Buy, Not Build) • Salaries for so@ware engineers are way more expensive than SaaS cost. • SaaS improves their features as their business. • SaaS maintains documents as product features. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 118

Slide 119

Slide 119 text covers wide improvement targets Improvement target Ac/ons from phase 0 to 1 Ac/ons from phase 1 to 2 Transparency - Encourage push communica3on - Encourage pull communica/on - Create war rooms - Share status pages Tangibility - Automate parts of incident response flow - Automate en/re incident response flow - Introduce incident lead role Trust - Introduce blameless culture - Split lead and opera/on roles for complex incidents Time to Engagement (TTE) - Automate incident announcements - Automate en/re incident response flow - Introduce on-call rota3ons - Expand follow-the-sun coverages Time to Fix (TTF) - Introduce observability - Improve observability Time to Triage (TTT) - Introduce observability - Improve observability Time to Learn (TTL) - Introduce postmortem template - Generate postmortem Time to Prepara3on (TTP) - Create incident management policies - Enforce incident management policies - Self-service incident response trainings SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 119

Slide 120

Slide 120 text

Central channel for all incidents can share all incidents in the specified Slack channel. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 120

Slide 121

Slide 121 text

Dedicated war rooms (Slack channel) handles all tasks we want to complete for incident response ini4aliza4ons. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 121

Slide 122

Slide 122 text

Dedicated war rooms (Slack channel) can assist incident responses. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 122

Slide 123

Slide 123 text

Dedicated Slack channel (closing incident) At the end of incident responses, tells us what we need to be done next. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 123

Slide 124

Slide 124 text

Status updates at central channel automa-cally syncs the latest status of incidents at the central channel. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 124

Slide 125

Slide 125 text

Postmortem genera,on can collect ,melines from war rooms and generates postmortems. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 125

Slide 126

Slide 126 text

Postmortem genera,on We can generate a postmortem documents using SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 126

Slide 127

Slide 127 text

Postmortem genera,on We can collect *melines from dedicated Slack channels. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 127

Slide 128

Slide 128 text

Postmortem genera,on SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 128

Slide 129

Slide 129 text

Self-training mode has a mode to walk though dummy incident responses on Slack. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 129

Slide 130

Slide 130 text

Introduc)on of lead role • We need communica-on leads when incidents are complex • However, for most of incident, a single person can be responsible for opera-ons and communica-ons. • So, adding a lead role only is prudent so we don't make incident managements overly complex. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 130

Slide 131

Slide 131 text

We have more rooms to improve! SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 131

Slide 132

Slide 132 text

Recap • Incident management has a life cycle. • Prepara6on -> Detec6on -> Recovery -> Post-incident ac6ons -> Prepara6on • Incident response roles and structures exist to embody 3T. • Transparency • Tangibility • Trust • Choosing strategy and tools makes incident managements at startups sensible. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 132

Slide 133

Slide 133 text

Thanks SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 133

Slide 134

Slide 134 text

References SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 134

Slide 135

Slide 135 text

Incident management 1. Atlassian: Understanding incident response roles and responsibili5es, h8ps:/ / response/roles-responsibili5es 2. PagerDuty Incident Response Training, h8ps:/ / 3. Anatomy of an Incident, Ayelet Sachto, Adrienne Walcer, and Jessie Yang, 2022. 4. US Federal Emergency Management Agency, Emergency Management Ins5tute ICS Resource Center, h8ps:/ / emiweb/is/icsresource/. 5. The Na5onal Ins5tute of Standards and Technology SP 800-61, Computer Security Incident Handling Guide, h8p:/ / 10.6028/NIST.SP.800-61r2. 6. Introduc5on: Incident Response overview, Gov UK Na5onal Cyber Security Centre, h8ps:/ / management/incident-response 7. Incident Review and Postmortem Best Prac5ces, h8ps:/ / 8. Incident Review Prac5ces [The Pragma5c Engineer Newsle8er], h8ps:/ / l2H05iKOUbpkrqwlZ61ZCJDnwY5iE8LtRM/edit#gid=0 SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 135

Slide 136

Slide 136 text

SRE 1. Google SRE book Chapter 14 - Managing Incidents, h>ps:/ / 2. Postmortem AcEon Items: Plan the Work and Work the Plan, Sue Lueder and Betsy Beyer (Google), USENIX SRECon 2017, h>ps:/ / 3. Google SRE book Chapter 15 - Postmortem Culture: Learning from Failure, h>ps:/ / 4. Postmortem Metadata Index, h>ps:/ / 5. The Art of SLOs, Google Site Reliability Engineering, h>ps:/ / 6. danluu/post-mortems: A collecEon of postmortems, h>ps:/ / 7. Great Incident Review Examples, The PragmaEc Engineer, h>ps:/ / incident-review-examples SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 136

Slide 137

Slide 137 text

DevOps performance metrics 1. Accelerate: The Science of Lean So4ware and DevOps: Building and Scaling High Performing Technology OrganizaDons, 2018. 2. GoogleCloudPlaKorm/fourkeys, hNps:/ / 3. Are you an Elite DevOps performer? Find out with the Four Keys Project, Google Cloud, hNps:/ / devops-sre/using-the-four-keys-to-measure-your-devops-performance 4. DORA DevOps Quick Check., hNps:/ / SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 137

Slide 138

Slide 138 text

SaaS and OSS 1. Datadog, h,ps:/ / 2., h,ps:/ / 3. jeli, h,ps:/ / 4. monzo/response, h,ps:/ / 5. Etsy/morgue, h,ps:/ / SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 138