life cycle. • Incident response roles and structures exist to embody 3T mental models. • Choosing strategies and tools makes incident managements at startups sensible. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 3
management and SRE prac0ces. • But contains a lot of opinionated ideas and philosophy as well. • So, the ideas might contradict to some people's. • Let's discuss on TwiAer using #srenext with @takanabe_w SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 5
incident management? • Dissect incident management prac8ces. • 3T mental models and life cycles • How can we improve incident management? • Choosing right strategies and tools SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 6
selec0ons by: • Predic(ng a meaningful subset of tests. • Iden(fying flaky tests. • Visualizing test trends with metrics. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 10
and stages. • Company A: +300,000 people • Company B: +400 people (Joined when they had +300 people) • Company C: +150 people (Joined when they only had less than 10 people) • Product developments are always the highest priority concerns. • OperaHon improvement != Product development velocity degradaHon. • We will never have enough engineering members to improve opera;ons. Never. 1 SRE NEXT 2020: Designing fault-tolerant microservices with SRE and circuit breaker centric architecture SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 20
so 2 3. • I believe sensible incident management accelerates our development velocity. 3 A Philosophy of So.ware Deisgn, Chapter 3: Working Code Isnt' Enough, pp. 13 - 18. 2 mar&nFowler.com: Is High Quality So;ware Worth the Cost? SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 21
reframe "Does a startup need incident management?" to: • Which incident management processes won't change even for rapid developments? • Which processes should we improve? • Let's dissect incident management prac=ces in the industry. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 22
overall process for handling incidents in an organiza5on. Incident response • Part of incident management for actual technical steps including detec5on, repor5ng, mi5ga5on, and recovery during incidents. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 24
an Incident Google’s Approach to Incident Management for Produc;on Services, Chapter 4: Mi;ga;on and Recovery, pp. 31-32. 5 Google SRE Workbook, Chapter 9: Incident Response SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 29
in organiza5ons. • Understand ongoing opera5ons • Understand who is doing what • Delegate sub-commander responsibility to others if necessary. • Make incident response tangible. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 35
For both internally and externally. • Make incident response transparent. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 36
issues. • Focus on triage, analysis, mi2ga2on and recovery. • Communica2on with rest of organiza2ons is not a primary concern. • In many cases, operators produce root causes of incidents but don't blame them. • Nobody wants to cause incidents. • All par2cipants focus on assigned roles based on chain of trust. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 37
embody 3T mental models. • Transparency • Keep informa-on of incident responses reachable for everybody. • Tangibility • Manage status of incidents. • Manage who handles what. • Trust • Believe everybody makes best efforts during incidents. • Don't blame anybody because nobody wants to cause incidents. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 38
• Time to engagement (TTE) • Time to fix (TTF) • Time to repair/recovery (TTR) • Time between failures (TBF) SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 50
detec0on mechanisms using Datadog and Sentry. Solu%on • Introduc*on of SLO and Error Budget makes our aler*ng criteria more clear. • But don't forget "Law of diminishing returns" to make decisions. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 57
during office hours at Slack channels. • We don't have on-call rota.ons ATM, which makes TTE uncontrollable. Solu%on • Apply follow-the-sun strategy to cover wide-range hours. • Introducing on-call rota:ons and pager. • But we don't feel it's necessary now. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 58
observability mechanisms • Depending on each developer's debug skill • During this window, developers cannot spend .me on product developments. Solu%on • Introducing more team-shared observability dashboards. • Introducing more observability mechanism to drill down root causes. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 59
and long MTBF are the best • Short MTTR but short MTBF = Incidents frequently occur but are recovered quickly. • Long MTTR but long MTBF = Incidents don't occur frequently but once occur, they aren't recovered soon. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 61
evolu.on without high cadence itera.ons at startups. • TTD and TTE are difficult to improve for us. • Reducing TTF results in reducing MTTR. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 62
,mes we spend as post-incident ac,vi,es. • Time to (addi,onal )triage (TTT) • Time to learn (TTL) • Time to improvement (TTI) • Time to prepara,on (TTP) SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 69
spend 0me on the following processes. • Addi$onal triages to dig root causes. • Prepara$on for learning ( != I don't like joining postmortem sessions). • Maintainance of incident management processes. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 71
• We some:mes cannot find root causes (this is acceptable). • We tend to spend a lot of :me here in that situa:on. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 72
take 'me. • Documen'ng for Postmortems. • Copy & paste dances to create 'meline sca=ered various places. • Timeline needs to consider 'me-zones. • There is a gravity which prevent people from announcing incident casually. • For starups, the most important ac'vi'es are learning as a team. • If TTL is long, people cannot announce incidents casually. • As a result, postmortems ruin short MTTR with high cadence learning itera'ons. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 73
management processes contains: • Upda.ng incident management policy. • Improving incident management structures. • Upda.ng documents. • Training people to align with the updates. • Characteris.cally, incidents don't occur frequently, • Too tough to memorize incident response processes for everybody. • In urgent situa.on, people don't read documents. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 74
coming from postmortems. • No teams can handle all ac0on items we discussed during postmortems. • Common an0-pa<ern: people create too many ac0on items and assign without priori0es. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 75
capacity • Priori7ze and classify the work9 10 10 Anatomy of an incident management, Chapter 5 9 Postmortem Ac,on Items: Plan the Work and Work the Plan, USENIX SRECon 2017 SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 76
startup need incident management?" • At startups, how can we: • Build an incident management structure enforcing the 3T mental models? • Improve the ":mes" of the incident management life cycle? SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 82
don't need incident responses. • Build incident management structure based on product growth. • All members do everything if necessary. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 85
have customers, we need an incident management. • The more so8ware engineers join, the more incidents happen. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 88
• We create a new postmortem page using a Confluence template feature. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 98
• We create a new postmortem page using a Confluence template feature. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 99
don't have solid policy but policy does not scale. • Employees are living in Japan and US. • Sharing all informa>on on Slack is easy to miss. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 104
have more customers. • The more so3ware engineers join, the more incidents happen. • Increase of employees and >me-zone gaps make sync and push-style communica>ons tough. • In the first place, Launchable encourages async and wriEen communica>ons. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 107
documents. • Involve appropriate people based on pull-style communica8ons. • Use the current tool chains in the company. • Too many new tools degrade teams' performance. • Use Slack as interac?ve communica?on places to keep flow info. • Use Confluence to keep stock info (non-urgent communica?ons). SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 108
so difficult to implement Slack App and Web App for this purpose. • But... I want to use my ?me for other stuff. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 117
maximize developers' disposal 5me for product developments. • We don't want to increase cogni5ve loads. • OSS and in-house tool needs code and document maintenance. • OSS and in-house tool needs evangelical ac5vi5es for this type of tools. • Use SaaS if money allows (Buy, Not Build) • Salaries for so@ware engineers are way more expensive than SaaS cost. • SaaS improves their features as their business. • SaaS maintains documents as product features. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 118
incidents are complex • However, for most of incident, a single person can be responsible for opera-ons and communica-ons. • So, adding a lead role only is prudent so we don't make incident managements overly complex. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 130
h>ps:/ /sre.google/sre-book/managing-incidents/ 2. Postmortem AcEon Items: Plan the Work and Work the Plan, Sue Lueder and Betsy Beyer (Google), USENIX SRECon 2017, h>ps:/ /www.usenix.org/conference/srecon17americas/program/presentaEon/lueder. 3. Google SRE book Chapter 15 - Postmortem Culture: Learning from Failure, h>ps:/ /sre.google/sre-book/postmortem-culture/. 4. Postmortem Metadata Index, h>ps:/ /postmortems.app/. 5. The Art of SLOs, Google Site Reliability Engineering, h>ps:/ /sre.google/resources/pracEces-and-processes/art-of-slos/ 6. danluu/post-mortems: A collecEon of postmortems, h>ps:/ /github.com/danluu/post-mortems. 7. Great Incident Review Examples, The PragmaEc Engineer, h>ps:/ /blog.pragmaEcengineer.com/postmortem-best-pracEces/#great- incident-review-examples SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 136
and DevOps: Building and Scaling High Performing Technology OrganizaDons, 2018. 2. GoogleCloudPlaKorm/fourkeys, hNps:/ /github.com/GoogleCloudPlaKorm/fourkeys 3. Are you an Elite DevOps performer? Find out with the Four Keys Project, Google Cloud, hNps:/ /cloud.google.com/blog/products/ devops-sre/using-the-four-keys-to-measure-your-devops-performance 4. DORA DevOps Quick Check., hNps:/ /www.devops-research.com/quickcheck.html SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 137