Upgrade to Pro — share decks privately, control downloads, hide ads and more …

SRE NEXT 2022: Sensible Incident Management for Software Startups

SRE NEXT 2022: Sensible Incident Management for Software Startups

More Decks by Takayuki WATANABE (渡辺 喬之)

Other Decks in Programming

Transcript

  1. SRE NEXT 2022
    Sensible Incident Management for So4ware Startups
    Takayuki Watanabe
    @Launchable, Inc.

    View Slide

  2. Who?
    Name: Takayuki Watanabe
    Affiliation: Launchable, Inc.
    Role: Software Engineer
    Sns:
    Blog: blog.takanabe.tokyo
    GitHub: takanabe
    Twitter: @takanabe_w
    Interests:
    - Developer Productivity
    - Site Reliability Engineering
    - Sustainability Engineering
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 2

    View Slide

  3. Your takeaways
    You can understand:
    • Incident management has a life cycle.
    • Incident response roles and structures exist to embody 3T mental models.
    • Choosing strategies and tools makes incident managements at startups sensible.
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 3

    View Slide

  4. Out of scope
    • Fundamental SRE terminology (e.g. SLO, SLI, Error budget, Postmortem)
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 4

    View Slide

  5. Disclaimer
    • This session refers a lot of exis0ng incident management and SRE prac0ces.
    • But contains a lot of opinionated ideas and philosophy as well.
    • So, the ideas might contradict to some people's.
    • Let's discuss on TwiAer using #srenext with @takanabe_w
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 5

    View Slide

  6. Today's agenda
    • About Launchable
    • Does a startup need incident management?
    • Dissect incident management prac8ces.
    • 3T mental models and life cycles
    • How can we improve incident management?
    • Choosing right strategies and tools
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 6

    View Slide

  7. Chapter 1:
    About Launchable
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 7

    View Slide

  8. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 8

    View Slide

  9. What is Launchable?
    A SaaS accelera)ng so.ware development cycles.
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 9

    View Slide

  10. What is Launchable?
    Current focus is machine learning based test selec0ons by:
    • Predic(ng a meaningful subset of tests.
    • Iden(fying flaky tests.
    • Visualizing test trends with metrics.
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 10

    View Slide

  11. What is Launchable?
    e.g. Reordering tests based on likelihood of failures.
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 11

    View Slide

  12. Our team size
    • Launchable is a startup
    • 2 CEOs + 15 employees
    • So3ware engineer (7 people)
    • Product manager
    • Marke>ng
    • Sales
    • etc...
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 12

    View Slide

  13. Phases and the number of so0ware engineers
    Note: the numbers are es/mated by the presenter based on previous experiences.
    • Phase 0: Founding ~ 4 so3ware engineers
    • Phase 1: 5 ~ 10 so3ware engineers
    • Phase 2: 11 ~ so3ware engineers
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 13

    View Slide

  14. My SRE NEXT 2022 is about ...
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 14

    View Slide

  15. Incident management at so#ware startups
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 15

    View Slide

  16. Does a startup need incident management?
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 16

    View Slide

  17. Yes, it's obvious if products have customers.
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 17

    View Slide

  18. Do you have enough engineering members?
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 18

    View Slide

  19. No! but...
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 19

    View Slide

  20. Learning from previous careers1
    • I've worked at various sizes and stages.
    • Company A: +300,000 people
    • Company B: +400 people (Joined when they had +300 people)
    • Company C: +150 people (Joined when they only had less than 10 people)
    • Product developments are always the highest priority concerns.
    • OperaHon improvement != Product development velocity degradaHon.
    • We will never have enough engineering members to improve opera;ons.
    Never.
    1 SRE NEXT 2020: Designing fault-tolerant microservices with SRE and circuit breaker centric architecture
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 20

    View Slide

  21. Are speed and quality trade-off?
    • I personally don't think so 2 3.
    • I believe sensible incident management accelerates our development velocity.
    3 A Philosophy of So.ware Deisgn, Chapter 3: Working Code Isnt' Enough, pp. 13 - 18.
    2 mar&nFowler.com: Is High Quality So;ware Worth the Cost?
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 21

    View Slide

  22. Can we reframe the original ques3on?
    • We want to reframe "Does a startup need incident management?" to:
    • Which incident management processes won't change even for rapid
    developments?
    • Which processes should we improve?
    • Let's dissect incident management prac=ces in the industry.
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 22

    View Slide

  23. Chapter 2:
    Dissect incident management prac/ces
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 23

    View Slide

  24. What is incident management?
    Incident management
    • High level and overall process for handling incidents in an organiza5on.
    Incident response
    • Part of incident management for actual technical steps including detec5on,
    repor5ng, mi5ga5on, and recovery during incidents.
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 24

    View Slide

  25. Exis%ng prac%ces
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 25

    View Slide

  26. e.g. Terminology
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 26

    View Slide

  27. Examples of terminology 4
    • CAN Reports
    • Deputy
    • Execu3ve Swoop
    • Grenade Thrower
    • Incident Commander (IC)
    • Resolver
    • Severity
    • Scribe
    • Subject Ma4er Expert (SME)
    4 h$ps:/
    /response.pagerduty.com/training/glossary/
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 27

    View Slide

  28. e.g. Roles
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 28

    View Slide

  29. Examples of roles at Google 5 6
    6 Anatomy of an Incident Google’s Approach to Incident Management for Produc;on Services, Chapter 4: Mi;ga;on and
    Recovery, pp. 31-32.
    5 Google SRE Workbook, Chapter 9: Incident Response
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 29

    View Slide

  30. Examples of roles at PagerDuty 7 8
    8 Google SRE Workbook, Chapter 9: Incident Response
    7 PagerDuty Incident Response Documenta6on, Different Roles -
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 30

    View Slide

  31. Too much!
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 31

    View Slide

  32. Can we translate these prac.ces
    into more higher level concepts?
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 32

    View Slide

  33. Chapter 3:
    3T mental models and life cycles
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 33

    View Slide

  34. Examples of roles at PagerDuty
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 34

    View Slide

  35. Command role
    • Responsibility is managing incident responses to align in organiza5ons.
    • Understand ongoing opera5ons
    • Understand who is doing what
    • Delegate sub-commander responsibility to others if necessary.
    • Make incident response tangible.
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 35

    View Slide

  36. Liason role
    • Responsibility is smooth repor1ng and communica1ons.
    • For both internally and externally.
    • Make incident response transparent.
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 36

    View Slide

  37. Opera&on role
    • Responsibility is actual technical ac2vi2es to solve issues.
    • Focus on triage, analysis, mi2ga2on and recovery.
    • Communica2on with rest of organiza2ons is not a primary concern.
    • In many cases, operators produce root causes of incidents but don't blame them.
    • Nobody wants to cause incidents.
    • All par2cipants focus on assigned roles based on chain of trust.
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 37

    View Slide

  38. 3T mental models for incident response
    The incident response roles embody 3T mental models.
    • Transparency
    • Keep informa-on of incident responses reachable for everybody.
    • Tangibility
    • Manage status of incidents.
    • Manage who handles what.
    • Trust
    • Believe everybody makes best efforts during incidents.
    • Don't blame anybody because nobody wants to cause incidents.
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 38

    View Slide

  39. High level view of incident management cycles
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 39

    View Slide

  40. High level view of incident management cycles
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 40

    View Slide

  41. High level view of incident management cycles
    Examples:
    • Incident management policy
    • Documenta3on
    • Repor3ng mechanism
    • Observability
    • Aler.ng policy
    • Incident response training
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 41

    View Slide

  42. High level view of incident management cycles
    Examples:
    • Aler&ng
    • Triage
    • Root-cause analysis
    • Escala'ons
    • Opening war rooms
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 42

    View Slide

  43. High level view of incident management cycles
    Examples:
    • Rollback deployment (mi3ga3on)
    • Kill slow queries (mi3ga3on)
    • Fix bug (recovery)
    • Add index to tables (recovery)
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 43

    View Slide

  44. High level view of incident management cycles
    Examples:
    • Addi%onal triage
    • Prepara%on for postmortems
    • Postmortems
    • Handle ac*on items raised at
    postmortems
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 44

    View Slide

  45. Postmortem vs FtS
    h"ps:/
    /twi"er.com/takanabe_w/status/1510943694467186699
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 45

    View Slide

  46. Chapter 4:
    How can we improve incident management?
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 46

    View Slide

  47. Where should we invest our 0me?
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 47

    View Slide

  48. Where should we invest our 0me?
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 48

    View Slide

  49. Key %mes of incident response
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 49

    View Slide

  50. Key %mes of incident response
    • Time to detect (TTD)
    • Time to engagement (TTE)
    • Time to fix (TTF)
    • Time to repair/recovery (TTR)
    • Time between failures (TBF)
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 50

    View Slide

  51. Time to detect (TTD)
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 51

    View Slide

  52. Time to engagement (TTE)
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 52

    View Slide

  53. Time to fix (TTF)
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 53

    View Slide

  54. Time to recovery (TTR)
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 54

    View Slide

  55. Time between failures (TBF)
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 55

    View Slide

  56. Which &me do we want to improve?
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 56

    View Slide

  57. TTD at Launchable
    Current status
    • We've already had several detec0on mechanisms using Datadog and Sentry.
    Solu%on
    • Introduc*on of SLO and Error Budget makes our aler*ng criteria more clear.
    • But don't forget "Law of diminishing returns" to make decisions.
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 57

    View Slide

  58. TTE at Launchable
    Current status
    • Easy enough to no.ce during office hours at Slack channels.
    • We don't have on-call rota.ons ATM, which makes TTE uncontrollable.
    Solu%on
    • Apply follow-the-sun strategy to cover wide-range hours.
    • Introducing on-call rota:ons and pager.
    • But we don't feel it's necessary now.
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 58

    View Slide

  59. TTF at Launchable
    Current status
    • We don't have enough observability mechanisms
    • Depending on each developer's debug skill
    • During this window, developers cannot spend .me on product developments.
    Solu%on
    • Introducing more team-shared observability dashboards.
    • Introducing more observability mechanism to drill down root causes.
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 59

    View Slide

  60. Which &me do we want to improve?
    TTF improvement brings us high returns with small efforts.
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 60

    View Slide

  61. Which do we op+mize? MTTR vs MTBF
    • Short MTTR and long MTBF are the best
    • Short MTTR but short MTBF
    = Incidents frequently occur but are recovered quickly.
    • Long MTTR but long MTBF
    = Incidents don't occur frequently but once occur, they aren't recovered soon.
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 61

    View Slide

  62. Startups should focus on MTTR improvement
    • There is no evolu.on without high cadence itera.ons at startups.
    • TTD and TTE are difficult to improve for us.
    • Reducing TTF results in reducing MTTR.
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 62

    View Slide

  63. Do we have other +mes
    we haven't ar+culated?
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 63

    View Slide

  64. Hidden key )mes of incident life cycles
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 64

    View Slide

  65. Hidden key )mes of incident life cycles
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 65

    View Slide

  66. Hidden key )mes of incident life cycles
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 66

    View Slide

  67. Hidden key )mes of incident life cycles
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 67

    View Slide

  68. Hidden key )mes of incident life cycles
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 68

    View Slide

  69. Hidden key )mes of incident life cycles
    Don't underes,mate the ,mes we spend as post-incident ac,vi,es.
    • Time to (addi,onal )triage (TTT)
    • Time to learn (TTL)
    • Time to improvement (TTI)
    • Time to prepara,on (TTP)
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 69

    View Slide

  70. Power ques*on:
    Which process do you hate?
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 70

    View Slide

  71. Which process do you hate?
    I personally don't want to spend 0me on the following processes.
    • Addi$onal triages to dig root causes.
    • Prepara$on for learning ( != I don't like joining postmortem sessions).
    • Maintainance of incident management processes.
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 71

    View Slide

  72. Why addi(onal triages?
    • Startups don't have enough observability mechanisms.
    • We some:mes cannot find root causes (this is acceptable).
    • We tend to spend a lot of :me here in that situa:on.
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 72

    View Slide

  73. Why prepara)on for learning?
    • Prepara'on for team-wise learning sessions take 'me.
    • Documen'ng for Postmortems.
    • Copy & paste dances to create 'meline sca=ered various places.
    • Timeline needs to consider 'me-zones.
    • There is a gravity which prevent people from announcing incident casually.
    • For starups, the most important ac'vi'es are learning as a team.
    • If TTL is long, people cannot announce incidents casually.
    • As a result, postmortems ruin short MTTR with high cadence learning
    itera'ons.
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 73

    View Slide

  74. Why maintenance of incident management processes?
    • Maintainance of incident management processes contains:
    • Upda.ng incident management policy.
    • Improving incident management structures.
    • Upda.ng documents.
    • Training people to align with the updates.
    • Characteris.cally, incidents don't occur frequently,
    • Too tough to memorize incident response processes for everybody.
    • In urgent situa.on, people don't read documents.
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 74

    View Slide

  75. Can we reduce TTI?
    • It's depending on ac0on items coming from postmortems.
    • No teams can handle all ac0on items we discussed during postmortems.
    • Common an0-papriori0es.
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 75

    View Slide

  76. Approach for unbalanced ac0on items
    • Think of engineering members' capacity
    • Priori7ze and classify the work9 10
    10 Anatomy of an incident management, Chapter 5
    9 Postmortem Ac,on Items: Plan the Work and Work the Plan, USENIX SRECon 2017
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 76

    View Slide

  77. Importance / Size / Urgency (ISU) Matrix
    • Assignee's confidence is also valuable to declare.
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 77

    View Slide

  78. ISU Matrix on GitHub Projects
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 78

    View Slide

  79. My focus is reduc-on of TTT, TTL, and TTP
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 79

    View Slide

  80. Chapter 5:
    Choosing right strategies and tools
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 80

    View Slide

  81. Phases and the number of so0ware engineers
    Note: the numbers are es/mated by the presenter based on previous experiences.
    • Phase 0: Founding ~ 4 so3ware engineers
    • Phase 1: 5 ~ 10 so3ware engineers
    • Phase 2: 11 ~ so3ware engineers
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 81

    View Slide

  82. Let's reframe the original ques3on again
    • Reframe "Does a startup need incident management?"
    • At startups, how can we:
    • Build an incident management structure enforcing the 3T mental models?
    • Improve the ":mes" of the incident management life cycle?
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 82

    View Slide

  83. Evolu&on of incident management at Launchable
    Improvement target Ac/ons from phase 0 to 1 Ac/ons from phase 1 to 2
    Transparency - Encourage push communica3on - Encourage pull communica3on
    - Create war rooms
    - Share status pages
    Tangibility - Automate parts of incident response flow - Automate en3re incident response flow
    - Introduce incident lead role
    Trust - Introduce blameless culture - Split lead and opera3on roles for complex incidents
    Time to Engagement (TTE) - Automate incident announcements - Automate en3re incident response flow
    - Introduce on-call rota3ons
    - Expand follow-the-sun coverages
    Time to Fix (TTF) - Introduce observability - Improve observability
    Time to Triage (TTT) - Introduce observability - Improve observability
    Time to Learn (TTL) - Introduce postmortem template - Generate postmortem
    Time to Prepara3on (TTP) - Create incident management policies - Enforce incident management policies
    - Self-service incident response trainings
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 83

    View Slide

  84. Phase 0: Founding ~ 4 so2ware engineers
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 84

    View Slide

  85. No strategy
    • Product does not have customers.
    • We don't need incident responses.
    • Build incident management structure based on product growth.
    • All members do everything if necessary.
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 85

    View Slide

  86. Incident management system
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 86

    View Slide

  87. Phase 1: 5 ~ 10 so-ware engineers
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 87

    View Slide

  88. Environmental changes from phase 0 to 1
    • When products have customers, we need an incident management.
    • The more so8ware engineers join, the more incidents happen.
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 88

    View Slide

  89. Strategy
    Make everything simple and easy to follow
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 89

    View Slide

  90. Incident management changes from phase 0 to 1
    Improvement target Ac/ons from phase 0 to 1 Ac/ons from phase 1 to 2
    Transparency - Encourage push communica/on - Encourage pull communica3on
    - Create war rooms
    - Share status pages
    Tangibility - Automate parts of incident response flow - Automate en3re incident response flow
    - Introduce incident lead role
    Trust - Introduce blameless culture - Split lead and opera3on roles for complex incidents
    Time to Engagement (TTE) - Automate incident announcements - Automate en3re incident response flow
    - Introduce on-call rota3ons
    - Expand follow-the-sun coverages
    Time to Fix (TTF) - Introduce observability - Improve observability
    Time to Triage (TTT) - Introduce observability - Improve observability
    Time to Learn (TTL) - Introduce postmortem template - Generate postmortem
    Time to Prepara3on (TTP) - Create incident management policies - Enforce incident management policies
    - Self-service incident response trainings
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 90

    View Slide

  91. Incident management system (phase 0 to 1)
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 91

    View Slide

  92. Incident management system (phase 0 to 1)
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 92

    View Slide

  93. Founda'on of incident management policies
    • We maintain policies on Confluence.
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 93

    View Slide

  94. Founda'on of incident management policies
    • We maintain policies on Confluence.
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 94

    View Slide

  95. Automa'on of incident escala'ons
    • We escalate incidents using Slack Workflow.
    • We handle incidents in Slack channel and Google Meet.
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 95

    View Slide

  96. Automa'on of incident escala'ons
    • We escalate incidents using Slack Workflow.
    • We handle incidents in Slack channel and Google Meet.
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 96

    View Slide

  97. Automa'on of incident escala'ons
    • We escalate incidents using Slack Workflow.
    • We handle incidents in Slack channel and Google Meet.
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 97

    View Slide

  98. Introduc)on of postmortem
    • We keep all postmortems on Confluence.
    • We create a new postmortem page using a Confluence template feature.
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 98

    View Slide

  99. Introduc)on of postmortem
    • We keep all postmortems on Confluence.
    • We create a new postmortem page using a Confluence template feature.
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 99

    View Slide

  100. Very simple and easy to follow
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 100

    View Slide

  101. Postmortems as strong fact-based data
    • Even we cannot solve root causes, you can use the postmortems as data.
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 101

    View Slide

  102. Can we improve the incident management?
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 102

    View Slide

  103. e.g. Does this what human should take care?
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 103

    View Slide

  104. e.g. Does this what human should take care?
    • We don't have solid policy but policy does not scale.
    • Employees are living in Japan and US.
    • Sharing all informa>on on Slack is easy to miss.
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 104

    View Slide

  105. e.g. Do we need roles?
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 105

    View Slide

  106. Phase 2: 11 ~ ?? so-ware engineers
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 106

    View Slide

  107. Environmental changes from Phase 1 to 2
    • Our products have more customers.
    • The more so3ware engineers join, the more incidents happen.
    • Increase of employees and >me-zone gaps make sync and push-style
    communica>ons tough.
    • In the first place, Launchable encourages async and wriEen communica>ons.
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 107

    View Slide

  108. Strategy
    • Enforce incident management policies by so4ware not by documents.
    • Involve appropriate people based on pull-style communica8ons.
    • Use the current tool chains in the company.
    • Too many new tools degrade teams' performance.
    • Use Slack as interac?ve communica?on places to keep flow info.
    • Use Confluence to keep stock info (non-urgent communica?ons).
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 108

    View Slide

  109. Incident management system (phase 1 to 2)
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 109

    View Slide

  110. Incident management system (phase 1 to 2)
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 110

    View Slide

  111. Incident management system (phase 1 to 2)
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 111

    View Slide

  112. SaaS: incident.io
    • h#ps:/
    /incident.io/
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 112

    View Slide

  113. SaaS: Blameless
    • h#ps:/
    /www.blameless.com/product/incident-resolu8on
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 113

    View Slide

  114. SaaS: Datadog Incident
    • h#ps:/
    /www.datadoghq.com/blog/incident-response-with-datadog/
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 114

    View Slide

  115. SaaS: Grafana Incident
    • h#ps:/
    /go2.grafana.com/incident-beta-interest.html
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 115

    View Slide

  116. OSS: monzo/response
    OSS version of incident.io
    • h#ps:/
    /github.com/monzo/response
    • h#ps:/
    /monzo.com/blog/2019/07/08/how-we-respond-to-incidents/
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 116

    View Slide

  117. in-house tool: Slack App + Web App
    • It's not so difficult to implement Slack App and Web App for this purpose.
    • But... I want to use my ?me for other stuff.
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 117

    View Slide

  118. SaaS vs OSS vs in-house tool
    • We want to maximize developers' disposal 5me for product developments.
    • We don't want to increase cogni5ve loads.
    • OSS and in-house tool needs code and document maintenance.
    • OSS and in-house tool needs evangelical ac5vi5es for this type of tools.
    • Use SaaS if money allows (Buy, Not Build)
    • Salaries for so@ware engineers are way more expensive than SaaS cost.
    • SaaS improves their features as their business.
    • SaaS maintains documents as product features.
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 118

    View Slide

  119. incident.io covers wide improvement targets
    Improvement target Ac/ons from phase 0 to 1 Ac/ons from phase 1 to 2
    Transparency - Encourage push communica3on - Encourage pull communica/on
    - Create war rooms
    - Share status pages
    Tangibility - Automate parts of incident response flow - Automate en/re incident response flow
    - Introduce incident lead role
    Trust - Introduce blameless culture - Split lead and opera/on roles for complex incidents
    Time to Engagement (TTE) - Automate incident announcements - Automate en/re incident response flow
    - Introduce on-call rota3ons
    - Expand follow-the-sun coverages
    Time to Fix (TTF) - Introduce observability - Improve observability
    Time to Triage (TTT) - Introduce observability - Improve observability
    Time to Learn (TTL) - Introduce postmortem template - Generate postmortem
    Time to Prepara3on (TTP) - Create incident management policies - Enforce incident management policies
    - Self-service incident response trainings
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 119

    View Slide

  120. Central channel for all incidents
    incident.io can share all incidents in the specified Slack channel.
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 120

    View Slide

  121. Dedicated war rooms (Slack channel)
    incident.io handles all tasks we want to complete for incident response
    ini4aliza4ons.
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 121

    View Slide

  122. Dedicated war rooms (Slack channel)
    incident.io can assist incident responses.
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 122

    View Slide

  123. Dedicated Slack channel (closing incident)
    At the end of incident responses, incident.io tells us what we need to be done next.
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 123

    View Slide

  124. Status updates at central channel
    incident.io automa-cally syncs the latest status of incidents at the central channel.
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 124

    View Slide

  125. Postmortem genera,on
    incident.io can collect ,melines from war rooms and generates postmortems.
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 125

    View Slide

  126. Postmortem genera,on
    We can generate a postmortem documents using incident.io.
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 126

    View Slide

  127. Postmortem genera,on
    We can collect *melines from dedicated Slack channels.
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 127

    View Slide

  128. Postmortem genera,on
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 128

    View Slide

  129. Self-training mode
    incident.io has a mode to walk though dummy incident responses on Slack.
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 129

    View Slide

  130. Introduc)on of lead role
    • We need communica-on leads when incidents are complex
    • However, for most of incident, a single person can be responsible for opera-ons
    and communica-ons.
    • So, adding a lead role only is prudent so we don't make incident managements
    overly complex.
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 130

    View Slide

  131. We have more rooms to improve!
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 131

    View Slide

  132. Recap
    • Incident management has a life cycle.
    • Prepara6on -> Detec6on -> Recovery -> Post-incident ac6ons -> Prepara6on
    • Incident response roles and structures exist to embody 3T.
    • Transparency
    • Tangibility
    • Trust
    • Choosing strategy and tools makes incident managements at startups sensible.
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 132

    View Slide

  133. Thanks
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 133

    View Slide

  134. References
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 134

    View Slide

  135. Incident management
    1. Atlassian: Understanding incident response roles and responsibili5es, h8ps:/
    /www.atlassian.com/incident-management/incident-
    response/roles-responsibili5es
    2. PagerDuty Incident Response Training, h8ps:/
    /response.pagerduty.com/training/overview/.
    3. Anatomy of an Incident, Ayelet Sachto, Adrienne Walcer, and Jessie Yang, 2022.
    4. US Federal Emergency Management Agency, Emergency Management Ins5tute ICS Resource Center, h8ps:/
    /training.fema.gov/
    emiweb/is/icsresource/.
    5. The Na5onal Ins5tute of Standards and Technology SP 800-61, Computer Security Incident Handling Guide, h8p:/
    /dx.doi.org/
    10.6028/NIST.SP.800-61r2.
    6. Introduc5on: Incident Response overview, Gov UK Na5onal Cyber Security Centre, h8ps:/
    /www.ncsc.gov.uk/collec5on/incident-
    management/incident-response
    7. Incident Review and Postmortem Best Prac5ces, h8ps:/
    /newsle8er.pragma5cengineer.com/p/incident-review-best-prac5ces
    8. Incident Review Prac5ces [The Pragma5c Engineer Newsle8er], h8ps:/
    /docs.google.com/spreadsheets/d/1GPINipdf-
    l2H05iKOUbpkrqwlZ61ZCJDnwY5iE8LtRM/edit#gid=0
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022)
    135

    View Slide

  136. SRE
    1. Google SRE book Chapter 14 - Managing Incidents, h>ps:/
    /sre.google/sre-book/managing-incidents/
    2. Postmortem AcEon Items: Plan the Work and Work the Plan, Sue Lueder and Betsy Beyer (Google), USENIX SRECon 2017,
    h>ps:/
    /www.usenix.org/conference/srecon17americas/program/presentaEon/lueder.
    3. Google SRE book Chapter 15 - Postmortem Culture: Learning from Failure, h>ps:/
    /sre.google/sre-book/postmortem-culture/.
    4. Postmortem Metadata Index, h>ps:/
    /postmortems.app/.
    5. The Art of SLOs, Google Site Reliability Engineering, h>ps:/
    /sre.google/resources/pracEces-and-processes/art-of-slos/
    6. danluu/post-mortems: A collecEon of postmortems, h>ps:/
    /github.com/danluu/post-mortems.
    7. Great Incident Review Examples, The PragmaEc Engineer, h>ps:/
    /blog.pragmaEcengineer.com/postmortem-best-pracEces/#great-
    incident-review-examples
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022)
    136

    View Slide

  137. DevOps performance metrics
    1. Accelerate: The Science of Lean So4ware and DevOps: Building and Scaling High Performing Technology OrganizaDons, 2018.
    2. GoogleCloudPlaKorm/fourkeys, hNps:/
    /github.com/GoogleCloudPlaKorm/fourkeys
    3. Are you an Elite DevOps performer? Find out with the Four Keys Project, Google Cloud, hNps:/
    /cloud.google.com/blog/products/
    devops-sre/using-the-four-keys-to-measure-your-devops-performance
    4. DORA DevOps Quick Check., hNps:/
    /www.devops-research.com/quickcheck.html
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022)
    137

    View Slide

  138. SaaS and OSS
    1. Datadog, h,ps:/
    /www.datadoghq.com/blog/incident-response-with-datadog/
    2. incident.io, h,ps:/
    /incident.io/
    3. jeli, h,ps:/
    /www.jeli.io/
    4. monzo/response, h,ps:/
    /monzo.com/blog/2019/07/08/how-we-respond-to-incidents
    5. Etsy/morgue, h,ps:/
    /github.com/etsy/morgue
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022)
    138

    View Slide