Upgrade to Pro — share decks privately, control downloads, hide ads and more …

SRE NEXT 2022: Sensible Incident Management for Software Startups

SRE NEXT 2022: Sensible Incident Management for Software Startups

More Decks by Takayuki WATANABE (渡辺 喬之)

Other Decks in Programming

Transcript

  1. SRE NEXT 2022
    Sensible Incident Management for So4ware Startups
    Takayuki Watanabe
    @Launchable, Inc.

    View full-size slide

  2. Who?
    Name: Takayuki Watanabe
    Affiliation: Launchable, Inc.
    Role: Software Engineer
    Sns:
    Blog: blog.takanabe.tokyo
    GitHub: takanabe
    Twitter: @takanabe_w
    Interests:
    - Developer Productivity
    - Site Reliability Engineering
    - Sustainability Engineering
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 2

    View full-size slide

  3. Your takeaways
    You can understand:
    • Incident management has a life cycle.
    • Incident response roles and structures exist to embody 3T mental models.
    • Choosing strategies and tools makes incident managements at startups sensible.
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 3

    View full-size slide

  4. Out of scope
    • Fundamental SRE terminology (e.g. SLO, SLI, Error budget, Postmortem)
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 4

    View full-size slide

  5. Disclaimer
    • This session refers a lot of exis0ng incident management and SRE prac0ces.
    • But contains a lot of opinionated ideas and philosophy as well.
    • So, the ideas might contradict to some people's.
    • Let's discuss on TwiAer using #srenext with @takanabe_w
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 5

    View full-size slide

  6. Today's agenda
    • About Launchable
    • Does a startup need incident management?
    • Dissect incident management prac8ces.
    • 3T mental models and life cycles
    • How can we improve incident management?
    • Choosing right strategies and tools
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 6

    View full-size slide

  7. Chapter 1:
    About Launchable
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 7

    View full-size slide

  8. SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 8

    View full-size slide

  9. What is Launchable?
    A SaaS accelera)ng so.ware development cycles.
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 9

    View full-size slide

  10. What is Launchable?
    Current focus is machine learning based test selec0ons by:
    • Predic(ng a meaningful subset of tests.
    • Iden(fying flaky tests.
    • Visualizing test trends with metrics.
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 10

    View full-size slide

  11. What is Launchable?
    e.g. Reordering tests based on likelihood of failures.
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 11

    View full-size slide

  12. Our team size
    • Launchable is a startup
    • 2 CEOs + 15 employees
    • So3ware engineer (7 people)
    • Product manager
    • Marke>ng
    • Sales
    • etc...
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 12

    View full-size slide

  13. Phases and the number of so0ware engineers
    Note: the numbers are es/mated by the presenter based on previous experiences.
    • Phase 0: Founding ~ 4 so3ware engineers
    • Phase 1: 5 ~ 10 so3ware engineers
    • Phase 2: 11 ~ so3ware engineers
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 13

    View full-size slide

  14. My SRE NEXT 2022 is about ...
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 14

    View full-size slide

  15. Incident management at so#ware startups
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 15

    View full-size slide

  16. Does a startup need incident management?
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 16

    View full-size slide

  17. Yes, it's obvious if products have customers.
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 17

    View full-size slide

  18. Do you have enough engineering members?
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 18

    View full-size slide

  19. No! but...
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 19

    View full-size slide

  20. Learning from previous careers1
    • I've worked at various sizes and stages.
    • Company A: +300,000 people
    • Company B: +400 people (Joined when they had +300 people)
    • Company C: +150 people (Joined when they only had less than 10 people)
    • Product developments are always the highest priority concerns.
    • OperaHon improvement != Product development velocity degradaHon.
    • We will never have enough engineering members to improve opera;ons.
    Never.
    1 SRE NEXT 2020: Designing fault-tolerant microservices with SRE and circuit breaker centric architecture
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 20

    View full-size slide

  21. Are speed and quality trade-off?
    • I personally don't think so 2 3.
    • I believe sensible incident management accelerates our development velocity.
    3 A Philosophy of So.ware Deisgn, Chapter 3: Working Code Isnt' Enough, pp. 13 - 18.
    2 mar&nFowler.com: Is High Quality So;ware Worth the Cost?
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 21

    View full-size slide

  22. Can we reframe the original ques3on?
    • We want to reframe "Does a startup need incident management?" to:
    • Which incident management processes won't change even for rapid
    developments?
    • Which processes should we improve?
    • Let's dissect incident management prac=ces in the industry.
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 22

    View full-size slide

  23. Chapter 2:
    Dissect incident management prac/ces
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 23

    View full-size slide

  24. What is incident management?
    Incident management
    • High level and overall process for handling incidents in an organiza5on.
    Incident response
    • Part of incident management for actual technical steps including detec5on,
    repor5ng, mi5ga5on, and recovery during incidents.
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 24

    View full-size slide

  25. Exis%ng prac%ces
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 25

    View full-size slide

  26. e.g. Terminology
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 26

    View full-size slide

  27. Examples of terminology 4
    • CAN Reports
    • Deputy
    • Execu3ve Swoop
    • Grenade Thrower
    • Incident Commander (IC)
    • Resolver
    • Severity
    • Scribe
    • Subject Ma4er Expert (SME)
    4 h$ps:/
    /response.pagerduty.com/training/glossary/
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 27

    View full-size slide

  28. e.g. Roles
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 28

    View full-size slide

  29. Examples of roles at Google 5 6
    6 Anatomy of an Incident Google’s Approach to Incident Management for Produc;on Services, Chapter 4: Mi;ga;on and
    Recovery, pp. 31-32.
    5 Google SRE Workbook, Chapter 9: Incident Response
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 29

    View full-size slide

  30. Examples of roles at PagerDuty 7 8
    8 Google SRE Workbook, Chapter 9: Incident Response
    7 PagerDuty Incident Response Documenta6on, Different Roles -
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 30

    View full-size slide

  31. Too much!
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 31

    View full-size slide

  32. Can we translate these prac.ces
    into more higher level concepts?
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 32

    View full-size slide

  33. Chapter 3:
    3T mental models and life cycles
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 33

    View full-size slide

  34. Examples of roles at PagerDuty
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 34

    View full-size slide

  35. Command role
    • Responsibility is managing incident responses to align in organiza5ons.
    • Understand ongoing opera5ons
    • Understand who is doing what
    • Delegate sub-commander responsibility to others if necessary.
    • Make incident response tangible.
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 35

    View full-size slide

  36. Liason role
    • Responsibility is smooth repor1ng and communica1ons.
    • For both internally and externally.
    • Make incident response transparent.
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 36

    View full-size slide

  37. Opera&on role
    • Responsibility is actual technical ac2vi2es to solve issues.
    • Focus on triage, analysis, mi2ga2on and recovery.
    • Communica2on with rest of organiza2ons is not a primary concern.
    • In many cases, operators produce root causes of incidents but don't blame them.
    • Nobody wants to cause incidents.
    • All par2cipants focus on assigned roles based on chain of trust.
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 37

    View full-size slide

  38. 3T mental models for incident response
    The incident response roles embody 3T mental models.
    • Transparency
    • Keep informa-on of incident responses reachable for everybody.
    • Tangibility
    • Manage status of incidents.
    • Manage who handles what.
    • Trust
    • Believe everybody makes best efforts during incidents.
    • Don't blame anybody because nobody wants to cause incidents.
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 38

    View full-size slide

  39. High level view of incident management cycles
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 39

    View full-size slide

  40. High level view of incident management cycles
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 40

    View full-size slide

  41. High level view of incident management cycles
    Examples:
    • Incident management policy
    • Documenta3on
    • Repor3ng mechanism
    • Observability
    • Aler.ng policy
    • Incident response training
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 41

    View full-size slide

  42. High level view of incident management cycles
    Examples:
    • Aler&ng
    • Triage
    • Root-cause analysis
    • Escala'ons
    • Opening war rooms
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 42

    View full-size slide

  43. High level view of incident management cycles
    Examples:
    • Rollback deployment (mi3ga3on)
    • Kill slow queries (mi3ga3on)
    • Fix bug (recovery)
    • Add index to tables (recovery)
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 43

    View full-size slide

  44. High level view of incident management cycles
    Examples:
    • Addi%onal triage
    • Prepara%on for postmortems
    • Postmortems
    • Handle ac*on items raised at
    postmortems
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 44

    View full-size slide

  45. Postmortem vs FtS
    h"ps:/
    /twi"er.com/takanabe_w/status/1510943694467186699
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 45

    View full-size slide

  46. Chapter 4:
    How can we improve incident management?
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 46

    View full-size slide

  47. Where should we invest our 0me?
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 47

    View full-size slide

  48. Where should we invest our 0me?
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 48

    View full-size slide

  49. Key %mes of incident response
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 49

    View full-size slide

  50. Key %mes of incident response
    • Time to detect (TTD)
    • Time to engagement (TTE)
    • Time to fix (TTF)
    • Time to repair/recovery (TTR)
    • Time between failures (TBF)
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 50

    View full-size slide

  51. Time to detect (TTD)
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 51

    View full-size slide

  52. Time to engagement (TTE)
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 52

    View full-size slide

  53. Time to fix (TTF)
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 53

    View full-size slide

  54. Time to recovery (TTR)
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 54

    View full-size slide

  55. Time between failures (TBF)
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 55

    View full-size slide

  56. Which &me do we want to improve?
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 56

    View full-size slide

  57. TTD at Launchable
    Current status
    • We've already had several detec0on mechanisms using Datadog and Sentry.
    Solu%on
    • Introduc*on of SLO and Error Budget makes our aler*ng criteria more clear.
    • But don't forget "Law of diminishing returns" to make decisions.
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 57

    View full-size slide

  58. TTE at Launchable
    Current status
    • Easy enough to no.ce during office hours at Slack channels.
    • We don't have on-call rota.ons ATM, which makes TTE uncontrollable.
    Solu%on
    • Apply follow-the-sun strategy to cover wide-range hours.
    • Introducing on-call rota:ons and pager.
    • But we don't feel it's necessary now.
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 58

    View full-size slide

  59. TTF at Launchable
    Current status
    • We don't have enough observability mechanisms
    • Depending on each developer's debug skill
    • During this window, developers cannot spend .me on product developments.
    Solu%on
    • Introducing more team-shared observability dashboards.
    • Introducing more observability mechanism to drill down root causes.
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 59

    View full-size slide

  60. Which &me do we want to improve?
    TTF improvement brings us high returns with small efforts.
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 60

    View full-size slide

  61. Which do we op+mize? MTTR vs MTBF
    • Short MTTR and long MTBF are the best
    • Short MTTR but short MTBF
    = Incidents frequently occur but are recovered quickly.
    • Long MTTR but long MTBF
    = Incidents don't occur frequently but once occur, they aren't recovered soon.
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 61

    View full-size slide

  62. Startups should focus on MTTR improvement
    • There is no evolu.on without high cadence itera.ons at startups.
    • TTD and TTE are difficult to improve for us.
    • Reducing TTF results in reducing MTTR.
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 62

    View full-size slide

  63. Do we have other +mes
    we haven't ar+culated?
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 63

    View full-size slide

  64. Hidden key )mes of incident life cycles
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 64

    View full-size slide

  65. Hidden key )mes of incident life cycles
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 65

    View full-size slide

  66. Hidden key )mes of incident life cycles
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 66

    View full-size slide

  67. Hidden key )mes of incident life cycles
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 67

    View full-size slide

  68. Hidden key )mes of incident life cycles
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 68

    View full-size slide

  69. Hidden key )mes of incident life cycles
    Don't underes,mate the ,mes we spend as post-incident ac,vi,es.
    • Time to (addi,onal )triage (TTT)
    • Time to learn (TTL)
    • Time to improvement (TTI)
    • Time to prepara,on (TTP)
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 69

    View full-size slide

  70. Power ques*on:
    Which process do you hate?
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 70

    View full-size slide

  71. Which process do you hate?
    I personally don't want to spend 0me on the following processes.
    • Addi$onal triages to dig root causes.
    • Prepara$on for learning ( != I don't like joining postmortem sessions).
    • Maintainance of incident management processes.
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 71

    View full-size slide

  72. Why addi(onal triages?
    • Startups don't have enough observability mechanisms.
    • We some:mes cannot find root causes (this is acceptable).
    • We tend to spend a lot of :me here in that situa:on.
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 72

    View full-size slide

  73. Why prepara)on for learning?
    • Prepara'on for team-wise learning sessions take 'me.
    • Documen'ng for Postmortems.
    • Copy & paste dances to create 'meline sca=ered various places.
    • Timeline needs to consider 'me-zones.
    • There is a gravity which prevent people from announcing incident casually.
    • For starups, the most important ac'vi'es are learning as a team.
    • If TTL is long, people cannot announce incidents casually.
    • As a result, postmortems ruin short MTTR with high cadence learning
    itera'ons.
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 73

    View full-size slide

  74. Why maintenance of incident management processes?
    • Maintainance of incident management processes contains:
    • Upda.ng incident management policy.
    • Improving incident management structures.
    • Upda.ng documents.
    • Training people to align with the updates.
    • Characteris.cally, incidents don't occur frequently,
    • Too tough to memorize incident response processes for everybody.
    • In urgent situa.on, people don't read documents.
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 74

    View full-size slide

  75. Can we reduce TTI?
    • It's depending on ac0on items coming from postmortems.
    • No teams can handle all ac0on items we discussed during postmortems.
    • Common an0-papriori0es.
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 75

    View full-size slide

  76. Approach for unbalanced ac0on items
    • Think of engineering members' capacity
    • Priori7ze and classify the work9 10
    10 Anatomy of an incident management, Chapter 5
    9 Postmortem Ac,on Items: Plan the Work and Work the Plan, USENIX SRECon 2017
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 76

    View full-size slide

  77. Importance / Size / Urgency (ISU) Matrix
    • Assignee's confidence is also valuable to declare.
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 77

    View full-size slide

  78. ISU Matrix on GitHub Projects
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 78

    View full-size slide

  79. My focus is reduc-on of TTT, TTL, and TTP
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 79

    View full-size slide

  80. Chapter 5:
    Choosing right strategies and tools
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 80

    View full-size slide

  81. Phases and the number of so0ware engineers
    Note: the numbers are es/mated by the presenter based on previous experiences.
    • Phase 0: Founding ~ 4 so3ware engineers
    • Phase 1: 5 ~ 10 so3ware engineers
    • Phase 2: 11 ~ so3ware engineers
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 81

    View full-size slide

  82. Let's reframe the original ques3on again
    • Reframe "Does a startup need incident management?"
    • At startups, how can we:
    • Build an incident management structure enforcing the 3T mental models?
    • Improve the ":mes" of the incident management life cycle?
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 82

    View full-size slide

  83. Evolu&on of incident management at Launchable
    Improvement target Ac/ons from phase 0 to 1 Ac/ons from phase 1 to 2
    Transparency - Encourage push communica3on - Encourage pull communica3on
    - Create war rooms
    - Share status pages
    Tangibility - Automate parts of incident response flow - Automate en3re incident response flow
    - Introduce incident lead role
    Trust - Introduce blameless culture - Split lead and opera3on roles for complex incidents
    Time to Engagement (TTE) - Automate incident announcements - Automate en3re incident response flow
    - Introduce on-call rota3ons
    - Expand follow-the-sun coverages
    Time to Fix (TTF) - Introduce observability - Improve observability
    Time to Triage (TTT) - Introduce observability - Improve observability
    Time to Learn (TTL) - Introduce postmortem template - Generate postmortem
    Time to Prepara3on (TTP) - Create incident management policies - Enforce incident management policies
    - Self-service incident response trainings
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 83

    View full-size slide

  84. Phase 0: Founding ~ 4 so2ware engineers
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 84

    View full-size slide

  85. No strategy
    • Product does not have customers.
    • We don't need incident responses.
    • Build incident management structure based on product growth.
    • All members do everything if necessary.
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 85

    View full-size slide

  86. Incident management system
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 86

    View full-size slide

  87. Phase 1: 5 ~ 10 so-ware engineers
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 87

    View full-size slide

  88. Environmental changes from phase 0 to 1
    • When products have customers, we need an incident management.
    • The more so8ware engineers join, the more incidents happen.
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 88

    View full-size slide

  89. Strategy
    Make everything simple and easy to follow
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 89

    View full-size slide

  90. Incident management changes from phase 0 to 1
    Improvement target Ac/ons from phase 0 to 1 Ac/ons from phase 1 to 2
    Transparency - Encourage push communica/on - Encourage pull communica3on
    - Create war rooms
    - Share status pages
    Tangibility - Automate parts of incident response flow - Automate en3re incident response flow
    - Introduce incident lead role
    Trust - Introduce blameless culture - Split lead and opera3on roles for complex incidents
    Time to Engagement (TTE) - Automate incident announcements - Automate en3re incident response flow
    - Introduce on-call rota3ons
    - Expand follow-the-sun coverages
    Time to Fix (TTF) - Introduce observability - Improve observability
    Time to Triage (TTT) - Introduce observability - Improve observability
    Time to Learn (TTL) - Introduce postmortem template - Generate postmortem
    Time to Prepara3on (TTP) - Create incident management policies - Enforce incident management policies
    - Self-service incident response trainings
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 90

    View full-size slide

  91. Incident management system (phase 0 to 1)
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 91

    View full-size slide

  92. Incident management system (phase 0 to 1)
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 92

    View full-size slide

  93. Founda'on of incident management policies
    • We maintain policies on Confluence.
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 93

    View full-size slide

  94. Founda'on of incident management policies
    • We maintain policies on Confluence.
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 94

    View full-size slide

  95. Automa'on of incident escala'ons
    • We escalate incidents using Slack Workflow.
    • We handle incidents in Slack channel and Google Meet.
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 95

    View full-size slide

  96. Automa'on of incident escala'ons
    • We escalate incidents using Slack Workflow.
    • We handle incidents in Slack channel and Google Meet.
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 96

    View full-size slide

  97. Automa'on of incident escala'ons
    • We escalate incidents using Slack Workflow.
    • We handle incidents in Slack channel and Google Meet.
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 97

    View full-size slide

  98. Introduc)on of postmortem
    • We keep all postmortems on Confluence.
    • We create a new postmortem page using a Confluence template feature.
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 98

    View full-size slide

  99. Introduc)on of postmortem
    • We keep all postmortems on Confluence.
    • We create a new postmortem page using a Confluence template feature.
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 99

    View full-size slide

  100. Very simple and easy to follow
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 100

    View full-size slide

  101. Postmortems as strong fact-based data
    • Even we cannot solve root causes, you can use the postmortems as data.
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 101

    View full-size slide

  102. Can we improve the incident management?
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 102

    View full-size slide

  103. e.g. Does this what human should take care?
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 103

    View full-size slide

  104. e.g. Does this what human should take care?
    • We don't have solid policy but policy does not scale.
    • Employees are living in Japan and US.
    • Sharing all informa>on on Slack is easy to miss.
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 104

    View full-size slide

  105. e.g. Do we need roles?
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 105

    View full-size slide

  106. Phase 2: 11 ~ ?? so-ware engineers
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 106

    View full-size slide

  107. Environmental changes from Phase 1 to 2
    • Our products have more customers.
    • The more so3ware engineers join, the more incidents happen.
    • Increase of employees and >me-zone gaps make sync and push-style
    communica>ons tough.
    • In the first place, Launchable encourages async and wriEen communica>ons.
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 107

    View full-size slide

  108. Strategy
    • Enforce incident management policies by so4ware not by documents.
    • Involve appropriate people based on pull-style communica8ons.
    • Use the current tool chains in the company.
    • Too many new tools degrade teams' performance.
    • Use Slack as interac?ve communica?on places to keep flow info.
    • Use Confluence to keep stock info (non-urgent communica?ons).
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 108

    View full-size slide

  109. Incident management system (phase 1 to 2)
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 109

    View full-size slide

  110. Incident management system (phase 1 to 2)
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 110

    View full-size slide

  111. Incident management system (phase 1 to 2)
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 111

    View full-size slide

  112. SaaS: incident.io
    • h#ps:/
    /incident.io/
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 112

    View full-size slide

  113. SaaS: Blameless
    • h#ps:/
    /www.blameless.com/product/incident-resolu8on
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 113

    View full-size slide

  114. SaaS: Datadog Incident
    • h#ps:/
    /www.datadoghq.com/blog/incident-response-with-datadog/
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 114

    View full-size slide

  115. SaaS: Grafana Incident
    • h#ps:/
    /go2.grafana.com/incident-beta-interest.html
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 115

    View full-size slide

  116. OSS: monzo/response
    OSS version of incident.io
    • h#ps:/
    /github.com/monzo/response
    • h#ps:/
    /monzo.com/blog/2019/07/08/how-we-respond-to-incidents/
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 116

    View full-size slide

  117. in-house tool: Slack App + Web App
    • It's not so difficult to implement Slack App and Web App for this purpose.
    • But... I want to use my ?me for other stuff.
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 117

    View full-size slide

  118. SaaS vs OSS vs in-house tool
    • We want to maximize developers' disposal 5me for product developments.
    • We don't want to increase cogni5ve loads.
    • OSS and in-house tool needs code and document maintenance.
    • OSS and in-house tool needs evangelical ac5vi5es for this type of tools.
    • Use SaaS if money allows (Buy, Not Build)
    • Salaries for so@ware engineers are way more expensive than SaaS cost.
    • SaaS improves their features as their business.
    • SaaS maintains documents as product features.
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 118

    View full-size slide

  119. incident.io covers wide improvement targets
    Improvement target Ac/ons from phase 0 to 1 Ac/ons from phase 1 to 2
    Transparency - Encourage push communica3on - Encourage pull communica/on
    - Create war rooms
    - Share status pages
    Tangibility - Automate parts of incident response flow - Automate en/re incident response flow
    - Introduce incident lead role
    Trust - Introduce blameless culture - Split lead and opera/on roles for complex incidents
    Time to Engagement (TTE) - Automate incident announcements - Automate en/re incident response flow
    - Introduce on-call rota3ons
    - Expand follow-the-sun coverages
    Time to Fix (TTF) - Introduce observability - Improve observability
    Time to Triage (TTT) - Introduce observability - Improve observability
    Time to Learn (TTL) - Introduce postmortem template - Generate postmortem
    Time to Prepara3on (TTP) - Create incident management policies - Enforce incident management policies
    - Self-service incident response trainings
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 119

    View full-size slide

  120. Central channel for all incidents
    incident.io can share all incidents in the specified Slack channel.
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 120

    View full-size slide

  121. Dedicated war rooms (Slack channel)
    incident.io handles all tasks we want to complete for incident response
    ini4aliza4ons.
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 121

    View full-size slide

  122. Dedicated war rooms (Slack channel)
    incident.io can assist incident responses.
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 122

    View full-size slide

  123. Dedicated Slack channel (closing incident)
    At the end of incident responses, incident.io tells us what we need to be done next.
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 123

    View full-size slide

  124. Status updates at central channel
    incident.io automa-cally syncs the latest status of incidents at the central channel.
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 124

    View full-size slide

  125. Postmortem genera,on
    incident.io can collect ,melines from war rooms and generates postmortems.
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 125

    View full-size slide

  126. Postmortem genera,on
    We can generate a postmortem documents using incident.io.
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 126

    View full-size slide

  127. Postmortem genera,on
    We can collect *melines from dedicated Slack channels.
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 127

    View full-size slide

  128. Postmortem genera,on
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 128

    View full-size slide

  129. Self-training mode
    incident.io has a mode to walk though dummy incident responses on Slack.
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 129

    View full-size slide

  130. Introduc)on of lead role
    • We need communica-on leads when incidents are complex
    • However, for most of incident, a single person can be responsible for opera-ons
    and communica-ons.
    • So, adding a lead role only is prudent so we don't make incident managements
    overly complex.
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 130

    View full-size slide

  131. We have more rooms to improve!
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 131

    View full-size slide

  132. Recap
    • Incident management has a life cycle.
    • Prepara6on -> Detec6on -> Recovery -> Post-incident ac6ons -> Prepara6on
    • Incident response roles and structures exist to embody 3T.
    • Transparency
    • Tangibility
    • Trust
    • Choosing strategy and tools makes incident managements at startups sensible.
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 132

    View full-size slide

  133. Thanks
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 133

    View full-size slide

  134. References
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022) 134

    View full-size slide

  135. Incident management
    1. Atlassian: Understanding incident response roles and responsibili5es, h8ps:/
    /www.atlassian.com/incident-management/incident-
    response/roles-responsibili5es
    2. PagerDuty Incident Response Training, h8ps:/
    /response.pagerduty.com/training/overview/.
    3. Anatomy of an Incident, Ayelet Sachto, Adrienne Walcer, and Jessie Yang, 2022.
    4. US Federal Emergency Management Agency, Emergency Management Ins5tute ICS Resource Center, h8ps:/
    /training.fema.gov/
    emiweb/is/icsresource/.
    5. The Na5onal Ins5tute of Standards and Technology SP 800-61, Computer Security Incident Handling Guide, h8p:/
    /dx.doi.org/
    10.6028/NIST.SP.800-61r2.
    6. Introduc5on: Incident Response overview, Gov UK Na5onal Cyber Security Centre, h8ps:/
    /www.ncsc.gov.uk/collec5on/incident-
    management/incident-response
    7. Incident Review and Postmortem Best Prac5ces, h8ps:/
    /newsle8er.pragma5cengineer.com/p/incident-review-best-prac5ces
    8. Incident Review Prac5ces [The Pragma5c Engineer Newsle8er], h8ps:/
    /docs.google.com/spreadsheets/d/1GPINipdf-
    l2H05iKOUbpkrqwlZ61ZCJDnwY5iE8LtRM/edit#gid=0
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022)
    135

    View full-size slide

  136. SRE
    1. Google SRE book Chapter 14 - Managing Incidents, h>ps:/
    /sre.google/sre-book/managing-incidents/
    2. Postmortem AcEon Items: Plan the Work and Work the Plan, Sue Lueder and Betsy Beyer (Google), USENIX SRECon 2017,
    h>ps:/
    /www.usenix.org/conference/srecon17americas/program/presentaEon/lueder.
    3. Google SRE book Chapter 15 - Postmortem Culture: Learning from Failure, h>ps:/
    /sre.google/sre-book/postmortem-culture/.
    4. Postmortem Metadata Index, h>ps:/
    /postmortems.app/.
    5. The Art of SLOs, Google Site Reliability Engineering, h>ps:/
    /sre.google/resources/pracEces-and-processes/art-of-slos/
    6. danluu/post-mortems: A collecEon of postmortems, h>ps:/
    /github.com/danluu/post-mortems.
    7. Great Incident Review Examples, The PragmaEc Engineer, h>ps:/
    /blog.pragmaEcengineer.com/postmortem-best-pracEces/#great-
    incident-review-examples
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022)
    136

    View full-size slide

  137. DevOps performance metrics
    1. Accelerate: The Science of Lean So4ware and DevOps: Building and Scaling High Performing Technology OrganizaDons, 2018.
    2. GoogleCloudPlaKorm/fourkeys, hNps:/
    /github.com/GoogleCloudPlaKorm/fourkeys
    3. Are you an Elite DevOps performer? Find out with the Four Keys Project, Google Cloud, hNps:/
    /cloud.google.com/blog/products/
    devops-sre/using-the-four-keys-to-measure-your-devops-performance
    4. DORA DevOps Quick Check., hNps:/
    /www.devops-research.com/quickcheck.html
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022)
    137

    View full-size slide

  138. SaaS and OSS
    1. Datadog, h,ps:/
    /www.datadoghq.com/blog/incident-response-with-datadog/
    2. incident.io, h,ps:/
    /incident.io/
    3. jeli, h,ps:/
    /www.jeli.io/
    4. monzo/response, h,ps:/
    /monzo.com/blog/2019/07/08/how-we-respond-to-incidents
    5. Etsy/morgue, h,ps:/
    /github.com/etsy/morgue
    SRE NEXT 2022 / presented by Takayuki Watanabe@Launchable, Inc. (May 15, 2022)
    138

    View full-size slide