Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Observability — Extending Into Incident Response
Search
Sponsored
·
Your Podcast. Everywhere. Effortlessly.
Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.
→
Narimichi Takamura
October 27, 2025
Technology
2
1k
Observability — Extending Into Incident Response
Observability Conference Tokyo 2025の登壇資料です。
https://o11ycon.jp/
Narimichi Takamura
October 27, 2025
Tweet
Share
More Decks by Narimichi Takamura
See All by Narimichi Takamura
インシデントキーメトリクスによるインシデント対応の改善 / Improving Incident Response using Incident Key Metrics
nari_ex
1
12k
組織的なインシデント対応を目指して〜成熟度評価と改善のステップ〜 / Towards an Organized Incident Response - Maturity Assessment and Improvement Steps -
nari_ex
7
9.4k
Waroomの開発モチベーションと今後のロードマップ / Waroom development motivation and roadmap
nari_ex
1
1.7k
Engineering with Business Impact
nari_ex
2
330
How We Foster Reliability in Diversity
nari_ex
14
13k
SRE Practices in Organizations
nari_ex
16
11k
Hardening におけるトラブルシューティング / Troubleshooting in Hardening
nari_ex
1
380
私が Engineering Manager になるまでに経験してきたこと、大切にしてきたこと / Lecture materials for Introduction to Venture Business at UEC
nari_ex
0
260
運用技術者組織の設計と運用 / Design and operation of operational engineer organization
nari_ex
11
10k
Other Decks in Technology
See All in Technology
モダンUIでフルサーバーレスなAIエージェントをAmplifyとCDKでサクッとデプロイしよう
minorun365
2
120
分析画面のクリック操作をそのままコード化 ! エンジニアとビジネスユーザーが共存するAI-ReadyなBI基盤
ikumi
1
230
ClickHouseはどのように大規模データを活用したAIエージェントを全社展開しているのか
mikimatsumoto
0
190
学生・新卒・ジュニアから目指すSRE
hiroyaonoe
2
530
名刺メーカーDevグループ 紹介資料
sansan33
PRO
0
1k
Context Engineeringの取り組み
nutslove
0
270
顧客との商談議事録をみんなで読んで顧客解像度を上げよう
shibayu36
0
140
usermode linux without MMU - fosdem2026 kernel devroom
thehajime
0
210
Contract One Engineering Unit 紹介資料
sansan33
PRO
0
13k
Data Hubグループ 紹介資料
sansan33
PRO
0
2.7k
Bill One急成長の舞台裏 開発組織が直面した失敗と教訓
sansantech
PRO
1
220
2026年、サーバーレスの現在地 -「制約と戦う技術」から「当たり前の実行基盤」へ- /serverless2026
slsops
2
200
Featured
See All Featured
What’s in a name? Adding method to the madness
productmarketing
PRO
24
3.9k
Information Architects: The Missing Link in Design Systems
soysaucechin
0
770
The untapped power of vector embeddings
frankvandijk
1
1.6k
Let's Do A Bunch of Simple Stuff to Make Websites Faster
chriscoyier
508
140k
BBQ
matthewcrist
89
10k
Java REST API Framework Comparison - PWX 2021
mraible
34
9.1k
Improving Core Web Vitals using Speculation Rules API
sergeychernyshev
21
1.4k
Fashionably flexible responsive web design (full day workshop)
malarkey
408
66k
Why Your Marketing Sucks and What You Can Do About It - Sophie Logan
marketingsoph
0
71
SEO in 2025: How to Prepare for the Future of Search
ipullrank
3
3.3k
RailsConf & Balkan Ruby 2019: The Past, Present, and Future of Rails at GitHub
eileencodes
141
34k
Navigating Weather and Climate Data
rabernat
0
97
Transcript
None
2
גࣜձࣾTopotalʢͱΆͨΔʣ • h#ps:/ /topotal.com • SREΛओ࣠ʹͨ͠ελʔτΞοϓ • 2ࣄۀΛӡӦ • SRE
as a Service • SaaS for SREʢWaroomʣ • ຊΠϕϯτͷεϙϯαʔ • ϒʔεͰSaaSͷσϞΛͬͯΔͷ Ͱɺͥͻཱ͓ͪدΓ͍ͩ͘͞ʂ 3
SRE as a Service • h#ps:/ /sre-as-a-service.com • SREʹಛԽٕͨ͠ज़ࢧԉαʔϏε •
ࢧԉͷྫ • SLI/SLOͷಋೖɾӡ༻վળ • ΦϒβʔόϏϦςΟͷઃܭɾ࣮ • ΠϯγσϯτϚωδϝϯτͷվળ 4
WaroomʢΘΔʔΉʣ • h#ps:/ /waroom.com • ৫తʹΠϯγσϯτରԠΛߦ͏ͨΊ ͷSaaS • Slack ϕʔεͷରԠʹ߹ΘͤͯࣗಈԽɾ
লྗԽ͕Ͱ͖Δ 5
6
7
8
ηογϣϯ֓ཁ • ΦϒβʔόϏϦςΟʢo11yʣʹΑΔվળޮՌͷྫͱͯ͠ɺΠϯγσϯ τϨεϙϯεʢIRʣͷվળ͕ڍ͛ΒΕΔ • ମײͰվળͯͦ͠͏͕ͩɺͦͷޮՌΛఆྔతʹࣔ͢͜ͱΉ͔͍ͣ͠ => IR SaaSͷ࡞Γख /
SRE ͱͯ͠ɺIRΛఆྔతʹվળ͢ΔͨΊͷϓϥΫς Οεʢ࣮ફతͳTTXϝτϦΫεʣ ʹ͍͓ͭͯ͠·͢ɻ => ऴ൫ͰʢιϑτΣΞͰͳ͘ʣIRϓϩηεͷՄ؍ଌੑΛߴΊΔ ͱ ͍͏ςʔϚʹ౿ΈࠐΜͰ͓͠·͢ɻ 9
ຊެԋͷλʔήοτ • o11y ͷվળޮՌΛఆྔతʹࣔ͢ϓϥΫςΟεʹڵຯ͕͋Δํ • IR ͷՄࢹԽʹڵຯ͕͋Δํ • ʮo11y Λ
IR ͷྖҬ֦ு͢Δ͜ͱʯʹڵຯ͕͋Δํ 10
ΞδΣϯμ 1. Ϟνϕʔγϣϯ 2. MTTRͷ 3. ࣮ફతͳ TTX ϝτϦΫεͷఆٛ 4.
TTX ϝτϦΫεͷ׆༻ 5. o11y ΛΠϯγσϯτϨεϙϯεͷྖҬద༻͢Δ 11
1. Ϟνϕʔγϣϯ 12
͍: ͦͷԾઆຊͳͷ͔ 1. γεςϜͷՄ؍ଌੑΛվળ͢Δ 2. ෳࡶͳγεςϜͷ෦ঢ়ଶΛਪଌɾѲͰ͖ΔΑ͏ʹͳΔ 3. ൃੜ࣌ʹݪҼಛఆ͕ਝʹͳΓ෮چ͕࣌ؒ͘ͳΔ ← ί
Ϩ 13
Γ͍ͨ͜ͱ2͚ͭͩ • Where: Ͳ͜ ͕վળͨ͠ͷ͔ • How much: Ͳͷఔ վળͨ͠ͷ͔
14
ΦϒβʔόϏϦςΟʹΑͬͯߦΘΕͨ ΠϯγσϯτରԠͷվળޮՌΛ ఆྔతʹදݱ͍ͨ͠ 15
෮چ࣌ؒͷॖʹޮՌ͕͋Δͣ → MTTR Λଌఆ͢Ε͍͍ͷͰʁ 16
2. MTTRͷ 17
MTTRʢฏۉ෮چ࣌ؒʣ ͱ • ো͕ൃੜ͔ͯ͠Βम෮·ͨ෮چ͢ Δ·Ͱͷฏۉ࣌ؒͷ͜ͱ • Mean Time To Recovery(Repair,
Resolve, Restore)ͷུ • ࢉग़ํ๏1 • MTTR = ૯मཧ࣌ؒ / ނোճ 1 MTTRʢฏۉ෮چ࣌ؒʣͱʁܭࢉํ๏ͱMTBFͱͷނোɾՔಇʹ ͓͚Δؔ 18
19
SREs should move away from defaul/ng to the assump/on that
MTTX can be useful. 20
MTTRͷ༗ޮੑͷݕূ • Ծઆ • MTTR͕༗ޮͳࢦඪͳΒɺTTRΛॖ͢ΕMTTRॖ͞Ε Δͣ 21
MTTRͷ༗ޮੑͷݕূ 1. Πϯγσϯτͷσʔληοτ2ΛϥϯμϜʹ2ׂ͢Δ 2. ยํͷσʔληοτͷम෮࣌ؒ(TTR)Λ10%ݮΒ͢ 3. ֤σʔληοτͷMTTR(ฏۉम෮࣌ؒ)Λܭࢉ͢Δ 4. σʔληοτؒͷMTTRͷࠩΛऔΔ •
diff = MTTR(unmodified)- MTTR(modified) 5. MTTRͷॖׂ߹(%)Λࢉग़͢Δ • = diff/MTTR(unmodified) 6. 1ʙ4Λ10ສճ܁Γฦ͢ 2 Unveiling the black box with observability stack 22
23
݁Ռ: MTTR͕10%Ҏ্վળ͢Δͷ50ʙ60% 24
֤ΠϯγσϯτΛվળͯ͠MTTR͕վળ͠ͳ͍ཧ༝ • MTTRͷΈʹऑ͍ • ҰํͰɺΠϯγσϯτσʔλ"Β͖ͭ"͕ܹ͍͠ 25
Πϯγσϯτσʔλͷಛ3 • େ͔ͳΓૣ͘ऩଋ͢Δ • Ұ෦൵ࢂͳΠϯγσϯτʹͳΔ • → ແ࡞ҝʹσʔληοτΛׂ͢Δ ͱɺ൵ࢂͳΠϯγσϯτͷภΓ͕MTTR ͷࢉग़ʹେ͖ͳӨڹΛٴ΅͢
• ex. ෮چʹ5000ஹ͔͔࣌ؒΔΠϯγσ ϯτͷৼΓ͚ઌ͕ͲͪΒʹͳΔ͔Ͱ MTTRͷվળ۩߹େ෯ʹมΘΔ 3 The VOID Report 26
ࢀߟ: म෮࣌ؒΛมߋͤͣʹγϛϡϨʔγϣϯͨ݁͠Ռ → վળ׆ಈͷ༗ແʹ͔͔ΘΒͣɺMTTRσʔληοτ࣍ୈͰվળ or ѱԽ͢Δ 27
Incident Metrics in SRE ͷओு • γϛϡϨʔγϣϯ͔ΒΘ͔ͬͨ͜ͱ • ΠϯγσϯτނোظؒͷΒ͖͕ͭେ͖͍ͨΊɺվળ͕ MTTR
ʹө͞ΕͮΒ͍ • ex. ʮࡢൺMTTR10%վળʂʯظԽͨ͠Πϯγσϯτ͕গͳ͔͚͔ͬͨͩ • ※ ຖ·ͬͨ͘ಉ͡ྔɾ෮چ࣌ؒͷΠϯγσϯτ͕ى͖ΔͳΒՁ͕͋Δ(ϜϦ) • ݁ • MTTR վળͷධՁࢦඪͱͯ͠ʹཱͨͳ͍ • MTTRͷΈʹऑ͘ɺΠϯγσϯτσʔλΒ͖͕ܹ͍͔ͭ͠Β 28
ͳʹ͕ͩͬͨͷʁ ֤ཁૉͳ͍ • Πϯγσϯτظؒͷมಈੑ͕ߴ͍͜ͱ • MTTRΛͳΜΒ͔ͷࢦඪʹ͢Δ͜ͱ • ࢦඪΛͱʹվળͷՌΛ֬ೝ͢Δ͜ͱ → తͱࢦඪ͕טΈ߹͍ͬͯͳ͍͜ͱ͕
29
σʔλੳʢԾઆݕূܕʣͷྲྀΕ 30
MTTRΛࢦඪʹ͢Δͱ͖ͷࢥߟͷྲྀΕ 31
ى͖͍ͯͨ͜ͱ: ԾઆݕূϩδοΫͷෆ߹ 32
ղܾࡦ: վળՕॴΛ໌Β͔ʹ͠ɺมಈੑΛ͑Δ 33
ղܾࡦ: վળՕॴΛ໌Β͔ʹ͠ɺมಈੑΛ͑Δ 34
͜͜·Ͱͷ·ͱΊ • MTTR(෮چ࣌ؒ)σʔλมಈੑ͕ߴ͍ͨΊվળࢦඪʹෆద • վળՕॴΛ໌֬Խ͠ɺΑΓࡉ͔͍ TTX ϝτϦΫεΛར༻͢Δ͜ ͱͰɺมಈੑΛ͑Δ͜ͱ͕Մೳ → TTRΑΓࡉ͔͍ϝτϦΫεͷधཁ͕ग़ͯ͘Δ
35
3. ࣮ફతͳ TTX ϝτϦΫε 36
Waroom͕ߟ͑Δ࣮ફతͳϝτϦΫεͱ • ཏతͰ͋Δ • ཻ͕ࡉ͔͍ • ऩू͕ݱ࣮తͰ͋Δ 37
ͲΜͳTTXϝτϦΫεΛ ऩू͢ΔͱΑ͍ͩΖ͏͔ 38
39
TTXϝτϦΫεͷ՝ײ • ੈͷதʹࣄྫ͍͔ͭ͋͘Δ͕ɺఆٛ౷Ұ͞Ε͍ͯͳ͍ • ࣄྫಉ࢜ΛΈ߹ΘͤΑ͏ͱͯ͠ɺॏෳෆ͕ੜ͡Δ • → ஶ໊ͳจݙΛϕʔεʹɺࡉ͔͘ɺཏతͳఆٛΛࢦ͢ 40
TTXϝτϦΫεఆٛͷྲྀΕ 1. ϕετϓϥΫςΟεΛֶͿ 2. ΠϯγσϯτεςʔλεΛఆٛ͢Δ 3. ΠϯγσϯτϚΠϧετʔϯ(εςʔλεͷڥ)Λఆٛ͢Δ 4. TTXϝτϦΫεΛఆٛ͢Δ 41
ϕετϓϥΫςΟεΛֶͿ 42
ΠϯγσϯτεςʔλεΛͬ͘͟Γఆٛ͢Δ 43
44
45
ϚΠϧετʔϯΛͱʹ TTXʹམͱ͠ࠐΉ 46
47
ϝτϦΫεऩू͍ͨΜ • ࡉ͔ͳϝτϦΫεΛఆٛ͢ΔͱɺϚΠϧετʔϯΛ͑Δ͝ͱ ʹλΠϜελϯϓΛه͢Δඞཁ͕͋Δ • ରԠதʹ͍͍ͪͪਓ͕ؒଧࠁ͢Δͷඇݱ࣮త • → Waroom ͰSlack
BotͰࣗಈऩू͍ͯ͠·͢ 48
ରԠதͷΠϕϯτΛτϦΨʔʹࣗಈऩू͢Δྫ ϚΠϧετʔϯ ରԠதͷΠϕϯτ Detectedʢݕʣ Ξϥʔτൃੜ௨ Acknowledgedʢೝʣ νϟϯωϧ࡞ɺΠϯγσϯτىථ Iden.fiedʢղܾࡦͷಛఆʣ RunbookͷϑΣʔζ͚ʢPrecheck ͱResolu.onʣ
Recoveredʢ෮چʣ SlackͷΓͱΓ͔ΒAI͕அ͢Δ 49
4. TTXϝτϦΫεͷ׆༻ 50
ϝτϦΫεΛޮՌతʹ͏ͨΊʹ ੳͷతͱϝτϦΫεͷಛΛ߹ͤ͞Δ 51
52
ϝτϦΫεͱվળࢪࡦͷྫ TTX ՝ վળࢪࡦ TTDetectʢݕʣ ൃੜ͔ͯ͠Βݕ·Ͱʹ࣌ ͕͔͔ؒΔ ϞχλϦϯάͷվળ TTEngageʢνʔϜߏʣ ରԠνʔϜΛߏஙʹ͕࣌ؒ
͔͔Δ γϑτׂͷ໌֬ԽɺΦ ϯίʔϧ੍ͷಋೖ TTInves-gateʢௐࠪʣ োΓ͚ʹ͕͔͔࣌ؒ Δ RunbookͷμογϡϘʔυͷ උ TTFixʢम෮ʣ োͷम෮ʹ͕͔͔࣌ؒΔ ϩʔϧόοΫͷߴԽ 53
54
യવͱͨ͠ԾઆΛͱʹɺ͔Β՝Λݟ͚ͭΔ Ծઆ ৽ͨʹൃݟͨ͠՝ͷྫ ڞ௨ͷڥͳͷͰɺ৫ͷ֤ TTXͷҰఆͷͣ αʔϏενʔϜʹΑͬͯύϑ ΥʔϚϯε͕ҟͳΔ ֤TTXఆʹ͍ۙͣ ʢex. TTAͳΒ10Ҏ͘Β
͍ʣ ʢ࣮ʣணख͕શମతʹ͍ɺ ղܾࡦͷಛఆ͕શମతʹ͍ 55
56
57
5. o11y ΛΠϯγσϯτϨεϙϯεʹద༻͢ Δ 58
o11yΛIRద༻͢Δ2 • ΠϯγσϯτϨεϙϯεͷ෦ߏͷ Մ؍ଌੑΛ͞ΒʹߴΊΔ • TTXͷఆٛʹΑͬͯɺϝτϦΫεͳ Μͱͳ͘ಋೖࡁΈ • ϝτϦΫεɺϩάɺτϨʔεͷϓϥΫ ςΟεΛ׆༻͢Δ͜ͱͰվળͰ͖ͳ͍
ͩΖ͏͔ 2 Unveiling the black box with observability stack 59
Metrics 60
ബͬ͢ΒͱΔ"ยखམͪ"ײ • հͨ͠TTXϝτϦΫεɺ͍ͣΕTTRΛղ͚ͨͩ͠ • ͭ·ΓɺγεςϜ෮چ࣌ؒͷॖ ʹ͚ͩয͕͍ͨͬͯΔ • SREࢹͰ αʔϏεͷ৴པੑ ͷ؍͕ॏཁ
• ex. ֶͼ͋Δ͔ɺ࠶ൃࢭ͞ΕΔ͔ • ϓϩμΫτӡӦࢹͰ ސ٬ͷ৴པੑ ͷ؍͕ॏཁ • ex. ސ٬ରԠेʹߦΘΕ͍ͯΔ͔ => Մ؍ଌੑΛߴΊΔʹɺΑΓଟ֯తͳରԠϓϩηεͷϝτϦΫε͕ඞཁ 61
γεςϜ෮چରԠͱฒߦ͍ͯͬͯ͠Δ͜ͱ • ސ٬ͷઆ໌ɾࣄͷڞ༗ • Πϯγσϯτͷใࠂɾੳ • ࠜຊରࡦͷݕ౼ɾ࣮ࢪ => ݱঢ়ͩͱɺ্هͷ׆ಈͷ؍ଌείʔϓ֎ʹͳ͍ͬͯΔ 62
TTXϝτϦΫεͷԠ༻: ؍ଌൣғͷ֦େ ؍ଌൣғΛΠϯγσϯτରԠશମʹ֦ு͠ɺվળࢦඪͱͳΔϝτϦΫεΛఆٛ͢Δ ϝτϦΫε໊ త Incident Response Metrics ७ਮͳ෮چରԠͷ՝ಛఆɾվળࢦඪ Customer
Reliability Metrics ސ٬ରԠͷ՝ಛఆɾվળࢦඪ Learning Metrics ৫ֶ͕ͼΛಘΔ·Ͱͷ׆ಈͷτϥοΩϯά Improvement Metrics ࠜຊରࡦͷ࣮ࢪঢ়گͷੳ => ࠓճɺCustomer Reliability Metrrics ͷྫΛհ 63
64
Log 65
ରԠதͷΠϕϯτΛه͢Δ • ऩू • ୭͕ɾ͍ͭɾͲͷίϚϯυɾͲͷ அΛߦ͔ͬͨΛߏԽϩάԽ • ex. νϟοτɺεςʔλεมߋɺ֎෦ πʔϧʹΑΔΠϕϯτ࿈ܞ
• ׆༻ྫ • λΠϜϥΠϯੜεςʔλεϖʔ δΛࣗಈੜ ! 66
WaroomͷཪଆͰ४උ͕ਐߦத... 67
Trace 68
ରԠϓϩηεͷྲྀΕɺґଘؔ Λ؍ଌ͢Δ • ऩू • Πϯγσϯτεςʔλε୯ҐͰεύϯԽ • ݕʙ෮چ·ͰΛ1ຊͷτϨʔεͱͯ͠ཧ • ΞΫγϣϯ͝ͱʹࡉԽͯ͠౷߹
• ׆༻ྫ • εςʔλεҠߦؒͰߦΘΕͨॲཧͱॴཁ࣌ ؒΛՄࢹԽ ! • ରԠͷϘτϧωοΫʹͳͬͨఔΛಛఆ ✨ 69
πʔϧ͕ԣஅ͢ΔதͰΠϕϯτΛͲ͏औಘ͢Δ͔ • ෮چରԠ࣌ʹ֤छπʔϧΛԣஅతʹར༻͢Δ͜ͱ͕ଟ͍ • ex. PagerDuty → Slack → Datadog
→ AWS → GitHub... • ݱঢ়ɺ୯ҰΠϯγσϯτͷͨΊʹߦͬͨ͜ͱΛ͍ͬͯΔͷରԠ ऀͷΈ • ରԠऀ͕खಈͰMELTΛอଘ͢Δͷඇݱ࣮త → AIϕʔεͰରԠΛ͢ΔੈքઢͰɺΑΓଟ͘ͷใ͕औಘՄೳʹʂ 70
AIϕʔεͷΠϯγσϯτϨε ϙϯε • AI͕ࣗવݴޠͰୡ͞Εͨ༰Λͱ ʹɺMCPαʔόʔ֎෦πʔϧͱ࿈ܞ ͠ͳ͕Β͞·͟·ͳૢ࡞Λߦ͏ • → ৗʹWaroomΛܦ༝ͯ͠ΞΫγϣϯ ͕ߦΘΕΔΑ͏ʹͳΓɺࡉ͔ͳΠϕϯ
τΛࣗಈతʹอଘͰ͖Δ 71
·ͱΊ 1. վળࢦඪͱͯ͠MTTRཱͨͳ͍ 2. ϝτϦΫε׆༻ɺతʙσʔλੳʹࢸΔ·Ͱͷ߹ੑ͕ॏཁ 3. มಈੑΛ͑ΔͨΊʹɺ͍ͷ۩ମԽͱϝτϦΫεͷࡉԽ͕ॏ ཁ 4. TTXϝτϦΫεͷఆٛաఔͱ׆༻ํ๏
5. o11yͷϓϥΫςΟεΛ࣋ͪࠐΉ͜ͱͰɺΑΓแׅతͳ؍ଌʹۙͮ͘ 72
͍͞͝ʹ • ϝτϦΫεͷࣗಈऩूͷΈΛ࡞Δ ͷ͍ͨΜ • ͞ΒʹɺՄࢹԽج൫ͷߏங͍ͨΜ • ͞ΒʹɺϝτϦΫεΛΧςΰϦϥϕ ϧͰ෦நग़͢Δͷ͍ͨΜ •
→ ͥͻ Waroom Λ͝׆༻͍ͩ͘͞ • ڵຯ͕༙͍ͨํ Topotal ͷϒʔε ʂ 73
͋Γ͕ͱ͏͍͟͝·ͨ͠