Lock in $30 Savings on PRO—Offer Ends Soon! ⏳
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Observability — Extending Into Incident Response
Search
Narimichi Takamura
October 27, 2025
Technology
2
900
Observability — Extending Into Incident Response
Observability Conference Tokyo 2025の登壇資料です。
https://o11ycon.jp/
Narimichi Takamura
October 27, 2025
Tweet
Share
More Decks by Narimichi Takamura
See All by Narimichi Takamura
インシデントキーメトリクスによるインシデント対応の改善 / Improving Incident Response using Incident Key Metrics
nari_ex
1
12k
組織的なインシデント対応を目指して〜成熟度評価と改善のステップ〜 / Towards an Organized Incident Response - Maturity Assessment and Improvement Steps -
nari_ex
7
9.1k
Waroomの開発モチベーションと今後のロードマップ / Waroom development motivation and roadmap
nari_ex
1
1.7k
Engineering with Business Impact
nari_ex
2
320
How We Foster Reliability in Diversity
nari_ex
14
13k
SRE Practices in Organizations
nari_ex
16
10k
Hardening におけるトラブルシューティング / Troubleshooting in Hardening
nari_ex
1
370
私が Engineering Manager になるまでに経験してきたこと、大切にしてきたこと / Lecture materials for Introduction to Venture Business at UEC
nari_ex
0
250
運用技術者組織の設計と運用 / Design and operation of operational engineer organization
nari_ex
11
10k
Other Decks in Technology
See All in Technology
Claude Code Getting Started Guide(en)
oikon48
0
140
私のRails開発環境
yahonda
0
180
命名から始めるSpec Driven
kuruwic
3
830
Oracle Database@AWS:サービス概要のご紹介
oracle4engineer
PRO
0
120
モバイルゲーム開発におけるエージェント技術活用への試行錯誤 ~開発効率化へのアプローチの紹介と未来に向けた展望~
qualiarts
0
290
生成AI時代の自動E2Eテスト運用とPlaywright実践知_引持力哉
legalontechnologies
PRO
0
100
Modern Data Stack大好きマンが語るSnowflakeの魅力
sagara
0
280
Agents IA : la nouvelle frontière des LLMs (Tech.Rocks Summit 2025)
glaforge
0
380
私も懇親会は苦手でした ~苦手だからこそ懇親会を楽しむ方法~ / 20251127 Masaki Okuda
shift_evolve
PRO
4
550
MCP・A2A概要 〜Google Cloudで構築するなら〜
shukob
0
160
Design System Documentation Tooling 2025
takanorip
1
930
【5分でわかる】セーフィー エンジニア向け会社紹介
safie_recruit
0
37k
Featured
See All Featured
Statistics for Hackers
jakevdp
799
230k
Design and Strategy: How to Deal with People Who Don’t "Get" Design
morganepeng
132
19k
It's Worth the Effort
3n
187
29k
ピンチをチャンスに:未来をつくるプロダクトロードマップ #pmconf2020
aki_iinuma
128
54k
Speed Design
sergeychernyshev
33
1.4k
Building an army of robots
kneath
306
46k
Embracing the Ebb and Flow
colly
88
4.9k
How STYLIGHT went responsive
nonsquared
100
5.9k
Rebuilding a faster, lazier Slack
samanthasiow
84
9.3k
Building Applications with DynamoDB
mza
96
6.8k
4 Signs Your Business is Dying
shpigford
186
22k
The Straight Up "How To Draw Better" Workshop
denniskardys
239
140k
Transcript
None
2
גࣜձࣾTopotalʢͱΆͨΔʣ • h#ps:/ /topotal.com • SREΛओ࣠ʹͨ͠ελʔτΞοϓ • 2ࣄۀΛӡӦ • SRE
as a Service • SaaS for SREʢWaroomʣ • ຊΠϕϯτͷεϙϯαʔ • ϒʔεͰSaaSͷσϞΛͬͯΔͷ Ͱɺͥͻཱ͓ͪدΓ͍ͩ͘͞ʂ 3
SRE as a Service • h#ps:/ /sre-as-a-service.com • SREʹಛԽٕͨ͠ज़ࢧԉαʔϏε •
ࢧԉͷྫ • SLI/SLOͷಋೖɾӡ༻վળ • ΦϒβʔόϏϦςΟͷઃܭɾ࣮ • ΠϯγσϯτϚωδϝϯτͷվળ 4
WaroomʢΘΔʔΉʣ • h#ps:/ /waroom.com • ৫తʹΠϯγσϯτରԠΛߦ͏ͨΊ ͷSaaS • Slack ϕʔεͷରԠʹ߹ΘͤͯࣗಈԽɾ
লྗԽ͕Ͱ͖Δ 5
6
7
8
ηογϣϯ֓ཁ • ΦϒβʔόϏϦςΟʢo11yʣʹΑΔվળޮՌͷྫͱͯ͠ɺΠϯγσϯ τϨεϙϯεʢIRʣͷվળ͕ڍ͛ΒΕΔ • ମײͰվળͯͦ͠͏͕ͩɺͦͷޮՌΛఆྔతʹࣔ͢͜ͱΉ͔͍ͣ͠ => IR SaaSͷ࡞Γख /
SRE ͱͯ͠ɺIRΛఆྔతʹվળ͢ΔͨΊͷϓϥΫς Οεʢ࣮ફతͳTTXϝτϦΫεʣ ʹ͍͓ͭͯ͠·͢ɻ => ऴ൫ͰʢιϑτΣΞͰͳ͘ʣIRϓϩηεͷՄ؍ଌੑΛߴΊΔ ͱ ͍͏ςʔϚʹ౿ΈࠐΜͰ͓͠·͢ɻ 9
ຊެԋͷλʔήοτ • o11y ͷվળޮՌΛఆྔతʹࣔ͢ϓϥΫςΟεʹڵຯ͕͋Δํ • IR ͷՄࢹԽʹڵຯ͕͋Δํ • ʮo11y Λ
IR ͷྖҬ֦ு͢Δ͜ͱʯʹڵຯ͕͋Δํ 10
ΞδΣϯμ 1. Ϟνϕʔγϣϯ 2. MTTRͷ 3. ࣮ફతͳ TTX ϝτϦΫεͷఆٛ 4.
TTX ϝτϦΫεͷ׆༻ 5. o11y ΛΠϯγσϯτϨεϙϯεͷྖҬద༻͢Δ 11
1. Ϟνϕʔγϣϯ 12
͍: ͦͷԾઆຊͳͷ͔ 1. γεςϜͷՄ؍ଌੑΛվળ͢Δ 2. ෳࡶͳγεςϜͷ෦ঢ়ଶΛਪଌɾѲͰ͖ΔΑ͏ʹͳΔ 3. ൃੜ࣌ʹݪҼಛఆ͕ਝʹͳΓ෮چ͕࣌ؒ͘ͳΔ ← ί
Ϩ 13
Γ͍ͨ͜ͱ2͚ͭͩ • Where: Ͳ͜ ͕վળͨ͠ͷ͔ • How much: Ͳͷఔ վળͨ͠ͷ͔
14
ΦϒβʔόϏϦςΟʹΑͬͯߦΘΕͨ ΠϯγσϯτରԠͷվળޮՌΛ ఆྔతʹදݱ͍ͨ͠ 15
෮چ࣌ؒͷॖʹޮՌ͕͋Δͣ → MTTR Λଌఆ͢Ε͍͍ͷͰʁ 16
2. MTTRͷ 17
MTTRʢฏۉ෮چ࣌ؒʣ ͱ • ো͕ൃੜ͔ͯ͠Βम෮·ͨ෮چ͢ Δ·Ͱͷฏۉ࣌ؒͷ͜ͱ • Mean Time To Recovery(Repair,
Resolve, Restore)ͷུ • ࢉग़ํ๏1 • MTTR = ૯मཧ࣌ؒ / ނোճ 1 MTTRʢฏۉ෮چ࣌ؒʣͱʁܭࢉํ๏ͱMTBFͱͷނোɾՔಇʹ ͓͚Δؔ 18
19
SREs should move away from defaul/ng to the assump/on that
MTTX can be useful. 20
MTTRͷ༗ޮੑͷݕূ • Ծઆ • MTTR͕༗ޮͳࢦඪͳΒɺTTRΛॖ͢ΕMTTRॖ͞Ε Δͣ 21
MTTRͷ༗ޮੑͷݕূ 1. Πϯγσϯτͷσʔληοτ2ΛϥϯμϜʹ2ׂ͢Δ 2. ยํͷσʔληοτͷम෮࣌ؒ(TTR)Λ10%ݮΒ͢ 3. ֤σʔληοτͷMTTR(ฏۉम෮࣌ؒ)Λܭࢉ͢Δ 4. σʔληοτؒͷMTTRͷࠩΛऔΔ •
diff = MTTR(unmodified)- MTTR(modified) 5. MTTRͷॖׂ߹(%)Λࢉग़͢Δ • = diff/MTTR(unmodified) 6. 1ʙ4Λ10ສճ܁Γฦ͢ 2 Unveiling the black box with observability stack 22
23
݁Ռ: MTTR͕10%Ҏ্վળ͢Δͷ50ʙ60% 24
֤ΠϯγσϯτΛվળͯ͠MTTR͕վળ͠ͳ͍ཧ༝ • MTTRͷΈʹऑ͍ • ҰํͰɺΠϯγσϯτσʔλ"Β͖ͭ"͕ܹ͍͠ 25
Πϯγσϯτσʔλͷಛ3 • େ͔ͳΓૣ͘ऩଋ͢Δ • Ұ෦൵ࢂͳΠϯγσϯτʹͳΔ • → ແ࡞ҝʹσʔληοτΛׂ͢Δ ͱɺ൵ࢂͳΠϯγσϯτͷภΓ͕MTTR ͷࢉग़ʹେ͖ͳӨڹΛٴ΅͢
• ex. ෮چʹ5000ஹ͔͔࣌ؒΔΠϯγσ ϯτͷৼΓ͚ઌ͕ͲͪΒʹͳΔ͔Ͱ MTTRͷվળ۩߹େ෯ʹมΘΔ 3 The VOID Report 26
ࢀߟ: म෮࣌ؒΛมߋͤͣʹγϛϡϨʔγϣϯͨ݁͠Ռ → վળ׆ಈͷ༗ແʹ͔͔ΘΒͣɺMTTRσʔληοτ࣍ୈͰվળ or ѱԽ͢Δ 27
Incident Metrics in SRE ͷओு • γϛϡϨʔγϣϯ͔ΒΘ͔ͬͨ͜ͱ • ΠϯγσϯτނোظؒͷΒ͖͕ͭେ͖͍ͨΊɺվળ͕ MTTR
ʹө͞ΕͮΒ͍ • ex. ʮࡢൺMTTR10%վળʂʯظԽͨ͠Πϯγσϯτ͕গͳ͔͚͔ͬͨͩ • ※ ຖ·ͬͨ͘ಉ͡ྔɾ෮چ࣌ؒͷΠϯγσϯτ͕ى͖ΔͳΒՁ͕͋Δ(ϜϦ) • ݁ • MTTR վળͷධՁࢦඪͱͯ͠ʹཱͨͳ͍ • MTTRͷΈʹऑ͘ɺΠϯγσϯτσʔλΒ͖͕ܹ͍͔ͭ͠Β 28
ͳʹ͕ͩͬͨͷʁ ֤ཁૉͳ͍ • Πϯγσϯτظؒͷมಈੑ͕ߴ͍͜ͱ • MTTRΛͳΜΒ͔ͷࢦඪʹ͢Δ͜ͱ • ࢦඪΛͱʹվળͷՌΛ֬ೝ͢Δ͜ͱ → తͱࢦඪ͕טΈ߹͍ͬͯͳ͍͜ͱ͕
29
σʔλੳʢԾઆݕূܕʣͷྲྀΕ 30
MTTRΛࢦඪʹ͢Δͱ͖ͷࢥߟͷྲྀΕ 31
ى͖͍ͯͨ͜ͱ: ԾઆݕূϩδοΫͷෆ߹ 32
ղܾࡦ: վળՕॴΛ໌Β͔ʹ͠ɺมಈੑΛ͑Δ 33
ղܾࡦ: վળՕॴΛ໌Β͔ʹ͠ɺมಈੑΛ͑Δ 34
͜͜·Ͱͷ·ͱΊ • MTTR(෮چ࣌ؒ)σʔλมಈੑ͕ߴ͍ͨΊվળࢦඪʹෆద • վળՕॴΛ໌֬Խ͠ɺΑΓࡉ͔͍ TTX ϝτϦΫεΛར༻͢Δ͜ ͱͰɺมಈੑΛ͑Δ͜ͱ͕Մೳ → TTRΑΓࡉ͔͍ϝτϦΫεͷधཁ͕ग़ͯ͘Δ
35
3. ࣮ફతͳ TTX ϝτϦΫε 36
Waroom͕ߟ͑Δ࣮ફతͳϝτϦΫεͱ • ཏతͰ͋Δ • ཻ͕ࡉ͔͍ • ऩू͕ݱ࣮తͰ͋Δ 37
ͲΜͳTTXϝτϦΫεΛ ऩू͢ΔͱΑ͍ͩΖ͏͔ 38
39
TTXϝτϦΫεͷ՝ײ • ੈͷதʹࣄྫ͍͔ͭ͋͘Δ͕ɺఆٛ౷Ұ͞Ε͍ͯͳ͍ • ࣄྫಉ࢜ΛΈ߹ΘͤΑ͏ͱͯ͠ɺॏෳෆ͕ੜ͡Δ • → ஶ໊ͳจݙΛϕʔεʹɺࡉ͔͘ɺཏతͳఆٛΛࢦ͢ 40
TTXϝτϦΫεఆٛͷྲྀΕ 1. ϕετϓϥΫςΟεΛֶͿ 2. ΠϯγσϯτεςʔλεΛఆٛ͢Δ 3. ΠϯγσϯτϚΠϧετʔϯ(εςʔλεͷڥ)Λఆٛ͢Δ 4. TTXϝτϦΫεΛఆٛ͢Δ 41
ϕετϓϥΫςΟεΛֶͿ 42
ΠϯγσϯτεςʔλεΛͬ͘͟Γఆٛ͢Δ 43
44
45
ϚΠϧετʔϯΛͱʹ TTXʹམͱ͠ࠐΉ 46
47
ϝτϦΫεऩू͍ͨΜ • ࡉ͔ͳϝτϦΫεΛఆٛ͢ΔͱɺϚΠϧετʔϯΛ͑Δ͝ͱ ʹλΠϜελϯϓΛه͢Δඞཁ͕͋Δ • ରԠதʹ͍͍ͪͪਓ͕ؒଧࠁ͢Δͷඇݱ࣮త • → Waroom ͰSlack
BotͰࣗಈऩू͍ͯ͠·͢ 48
ରԠதͷΠϕϯτΛτϦΨʔʹࣗಈऩू͢Δྫ ϚΠϧετʔϯ ରԠதͷΠϕϯτ Detectedʢݕʣ Ξϥʔτൃੜ௨ Acknowledgedʢೝʣ νϟϯωϧ࡞ɺΠϯγσϯτىථ Iden.fiedʢղܾࡦͷಛఆʣ RunbookͷϑΣʔζ͚ʢPrecheck ͱResolu.onʣ
Recoveredʢ෮چʣ SlackͷΓͱΓ͔ΒAI͕அ͢Δ 49
4. TTXϝτϦΫεͷ׆༻ 50
ϝτϦΫεΛޮՌతʹ͏ͨΊʹ ੳͷతͱϝτϦΫεͷಛΛ߹ͤ͞Δ 51
52
ϝτϦΫεͱվળࢪࡦͷྫ TTX ՝ վળࢪࡦ TTDetectʢݕʣ ൃੜ͔ͯ͠Βݕ·Ͱʹ࣌ ͕͔͔ؒΔ ϞχλϦϯάͷվળ TTEngageʢνʔϜߏʣ ରԠνʔϜΛߏஙʹ͕࣌ؒ
͔͔Δ γϑτׂͷ໌֬ԽɺΦ ϯίʔϧ੍ͷಋೖ TTInves-gateʢௐࠪʣ োΓ͚ʹ͕͔͔࣌ؒ Δ RunbookͷμογϡϘʔυͷ උ TTFixʢम෮ʣ োͷम෮ʹ͕͔͔࣌ؒΔ ϩʔϧόοΫͷߴԽ 53
54
യવͱͨ͠ԾઆΛͱʹɺ͔Β՝Λݟ͚ͭΔ Ծઆ ৽ͨʹൃݟͨ͠՝ͷྫ ڞ௨ͷڥͳͷͰɺ৫ͷ֤ TTXͷҰఆͷͣ αʔϏενʔϜʹΑͬͯύϑ ΥʔϚϯε͕ҟͳΔ ֤TTXఆʹ͍ۙͣ ʢex. TTAͳΒ10Ҏ͘Β
͍ʣ ʢ࣮ʣணख͕શମతʹ͍ɺ ղܾࡦͷಛఆ͕શମతʹ͍ 55
56
57
5. o11y ΛΠϯγσϯτϨεϙϯεʹద༻͢ Δ 58
o11yΛIRద༻͢Δ2 • ΠϯγσϯτϨεϙϯεͷ෦ߏͷ Մ؍ଌੑΛ͞ΒʹߴΊΔ • TTXͷఆٛʹΑͬͯɺϝτϦΫεͳ Μͱͳ͘ಋೖࡁΈ • ϝτϦΫεɺϩάɺτϨʔεͷϓϥΫ ςΟεΛ׆༻͢Δ͜ͱͰվળͰ͖ͳ͍
ͩΖ͏͔ 2 Unveiling the black box with observability stack 59
Metrics 60
ബͬ͢ΒͱΔ"ยखམͪ"ײ • հͨ͠TTXϝτϦΫεɺ͍ͣΕTTRΛղ͚ͨͩ͠ • ͭ·ΓɺγεςϜ෮چ࣌ؒͷॖ ʹ͚ͩয͕͍ͨͬͯΔ • SREࢹͰ αʔϏεͷ৴པੑ ͷ؍͕ॏཁ
• ex. ֶͼ͋Δ͔ɺ࠶ൃࢭ͞ΕΔ͔ • ϓϩμΫτӡӦࢹͰ ސ٬ͷ৴པੑ ͷ؍͕ॏཁ • ex. ސ٬ରԠेʹߦΘΕ͍ͯΔ͔ => Մ؍ଌੑΛߴΊΔʹɺΑΓଟ֯తͳରԠϓϩηεͷϝτϦΫε͕ඞཁ 61
γεςϜ෮چରԠͱฒߦ͍ͯͬͯ͠Δ͜ͱ • ސ٬ͷઆ໌ɾࣄͷڞ༗ • Πϯγσϯτͷใࠂɾੳ • ࠜຊରࡦͷݕ౼ɾ࣮ࢪ => ݱঢ়ͩͱɺ্هͷ׆ಈͷ؍ଌείʔϓ֎ʹͳ͍ͬͯΔ 62
TTXϝτϦΫεͷԠ༻: ؍ଌൣғͷ֦େ ؍ଌൣғΛΠϯγσϯτରԠશମʹ֦ு͠ɺվળࢦඪͱͳΔϝτϦΫεΛఆٛ͢Δ ϝτϦΫε໊ త Incident Response Metrics ७ਮͳ෮چରԠͷ՝ಛఆɾվળࢦඪ Customer
Reliability Metrics ސ٬ରԠͷ՝ಛఆɾվળࢦඪ Learning Metrics ৫ֶ͕ͼΛಘΔ·Ͱͷ׆ಈͷτϥοΩϯά Improvement Metrics ࠜຊରࡦͷ࣮ࢪঢ়گͷੳ => ࠓճɺCustomer Reliability Metrrics ͷྫΛհ 63
64
Log 65
ରԠதͷΠϕϯτΛه͢Δ • ऩू • ୭͕ɾ͍ͭɾͲͷίϚϯυɾͲͷ அΛߦ͔ͬͨΛߏԽϩάԽ • ex. νϟοτɺεςʔλεมߋɺ֎෦ πʔϧʹΑΔΠϕϯτ࿈ܞ
• ׆༻ྫ • λΠϜϥΠϯੜεςʔλεϖʔ δΛࣗಈੜ ! 66
WaroomͷཪଆͰ४උ͕ਐߦத... 67
Trace 68
ରԠϓϩηεͷྲྀΕɺґଘؔ Λ؍ଌ͢Δ • ऩू • Πϯγσϯτεςʔλε୯ҐͰεύϯԽ • ݕʙ෮چ·ͰΛ1ຊͷτϨʔεͱͯ͠ཧ • ΞΫγϣϯ͝ͱʹࡉԽͯ͠౷߹
• ׆༻ྫ • εςʔλεҠߦؒͰߦΘΕͨॲཧͱॴཁ࣌ ؒΛՄࢹԽ ! • ରԠͷϘτϧωοΫʹͳͬͨఔΛಛఆ ✨ 69
πʔϧ͕ԣஅ͢ΔதͰΠϕϯτΛͲ͏औಘ͢Δ͔ • ෮چରԠ࣌ʹ֤छπʔϧΛԣஅతʹར༻͢Δ͜ͱ͕ଟ͍ • ex. PagerDuty → Slack → Datadog
→ AWS → GitHub... • ݱঢ়ɺ୯ҰΠϯγσϯτͷͨΊʹߦͬͨ͜ͱΛ͍ͬͯΔͷରԠ ऀͷΈ • ରԠऀ͕खಈͰMELTΛอଘ͢Δͷඇݱ࣮త → AIϕʔεͰରԠΛ͢ΔੈքઢͰɺΑΓଟ͘ͷใ͕औಘՄೳʹʂ 70
AIϕʔεͷΠϯγσϯτϨε ϙϯε • AI͕ࣗવݴޠͰୡ͞Εͨ༰Λͱ ʹɺMCPαʔόʔ֎෦πʔϧͱ࿈ܞ ͠ͳ͕Β͞·͟·ͳૢ࡞Λߦ͏ • → ৗʹWaroomΛܦ༝ͯ͠ΞΫγϣϯ ͕ߦΘΕΔΑ͏ʹͳΓɺࡉ͔ͳΠϕϯ
τΛࣗಈతʹอଘͰ͖Δ 71
·ͱΊ 1. վળࢦඪͱͯ͠MTTRཱͨͳ͍ 2. ϝτϦΫε׆༻ɺతʙσʔλੳʹࢸΔ·Ͱͷ߹ੑ͕ॏཁ 3. มಈੑΛ͑ΔͨΊʹɺ͍ͷ۩ମԽͱϝτϦΫεͷࡉԽ͕ॏ ཁ 4. TTXϝτϦΫεͷఆٛաఔͱ׆༻ํ๏
5. o11yͷϓϥΫςΟεΛ࣋ͪࠐΉ͜ͱͰɺΑΓแׅతͳ؍ଌʹۙͮ͘ 72
͍͞͝ʹ • ϝτϦΫεͷࣗಈऩूͷΈΛ࡞Δ ͷ͍ͨΜ • ͞ΒʹɺՄࢹԽج൫ͷߏங͍ͨΜ • ͞ΒʹɺϝτϦΫεΛΧςΰϦϥϕ ϧͰ෦நग़͢Δͷ͍ͨΜ •
→ ͥͻ Waroom Λ͝׆༻͍ͩ͘͞ • ڵຯ͕༙͍ͨํ Topotal ͷϒʔε ʂ 73
͋Γ͕ͱ͏͍͟͝·ͨ͠