σʔλఆٛɿΠϯγσϯτΛϞσϦϯά͢Δ
• Insidentͷଐੑ
• Title
• Status
• State Machine. ޙड़
• Severity
• SEV 1~3(IRF2.0)
• Direct Cause
• ޙड़
• Direct Cause System
• MicroServiceͷҙຯ୯ҐͰͷίϯϙʔωϯτ܈
• Direct Cause Workload
• Online Service, Offline Pipeline, …
ՄೳͳݶΓEnumΛఆٛ͢Δ
46
2-4. P2: ࠜຊվળ Πϯγσϯτͷղ૾ΛߴΊΔ
ࣗ༝ೖྗͰΧʔσΟφϦςΟ͕ߴ͘ͳΓ͗ͯ͢·ͱͳੳ͕Ͱ͖ͳ͍
Slide 47
Slide 47 text
ΠϯγσϯτϞσϦϯάɿDirect Cause
ΠϯγσϯτͷݪҼɻੳͯ͠ΞΫγϣφϒϧʹͳΔΑ͏ઃఆ
47
2-4. P2: ࠜຊվળ Πϯγσϯτͷղ૾ΛߴΊΔ
ॏཁͳࢦඪͷ؍ଌɿMTTRͷ؍ଌͱࡉԽ
1. Occurred
2. Detected
3. Declared
4. Mitigated
5. Resolved
Time To Detect
Time To Mitigate
Ͳ͜ʹ͕͔͔͍࣌ؒͬͯΔ͔ɺݱࡏ͕Θ͔ͬͨʂ
Time To Resolve
2-4. P2: ࠜຊվળ Πϯγσϯτͷղ૾ΛߴΊΔ
MTTRͷղ૾Λ্͛ΔɿεςʔτϚγϯ
1. Occurred
2. Detected
3. Declared
4. Mitigated
5. Resolved
Time To Detect
Time To Resolve
Time To Mitigate
ͦΕͧΕॏཁɺରࡦ͕ҟͳΔʂ
ݕʹ͔͔Δ࣌ؒɻ
ओʹAlertingͷྖҬ
ݕʙࢭ݂ʹ͔͔Δ࣌ؒɻ
࠷ΫϦςΟΧϧ͕ͩɺΞϓϩʔν͍͢͠
IRFͷྖҬ
ࢭ݂ʙࠜຊରԠ/ิਖ਼ͳͲޙॲཧʹ͔͔Δ࣌ؒɻ
͢Ͱʹࢭ݂͞Ε͍ͯΔͷͰɺ͞ΑΓਖ਼֬͞
͕ٻΊΒΕΔ
3-2. Πϯγσϯτղܾ࣌ؒͷΞϓϩʔν
Slide 63
Slide 63 text
63
MTTD(Mean Time To Detect) ʹର͢ΔΞϓϩʔν
Ξϥʔτͷඋ
• ݕͰ͖ͳ͔ͬͨ/Εͨʹ৽͘͠ΞϥʔτΛ͚ͭΑ͏ɺ͏·͍͔͘ͳ͍
• “over-monitoring is a harder problem to solve than under-monitoring.” — SRE: How Google
Runs Production System
• False Positive Ξϥʔτଟ͗͢ɺຒΕΔ
• SLO + Error Budget ʹΑΔΞϥʔτͷγϑτ
• Pager͞ΕͨΒΠϯγσϯτɺ͕ཧ
• ҰேҰ༦ʹ͍͔ͳ͍ʂ
͞Εͨ՝ɻ̐ষͰ͓͠·͢ʂ
3-2. Πϯγσϯτղܾ࣌ؒͷΞϓϩʔν