• SREϕετϓϥΫςΟεΛ৫ʹΠϯετʔϧ͢Δ • վળରKPI: • Mean Time Between Failure(MTBF) / Change Failure Rate(CFR) ʹΠϯγσϯτ • Mean Time to Recover(MTTR) ʹΠϯγσϯτղܾ࣌ؒ “զʑͳͥ͜͜ʹ͍Δͷ͔” ͷݴޠԽʂ VPoE͕͏·ͬͯ͘͘Ε·ͨ͠
ޙड़ • Severity • SEV 1~3(IRF2.0) • Direct Cause • ޙड़ • Direct Cause System • MicroServiceͷҙຯ୯ҐͰͷίϯϙʔωϯτ܈ • Direct Cause Workload • Online Service, Offline Pipeline, … ՄೳͳݶΓEnumΛఆٛ͢Δ 46 2-4. P2: ࠜຊվળ Πϯγσϯτͷղ૾ΛߴΊΔ ࣗ༝ೖྗͰΧʔσΟφϦςΟ͕ߴ͘ͳΓ͗ͯ͢·ͱͳੳ͕Ͱ͖ͳ͍
Resolved Time To Detect Time To Resolve Time To Mitigate ͦΕͧΕॏཁɺରࡦ͕ҟͳΔʂ ݕʹ͔͔Δ࣌ؒɻ ओʹAlertingͷྖҬ ݕʙࢭ݂ʹ͔͔Δ࣌ؒɻ ࠷ΫϦςΟΧϧ͕ͩɺΞϓϩʔν͍͢͠ IRFͷྖҬ ࢭ݂ʙࠜຊରԠ/ิਖ਼ͳͲޙॲཧʹ͔͔Δ࣌ؒɻ ͢Ͱʹࢭ݂͞Ε͍ͯΔͷͰɺ͞ΑΓਖ਼֬͞ ͕ٻΊΒΕΔ 3-2. Πϯγσϯτղܾ࣌ؒͷΞϓϩʔν
“over-monitoring is a harder problem to solve than under-monitoring.” — SRE: How Google Runs Production System • False Positive Ξϥʔτଟ͗͢ɺຒΕΔ • SLO + Error Budget ʹΑΔΞϥʔτͷγϑτ • Pager͞ΕͨΒΠϯγσϯτɺ͕ཧ • ҰேҰ༦ʹ͍͔ͳ͍ʂ ͞Εͨ՝ɻ̐ষͰ͓͠·͢ʂ 3-2. Πϯγσϯτղܾ࣌ؒͷΞϓϩʔν