Upgrade to Pro — share decks privately, control downloads, hide ads and more …

「アラーティング」の話をしよう— SREconや論文等の最先端とのギャップをみる

Sponsored · Your Podcast. Everywhere. Effortlessly. Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.

「アラーティング」の話をしよう— SREconや論文等の最先端とのギャップをみる

ホンマでっかSRE勉強会 #1
https://honmadekka.connpass.com/event/395755/

Avatar for Yuuki Tsubouchi (yuuk1)

Yuuki Tsubouchi (yuuk1)

June 25, 2026

More Decks by Yuuki Tsubouchi (yuuk1)

Transcript

  1. Ironies of Automation ґવͱͯ͠ʮΞϥʔτʯ͕ਓؒʹͱΜͰ͘ΔͷͰ͸ Bainbridge, L. "Ironies of Automation." Automatica,

    19(6):775–779, 1983. ੍ޚγεςϜ͕ߴ౓ʹͳΕ͹ͳΔ΄ͲɺਓؒͷΦϖ Ϩʔλͷߩݙ͕ΑΓॏཁʹͳΔͱ͍͏ൽ೑
  2. ࠓ೔ͷ͸ͳ͠ ࠷ઌ୺ SREcon / ֶज़࿦จ (ICSE, ASE౳) ݱࡏ஍ ΪϟοϓΛຒΊΔ ΞϥʔςΟϯάͷϥΠϑαΠΫϧ

    Λ4ϑΣʔζʹ෼͚ͯઃܭݪଇΛ நग़ yuuk1͕ؔΘΔεύίϯαʔϏε ͷGrafana Alerting
  3. ΞϥʔςΟϯάͷ4ϑΣʔζϞσϧ PHASE A อূ → PHASE B ཈੍ → PHASE

    C ഑৴࠷దԽ → PHASE D ղܾࢧԉ ෆཁͳΞϥʔτΛͲ ͏ݮΒ͔͢ʁ ड͚ख͕ԿΛ ͢Δ͔ʁ ԣஅ — Ξϥʔτ඼࣭ΛͲ͏ଌΔ͔ʁ Phase A ͕յΕ͍ͯΕ͹ɺB Ҏ߱ΛͲΕ͚ͩվળͯ͠΋ҙຯ͕ͳ͍ ԿͰΞϥʔτ͢Δ ͔ʁ Ξϥʔτ͢΂ ͖΋ͷ͕ຊ౰ʹΞϥ ʔτ͢Δ͔ʁ ୭ʹɺͲ͏ಧ͚Δ ͔ʁ
  4. 4ϑΣʔζʹরΒͨ͠Ϊϟοϓ Phase Gap Status A. อূ SLOϕʔεΞϥʔτ͸ະಋೖ ϧʔϧ݈શੑͷ CI ݕূͳ͠ɺৗઃΧφϦΞͳ͠ɺ…

    ⾠ ෆ଍ B. ཈੍ ݱࡏͷΞϥʔτ݅਺͸10݅/೔ ҎԼɻ ✓ OK C. ഑৴࠷దԽ Өڹൣғxଈ࣌ੑʹΑΔνϟϯωϧग़͠෼͚ͳͲ ✓ OK D. ղܾࢧԉ ௐࠪࣗಈԽ / RCA / ΤʔδΣϯτະணख ⾠ ෆ଍ ԣஅʢ඼࣭ʣ Quality of Alerts / ίετϞσϧະಋೖ ⾠ ෆ଍ େن໛ϚΠΫϩαʔϏεલఏͷख๏͸আ֎ʢΞϥʔτετʔϜͷ཈੍΍ΤεΧ Ϩʔγϣϯ࠷దԽΛػցֶशͰߦ͏΋ͷͳͲ͕࿦จͰ͸ଟ͘ఏҊ͞Ε͍ͯΔʣ
  5. 4ϑΣʔζʹরΒͨ͠Ϊϟοϓ Phase Gap Status A. อূ SLOϕʔεΞϥʔτ͸ະಋೖ ϧʔϧ݈શੑͷ CI ݕূͳ͠ɺৗઃΧφϦΞͳ͠ɺ…

    ⾠ ෆ଍ B. ཈੍ ݱࡏͷΞϥʔτ݅਺͸10݅/೔ ҎԼɻ ✓ OK C. ഑৴࠷దԽ Өڹൣғxଈ࣌ੑʹΑΔνϟϯωϧग़͠෼͚ͳͲ ✓ OK D. ղܾࢧԉ ௐࠪࣗಈԽ / RCA / ΤʔδΣϯτະணख ⾠ ෆ଍ ԣஅʢ඼࣭ʣ Quality of Alerts / ίετϞσϧະಋೖ ⾠ ෆ଍ ࠓ೔͸ɺAͱԣஅ ΛΈ͍ͯ͘
  6. PHASE Aɿ঱ঢ়ϕʔεΞϥʔτ ✕ ݪҼى఺ CPU > 90% Disk > 85%

    OOM detected → ΞΫγϣϯͰ͖ͳ͍͜ͱ͕ଟ͍ ✓ ঱ঢ়ى఺ SLOϕʔεͷΞϥʔτ ʮϢʔβʔ͕ӨڹΛड͚͍ͯΔʯ → ΞΫγϣϯ͸ඞਢͰ͋Δ͜ͱΛอূ Multi-burn-rate SLO alerts Thurgood, S., et al. "Alerting on SLOs." In Beyer, B. et al. (eds.), The Site Reliability Workbook, ch. 5. O'Reilly Media, 2018. 4 Golden Signals
  7. Ϊϟοϓᶃ ঱ঢ়ϕʔεΞϥʔτ ݪҼى఺ ঱ঢ়ى఺ ɾ ݱঢ়͸ɺݪҼى఺ͱ঱ঢ়ى఺͕ࠞࡏ͍ͯ͠Δ ɾ SLI/SLO͸ະಋೖ • δϣϒ౤ೖՄ༻ੑ

    • GPU ܭࢉ༰ྔ • ετϨʔδՄ༻ੑ • ΠϯλϥΫςΟϒϊʔυ΁ͷ౸ୡੑ ͷΑ͏ͳ΋ͷΛSLIͱ࣮ͯ͠૷͍ͨ͠ Ϣʔβʔͱͷڥք໘͕ଟ͍ͷͰઃܭ͕ ೉͍͠… • δϣϒεέδϡʔϥ systemdىಈ • NTP࣌ࠁಉظ஗Ԇ • IPΞυϨεଟॏઃఆ؂ࢹ • … PHASE DͷΞϥʔτௐࠪࣗಈԽʹΑ Γফ͠΍͘͢ͳΔ ো֐΍ώϠϦϋοτ͕ى͖͔ͯΒ௥Ճ ͢ΔΞϥʔτ͸ݪҼى఺͕ଟ͍
  8. PHASE Aɿ Ξϥʔτઃఆ͕ͦ΋ͦ΋ͳ͍໰୊ 40.41% of detection failures ݕ஌ࣦഊͷ࠷େཁҼ͸ ʮͦ΋ͦ΋ରԠ͢Δ Ξϥʔτ͕ॻ͔Ε͍ͯͳ͍ʯ

    — Microsoft Ͱͷ࣮ূݚڀ [Ganatra+, FSE2023] Ganatra, V. et al. "Detection Is Better Than Cure: A Cloud Incidents Perspective." ESEC/FSE, pp. 310–321, 2023. ڞ௨͢Δґଘؔ܎Λ΋ͭผϚΠΫϩαʔϏεͷΞ ϥʔτϧʔϧ͔Βࣗಈਪન աڈͷΠϯγσϯτνέοτ͔Βࣗಈਪન
  9. PHASE Aɿͳ͔ͥΞϥʔτͰ͖͍ͯͳ͔ͬͨ໰୊ ໐Γ͗͢໰୊ʢfalse alarmʣ͸ٞ࿦ͭ͘͠͞Ε͍ͯΔ ੩͔ʹյΕΔ໰୊ʢsilent failureʣ͸͋·Γٞ࿦͞Εͳ͍ ↳ ϥϕϧΛม͑ͨॠؒʹΫΤϦ͕ۭΛฦ͢Α͏ͳࣄނ CI —

    ͦ΋ͦ΋յΕͨϧʔϧΛೖΕͳ͍ Daemon — ಈ͖ग़͔ͯ͠Β઴ਐతʹյΕͨ΋ͷΛݕग़ Lukasz Mierzwa, "Monitoring our Monitoring: How we validate our Prometheus alert rules,” The Cloudflare Blog, 2022, 
 https://blog.cloudflare.com/monitoring-our-monitoring/ cloudflare/pint: PrometheusͷϧʔϧϦϯλʔ ৗறΧφϦΞ vector(1) Λ֤ route ʹྲྀͯ͠ɺ ௨஌ܦ࿏ͷࢮ׆Λৗ࣌ݕূ
  10. Phase BʙD — ݱঢ়ͱকདྷ B. ཈੍ ✓ pending / keep_firing_for

    / silence / mute timing Λ࢖͍෼͚͍ͯΔ C. ഑৴࠷దԽ ✓ ඞਢϥϕϧ͕͚ܽΔΞϥʔτ͸ default route ʹִ཭ ಛఆϥϕϧ lifecycle=experimental ͸ຊ൪νϟωϧ͔Β֎͢ Өڹൣғ x ଈ࣌ੑͷ4৅ݶͰSlackνϟϯωϧ෼͚ͱ @here Ͱϖʔδ͢Δ͔ܾΊΔ D. ղܾࢧԉ ˚ ݪҼى఺ΞϥʔτͷRunbook͕ະ੔උ কདྷ: ࣗಈূڌऩू / RCA / ΤʔδΣϯτॲཧ ର৅֎: ML ू໿ɾϥϯΩϯάɾಈతҼՌϧʔςΟϯάʢ1νʔϜن໛Ͱ͸අ༻ରޮՌ͕߹Θͳ͍ʣ Iwahori, S. "RunbookʹԿΛॻ͖ɺͲͷΑ͏ʹΞϥʔτΛৼΓ෼͚Δ͔ʁ" SRE NEXT 2023, Tokyo, 2023.
  11. ԣஅ — Ξϥʔτͷ඼࣭ΛͲ͏ଌΔ͔ ᶃ QoAʢQuality of Alertsʣ 3࣠ Indicativeness —

    Կ͕ى͖͍ͯΔ͔ࣔͤ ͍ͯΔ͔ Precision — ͲΕ͚ͩਖ਼͔֬ Handleability — ड͚ख͕ॲཧͰ͖Δ͔ Yang, T., Shen, J., Su, Y., Ren, X., Yang, Y., Lyu, M. R. "Characterizing and Mitigating Anti-patterns of Alerts in Industrial Cloud Systems." DSN, pp. 1–12, 2022. Ξϯνύλʔϯ Լ͕Δ QoA ࣠ A1 Unclear Name/ Description Handleability A2 Misleading Severity Precision A3 Improper/Outdated Generation Rule Indicativeness A4 Transient/Toggling Alerts Precision / Handleability A5 Repeating Alerts Indicativeness A6 Cascading Alerts Indicativeness / Precision [Yang+, DSN2022]
  12. ԣஅ — Ξϥʔτͷ඼࣭ΛͲ͏ଌΔ͔ ᶄ ίετϞσϧ [Zadka, SREcon2022] ِΞϥʔτ΁ͷରԠίετ ਅΞϥʔτͷॲཧίετ =

    ෮چ࡞ۀίετ ܽམΞϥʔτʹΑΔଛ֐ = ෮چ·Ͱͷ௥Ճ࣌ؒ × ඃ֐ن໛ ͱ ݕ஌஗ԆʹΑΔඃ֐֦େ Zadka, M. "Modeling Alert Quality." USENIX SREcon22 Americas, 2022. Ξϥʔτ඼࣭ΛɺίετʢΞϯνΫΦϦςΟʣͷ߹ࢉͱͯ͠ଌఆ͢Δٯసͷண૝ ൃੜ ݕ஌ ֬ೝ ਍அ ෮چ ᶃ ᶄ ᶅ ᶆ 4۠ؒʹ෼ղͯ͠ܭଌ = ਓ਺ × ࣌ؒ × ෆศ͞܎਺ʢۈ຿࣌ؒ֎͔Ͳ͏͔౳ʣ
  13. ·ͱΊ ಈػɿΞϥʔςΟϯά͸AI࣌୅ʹ΋࢒Γͦ͏͕ͩɺͦΕࣗମͷਐา͕Έ͑ʹ͍͘ ϑΣʔζDͷղܾࢧԉ ͸Agentic SRE ࠷ઌ୺ͷจݙௐ͔ࠪΒɺ ৔౰ͨΓతͳվળ͔Βɺ ཧ૝ͱͷΪϟοϓ෼ੳΞϓϩʔν΁ม͍͚͑ͯΔͷͰ͸ͳ͍͔ʁ ࠷ઌ୺΁ͷΩϟονΞοϓ͸ɺݱ୅ͳΒLLM WikiͳͲͰݱ࣮తʹ

    SREؔ࿈ͷࠃࡍձٞϦετ https://gist.github.com/yuuki/60b768fcb6bdf3f3552ee59f5a9e4972 ओுɿΞϥʔτ؅ཧʹ4ϑΣʔζϞσϧʹ෼ղͯ͠ɺ֤ϑΣʔζͰ࠷ઌ୺ͱൺֱ ಎ࡯ɿϑΣʔζAʢอূʣͱΞϥʔτ඼࣭ධՁʹ՝୊͕ଟ͍ ղܾɿ SLOಋೖɺϧʔϧ݈શੑͷCIɺΞϥʔτΞϯνύλʔϯͷ੔ཧͱίετϞσϧ