Slide 1

Slide 1 text

#SRE࿦จ঺հ Yuuki Tsubouchi / @yuuk1t TopotalςΫϊϩδΞυόΠβʔ Waroom Meetup #1 Detection is Better Than Cure: A Cloud Incidents Perspective V. Ganatra et. al., ESEC/FSE’23 2024/06/04

Slide 2

Slide 2 text

ΠϯγσϯτϨεϙϯε΁ͷ ޻ֶతͳΞϓϩʔνΛ࿦จ͔ΒֶͿ

Slide 3

Slide 3 text

3 ɾஶऀɿMicrosoft India, China, USͦΕͧΕͷॴଐ ɾձٞɿESEC/FSEɻιϑτ΢ΣΞ޻ֶܥͷτοϓձٞ (CORE Rank A) ࿦จ঺հͷϝλσʔλ Ganatra V, Parayil A, Ghosh S, Kang Y, Ma M, Bansal C, Nath S, Mace J. Detection Is Better Than Cure: A Cloud Incidents Perspective. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering 2023 (pp. 1891-1902). ※ ɾ೥ɿ2023 Microsoft͸ϝΨΫϥ΢υϕ ϯμʔͷதͰΠϯγσϯτ ؅ཧ෼໺Ͱ਺ଟ͘ͷ࿦จΛ ൃද͍ͯ͠Δ

Slide 4

Slide 4 text

4 (1) Ϟχλʔ͕ෆ଍͢Δ͜ͱʢmiss-detection, ݟಀ͠ʣ Πϯγσϯτݕ஌ͷ໰୊ҙࣝ Monitoring Gap (2) ෆཁͳϞχλʔ͕͋Δ͜ͱ Πϯγσϯτͷॳظ঱ঢ়Λݕ஌ Ͱ͖ͳ͍ ΠϯγσϯτղܾޙʹɺϞχλʔΛΞυϗοΫʹ࡞੒͕ͪ͠ ΞϥʔτετʔϜΛҾ͖ى͜͢

Slide 5

Slide 5 text

5 େن໛Ϋϥ΢υαʔϏεͷ࣮ূݚڀ ɾAzureͷΠϯγσϯτ෼ੳ͔Βιϑτ΢ΣΞόάͷҰൠతͳࠜຊݪҼΛಛఆ ɾTeamsͷΠϯγσϯτ෼ੳ͔Βڞ௨͢ΔࠜຊݪҼͱ؇࿨ࡦΛಛఆ ࣅͨΑ͏ͳ͜ͱΛ΍͍ͬͯΔਓ͸͍ͳ͍ʁ ΞϥʔτετʔϜΛ෼ੳ͢Δ࣮ূݚڀ ɾେن໛ۜߦγεςϜͷΞϥʔτετʔϜΛ෼ੳ ɾ୅දతͳΞϥʔτΛબ୒͢ΔͨΊͷΞϧΰϦζϜఏҊ https://x.com/yuuk1t/status/1648558134481547264 [Ghosh+, SoCC2022]: How to fight production incidents? an empirical study on a large-scale cloud service [Liu+,HotOS2019]: What bugs cause production cloud incidents? miss-detectionʢݟಀ͠ʣʹؔ͢Δ࣮ূݚڀ͸ͳ͍ [Zhao+,ICSE/SEP’2020]. Understanding and handling alert storm for online service systems.

Slide 6

Slide 6 text

6 Microsoftͷ300ݸҎ্ͷαʔϏεͷ؂ࢹγεςϜΛ෼ੳ͠ɺ miss-detectionͷڞ௨ݪҼΛཧղ͢Δ͜ͱ ͜ͷ࿦จͰ͸ԿΛ΍ͬͨͷ͔ʁ ɾϞχλʔΛमਖ਼ͨ͠ܗ੻ͷ͋Δ950݅Ҏ্ͷΠϯγσϯτΛௐࠪͨ͠ ɾखಈϥϕϦϯάͱGPT-3.5Λ༻͍ͨٙࣅϥϕϦϯάʹΑΓ࢓෼͚ͨ ɾmiss-detectionͷओͳݪҼʹ͍ͭͯ෼ྨ͠ɺಎ࡯Λಘͨ ݚڀ໨త ࠓճ͸লུ

Slide 7

Slide 7 text

7 Πϯγσϯτͷ96.7%͸Ϟχλʔ ʹΑͬͯݕग़͞Εͨ miss-detectionͷස౓͸ʁ ͔͠͠ɺʮॳظ঱ঢ়ʯΛੵۃతʹ ݕग़Ͱ͖ͳ͔ͬͨέʔε΋͋Δ

Slide 8

Slide 8 text

8 (1) Missing/improper signal: ඞཁͳςϨϝτϦ͕ͳ͍ (2) Missing monitor/alert: ςϨϝτϦ͸͋Δ͕Ϟχλʔ/Ξϥʔτ͕ͳ͍ (3) Improper monitor coverage: Ϟχλʔ͕ΠϯγσϯτΛΧόʔ͠ͳ͍ (4) Incorrect alerting logic: ᮢ஋͕ߴ͗͢ΔͳͲϩδοΫ͕ෆద੾ (5) Buggy monitor: Ϟχλʔઃఆόά(৽൛ϝτϦΫεΛ࢖͑ͯͳ͍ͳͲʣ (6) Others: ΞϥʔτͷυΩϡϝϯτʢRunbookʣ͕ܽམɺޡΓ miss-detectionͷओཁͳ̒छͷݪҼ

Slide 9

Slide 9 text

9 miss-detectionͷݪҼ͸ͲΕ͕ଟ͍ʁ Ϟχλʔ/Ξϥʔτͷܽམ͕40%Ҏ্ Ͱ͋Γѹ౗తଟ਺ ࣍఺ͰɺςϨϝτϦͷܽམ

Slide 10

Slide 10 text

10 miss-detectionͷ27.5%͕αʔϏεఀࢭ(outage)΁ miss-detectionͷӨڹ͸ͲΕ͘Β͍ʁ ΞϥʔτͷϩδοΫ/υΩϡϝϯτޡΓͰ͸ 40%Ҏ্͕ఀࢭ΁ Figure 5: (a) Proportion of incidents from each miss-detection class that led to outages

Slide 11

Slide 11 text

11 miss-detectionͷӨڹ͸ͲΕ͘Β͍ʁ Figure 5: (b) Time to Detect (TTD) and Time to Mitigate (TTM) for cloud incidents that were not detected properly. TTD͸Ϟχλʔ/ςϨϝτϦ͕ ܽམ͍ͯ͠Δͱ࠷େ TTM͸ςϨϝτϦ΍υΩϡϝϯτ͕ ͳ͍৔߹ʹಛʹߴ͘ͳΔ

Slide 12

Slide 12 text

12 ͦͷଞͷmiss-detectionͷݪҼ෼ੳ αʔϏεͷछผ͝ͱ αʔϏε੒ख़౓͝ͱ ґଘαʔϏε਺͝ͱ SLA༗ແ͝ͱ SLIͷΫϥελ͝ͱ

Slide 13

Slide 13 text

̎αʔϏεؒڞ௨ͷґଘؔ܎&Ϟχλʔͷ஌ࣝΛ࢖ͬͯ Ϟχλʔ௥ՃΛࣗಈఏҊͰ͖ͨ͸ͣ 13 ΑΓྑ͍ϞχλϦϯάΛ͢Δʹ͸ʁ ࠷ॳͷΠϯγσϯτ ̎൪໨ͷΠϯγσϯτ 2. ಉҰϦʔδϣϯͷαʔϏε͕DB΁ ͷ৽ن઀ଓΛ։͚ͳ͘ͳͬͨ վળ఺ 1. ͋ΔϦʔδϣϯͷDB͕ఀࢭͨ͠ 3. DB઀ଓͷো֐Λ؂ࢹ͓ͯ͠Βͣɺ ΠϯγσϯτΛݕग़Ͱ͖ͳ͔ͬͨ 4. ΞΫγϣϯΞΠςϜͱͯ͠DB઀ଓ ͷϞχλʔΛ௥Ճͨ͠ 1. ผαʔϏεͰΤϯΩϡʔʹ͕࣌ؒ ͔͔Γɺδϣϒ͕٧·ͬͨ 2. SQLͷλΠϜΞ΢τʹ௚໘ͨ͠ 3.Πϯγσϯτ͸ϞχλʔͰݕग़Ͱ͖ ͣɺखಈͰ؍ଌ͞Εͨ ࠷ॳͷαʔϏεͱґଘؔ܎ͷ40%Ҏ্Λڞ ༗͠ɺେྔͷڞ௨ͷϞχλʔΛ΋͍ͬͯͨ

Slide 14

Slide 14 text

14 ݱߦɿৼΓฦΓ͔ΒΞυϗοΫʹϞχλʔΛ௥Ճ ײ૝ɿΑΓΑ͍Πϯγσϯτ؅ཧ΁ ϞχλϦϯά ݕ஌ ৼΓฦΓ ΠϯγσϯτରԠ ղܾ Φϯίʔϧ Ϟχλʔઃఆ Ξϥʔτ

Slide 15

Slide 15 text

15 ײ૝ɿΑΓΑ͍Πϯγσϯτ؅ཧ΁ ϞχλϦϯά ݕ஌ ৼΓฦΓ ΠϯγσϯτରԠ ղܾ Φϯίʔϧ Ϟχλʔઃఆ Ξϥʔτ σʔλΛूੵ ༗༻/ෆཁϞχλʔͷࣗಈఏҊͳͲ ແବͳΞϥʔτ/ίʔϧ਺ miss-detection਺ ΠϯγσϯτϝτϦΫε ΞΫγϣϯ

Slide 16

Slide 16 text

৔౰ͨΓతͳվળ͔Β ιϑτ΢ΣΞΤϯδχΞϦϯά΁

Slide 17

Slide 17 text

17 ɾMicrosoftʹ͓͚ΔΠϯγσϯτͷmiss-detection໰୊ͷ࣮ূݚڀ࿦ จΛ঺հͨ͠ ɾ΄ͱΜͲͷΠϯγσϯτ͸ϞχλʔʹΑΓݕग़Ͱ͖͍͕ͯͨɺॳظ ঱ঢ়Λݟಀͨ݁͠ՌɺγεςϜఀࢭʹͭͳ͕Δ͜ͱ΋গͳ͘ͳ͍ ɾmiss-detectionͷݪҼ͸ʮϞχλʔ/Ξϥʔτͷܽམ͕શମͷ40%ʯ ɾΑΓݡ͍ϞχλϦϯάϑϨʔϜϫʔΫͰ͸ɺྫ͑͹ෆ଍͍ͯ͠ΔϞ χλʔͷࣗಈఏҊ͕Ͱ͖ΔΑ͏ʹ ·ͱΊ

Slide 18

Slide 18 text

18 ɾSRE NEXT 2023ͰSRE࿦จͷ୳͠ํ΍ಡΈํΛ঺հͨ͠ ෇࿥ɿ࿦จͷ୳͠ํͱಡΈํ https://blog.yuuk.io/entry/2023/srenext2023