#SRE論文紹介 Detection is Better Than Cure: A Cloud...

#SRE論文紹介 Detection is Better Than Cure: A Cloud Incidents Perspective
V. Ganatra et. al., ESEC/FSE’23

Waroom Meetup #1

Ganatra V, Parayil A, Ghosh S, Kang Y, Ma M, Bansal C, Nath S, Mace J. Detection Is Better Than Cure: A Cloud Incidents Perspective. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering 2023 (pp. 1891-1902).

Yuuki Tsubouchi (yuuk1)

June 04, 2024

  1. #SRE࿦จ঺հ Yuuki Tsubouchi / @yuuk1t TopotalςΫϊϩδΞυόΠβʔ Waroom Meetup #1 Detection

    is Better Than Cure: A Cloud Incidents Perspective V. Ganatra et. al., ESEC/FSE’23 2024/06/04
  2. 3 ɾஶऀɿMicrosoft India, China, USͦΕͧΕͷॴଐ ɾձٞɿESEC/FSEɻιϑτ΢ΣΞ޻ֶܥͷτοϓձٞ (CORE Rank A) ࿦จ঺հͷϝλσʔλ

    Ganatra V, Parayil A, Ghosh S, Kang Y, Ma M, Bansal C, Nath S, Mace J. Detection Is Better Than Cure: A Cloud Incidents Perspective. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering 2023 (pp. 1891-1902). ※ ɾ೥ɿ2023 Microsoft͸ϝΨΫϥ΢υϕ ϯμʔͷதͰΠϯγσϯτ ؅ཧ෼໺Ͱ਺ଟ͘ͷ࿦จΛ ൃද͍ͯ͠Δ
  3. 5 େن໛Ϋϥ΢υαʔϏεͷ࣮ূݚڀ ɾAzureͷΠϯγσϯτ෼ੳ͔Βιϑτ΢ΣΞόάͷҰൠతͳࠜຊݪҼΛಛఆ ɾTeamsͷΠϯγσϯτ෼ੳ͔Βڞ௨͢ΔࠜຊݪҼͱ؇࿨ࡦΛಛఆ ࣅͨΑ͏ͳ͜ͱΛ΍͍ͬͯΔਓ͸͍ͳ͍ʁ ΞϥʔτετʔϜΛ෼ੳ͢Δ࣮ূݚڀ ɾେن໛ۜߦγεςϜͷΞϥʔτετʔϜΛ෼ੳ ɾ୅දతͳΞϥʔτΛબ୒͢ΔͨΊͷΞϧΰϦζϜఏҊ https://x.com/yuuk1t/status/1648558134481547264 [Ghosh+,

    SoCC2022]: How to fight production incidents? an empirical study on a large-scale cloud service [Liu+,HotOS2019]: What bugs cause production cloud incidents? miss-detectionʢݟಀ͠ʣʹؔ͢Δ࣮ূݚڀ͸ͳ͍ [Zhao+,ICSE/SEP’2020]. Understanding and handling alert storm for online service systems.
  4. 8 (1) Missing/improper signal: ඞཁͳςϨϝτϦ͕ͳ͍ (2) Missing monitor/alert: ςϨϝτϦ͸͋Δ͕Ϟχλʔ/Ξϥʔτ͕ͳ͍ (3)

    Improper monitor coverage: Ϟχλʔ͕ΠϯγσϯτΛΧόʔ͠ͳ͍ (4) Incorrect alerting logic: ᮢ஋͕ߴ͗͢ΔͳͲϩδοΫ͕ෆద੾ (5) Buggy monitor: Ϟχλʔઃఆόά(৽൛ϝτϦΫεΛ࢖͑ͯͳ͍ͳͲʣ (6) Others: ΞϥʔτͷυΩϡϝϯτʢRunbookʣ͕ܽམɺޡΓ miss-detectionͷओཁͳ̒छͷݪҼ
  5. 11 miss-detectionͷӨڹ͸ͲΕ͘Β͍ʁ Figure 5: (b) Time to Detect (TTD) and

    Time to Mitigate (TTM) for cloud incidents that were not detected properly. TTD͸Ϟχλʔ/ςϨϝτϦ͕ ܽམ͍ͯ͠Δͱ࠷େ TTM͸ςϨϝτϦ΍υΩϡϝϯτ͕ ͳ͍৔߹ʹಛʹߴ͘ͳΔ
  6. ̎αʔϏεؒڞ௨ͷґଘؔ܎&Ϟχλʔͷ஌ࣝΛ࢖ͬͯ Ϟχλʔ௥ՃΛࣗಈఏҊͰ͖ͨ͸ͣ 13 ΑΓྑ͍ϞχλϦϯάΛ͢Δʹ͸ʁ ࠷ॳͷΠϯγσϯτ ̎൪໨ͷΠϯγσϯτ 2. ಉҰϦʔδϣϯͷαʔϏε͕DB΁ ͷ৽ن઀ଓΛ։͚ͳ͘ͳͬͨ վળ఺

    1. ͋ΔϦʔδϣϯͷDB͕ఀࢭͨ͠ 3. DB઀ଓͷো֐Λ؂ࢹ͓ͯ͠Βͣɺ ΠϯγσϯτΛݕग़Ͱ͖ͳ͔ͬͨ 4. ΞΫγϣϯΞΠςϜͱͯ͠DB઀ଓ ͷϞχλʔΛ௥Ճͨ͠ 1. ผαʔϏεͰΤϯΩϡʔʹ͕࣌ؒ ͔͔Γɺδϣϒ͕٧·ͬͨ 2. SQLͷλΠϜΞ΢τʹ௚໘ͨ͠ 3.Πϯγσϯτ͸ϞχλʔͰݕग़Ͱ͖ ͣɺखಈͰ؍ଌ͞Εͨ ࠷ॳͷαʔϏεͱґଘؔ܎ͷ40%Ҏ্Λڞ ༗͠ɺେྔͷڞ௨ͷϞχλʔΛ΋͍ͬͯͨ
  7. 15 ײ૝ɿΑΓΑ͍Πϯγσϯτ؅ཧ΁ ϞχλϦϯά ݕ஌ ৼΓฦΓ ΠϯγσϯτରԠ ղܾ Φϯίʔϧ Ϟχλʔઃఆ Ξϥʔτ

    σʔλΛूੵ ༗༻/ෆཁϞχλʔͷࣗಈఏҊͳͲ ແବͳΞϥʔτ/ίʔϧ਺ miss-detection਺ ΠϯγσϯτϝτϦΫε ΞΫγϣϯ