サービスレベル目標(SLO)を軸にした異常の原因探索のための研究の調査 / SLO-based Causality Discovery in Distributed Applications

サービスレベル目標(SLO)を軸にした異常の原因探索のための研究の調査 / SLO-based Causality Discovery in Distributed Applications

A658ec7f1badf73819dfa501165016c1?s=128

Yuuki Tsubouchi (yuuk1)

July 07, 2020
Tweet

Transcript

  1. αʔϏεϨϕϧ໨ඪʢSLOʣΛ࣠ʹͨ͠ ҟৗͷݪҼ୳ࡧͷͨΊͷݚڀͷௐࠪ ژ౎େֶେֶӃ ৘ใֶݚڀՊ ஌ೳ৘ใֶઐ߈ D1 ௶಺ ༎थ Ԭ෦ɾٶ࡚ݚڀࣨ ݚڀձ

    2020೥7݄7೔
  2. લճͷ͓͞Β͍ 2 ෼ࢄΞϓϦέʔγϣϯͷҟৗͷਝ଎ͳҼՌ୳ࡧख๏ ɾഎܠ: ෼ࢄΞϓϦέʔγϣϯͷෳࡶԽʹΑΓɼҟৗͷݪҼಛఆ͕ࠔ೉ ͱͳ͍ͬͯΔɽ ɾ໰୊: αʔϏεϨϕϧͷ௿ԼΛܖػͱͯ࣌͠ܥྻϝτϦοΫ͔ΒҼՌ άϥϑΛߏங͢Δख๏͕͋Δɽ͔͠͠ɼ୳ࡧ͢ΔϝτϦοΫͷछྨΛ ؅ཧऀ͕༧Ίબ୒͢Δඞཁ͕͋Δɽ

    ɾఏҊ: શͯͷϝτϦοΫܥྻΛ୳ࡧ͢ΔલఏͰɼʓʓͷख๏ʹΑΓߴ ଎ʹ୳ࡧ͢Δɽ ɾGPGPUɼաڈͷ୳ࡧ݁Ռͷ࠶ར༻ɼγεςϜมߋཤྺͷ׆༻ɼෳ੡ ϗετͷݕग़ͱআ֎
  3. 3 1. αʔϏεϨϕϧࢦඪͱαʔϏεϨϕϧ໨ඪ 2. ෼ࢄΞϓϦέʔγϣϯͷҟৗͷҼՌ 3. ·ͱΊͱࠓޙͷ༧ఆ ໨࣍

  4. 1. αʔϏεϨϕϧࢦඪ αʔϏεϨϕϧ໨ඪ

  5. 5 ઈରʹམͪͳ͍Α͏ʹ͢Δैདྷͷӡ༻؅ཧ *1 David Oppenheimer, Archana Ganapathi, and David A.

    Patterson: “Why Do Internet Services Fail, and What Can Be Done About It?, USENIX Symposium on Internet Technologies and Systems (USITS), 2003. ɾैདྷͷӡ༻؅ཧͰ͸ɼγεςϜͷΤϥʔΛθϩʹ͢Δ͜ͱΛ໨ࢦ͢ ɾγεςϜো֐ͷݪҼͱͯ͠࠷΋ଟ͍ͷ͸؅ཧऀʹΑΔઃఆΤϥʔͰ͋ Γɼϋʔυ΢ΣΞͷϑΥʔϧτʹΑΔ΋ͷ͸10~25%ʹա͗ͳ͍*1 ɾ؅ཧऀ͸γεςϜʹมߋΛՃ͑ͳ͍͜ͱʹΑΓɼΤϥʔΛൃੜͤ͞ ͳ͍Α͏ʹ͢Δ ɾͦͷ݁Ռɼϋʔυ΢ΣΞނো཰͕ߴ·ΔɼαʔϏεͷػೳ௥Ճ͕଺ Δɼ͋Δ͍͸ιϑτ΢ΣΞͷ੬ऑੑ͕࢒Δͱ͍ͬͨ໰୊͕ى͖Δ
  6. ৴པੑͱ͸ γεςϜ͕ٻΊΒΕΔػೳΛɼఆΊΒΕͨ৚݅ͷԼͰఆ ΊΒΕͨظؒʹΘͨΓো֐Λى͜͢͜ͱͳ࣮͘ߦ͢Δ֬཰ ※2 6 ɾαʔϏεϨϕϧࢦඪʢService Level Indicator, SLIʣ ɾαʔϏεϨϕϧ໨ඪʢService

    Level Objective, SLOʣ 100%ͷ৴པੑΛ໨ࢦ͞ͳ͍ *2 P. O’Connor, A. Kleyner. Practical Reliability Engineering, 5th edition, Wiley 2012. Ϋϥ΢υ্Ͱల։͍ͯ͠Δଟ਺ͷαʔϏεࣄۀऀ͕SLIɾSLOʹΑΓ ৴པੑΛఆྔతʹܭଌ͠ɼ݁ՌΛҙࢥܾఆʹར༻͍ͯ͠Δ ৴པੑͷࢦඪͱͦͷ໨ඪ஋Λܾఆ͠ɼܭଌظؒதʹ໨ඪ஋ΛԼճΒͳ ͍ݶΓɼαʔϏεࣄۀऀ͸ੵۃతʹγεςϜΛมߋͰ͖Δ ※SLAʢService Level Agreementʣ͸Ϗδωε্ͷܖ໿Ͱ͋Γɺ Ϣʔβʔͷෆຬʹର͢ΔิঈͳͲؚ͕·ΕΔ
  7. 7 γεςϜͷछྨ ϦΫΤετۦಈ ύΠϓϥΠϯ ετϨʔδ Ϣʔβʔ͕ԿΒ͔ͷछྨͷΠϕ ϯτΛ࡞੒͠ɺϨεϙϯε͕ฦ Δ͜ͱΛظ଴͢ΔγεςϜ *3 Betsy

    Beyer et. al., The Site Reliability Workbook: Practical Ways to Implement SRE, O'Reilly Media, Inc. 2018. *2 ϨίʔυΛೖྗͱͯ͠औΓɺ ͦΕΒΛมԽͤ͞ɺͲ͔͜ผ ͷ৔ॴʹग़ྗ͢ΔγεςϜ σʔλ(όΠτྻɺϨίʔυɺ ϑΝΠϧͳͲ)Λड͚औΓɺͦ ΕΛޙʹऔΓग़ͤΔΑ͏ʹ͠ ͓ͯ͘γεςϜ ※2 Figure 2-1. Architecture for an example mobile phone game ΑΓҾ༻ αʔϏεྫ
  8. 8 SLIͷఆٛͱछྨͷྫ *3 Table 2-1. Potential SLIs for different types

    of components ΑΓҰ෦ൈਮ γεςϜͷछྨ SLIͷछྨ આ໌ ϦΫΤετۦಈ Մ༻ੑ Ϩεϙϯεʹ੒ޭͨ͠ϦΫΤετͷൺ཰ ϦΫΤετۦಈ ϨΠςϯγ ᮢ஋ΑΓ΋ߴ଎ʹॲཧ͞ΕͨϦΫΤετͷൺ཰ ύΠϓϥΠϯ σʔλͷ৽઱͞ ࣌ؒͷᮢ஋ΑΓ΋࠷ۙʹߋ৽͞Εͨσʔλͷൺ཰ ύΠϓϥΠϯ σʔλͷਖ਼֬ੑ ਖ਼͍͠஋ͷग़ྗʹͭͳ͕ͬͨύΠϓϥΠϯ΁ͷೖ ྗϨίʔυͷൺ཰ ετϨʔδ ଱ٱੑ ॻ͖ࠐ·ΕͨϨίʔυͷ͏ͪɺਖ਼͘͠ಡΈग़ͤΔ ΋ͷͷൺ཰ SLI = [ good events / valid events × 100 ] *3 Betsy Beyer et. al., The Site Reliability Workbook: Practical Ways to Implement SRE, O'Reilly Media, Inc. 2018.
  9. 9 SLIɾSLOͷઃఆྫ ෼ྨ SLI SLO Մ༻ੑ ϩʔυόϥϯαͷϝτϦοΫ͔Βܭଌͨ͠ɼ੒ޭ ͨ͠ϦΫΤετͷൺ཰ 97%ͷ੒ޭ཰ ϨΠςϯγ

    ϩʔυόϥϯαͷϝτϦοΫ͔Βܭଌͨ͠ɼे෼ ʹߴ଎ͳϦΫΤετͷൺ཰ 90%ͷϦΫΤετ͕400ms ҎԼ 99%ͷϦΫΤετ͕850ms ҎԼ σʔλͷ ৽઱͞ Ϧʔάςʔϒϧ͔ΒಡΈऔΒΕͨϨίʔυͷ͏ ͪɼ௚ۙʹߋ৽͞Εͨ΋ͷͷൺ཰ ʮ௚ۙʯ͸1෼Ҏ಺ɼ͋Δ͍͸10෼Ҏ಺ͱఆٛ 90%ͷಡΈऔΓ͕1෼Ҏ಺ʹ ॻ͖ࠐ·ΕͨσʔλΛ࢖༻ *3 Appendix A: Example of SLO DocumentΑΓҰ෦ൈਮ *3 Betsy Beyer et. al., The Site Reliability Workbook: Practical Ways to Implement SRE, O'Reilly Media, Inc. 2018.
  10. 10 ɾैདྷ͸ݸʑͷίϯϙʔωϯτ୯ҐͰ࣍ͷΑ͏ͳ؂ࢹΛઃఆ͍ͯͨ͠ ɾνΣοΫ؂ࢹʢPINGԠ౴ɼϙʔτ΍ϓϩηεͷੜࢮͳͲͷ2஋ʣ ɾϦιʔε؂ࢹʢϝϞϦͳͲͷϦιʔεফඅྔʹରͯ͠ᮢ஋Λઃఆʣ ɾίϯϙʔωϯτ͕Ϧιʔε࢖༻ྔ͕ᮢ஋Λ௒͔͑ͨΒͱ͍ͬͯɼର ॲΛඞཁͱ͢ΔॏେͳΠϕϯτ͕ى͖͍ͯΔͱ͸ݶΒͳ͍ ɾϊΠζΛ௿ݮͤ͞ΔͨΊʹɼSLIͷ௿Լʹରͯ͠Ξϥʔτ͢Δ SLIɾSLOͷਁಁʹΑΔΞϥʔτͷมԽ

  11. 11 ௿଎ͳ༧ࢉফඅͷ৔߹ SLOʹجͮ͘Ξϥʔτ ༧ࢉ࢒ྔ(%) 0 25 50 75 100 ೔਺

    0 2 4 6 8 10 ٸ଎ͳ༧ࢉফඅͷ৔߹ Τϥʔ༧ࢉ = 1 - SLO໨ඪ ͱ͢Δ Τϥʔ༧ࢉফඅ͕ٸ଎ͳ৔߹ʹ͸ਝ଎ʹରԠ͕ඞཁͱͳΓɼ௿଎ͳ৔߹͸ ೔୯ҐͰରԠ͢Δ ༧ࢉ࢒ྔ(%) 0 25 50 75 100 ೔਺ 0 2 4 6 8 10
  12. 12 SLIɾSLOΛத৺ʹਾ͑ͨγεςϜҟৗ΁ͷΞϓϩʔν ɾ༧ଌɹաڈʹSLI͕௿Լͨ͠௚લͷৼΔ෣͍ͱྨࣅ͢ΔৼΔ෣͍Λൃݟ ɾҟৗՕॴͷಛఆɹSLOҧ൓࣌ʹҟৗՕॴΛ୳ࡧ ɾݪҼڀ໌ɹʢ޻ࣄதʣ ɾճ෮ɹSLIΛλʔήοτͱ͢ΔϑΟʔυόοΫ੍ޚ ɾSLIͷ୅ସɹࠓ͋ΔϝτϦοΫΛ૊Έ߹ΘͤͯSLIͷ୅ସࢦඪΛ࡞੒ ༧ଌ ҟৗՕॴͷಛఆ ݪҼڀ໌

    ճ෮ ݕ஌ SLOʹجͮ͘Ξϥʔτ ௚ۙͷڵຯର৅
  13. 2. ෼ࢄΞϓϦέʔγϣϯͷ ҟৗ΁ͷΞϓϩʔνͷ؍఺

  14. 14 ঱ঢ়ͱݪҼ ɾ঱ঢ়ɿԿ͕͜ΘΕͨͷ͔ ɾSLI͸঱ঢ়Λදݱ͢Δ ɾݪҼɿͳͥͦΕ͕͜ΘΕͨͷ͔ ɾڀ໌࣌ʹ͸ҟৗʹؔ࿈ͷਂ͍৘ใΛ୳͢ɽϝτϦοΫɼϩάɼઃఆ ϑΝΠϧɼιʔείʔυͳͲɽ ঱ঢ় ݪҼ HTTP

    500΋͘͠͸400͕ฦ͞Ε͍ͯΔ σʔλϕʔεαʔό͕઀ଓΛڋ൱͍ͯ͠Δ Ϩεϙϯεͷ଎౓௿Լ bogosort Ͱ CPU ʹաେͳෛՙ͕͔͔͍ͬͯ Δɺ͋Δ͍͸Πʔαωοτέʔϒϧ͕ϥοΫͷԼ ʹڬ·͍ͬͯΔͳͲɻ *4 Table 6-1. Example symptoms and causes ΑΓҰ෦ൈਮ *4 Betsy Beyer et. al., Site Reliability Engineering: How Google Runs Production Systems, O'Reilly Media, Inc. 2016.
  15. 15 ঱ঢ়ͱݪҼͷ૬ରੑ ɾ঱ঢ়͔ΒࠜຊݪҼʹͨͲΓͭ͘·Ͱʹɼෳ਺ͷද૚తͳݪҼ͕ଘࡏ ͢Δ͜ͱ͕͋Δ Ϩϕϧ ঱ঢ় ݪҼ 1 HTTP 500΋͘͠͸400͕ฦ͞Ε͍ͯΔ

    σʔλϕʔεαʔό͕઀ଓΛڋ൱͠ ͍ͯΔ 2 σʔλϕʔεαʔό͕઀ଓΛڋ൱͍ͯ͠Δ σʔλϕʔεαʔόͷ σΟεΫ࢖༻ྔ͕ຬഋͱͳ͍ͬͯΔ 3 σʔλϕʔεαʔόͷ σΟεΫ࢖༻ྔ͕ຬഋͱͳ͍ͬͯΔ ΫΤϦϩάͷϑΝΠϧαΠζ͕ٸ଎ ʹ૿Ճ 4 ΫΤϦϩάͷϑΝΠϧαΠζ͕ٸ଎ʹ૿Ճ …
  16. 16 ঱ঢ়ͱݪҼͷڑ཭ ঱ঢ়͕൑໌ͨ࣌͠ࠁͱɼݪҼ͕ൃੜͨ࣌͠ࠁͱͷ ࠩ෼ SLOʹجͮ͘ΞϥʔτͰ௿଎ͳ༧ࢉফඅͷ৔ ߹ɼ࣌ؒతڑ཭͕େ͖͘ͳΔ ۭؒతڑ཭ ঱ঢ়Λใࠂ͢ΔՕॴͱݪҼ͕ൃੜ͍ͯ͠ΔՕॴ ͱͷڑ཭ ࣌ؒతڑ཭

  17. 17 ҼՌͷ఻ൖϞσϧ ωοτϫʔΫ ௨৴Ϟσϧ ΞϓϦέʔγϣϯ૚ɼτϥϯεϙʔτ૚ɼωο τϫʔΫ૚ͷ֤֊૚ʹ͓͚ΔωοτϫʔΫ௨৴ ʹΑΓҼՌ͕఻ൖ͢Δ Ϧιʔεಉډ Ϟσϧ ෳ਺ͷҟͳΔϓϩηε͕ಉҰͷϦιʔεΛڞ༗

    ͢Δͱ͖ʹϦιʔεܦ༝ͰҼՌ͕఻ൖ͢Δɽ αʔόԾ૝Խ΍ίϯςφԾ૝ԽʹΑΓϗετΛ ಉډ͢Δͱ͖ʹൃੜ͢Δ
  18. 18 ౷ܭతҼՌ୳ࡧʹண໨ ɾҼՌͷ఻ൖϞσϧΛϕʔεʹɼٙࣅ૬ؔΛআ֎ͨ͠ҼՌάϥϑΛߏ ங͢Δ ɾ͜͜਺೥ͰϚΠΫϩαʔϏεͷจ຺Ͱෳ਺ͷؔ࿈ݚڀ*5,*6,*7͕ใࠂ ͞Ε͍ͯΔ ɾγεςϜͷछྨɼݪҼΛࣔ͢σʔλιʔεɼ૬ରੑɼڑ཭ɼҼՌ఻ ൖϞσϧͷ֤؍఺ͱ࣮ߦ଎౓ɼϦιʔεফඅྔɼਫ਼౓ͳͲͷཁ݅Λ ౿·͑ͯɼະղܾͷ໰୊Λ୳Δ *5

    Ma, Meng, et al. AutoMAP: Diagnose Your Microservice-based Web Applications Automatically, Web Conference. pp. 246-258, 2020. *6 Lin, JinJin, Chen, Pengfei, Zheng, Zibin, Microscope: Pinpoint performance issues with causal graphs in micro-service environments, International Conference on Service-Oriented Computing, pp.3-20, 2018. *7 Qiu, Juan, et al, A Causality Mining and Knowledge Graph Based Method of Root Cause Diagnosis for Performance Anomaly in Cloud Applications, Applied Sciences, 10.6: 2166, 2020.
  19. 3. ·ͱΊͱࠓޙͷ༧ఆ

  20. 20 ·ͱΊ ɾSLIɾSLOΛ࣠ʹͯ͠γεςϜҟৗ΁Ξϓϩʔν͢Δߏ૝Λ঺հͨ͠ ɾ౷ܭతҼՌ୳ࡧʹؔ࿈͢Δ஌ࣝͷෆ଍Λิ͏ͨΊʹɼ͘͞ΒΠϯ λʔωοτͷͭΔ΂ʔ͞Μ(@tsurubee3)ͱڞಉͰਐΊΔ༧ఆ ɾ͏·͘·ͱ·Ε͹ɼIOTS2020ʢ9্݄०క੾ʣ΁ͷ౤ߘΛ໨ࢦ͢