Upgrade to Pro — share decks privately, control downloads, hide ads and more …

サービスレベル目標(SLO)を軸にした異常の原因探索のための研究の調査 / SLO-based...

サービスレベル目標(SLO)を軸にした異常の原因探索のための研究の調査 / SLO-based Causality Discovery in Distributed Applications

Yuuki Tsubouchi (yuuk1)

July 07, 2020
Tweet

More Decks by Yuuki Tsubouchi (yuuk1)

Other Decks in Research

Transcript

  1. 5 ઈରʹམͪͳ͍Α͏ʹ͢Δैདྷͷӡ༻؅ཧ *1 David Oppenheimer, Archana Ganapathi, and David A.

    Patterson: “Why Do Internet Services Fail, and What Can Be Done About It?, USENIX Symposium on Internet Technologies and Systems (USITS), 2003. ɾैདྷͷӡ༻؅ཧͰ͸ɼγεςϜͷΤϥʔΛθϩʹ͢Δ͜ͱΛ໨ࢦ͢ ɾγεςϜো֐ͷݪҼͱͯ͠࠷΋ଟ͍ͷ͸؅ཧऀʹΑΔઃఆΤϥʔͰ͋ Γɼϋʔυ΢ΣΞͷϑΥʔϧτʹΑΔ΋ͷ͸10~25%ʹա͗ͳ͍*1 ɾ؅ཧऀ͸γεςϜʹมߋΛՃ͑ͳ͍͜ͱʹΑΓɼΤϥʔΛൃੜͤ͞ ͳ͍Α͏ʹ͢Δ ɾͦͷ݁Ռɼϋʔυ΢ΣΞނো཰͕ߴ·ΔɼαʔϏεͷػೳ௥Ճ͕଺ Δɼ͋Δ͍͸ιϑτ΢ΣΞͷ੬ऑੑ͕࢒Δͱ͍ͬͨ໰୊͕ى͖Δ
  2. ৴པੑͱ͸ γεςϜ͕ٻΊΒΕΔػೳΛɼఆΊΒΕͨ৚݅ͷԼͰఆ ΊΒΕͨظؒʹΘͨΓো֐Λى͜͢͜ͱͳ࣮͘ߦ͢Δ֬཰ ※2 6 ɾαʔϏεϨϕϧࢦඪʢService Level Indicator, SLIʣ ɾαʔϏεϨϕϧ໨ඪʢService

    Level Objective, SLOʣ 100%ͷ৴པੑΛ໨ࢦ͞ͳ͍ *2 P. O’Connor, A. Kleyner. Practical Reliability Engineering, 5th edition, Wiley 2012. Ϋϥ΢υ্Ͱల։͍ͯ͠Δଟ਺ͷαʔϏεࣄۀऀ͕SLIɾSLOʹΑΓ ৴པੑΛఆྔతʹܭଌ͠ɼ݁ՌΛҙࢥܾఆʹར༻͍ͯ͠Δ ৴པੑͷࢦඪͱͦͷ໨ඪ஋Λܾఆ͠ɼܭଌظؒதʹ໨ඪ஋ΛԼճΒͳ ͍ݶΓɼαʔϏεࣄۀऀ͸ੵۃతʹγεςϜΛมߋͰ͖Δ ※SLAʢService Level Agreementʣ͸Ϗδωε্ͷܖ໿Ͱ͋Γɺ Ϣʔβʔͷෆຬʹର͢ΔิঈͳͲؚ͕·ΕΔ
  3. 7 γεςϜͷछྨ ϦΫΤετۦಈ ύΠϓϥΠϯ ετϨʔδ Ϣʔβʔ͕ԿΒ͔ͷछྨͷΠϕ ϯτΛ࡞੒͠ɺϨεϙϯε͕ฦ Δ͜ͱΛظ଴͢ΔγεςϜ *3 Betsy

    Beyer et. al., The Site Reliability Workbook: Practical Ways to Implement SRE, O'Reilly Media, Inc. 2018. *2 ϨίʔυΛೖྗͱͯ͠औΓɺ ͦΕΒΛมԽͤ͞ɺͲ͔͜ผ ͷ৔ॴʹग़ྗ͢ΔγεςϜ σʔλ(όΠτྻɺϨίʔυɺ ϑΝΠϧͳͲ)Λड͚औΓɺͦ ΕΛޙʹऔΓग़ͤΔΑ͏ʹ͠ ͓ͯ͘γεςϜ ※2 Figure 2-1. Architecture for an example mobile phone game ΑΓҾ༻ αʔϏεྫ
  4. 8 SLIͷఆٛͱछྨͷྫ *3 Table 2-1. Potential SLIs for different types

    of components ΑΓҰ෦ൈਮ γεςϜͷछྨ SLIͷछྨ આ໌ ϦΫΤετۦಈ Մ༻ੑ Ϩεϙϯεʹ੒ޭͨ͠ϦΫΤετͷൺ཰ ϦΫΤετۦಈ ϨΠςϯγ ᮢ஋ΑΓ΋ߴ଎ʹॲཧ͞ΕͨϦΫΤετͷൺ཰ ύΠϓϥΠϯ σʔλͷ৽઱͞ ࣌ؒͷᮢ஋ΑΓ΋࠷ۙʹߋ৽͞Εͨσʔλͷൺ཰ ύΠϓϥΠϯ σʔλͷਖ਼֬ੑ ਖ਼͍͠஋ͷग़ྗʹͭͳ͕ͬͨύΠϓϥΠϯ΁ͷೖ ྗϨίʔυͷൺ཰ ετϨʔδ ଱ٱੑ ॻ͖ࠐ·ΕͨϨίʔυͷ͏ͪɺਖ਼͘͠ಡΈग़ͤΔ ΋ͷͷൺ཰ SLI = [ good events / valid events × 100 ] *3 Betsy Beyer et. al., The Site Reliability Workbook: Practical Ways to Implement SRE, O'Reilly Media, Inc. 2018.
  5. 9 SLIɾSLOͷઃఆྫ ෼ྨ SLI SLO Մ༻ੑ ϩʔυόϥϯαͷϝτϦοΫ͔Βܭଌͨ͠ɼ੒ޭ ͨ͠ϦΫΤετͷൺ཰ 97%ͷ੒ޭ཰ ϨΠςϯγ

    ϩʔυόϥϯαͷϝτϦοΫ͔Βܭଌͨ͠ɼे෼ ʹߴ଎ͳϦΫΤετͷൺ཰ 90%ͷϦΫΤετ͕400ms ҎԼ 99%ͷϦΫΤετ͕850ms ҎԼ σʔλͷ ৽઱͞ Ϧʔάςʔϒϧ͔ΒಡΈऔΒΕͨϨίʔυͷ͏ ͪɼ௚ۙʹߋ৽͞Εͨ΋ͷͷൺ཰ ʮ௚ۙʯ͸1෼Ҏ಺ɼ͋Δ͍͸10෼Ҏ಺ͱఆٛ 90%ͷಡΈऔΓ͕1෼Ҏ಺ʹ ॻ͖ࠐ·ΕͨσʔλΛ࢖༻ *3 Appendix A: Example of SLO DocumentΑΓҰ෦ൈਮ *3 Betsy Beyer et. al., The Site Reliability Workbook: Practical Ways to Implement SRE, O'Reilly Media, Inc. 2018.
  6. 11 ௿଎ͳ༧ࢉফඅͷ৔߹ SLOʹجͮ͘Ξϥʔτ ༧ࢉ࢒ྔ(%) 0 25 50 75 100 ೔਺

    0 2 4 6 8 10 ٸ଎ͳ༧ࢉফඅͷ৔߹ Τϥʔ༧ࢉ = 1 - SLO໨ඪ ͱ͢Δ Τϥʔ༧ࢉফඅ͕ٸ଎ͳ৔߹ʹ͸ਝ଎ʹରԠ͕ඞཁͱͳΓɼ௿଎ͳ৔߹͸ ೔୯ҐͰରԠ͢Δ ༧ࢉ࢒ྔ(%) 0 25 50 75 100 ೔਺ 0 2 4 6 8 10
  7. 14 ঱ঢ়ͱݪҼ ɾ঱ঢ়ɿԿ͕͜ΘΕͨͷ͔ ɾSLI͸঱ঢ়Λදݱ͢Δ ɾݪҼɿͳͥͦΕ͕͜ΘΕͨͷ͔ ɾڀ໌࣌ʹ͸ҟৗʹؔ࿈ͷਂ͍৘ใΛ୳͢ɽϝτϦοΫɼϩάɼઃఆ ϑΝΠϧɼιʔείʔυͳͲɽ ঱ঢ় ݪҼ HTTP

    500΋͘͠͸400͕ฦ͞Ε͍ͯΔ σʔλϕʔεαʔό͕઀ଓΛڋ൱͍ͯ͠Δ Ϩεϙϯεͷ଎౓௿Լ bogosort Ͱ CPU ʹաେͳෛՙ͕͔͔͍ͬͯ Δɺ͋Δ͍͸Πʔαωοτέʔϒϧ͕ϥοΫͷԼ ʹڬ·͍ͬͯΔͳͲɻ *4 Table 6-1. Example symptoms and causes ΑΓҰ෦ൈਮ *4 Betsy Beyer et. al., Site Reliability Engineering: How Google Runs Production Systems, O'Reilly Media, Inc. 2016.
  8. 15 ঱ঢ়ͱݪҼͷ૬ରੑ ɾ঱ঢ়͔ΒࠜຊݪҼʹͨͲΓͭ͘·Ͱʹɼෳ਺ͷද૚తͳݪҼ͕ଘࡏ ͢Δ͜ͱ͕͋Δ Ϩϕϧ ঱ঢ় ݪҼ 1 HTTP 500΋͘͠͸400͕ฦ͞Ε͍ͯΔ

    σʔλϕʔεαʔό͕઀ଓΛڋ൱͠ ͍ͯΔ 2 σʔλϕʔεαʔό͕઀ଓΛڋ൱͍ͯ͠Δ σʔλϕʔεαʔόͷ σΟεΫ࢖༻ྔ͕ຬഋͱͳ͍ͬͯΔ 3 σʔλϕʔεαʔόͷ σΟεΫ࢖༻ྔ͕ຬഋͱͳ͍ͬͯΔ ΫΤϦϩάͷϑΝΠϧαΠζ͕ٸ଎ ʹ૿Ճ 4 ΫΤϦϩάͷϑΝΠϧαΠζ͕ٸ଎ʹ૿Ճ …
  9. 18 ౷ܭతҼՌ୳ࡧʹண໨ ɾҼՌͷ఻ൖϞσϧΛϕʔεʹɼٙࣅ૬ؔΛআ֎ͨ͠ҼՌάϥϑΛߏ ங͢Δ ɾ͜͜਺೥ͰϚΠΫϩαʔϏεͷจ຺Ͱෳ਺ͷؔ࿈ݚڀ*5,*6,*7͕ใࠂ ͞Ε͍ͯΔ ɾγεςϜͷछྨɼݪҼΛࣔ͢σʔλιʔεɼ૬ରੑɼڑ཭ɼҼՌ఻ ൖϞσϧͷ֤؍఺ͱ࣮ߦ଎౓ɼϦιʔεফඅྔɼਫ਼౓ͳͲͷཁ݅Λ ౿·͑ͯɼະղܾͷ໰୊Λ୳Δ *5

    Ma, Meng, et al. AutoMAP: Diagnose Your Microservice-based Web Applications Automatically, Web Conference. pp. 246-258, 2020. *6 Lin, JinJin, Chen, Pengfei, Zheng, Zibin, Microscope: Pinpoint performance issues with causal graphs in micro-service environments, International Conference on Service-Oriented Computing, pp.3-20, 2018. *7 Qiu, Juan, et al, A Causality Mining and Knowledge Graph Based Method of Root Cause Diagnosis for Performance Anomaly in Cloud Applications, Applied Sciences, 10.6: 2166, 2020.