Upgrade to Pro — share decks privately, control downloads, hide ads and more …

サービスレベル目標(SLO)を軸にした異常の原因探索のための研究の調査 / SLO-based Causality Discovery in Distributed Applications

サービスレベル目標(SLO)を軸にした異常の原因探索のための研究の調査 / SLO-based Causality Discovery in Distributed Applications

Yuuki Tsubouchi (yuuk1)

July 07, 2020
Tweet

More Decks by Yuuki Tsubouchi (yuuk1)

Other Decks in Research

Transcript

  1. αʔϏεϨϕϧ໨ඪʢSLOʣΛ࣠ʹͨ͠
    ҟৗͷݪҼ୳ࡧͷͨΊͷݚڀͷௐࠪ
    ژ౎େֶେֶӃ ৘ใֶݚڀՊ ஌ೳ৘ใֶઐ߈
    D1 ௶಺ ༎थ
    Ԭ෦ɾٶ࡚ݚڀࣨ ݚڀձ
    2020೥7݄7೔

    View full-size slide

  2. લճͷ͓͞Β͍
    2
    ෼ࢄΞϓϦέʔγϣϯͷҟৗͷਝ଎ͳҼՌ୳ࡧख๏
    ɾഎܠ: ෼ࢄΞϓϦέʔγϣϯͷෳࡶԽʹΑΓɼҟৗͷݪҼಛఆ͕ࠔ೉
    ͱͳ͍ͬͯΔɽ
    ɾ໰୊: αʔϏεϨϕϧͷ௿ԼΛܖػͱͯ࣌͠ܥྻϝτϦοΫ͔ΒҼՌ
    άϥϑΛߏங͢Δख๏͕͋Δɽ͔͠͠ɼ୳ࡧ͢ΔϝτϦοΫͷछྨΛ
    ؅ཧऀ͕༧Ίબ୒͢Δඞཁ͕͋Δɽ
    ɾఏҊ: શͯͷϝτϦοΫܥྻΛ୳ࡧ͢ΔલఏͰɼʓʓͷख๏ʹΑΓߴ
    ଎ʹ୳ࡧ͢Δɽ
    ɾGPGPUɼաڈͷ୳ࡧ݁Ռͷ࠶ར༻ɼγεςϜมߋཤྺͷ׆༻ɼෳ੡
    ϗετͷݕग़ͱআ֎

    View full-size slide

  3. 3
    1. αʔϏεϨϕϧࢦඪͱαʔϏεϨϕϧ໨ඪ
    2. ෼ࢄΞϓϦέʔγϣϯͷҟৗͷҼՌ
    3. ·ͱΊͱࠓޙͷ༧ఆ
    ໨࣍

    View full-size slide

  4. 1.
    αʔϏεϨϕϧࢦඪ
    αʔϏεϨϕϧ໨ඪ

    View full-size slide

  5. 5
    ઈରʹམͪͳ͍Α͏ʹ͢Δैདྷͷӡ༻؅ཧ
    *1 David Oppenheimer, Archana Ganapathi, and David A. Patterson: “Why Do Internet Services Fail, and What Can Be Done About It?, USENIX Symposium
    on Internet Technologies and Systems (USITS), 2003.
    ɾैདྷͷӡ༻؅ཧͰ͸ɼγεςϜͷΤϥʔΛθϩʹ͢Δ͜ͱΛ໨ࢦ͢
    ɾγεςϜো֐ͷݪҼͱͯ͠࠷΋ଟ͍ͷ͸؅ཧऀʹΑΔઃఆΤϥʔͰ͋
    Γɼϋʔυ΢ΣΞͷϑΥʔϧτʹΑΔ΋ͷ͸10~25%ʹա͗ͳ͍*1
    ɾ؅ཧऀ͸γεςϜʹมߋΛՃ͑ͳ͍͜ͱʹΑΓɼΤϥʔΛൃੜͤ͞
    ͳ͍Α͏ʹ͢Δ
    ɾͦͷ݁Ռɼϋʔυ΢ΣΞނো཰͕ߴ·ΔɼαʔϏεͷػೳ௥Ճ͕଺
    Δɼ͋Δ͍͸ιϑτ΢ΣΞͷ੬ऑੑ͕࢒Δͱ͍ͬͨ໰୊͕ى͖Δ

    View full-size slide

  6. ৴པੑͱ͸ γεςϜ͕ٻΊΒΕΔػೳΛɼఆΊΒΕͨ৚݅ͷԼͰఆ
    ΊΒΕͨظؒʹΘͨΓো֐Λى͜͢͜ͱͳ࣮͘ߦ͢Δ֬཰ ※2
    6
    ɾαʔϏεϨϕϧࢦඪʢService Level Indicator, SLIʣ
    ɾαʔϏεϨϕϧ໨ඪʢService Level Objective, SLOʣ
    100%ͷ৴པੑΛ໨ࢦ͞ͳ͍
    *2 P. O’Connor, A. Kleyner. Practical Reliability Engineering, 5th edition, Wiley 2012.
    Ϋϥ΢υ্Ͱల։͍ͯ͠Δଟ਺ͷαʔϏεࣄۀऀ͕SLIɾSLOʹΑΓ
    ৴པੑΛఆྔతʹܭଌ͠ɼ݁ՌΛҙࢥܾఆʹར༻͍ͯ͠Δ
    ৴པੑͷࢦඪͱͦͷ໨ඪ஋Λܾఆ͠ɼܭଌظؒதʹ໨ඪ஋ΛԼճΒͳ
    ͍ݶΓɼαʔϏεࣄۀऀ͸ੵۃతʹγεςϜΛมߋͰ͖Δ
    ※SLAʢService Level Agreementʣ͸Ϗδωε্ͷܖ໿Ͱ͋Γɺ Ϣʔβʔͷෆຬʹର͢ΔิঈͳͲؚ͕·ΕΔ

    View full-size slide

  7. 7
    γεςϜͷछྨ
    ϦΫΤετۦಈ
    ύΠϓϥΠϯ
    ετϨʔδ
    Ϣʔβʔ͕ԿΒ͔ͷछྨͷΠϕ
    ϯτΛ࡞੒͠ɺϨεϙϯε͕ฦ
    Δ͜ͱΛظ଴͢ΔγεςϜ
    *3 Betsy Beyer et. al., The Site Reliability Workbook: Practical Ways to Implement SRE, O'Reilly Media, Inc. 2018.
    *2
    ϨίʔυΛೖྗͱͯ͠औΓɺ
    ͦΕΒΛมԽͤ͞ɺͲ͔͜ผ
    ͷ৔ॴʹग़ྗ͢ΔγεςϜ
    σʔλ(όΠτྻɺϨίʔυɺ
    ϑΝΠϧͳͲ)Λड͚औΓɺͦ
    ΕΛޙʹऔΓग़ͤΔΑ͏ʹ͠
    ͓ͯ͘γεςϜ ※2 Figure 2-1. Architecture for an example mobile
    phone game ΑΓҾ༻
    αʔϏεྫ

    View full-size slide

  8. 8
    SLIͷఆٛͱछྨͷྫ
    *3 Table 2-1. Potential SLIs for different types of components ΑΓҰ෦ൈਮ
    γεςϜͷछྨ SLIͷछྨ આ໌
    ϦΫΤετۦಈ Մ༻ੑ Ϩεϙϯεʹ੒ޭͨ͠ϦΫΤετͷൺ཰
    ϦΫΤετۦಈ ϨΠςϯγ ᮢ஋ΑΓ΋ߴ଎ʹॲཧ͞ΕͨϦΫΤετͷൺ཰
    ύΠϓϥΠϯ σʔλͷ৽઱͞ ࣌ؒͷᮢ஋ΑΓ΋࠷ۙʹߋ৽͞Εͨσʔλͷൺ཰
    ύΠϓϥΠϯ σʔλͷਖ਼֬ੑ
    ਖ਼͍͠஋ͷग़ྗʹͭͳ͕ͬͨύΠϓϥΠϯ΁ͷೖ
    ྗϨίʔυͷൺ཰
    ετϨʔδ ଱ٱੑ
    ॻ͖ࠐ·ΕͨϨίʔυͷ͏ͪɺਖ਼͘͠ಡΈग़ͤΔ
    ΋ͷͷൺ཰
    SLI = [ good events / valid events × 100 ]
    *3 Betsy Beyer et. al., The Site Reliability Workbook: Practical Ways to Implement SRE, O'Reilly Media, Inc. 2018.

    View full-size slide

  9. 9
    SLIɾSLOͷઃఆྫ
    ෼ྨ SLI SLO
    Մ༻ੑ
    ϩʔυόϥϯαͷϝτϦοΫ͔Βܭଌͨ͠ɼ੒ޭ
    ͨ͠ϦΫΤετͷൺ཰
    97%ͷ੒ޭ཰
    ϨΠςϯγ
    ϩʔυόϥϯαͷϝτϦοΫ͔Βܭଌͨ͠ɼे෼
    ʹߴ଎ͳϦΫΤετͷൺ཰
    90%ͷϦΫΤετ͕400ms
    ҎԼ
    99%ͷϦΫΤετ͕850ms
    ҎԼ
    σʔλͷ
    ৽઱͞
    Ϧʔάςʔϒϧ͔ΒಡΈऔΒΕͨϨίʔυͷ͏
    ͪɼ௚ۙʹߋ৽͞Εͨ΋ͷͷൺ཰
    ʮ௚ۙʯ͸1෼Ҏ಺ɼ͋Δ͍͸10෼Ҏ಺ͱఆٛ
    90%ͷಡΈऔΓ͕1෼Ҏ಺ʹ
    ॻ͖ࠐ·ΕͨσʔλΛ࢖༻
    *3 Appendix A: Example of SLO DocumentΑΓҰ෦ൈਮ
    *3 Betsy Beyer et. al., The Site Reliability Workbook: Practical Ways to Implement SRE, O'Reilly Media, Inc. 2018.

    View full-size slide

  10. 10
    ɾैདྷ͸ݸʑͷίϯϙʔωϯτ୯ҐͰ࣍ͷΑ͏ͳ؂ࢹΛઃఆ͍ͯͨ͠
    ɾνΣοΫ؂ࢹʢPINGԠ౴ɼϙʔτ΍ϓϩηεͷੜࢮͳͲͷ2஋ʣ
    ɾϦιʔε؂ࢹʢϝϞϦͳͲͷϦιʔεফඅྔʹରͯ͠ᮢ஋Λઃఆʣ
    ɾίϯϙʔωϯτ͕Ϧιʔε࢖༻ྔ͕ᮢ஋Λ௒͔͑ͨΒͱ͍ͬͯɼର
    ॲΛඞཁͱ͢ΔॏେͳΠϕϯτ͕ى͖͍ͯΔͱ͸ݶΒͳ͍
    ɾϊΠζΛ௿ݮͤ͞ΔͨΊʹɼSLIͷ௿Լʹରͯ͠Ξϥʔτ͢Δ
    SLIɾSLOͷਁಁʹΑΔΞϥʔτͷมԽ

    View full-size slide

  11. 11
    ௿଎ͳ༧ࢉফඅͷ৔߹
    SLOʹجͮ͘Ξϥʔτ
    ༧ࢉ࢒ྔ(%)
    0
    25
    50
    75
    100
    ೔਺
    0 2 4 6 8 10
    ٸ଎ͳ༧ࢉফඅͷ৔߹
    Τϥʔ༧ࢉ = 1 - SLO໨ඪ ͱ͢Δ
    Τϥʔ༧ࢉফඅ͕ٸ଎ͳ৔߹ʹ͸ਝ଎ʹରԠ͕ඞཁͱͳΓɼ௿଎ͳ৔߹͸
    ೔୯ҐͰରԠ͢Δ
    ༧ࢉ࢒ྔ(%)
    0
    25
    50
    75
    100
    ೔਺
    0 2 4 6 8 10

    View full-size slide

  12. 12
    SLIɾSLOΛத৺ʹਾ͑ͨγεςϜҟৗ΁ͷΞϓϩʔν
    ɾ༧ଌɹաڈʹSLI͕௿Լͨ͠௚લͷৼΔ෣͍ͱྨࣅ͢ΔৼΔ෣͍Λൃݟ
    ɾҟৗՕॴͷಛఆɹSLOҧ൓࣌ʹҟৗՕॴΛ୳ࡧ
    ɾݪҼڀ໌ɹʢ޻ࣄதʣ
    ɾճ෮ɹSLIΛλʔήοτͱ͢ΔϑΟʔυόοΫ੍ޚ
    ɾSLIͷ୅ସɹࠓ͋ΔϝτϦοΫΛ૊Έ߹ΘͤͯSLIͷ୅ସࢦඪΛ࡞੒
    ༧ଌ ҟৗՕॴͷಛఆ ݪҼڀ໌ ճ෮
    ݕ஌
    SLOʹجͮ͘Ξϥʔτ
    ௚ۙͷڵຯର৅

    View full-size slide

  13. 2.
    ෼ࢄΞϓϦέʔγϣϯͷ
    ҟৗ΁ͷΞϓϩʔνͷ؍఺

    View full-size slide

  14. 14
    ঱ঢ়ͱݪҼ
    ɾ঱ঢ়ɿԿ͕͜ΘΕͨͷ͔
    ɾSLI͸঱ঢ়Λදݱ͢Δ
    ɾݪҼɿͳͥͦΕ͕͜ΘΕͨͷ͔
    ɾڀ໌࣌ʹ͸ҟৗʹؔ࿈ͷਂ͍৘ใΛ୳͢ɽϝτϦοΫɼϩάɼઃఆ
    ϑΝΠϧɼιʔείʔυͳͲɽ
    ঱ঢ় ݪҼ
    HTTP 500΋͘͠͸400͕ฦ͞Ε͍ͯΔ σʔλϕʔεαʔό͕઀ଓΛڋ൱͍ͯ͠Δ
    Ϩεϙϯεͷ଎౓௿Լ
    bogosort Ͱ CPU ʹաେͳෛՙ͕͔͔͍ͬͯ
    Δɺ͋Δ͍͸Πʔαωοτέʔϒϧ͕ϥοΫͷԼ
    ʹڬ·͍ͬͯΔͳͲɻ
    *4 Table 6-1. Example symptoms and causes ΑΓҰ෦ൈਮ
    *4 Betsy Beyer et. al., Site Reliability Engineering: How Google Runs Production Systems, O'Reilly Media, Inc. 2016.

    View full-size slide

  15. 15
    ঱ঢ়ͱݪҼͷ૬ରੑ
    ɾ঱ঢ়͔ΒࠜຊݪҼʹͨͲΓͭ͘·Ͱʹɼෳ਺ͷද૚తͳݪҼ͕ଘࡏ
    ͢Δ͜ͱ͕͋Δ
    Ϩϕϧ ঱ঢ় ݪҼ
    1 HTTP 500΋͘͠͸400͕ฦ͞Ε͍ͯΔ
    σʔλϕʔεαʔό͕઀ଓΛڋ൱͠
    ͍ͯΔ
    2 σʔλϕʔεαʔό͕઀ଓΛڋ൱͍ͯ͠Δ
    σʔλϕʔεαʔόͷ
    σΟεΫ࢖༻ྔ͕ຬഋͱͳ͍ͬͯΔ
    3
    σʔλϕʔεαʔόͷ
    σΟεΫ࢖༻ྔ͕ຬഋͱͳ͍ͬͯΔ
    ΫΤϦϩάͷϑΝΠϧαΠζ͕ٸ଎
    ʹ૿Ճ
    4 ΫΤϦϩάͷϑΝΠϧαΠζ͕ٸ଎ʹ૿Ճ …

    View full-size slide

  16. 16
    ঱ঢ়ͱݪҼͷڑ཭
    ঱ঢ়͕൑໌ͨ࣌͠ࠁͱɼݪҼ͕ൃੜͨ࣌͠ࠁͱͷ
    ࠩ෼
    SLOʹجͮ͘ΞϥʔτͰ௿଎ͳ༧ࢉফඅͷ৔
    ߹ɼ࣌ؒతڑ཭͕େ͖͘ͳΔ
    ۭؒతڑ཭ ঱ঢ়Λใࠂ͢ΔՕॴͱݪҼ͕ൃੜ͍ͯ͠ΔՕॴ
    ͱͷڑ཭
    ࣌ؒతڑ཭

    View full-size slide

  17. 17
    ҼՌͷ఻ൖϞσϧ
    ωοτϫʔΫ
    ௨৴Ϟσϧ
    ΞϓϦέʔγϣϯ૚ɼτϥϯεϙʔτ૚ɼωο
    τϫʔΫ૚ͷ֤֊૚ʹ͓͚ΔωοτϫʔΫ௨৴
    ʹΑΓҼՌ͕఻ൖ͢Δ
    Ϧιʔεಉډ
    Ϟσϧ
    ෳ਺ͷҟͳΔϓϩηε͕ಉҰͷϦιʔεΛڞ༗
    ͢Δͱ͖ʹϦιʔεܦ༝ͰҼՌ͕఻ൖ͢Δɽ
    αʔόԾ૝Խ΍ίϯςφԾ૝ԽʹΑΓϗετΛ
    ಉډ͢Δͱ͖ʹൃੜ͢Δ

    View full-size slide

  18. 18
    ౷ܭతҼՌ୳ࡧʹண໨
    ɾҼՌͷ఻ൖϞσϧΛϕʔεʹɼٙࣅ૬ؔΛআ֎ͨ͠ҼՌάϥϑΛߏ
    ங͢Δ
    ɾ͜͜਺೥ͰϚΠΫϩαʔϏεͷจ຺Ͱෳ਺ͷؔ࿈ݚڀ*5,*6,*7͕ใࠂ
    ͞Ε͍ͯΔ
    ɾγεςϜͷछྨɼݪҼΛࣔ͢σʔλιʔεɼ૬ରੑɼڑ཭ɼҼՌ఻
    ൖϞσϧͷ֤؍఺ͱ࣮ߦ଎౓ɼϦιʔεফඅྔɼਫ਼౓ͳͲͷཁ݅Λ
    ౿·͑ͯɼະղܾͷ໰୊Λ୳Δ
    *5 Ma, Meng, et al. AutoMAP: Diagnose Your Microservice-based Web Applications Automatically, Web Conference. pp. 246-258, 2020.
    *6 Lin, JinJin, Chen, Pengfei, Zheng, Zibin, Microscope: Pinpoint performance issues with causal graphs in micro-service environments,
    International Conference on Service-Oriented Computing, pp.3-20, 2018.
    *7 Qiu, Juan, et al, A Causality Mining and Knowledge Graph Based Method of Root Cause Diagnosis for Performance Anomaly in Cloud
    Applications, Applied Sciences, 10.6: 2166, 2020.

    View full-size slide

  19. 3.
    ·ͱΊͱࠓޙͷ༧ఆ

    View full-size slide

  20. 20
    ·ͱΊ
    ɾSLIɾSLOΛ࣠ʹͯ͠γεςϜҟৗ΁Ξϓϩʔν͢Δߏ૝Λ঺հͨ͠
    ɾ౷ܭతҼՌ୳ࡧʹؔ࿈͢Δ஌ࣝͷෆ଍Λิ͏ͨΊʹɼ͘͞ΒΠϯ
    λʔωοτͷͭΔ΂ʔ͞Μ(@tsurubee3)ͱڞಉͰਐΊΔ༧ఆ
    ɾ͏·͘·ͱ·Ε͹ɼIOTS2020ʢ9্݄०క੾ʣ΁ͷ౤ߘΛ໨ࢦ͢

    View full-size slide