Upgrade to Pro — share decks privately, control downloads, hide ads and more …

分散アプリケーションの異常の原因を即時に診断するための手法の構想 / Causality Tracing in Distributed Applications

分散アプリケーションの異常の原因を即時に診断するための手法の構想 / Causality Tracing in Distributed Applications

Tweet

More Decks by Yuuki Tsubouchi (yuuk1)

Other Decks in Research

Transcript

  1. ෼ࢄΞϓϦέʔγϣϯͷҟৗͷݪҼΛ
    ଈ࣌ʹ਍அ͢ΔͨΊͷख๏ͷߏ૝
    ژ౎େֶ৘ใֶݚڀՊ ஌ೳ৘ใֶઐ߈
    D1 ௶಺ ༎थ
    Ԭ෦ɾٶ࡚ݚڀࣨ ݚڀձ
    2020೥5݄7೔

    View Slide

  2. 2
    1. എܠͱ໨త
    2. ؔ࿈ݚڀ
    3. ఏҊख๏
    4. ࣮ݧ༧ఆ
    5. ·ͱΊͱࠓޙͷ༧ఆ
    ໨࣍

    View Slide

  3. 1.
    എܠͱ໨త

    View Slide

  4. 4
    ɾ୯ҰͷڊେͳΞϓϦέʔγϣϯΛߏ੒͢ΔͷͰ͸ͳ͘ɺখ͞ͳαʔ
    ϏεΛ૊Έ߹ΘͤΔ෼ࢄߏ੒ͷ୆಄
    ɾଟ͘ͷߏ੒ཁૉ͕ޓ͍ʹ௨৴ͯ͠ಈ࡞͢ΔͨΊɺҟৗͷݪҼͷಛఆ
    ͕ࠔ೉ͱͳΔ
    ɾෳࡶͳωοτϫʔΫґଘؔ܎
    ɾґଘؔ܎ͷಈతͳมߋ
    ɾߏ੒ཁૉ͝ͱʹ؂ࢹ͢ΔେྔͷϝτϦοΫ
    ෼ࢄΞϓϦέʔγϣϯͷҟৗݪҼͷಛఆͷࠔ೉͞

    View Slide

  5. 5
    ɾ௿ϊΠζੑ: ར༻ऀʹѱӨڹͷ͋Δ঱ঢ়ʹର͢ΔΞϥʔτͷΈΛγε
    ςϜ؅ཧऀʹ௨஌ͯ͠΄͍͠
    ɾCPUར༻཰͕100%Ͱ͋ͬͯ΋ར༻ऀʹӨڹ͕͋Δͱ͸ݶΒͳ͍
    ɾϊΠζ͕૿͑Δͱɺ؅ཧऀͷೝ஌ෛՙ͕ߴ·Γɺ؃աʹͭͳ͕Δ
    ɾଈ࣌ੑ: ҟৗΛݕ஌ͨ͠ͷͪʹɺଈ࣌ʹݪҼΛಛఆ͍ͨ͠
    ɾߏ੒ཁૉ਺ͱϝτϦοΫͷݸ਺͕૿େͯ͠΋ɺଈ࣌ੑΛҡ࣋ͨ͠
    ͍
    ҟৗݪҼͷಛఆʹର͢Δཁٻ

    View Slide

  6. 6
    ɾϦΫΤετ୯ҐͷτϨʔγϯάʹΑΔࠜຊݪҼ෼ੳ[1]
    ɾΞϓϦέʔγϣϯʹܭଌίʔυΛ௥Ճ͠ͳ͚Ε͹ͳΒͳ͍
    ɾߏ੒ཁૉؒͷґଘάϥϑͱϝτϦοΫΛར༻ͨࠜ͠ຊݪҼ෼ੳ[2][3]
    ɾґଘؔ܎ͷมԽ΍ϫʔΫϩʔυͷมԽ͕͋Δͱɺґଘؔ܎ͷநग़ॲ
    ཧΛ΍Γ௚͞ͳ͚Ε͹ͳΒͳ͍ => ଈ࣌ੑΛຬͨ͞ͳ͍
    ɾ·ͨ͸ɺ֤ߏ੒ཁૉ͝ͱʹαʔϏεϨϕϧΛ୅ද͢ΔࢦඪʢSLOʣ
    Λઃܭ͠ɺ؂ࢹ͢ΔͨΊͷख͕ؒ͋Δ
    ઌߦݚڀͷ՝୊
    [1]: H. Jayathilaka, C. Krintz and R. Wolski, Performance monitoring and root cause analysis for cloud-hosted web applications, WWW, pp. 469–478, 2017.
    [2]: J. Thalheim, A. Rodrigues, I. Akkus, and others, Sieve: Actionable Insights from Monitored Metrics in Distributed Systems, ACM/IFIP/USENIX Middleware, pp.14-27 2017.
    [3]: J. Lin, C. Pengfei, and Z. Zibin, "Microscope: Pinpoint performance issues with causal graphs in micro-service environments." ICSO, pp.3-20, 2018.

    View Slide

  7. 7
    1. ௿ϊΠζͰ͋Δ͜ͱͱߏ੒ཁૉ୯ҐͰࢦඪΛઃܭ͢Δखؒͷ௿ݮ
    ↪ αʔϏεΛ୅ද͢ΔࢦඪͷΈΛ؂ࢹ (Ԡ౴࣌ؒ΍Ԡ౴Τϥʔ཰ͳͲ)
    ؂ࢹΞϥʔτΛܖػʹTCP઀ଓґଘάϥϑͱ֤ϊʔυ্ͷϝτϦο
    ΫΛ୳ࡧ
    ୅දࢦඪͱ࣌ܥྻతಛ௃͕૬ؔ͢Δ΋ͷΛݪҼͷީิͱ͢Δ
    2. ଈ࣌ੑ
    ֤ϊʔυ্ͷશϝτϦοΫΛ෼ੳ͢ΔͱͳΔͱॲཧ͕஗͘ͳΔ
    ↪ ϝτϦοΫऔಘॲཧΛ֤ϊʔυʹ෼ࢄอ࣋ɾ໰͍߹ΘͤʹΑΓߴ
    ଎Խ
    ↪ ࣌ܥྻͷ૬ؔ෼ੳॲཧΛGPGPUʹΑΓߴ଎Խ
    ↪ աڈͷॲཧ݁ՌΛ࠶ར༻͠ɺߴ଎൑ఆ
    ݚڀͷ໨త

    View Slide

  8. 8
    ɾγεςϜ؅ཧऀ͸ɺࣄલʹ֤ߏ੒ཁૉ্ͰϝτϦοΫΛऩू͠ɺগ
    ਺ͷ୅දࢦඪͷΈʹ؂ࢹΞϥʔτΛઃఆ͍ͯ͑͠͞Ε͹ɺΞϥʔτ
    Λܖػʹଈ࣌ݪҼՕॴΛਪ࿦Մೳ
    ɾ࣮؀ڥͰൃੜͨ͠ҟৗγφϦΦέʔεʹରͯ͠ɺఏҊख๏͕ʓʓ%
    ͷਫ਼౓Λୡ੒
    ظ଴͞ΕΔݚڀͷߩݙ

    View Slide

  9. 2.
    ؔ࿈ݚڀ

    View Slide

  10. 10
    ɾΞϓϦέʔγϣϯ૚ͷ֤ϦΫΤετʹࣝผࢠΛׂΓৼΓɼޙଓͷϦ
    ΫΤετʹຒΊࠐΜ্ͩͰɼޙଓͷϓϩηε΁఻ൖͤ͞Δ
    ɾࣝผࢠΛཔΓʹɼ೚ҙͷߏ੒ཁૉؒͷ஗ԆΛܭଌՄೳ
    ɾར఺: ΞϓϦέʔγϣϯ಺෦ͷৄࡉͳ௥੻͕Մೳ
    ɾ՝୊: ܭଌͷͨΊʹɺΞϓϦέʔγϣϯίʔυʹมߋ͕ඞཁ
    B. H. Sigelman, et al., Dapper, a Large-Scale Distributed Systems Tracing Infrastructure, Technical report, Google 2010.
    ؔ࿈ݚڀᶃ: ϦΫΤετ୯ҐͷτϨʔγϯά

    View Slide

  11. 11
    ؔ࿈ݚڀᶄ: Sieve
    B. H. Sigelman, et al., Dapper, a Large-Scale Distributed Systems Tracing Infrastructure, Technical report, Google 2010.
    ɾطଘͷ؂ࢹج൫͔ΒΞΫγϣϯՄೳͳώϯτΛಘΔγεςϜΛఏҊ
    ɾ࣌ܥྻΫϥελ෼ੳʹΑΓίϯϙʔωϯτؒͷґଘؔ܎Λࣝผ͢Δ
    ɾ՝୊: ߏ੒มߋ΍ϫʔΫϩʔυͷมԽʹ௥ै͢ΔͨΊʹ͸ɺ෼ੳε
    ςοϓΛ࠷ॳ͔Β΍Γ௚͢ඞཁ͕͋Δ
    [2]: J. Thalheim, A. Rodrigues, I. Akkus, and others, Sieve: Actionable Insights from Monitored Metrics in Distributed Systems, ACM/IFIP/USENIX Middleware, pp.14-27 2017.

    View Slide

  12. 12
    ؔ࿈ݚڀᶅ: Microscope
    ɾϚΠΫϩαʔϏεʹܭଌίʔυΛ௥Ճ͢Δ͜ͱͳ͘ɺཁૉؒͷҼՌ
    ؔ܎άϥϑΛޮ཰తʹߏங͠ɺҟৗͷݪҼΛϦΞϧλΠϜͰਪଌ
    ɾ՝୊1: ֤ߏ੒ཁૉͷ୅දϝτϦοΫͷมಈΛΈͯߜΓࠐΉ͕ɺͦΕ
    Ҏ֎ͷݪҼͱͳΔϝτϦοΫީิΛग़ྗ͠ͳ͍
    ɾ՝୊2: ߏ੒ཁૉ͝ͱʹ୅දϝτϦοΫͷઃఆͷͨΊʹΞϓϦέʔ
    γϣϯ஌͕ࣝඞཁ
    [3]: J. Lin, C. Pengfei, and Z. Zibin, "Microscope: Pinpoint performance issues with causal graphs in micro-service environments." ICSO, pp.3-20, 2018.

    View Slide

  13. 13
    ؔ࿈ݚڀͷ՝୊ͷ·ͱΊ
    ɾؔ࿈ݚڀᶅ Microscope͸αʔϏεશମͷ୅දࢦඪʢSLOʣͷΈʹ
    ؂ࢹΞϥʔτΛઃఆ͢ΔͨΊɺ௿ϊΠζੑΛຬͨ͢
    ɾMicroscope͸ɺҼՌάϥϑߏஙॲཧΛฒྻԽ͢Δ͜ͱʹΑΓଈ࣌ੑ
    Λຬͨ͢
    ɾ͔͠͠ɺαʔϏεશମ͚ͩͰͳ͘ɺ֤ߏ੒ཁૉ͝ͱʹ୅දࢦඪΛઃ
    ܭ͠ɺ࣌ܥྻσʔλͱͯ͠ऩू͠ͳ͚Ε͹ͳΒͳ͍
    ɾߏ੒ཁૉ͕ଟ͍΄Ͳɺࢦඪͷઃܭίετ͸૿େ͢Δ

    View Slide

  14. 3.
    ఏҊख๏

    View Slide

  15. 15
    ఏҊख๏ͷલఏ
    ɾؔ࿈ݚڀᶅ MicroscopeͷҼՌਪ࿦ख๏Λϕʔεʹ͢Δ
    ɾαʔϏεશମͷࢦඪͷΈΛఆٛ͠ɺ؂ࢹΞϥʔτΛઃఆ͢Δ
    ɾϗετ্ͷ͢΂ͯͷϝτϦοΫ
    ɾϓϩηεΛϊʔυɺTCP઀ଓΛΤοδͱͨ͠׬શͳωοτϫʔΫґ
    ଘάϥϑ[4] Λར༻ͯ͠ɺҟৗͷҼՌؔ܎άϥϑΛߏங͢Δ
    [4] ௶಺༎थ, ݹ઒խେ, দຊ྄հ, “Transtracer: ෼ࢄγεςϜʹ͓͚ΔTCP/UDP௨৴ͷऴ୺఺ͷ؂ࢹʹΑΔϓϩηεؒґଘؔ܎ͷࣗಈ௥
    ੻”, Πϯλʔωοτͱӡ༻ٕज़γϯϙδ΢Ϝ࿦จू, 2019, 64-71 (2019-11-28), 2019೥12݄.

    View Slide

  16. 16
    ఏҊख๏ͷϫʔΫϑϩʔʢਤʣ
    reqs/sec
    errors/sec
    latency
    CPU usage

    ݪҼީิϦετ
    Frontend
    Component
    ֤ϊʔυ͸
    ෳ਺ͷϝτϦοΫΛ΋ͭ
    1. ୅දࢦඪͷ
    ҟৗݕ஌ 2. ґଘάϥϑ͔Βҟৗ
    ͷҼՌؔ܎άϥϑΛߏங
    (component, metric, score)
    (component, metric, score)
    .
    .
    .

    View Slide

  17. 17
    ఏҊख๏ͷϫʔΫϑϩʔ
    1. SLOͷҧ൓Λݕ஌
    2. ґଘؔ܎άϥϑΛ΋ͱʹҼՌਪ࿦Λ࣮ߦ
    ɾϑϩϯτϊʔυ͔Βྡ઀͢ΔϊʔυΛऔಘ
    ɾ֤ϊʔυ্ͰαʔϏεશମͷ୅දࢦඪͷ࣌ܥྻಛ௃ͱ૬ؔ͢Δϝτ
    ϦοΫΛอ࣋͢Δ͔Ͳ͏͔Λݕఆ
    ɾᮢ஋Λ௒͑ͨ৔߹౰֘ϊʔυͱϝτϦοΫͷ૊ΛݪҼީิϦετ΁
    ɾશϊʔυΛ୳ࡧ͢Δ·ͰґଘάϥϑΛ୳ࡧ
    ɾ૬ؔ౓߹͍ΛείΞͱͯ͠ީิϦετΛϥϯΩϯάԽ
    0. ϑϩϯτϊʔυͷ୅දϝτϦοΫ؂ࢹͱґଘάϥϑͷऩू

    View Slide

  18. 18
    ҼՌάϥϑਪ࿦ͱݕఆΞϧΰϦζϜͷީิ
    ɾҼՌάϥϑਪ࿦
    ɾPCΞϧΰϦζϜ: ࣄ৅ͷڞىؔ܎ͷΈΛѻ͏
    ɾϕΠζ๏: ࣄ৅ͷڞىʹՃ͑ͯࣄ৅ؒͷલޙؔ܎Λѻ͏
    ɾ֤ϝτϦοΫͷܥྻؒʹ࣌ࠁಉظ͕͞Ε͍ͯΕ͹ϕΠζ๏ͷ΄͏
    ͕ద੾ͱͳΔՄೳੑ͕͋Δ
    ɾ࣌ܥྻσʔλʹର͢Δ৚݅෇͖ಠཱੑͷݕఆ
    ɾG2ݕఆͳͲ
    ɾ૬ؔ܎਺ΛϥϯΩϯάείΞͱ͢Δ

    View Slide

  19. 19
    ఏҊ͢Δߴ଎ԽॲཧʢJust Ideaʣ
    ɾ࣌ܥྻσʔλʹର͢ΔݕఆॲཧΛߴ଎Խ
    ɾܥྻؒͷσʔλฒྻੑʹண໨͠ɺGPUʹΑΓฒྻॲཧ
    ɾϝτϦοΫͷऔಘॲཧΛߴ଎Խ
    ɾ௚ۙͷσʔλͷΈ֤ϊʔυͷϩʔΧϧʹอ࣋͠ɺ෼ࢄ໰͍߹Θͤ
    ɾաڈͷਪ࿦݁ՌΛ౿·͑ͯɺಉҰͷҟৗέʔε͔Ͳ͏͔Λ൑ఆ͠ɺ
    ಉҰͰ͋Ε͹աڈͷਪ࿦݁ՌΛར༻͢Δ
    ɾҟৗέʔε͕গͳ͍৔߹Λ૝ఆͯ͠ɺ࣮؀ڥʹҙਤతʹҟৗΛ஫ೖ
    ͠ɺֶश͓ͤͯ͘͞

    View Slide

  20. 4.
    ࣮ݧ༧ఆ

    View Slide

  21. 21
    ධՁ߲໨
    ɾଈ࣌ੑ
    ɾҟৗΛݕ஌ޙʹͲΕ͚ͩ଎͘ݪҼՕॴͷީิΛൃݟͰ͖Δ͔
    ɾߏ੒ཁૉ਺ͱϝτϦοΫ਺ͷ૿େʹର͢Δ࣮ߦ࣌ؒͷมԽΛධՁ
    ɾݪҼͷਪ࿦ਫ਼౓
    ɾؔ࿈ݚڀᶅ Microscopeͱൺֱ͠ɺਫ਼౓ʹ͕ࠩͰΔ͔Ͳ͏͔

    View Slide

  22. 22
    ධՁͷϕʔεϥΠϯ
    ɾଈ࣌ੑ
    ɾؔ࿈ݚڀᶅ Microscope͸୯Ұ෺ཧϚγϯʹ෼ੳͤͯ͞12ඵ
    ɾҰൠʹར༻ऀʹো֐Өڹ࣌ؒΛఏࣔ͢Δͱ͖ʹ෼୯ҐͰ͋ΔͨΊɺ
    60ඵҎ಺ͷճ෮ͷ஗Ε͸ڐ༰Մೳ
    ɾਫ਼౓ͱ࠶ݱ཰
    ɾؔ࿈ݚڀᶅ Microscope͸ਫ਼౓88%ͱ࠶ݱ཰85%

    View Slide

  23. 5.
    ·ͱΊͱࠓޙͷ༧ఆ

    View Slide

  24. 24
    ɾ෼ࢄΞϓϦέʔγϣϯͷҟৗൃੜ࣌ʹɺଈ࣌ੑͱ௿ϊΠζੑΛཱ྆
    ͭͭ͠ɺগ਺ͷ୅දࢦඪΛઃܭɾ؂ࢹ͢ΔͷΈͰɺݪҼՕॴΛਪ࿦
    ͢Δख๏ΛఏҊ
    ·ͱΊͱࠓޙͷ༧ఆ
    ɾࠓޙͷ༧ఆ
    ɾఏҊख๏ͷਪ࿦ͷ࣮ߦ࣌ؒΛͲͷఔ౓୹ॖͰ͖Δ͔Λ༧උ࣮ݧ͢
    Δ
    ɾIOT50ʢ6݄18೔ʒ੾ʣͰͦͷ࣌఺ͰͷਐḿΛ੔ཧͯ͠ൃද͢Δ
    ɾΑ͍݁Ռ͕ͰΕ͹IOTS2020ʢ9্݄०ʒ੾ʣͰൃද͢Δ

    View Slide

  25. 25
    ɾఏҊख๏ͷਪ࿦ͷ࣮ߦ࣌ؒΛͲͷఔ౓୹ॖͰ͖Δ͔Λ༧උ࣮ݧ͢Δ
    ɾIOT50ʢ6݄18೔ʒ੾ʣͰͦͷ࣌఺ͰͷਐḿΛ੔ཧͯ͠ൃද͢Δ
    ɾΑ͍݁Ռ͕ͰΕ͹IOTS2020ʢ9্݄०ʒ੾ʣͰൃද͢Δ
    ࠓޙͷ༧ఆ

    View Slide