Pro Yearly is on sale from $80 to $50! »

分散アプリケーションの異常の原因を即時に診断するための手法の構想 / Causality Tracing in Distributed Applications

分散アプリケーションの異常の原因を即時に診断するための手法の構想 / Causality Tracing in Distributed Applications

Tweet

Transcript

  1. ෼ࢄΞϓϦέʔγϣϯͷҟৗͷݪҼΛ ଈ࣌ʹ਍அ͢ΔͨΊͷख๏ͷߏ૝ ژ౎େֶ৘ใֶݚڀՊ ஌ೳ৘ใֶઐ߈ D1 ௶಺ ༎थ Ԭ෦ɾٶ࡚ݚڀࣨ ݚڀձ 2020೥5݄7೔

  2. 2 1. എܠͱ໨త 2. ؔ࿈ݚڀ 3. ఏҊख๏ 4. ࣮ݧ༧ఆ 5.

    ·ͱΊͱࠓޙͷ༧ఆ ໨࣍
  3. 1. എܠͱ໨త

  4. 4 ɾ୯ҰͷڊେͳΞϓϦέʔγϣϯΛߏ੒͢ΔͷͰ͸ͳ͘ɺখ͞ͳαʔ ϏεΛ૊Έ߹ΘͤΔ෼ࢄߏ੒ͷ୆಄ ɾଟ͘ͷߏ੒ཁૉ͕ޓ͍ʹ௨৴ͯ͠ಈ࡞͢ΔͨΊɺҟৗͷݪҼͷಛఆ ͕ࠔ೉ͱͳΔ ɾෳࡶͳωοτϫʔΫґଘؔ܎ ɾґଘؔ܎ͷಈతͳมߋ ɾߏ੒ཁૉ͝ͱʹ؂ࢹ͢ΔେྔͷϝτϦοΫ ෼ࢄΞϓϦέʔγϣϯͷҟৗݪҼͷಛఆͷࠔ೉͞

  5. 5 ɾ௿ϊΠζੑ: ར༻ऀʹѱӨڹͷ͋Δ঱ঢ়ʹର͢ΔΞϥʔτͷΈΛγε ςϜ؅ཧऀʹ௨஌ͯ͠΄͍͠ ɾCPUར༻཰͕100%Ͱ͋ͬͯ΋ར༻ऀʹӨڹ͕͋Δͱ͸ݶΒͳ͍ ɾϊΠζ͕૿͑Δͱɺ؅ཧऀͷೝ஌ෛՙ͕ߴ·Γɺ؃աʹͭͳ͕Δ ɾଈ࣌ੑ: ҟৗΛݕ஌ͨ͠ͷͪʹɺଈ࣌ʹݪҼΛಛఆ͍ͨ͠ ɾߏ੒ཁૉ਺ͱϝτϦοΫͷݸ਺͕૿େͯ͠΋ɺଈ࣌ੑΛҡ࣋ͨ͠ ͍

    ҟৗݪҼͷಛఆʹର͢Δཁٻ
  6. 6 ɾϦΫΤετ୯ҐͷτϨʔγϯάʹΑΔࠜຊݪҼ෼ੳ[1] ɾΞϓϦέʔγϣϯʹܭଌίʔυΛ௥Ճ͠ͳ͚Ε͹ͳΒͳ͍ ɾߏ੒ཁૉؒͷґଘάϥϑͱϝτϦοΫΛར༻ͨࠜ͠ຊݪҼ෼ੳ[2][3] ɾґଘؔ܎ͷมԽ΍ϫʔΫϩʔυͷมԽ͕͋Δͱɺґଘؔ܎ͷநग़ॲ ཧΛ΍Γ௚͞ͳ͚Ε͹ͳΒͳ͍ => ଈ࣌ੑΛຬͨ͞ͳ͍ ɾ·ͨ͸ɺ֤ߏ੒ཁૉ͝ͱʹαʔϏεϨϕϧΛ୅ද͢ΔࢦඪʢSLOʣ Λઃܭ͠ɺ؂ࢹ͢ΔͨΊͷख͕ؒ͋Δ

    ઌߦݚڀͷ՝୊ [1]: H. Jayathilaka, C. Krintz and R. Wolski, Performance monitoring and root cause analysis for cloud-hosted web applications, WWW, pp. 469–478, 2017. [2]: J. Thalheim, A. Rodrigues, I. Akkus, and others, Sieve: Actionable Insights from Monitored Metrics in Distributed Systems, ACM/IFIP/USENIX Middleware, pp.14-27 2017. [3]: J. Lin, C. Pengfei, and Z. Zibin, "Microscope: Pinpoint performance issues with causal graphs in micro-service environments." ICSO, pp.3-20, 2018.
  7. 7 1. ௿ϊΠζͰ͋Δ͜ͱͱߏ੒ཁૉ୯ҐͰࢦඪΛઃܭ͢Δखؒͷ௿ݮ ↪ αʔϏεΛ୅ද͢ΔࢦඪͷΈΛ؂ࢹ (Ԡ౴࣌ؒ΍Ԡ౴Τϥʔ཰ͳͲ) ؂ࢹΞϥʔτΛܖػʹTCP઀ଓґଘάϥϑͱ֤ϊʔυ্ͷϝτϦο ΫΛ୳ࡧ ୅දࢦඪͱ࣌ܥྻతಛ௃͕૬ؔ͢Δ΋ͷΛݪҼͷީิͱ͢Δ 2.

    ଈ࣌ੑ ֤ϊʔυ্ͷશϝτϦοΫΛ෼ੳ͢ΔͱͳΔͱॲཧ͕஗͘ͳΔ ↪ ϝτϦοΫऔಘॲཧΛ֤ϊʔυʹ෼ࢄอ࣋ɾ໰͍߹ΘͤʹΑΓߴ ଎Խ ↪ ࣌ܥྻͷ૬ؔ෼ੳॲཧΛGPGPUʹΑΓߴ଎Խ ↪ աڈͷॲཧ݁ՌΛ࠶ར༻͠ɺߴ଎൑ఆ ݚڀͷ໨త
  8. 8 ɾγεςϜ؅ཧऀ͸ɺࣄલʹ֤ߏ੒ཁૉ্ͰϝτϦοΫΛऩू͠ɺগ ਺ͷ୅දࢦඪͷΈʹ؂ࢹΞϥʔτΛઃఆ͍ͯ͑͠͞Ε͹ɺΞϥʔτ Λܖػʹଈ࣌ݪҼՕॴΛਪ࿦Մೳ ɾ࣮؀ڥͰൃੜͨ͠ҟৗγφϦΦέʔεʹରͯ͠ɺఏҊख๏͕ʓʓ% ͷਫ਼౓Λୡ੒ ظ଴͞ΕΔݚڀͷߩݙ

  9. 2. ؔ࿈ݚڀ

  10. 10 ɾΞϓϦέʔγϣϯ૚ͷ֤ϦΫΤετʹࣝผࢠΛׂΓৼΓɼޙଓͷϦ ΫΤετʹຒΊࠐΜ্ͩͰɼޙଓͷϓϩηε΁఻ൖͤ͞Δ ɾࣝผࢠΛཔΓʹɼ೚ҙͷߏ੒ཁૉؒͷ஗ԆΛܭଌՄೳ ɾར఺: ΞϓϦέʔγϣϯ಺෦ͷৄࡉͳ௥੻͕Մೳ ɾ՝୊: ܭଌͷͨΊʹɺΞϓϦέʔγϣϯίʔυʹมߋ͕ඞཁ B. H.

    Sigelman, et al., Dapper, a Large-Scale Distributed Systems Tracing Infrastructure, Technical report, Google 2010. ؔ࿈ݚڀᶃ: ϦΫΤετ୯ҐͷτϨʔγϯά
  11. 11 ؔ࿈ݚڀᶄ: Sieve B. H. Sigelman, et al., Dapper, a

    Large-Scale Distributed Systems Tracing Infrastructure, Technical report, Google 2010. ɾطଘͷ؂ࢹج൫͔ΒΞΫγϣϯՄೳͳώϯτΛಘΔγεςϜΛఏҊ ɾ࣌ܥྻΫϥελ෼ੳʹΑΓίϯϙʔωϯτؒͷґଘؔ܎Λࣝผ͢Δ ɾ՝୊: ߏ੒มߋ΍ϫʔΫϩʔυͷมԽʹ௥ै͢ΔͨΊʹ͸ɺ෼ੳε ςοϓΛ࠷ॳ͔Β΍Γ௚͢ඞཁ͕͋Δ [2]: J. Thalheim, A. Rodrigues, I. Akkus, and others, Sieve: Actionable Insights from Monitored Metrics in Distributed Systems, ACM/IFIP/USENIX Middleware, pp.14-27 2017.
  12. 12 ؔ࿈ݚڀᶅ: Microscope ɾϚΠΫϩαʔϏεʹܭଌίʔυΛ௥Ճ͢Δ͜ͱͳ͘ɺཁૉؒͷҼՌ ؔ܎άϥϑΛޮ཰తʹߏங͠ɺҟৗͷݪҼΛϦΞϧλΠϜͰਪଌ ɾ՝୊1: ֤ߏ੒ཁૉͷ୅දϝτϦοΫͷมಈΛΈͯߜΓࠐΉ͕ɺͦΕ Ҏ֎ͷݪҼͱͳΔϝτϦοΫީิΛग़ྗ͠ͳ͍ ɾ՝୊2: ߏ੒ཁૉ͝ͱʹ୅දϝτϦοΫͷઃఆͷͨΊʹΞϓϦέʔ

    γϣϯ஌͕ࣝඞཁ [3]: J. Lin, C. Pengfei, and Z. Zibin, "Microscope: Pinpoint performance issues with causal graphs in micro-service environments." ICSO, pp.3-20, 2018.
  13. 13 ؔ࿈ݚڀͷ՝୊ͷ·ͱΊ ɾؔ࿈ݚڀᶅ Microscope͸αʔϏεશମͷ୅දࢦඪʢSLOʣͷΈʹ ؂ࢹΞϥʔτΛઃఆ͢ΔͨΊɺ௿ϊΠζੑΛຬͨ͢ ɾMicroscope͸ɺҼՌάϥϑߏஙॲཧΛฒྻԽ͢Δ͜ͱʹΑΓଈ࣌ੑ Λຬͨ͢ ɾ͔͠͠ɺαʔϏεશମ͚ͩͰͳ͘ɺ֤ߏ੒ཁૉ͝ͱʹ୅දࢦඪΛઃ ܭ͠ɺ࣌ܥྻσʔλͱͯ͠ऩू͠ͳ͚Ε͹ͳΒͳ͍ ɾߏ੒ཁૉ͕ଟ͍΄Ͳɺࢦඪͷઃܭίετ͸૿େ͢Δ

  14. 3. ఏҊख๏

  15. 15 ఏҊख๏ͷલఏ ɾؔ࿈ݚڀᶅ MicroscopeͷҼՌਪ࿦ख๏Λϕʔεʹ͢Δ ɾαʔϏεશମͷࢦඪͷΈΛఆٛ͠ɺ؂ࢹΞϥʔτΛઃఆ͢Δ ɾϗετ্ͷ͢΂ͯͷϝτϦοΫ ɾϓϩηεΛϊʔυɺTCP઀ଓΛΤοδͱͨ͠׬શͳωοτϫʔΫґ ଘάϥϑ[4] Λར༻ͯ͠ɺҟৗͷҼՌؔ܎άϥϑΛߏங͢Δ [4]

    ௶಺༎थ, ݹ઒խେ, দຊ྄հ, “Transtracer: ෼ࢄγεςϜʹ͓͚ΔTCP/UDP௨৴ͷऴ୺఺ͷ؂ࢹʹΑΔϓϩηεؒґଘؔ܎ͷࣗಈ௥ ੻”, Πϯλʔωοτͱӡ༻ٕज़γϯϙδ΢Ϝ࿦จू, 2019, 64-71 (2019-11-28), 2019೥12݄.
  16. 16 ఏҊख๏ͷϫʔΫϑϩʔʢਤʣ reqs/sec errors/sec latency CPU usage … ݪҼީิϦετ Frontend

    Component ֤ϊʔυ͸ ෳ਺ͷϝτϦοΫΛ΋ͭ 1. ୅දࢦඪͷ ҟৗݕ஌ 2. ґଘάϥϑ͔Βҟৗ ͷҼՌؔ܎άϥϑΛߏங (component, metric, score) (component, metric, score) . . .
  17. 17 ఏҊख๏ͷϫʔΫϑϩʔ 1. SLOͷҧ൓Λݕ஌ 2. ґଘؔ܎άϥϑΛ΋ͱʹҼՌਪ࿦Λ࣮ߦ ɾϑϩϯτϊʔυ͔Βྡ઀͢ΔϊʔυΛऔಘ ɾ֤ϊʔυ্ͰαʔϏεશମͷ୅දࢦඪͷ࣌ܥྻಛ௃ͱ૬ؔ͢Δϝτ ϦοΫΛอ࣋͢Δ͔Ͳ͏͔Λݕఆ ɾᮢ஋Λ௒͑ͨ৔߹౰֘ϊʔυͱϝτϦοΫͷ૊ΛݪҼީิϦετ΁

    ɾશϊʔυΛ୳ࡧ͢Δ·ͰґଘάϥϑΛ୳ࡧ ɾ૬ؔ౓߹͍ΛείΞͱͯ͠ީิϦετΛϥϯΩϯάԽ 0. ϑϩϯτϊʔυͷ୅දϝτϦοΫ؂ࢹͱґଘάϥϑͷऩू
  18. 18 ҼՌάϥϑਪ࿦ͱݕఆΞϧΰϦζϜͷީิ ɾҼՌάϥϑਪ࿦ ɾPCΞϧΰϦζϜ: ࣄ৅ͷڞىؔ܎ͷΈΛѻ͏ ɾϕΠζ๏: ࣄ৅ͷڞىʹՃ͑ͯࣄ৅ؒͷલޙؔ܎Λѻ͏ ɾ֤ϝτϦοΫͷܥྻؒʹ࣌ࠁಉظ͕͞Ε͍ͯΕ͹ϕΠζ๏ͷ΄͏ ͕ద੾ͱͳΔՄೳੑ͕͋Δ ɾ࣌ܥྻσʔλʹର͢Δ৚݅෇͖ಠཱੑͷݕఆ

    ɾG2ݕఆͳͲ ɾ૬ؔ܎਺ΛϥϯΩϯάείΞͱ͢Δ
  19. 19 ఏҊ͢Δߴ଎ԽॲཧʢJust Ideaʣ ɾ࣌ܥྻσʔλʹର͢ΔݕఆॲཧΛߴ଎Խ ɾܥྻؒͷσʔλฒྻੑʹண໨͠ɺGPUʹΑΓฒྻॲཧ ɾϝτϦοΫͷऔಘॲཧΛߴ଎Խ ɾ௚ۙͷσʔλͷΈ֤ϊʔυͷϩʔΧϧʹอ࣋͠ɺ෼ࢄ໰͍߹Θͤ ɾաڈͷਪ࿦݁ՌΛ౿·͑ͯɺಉҰͷҟৗέʔε͔Ͳ͏͔Λ൑ఆ͠ɺ ಉҰͰ͋Ε͹աڈͷਪ࿦݁ՌΛར༻͢Δ ɾҟৗέʔε͕গͳ͍৔߹Λ૝ఆͯ͠ɺ࣮؀ڥʹҙਤతʹҟৗΛ஫ೖ

    ͠ɺֶश͓ͤͯ͘͞
  20. 4. ࣮ݧ༧ఆ

  21. 21 ධՁ߲໨ ɾଈ࣌ੑ ɾҟৗΛݕ஌ޙʹͲΕ͚ͩ଎͘ݪҼՕॴͷީิΛൃݟͰ͖Δ͔ ɾߏ੒ཁૉ਺ͱϝτϦοΫ਺ͷ૿େʹର͢Δ࣮ߦ࣌ؒͷมԽΛධՁ ɾݪҼͷਪ࿦ਫ਼౓ ɾؔ࿈ݚڀᶅ Microscopeͱൺֱ͠ɺਫ਼౓ʹ͕ࠩͰΔ͔Ͳ͏͔

  22. 22 ධՁͷϕʔεϥΠϯ ɾଈ࣌ੑ ɾؔ࿈ݚڀᶅ Microscope͸୯Ұ෺ཧϚγϯʹ෼ੳͤͯ͞12ඵ ɾҰൠʹར༻ऀʹো֐Өڹ࣌ؒΛఏࣔ͢Δͱ͖ʹ෼୯ҐͰ͋ΔͨΊɺ 60ඵҎ಺ͷճ෮ͷ஗Ε͸ڐ༰Մೳ ɾਫ਼౓ͱ࠶ݱ཰ ɾؔ࿈ݚڀᶅ Microscope͸ਫ਼౓88%ͱ࠶ݱ཰85%

  23. 5. ·ͱΊͱࠓޙͷ༧ఆ

  24. 24 ɾ෼ࢄΞϓϦέʔγϣϯͷҟৗൃੜ࣌ʹɺଈ࣌ੑͱ௿ϊΠζੑΛཱ྆ ͭͭ͠ɺগ਺ͷ୅දࢦඪΛઃܭɾ؂ࢹ͢ΔͷΈͰɺݪҼՕॴΛਪ࿦ ͢Δख๏ΛఏҊ ·ͱΊͱࠓޙͷ༧ఆ ɾࠓޙͷ༧ఆ ɾఏҊख๏ͷਪ࿦ͷ࣮ߦ࣌ؒΛͲͷఔ౓୹ॖͰ͖Δ͔Λ༧උ࣮ݧ͢ Δ ɾIOT50ʢ6݄18೔ʒ੾ʣͰͦͷ࣌఺ͰͷਐḿΛ੔ཧͯ͠ൃද͢Δ ɾΑ͍݁Ռ͕ͰΕ͹IOTS2020ʢ9্݄०ʒ੾ʣͰൃද͢Δ

  25. 25 ɾఏҊख๏ͷਪ࿦ͷ࣮ߦ࣌ؒΛͲͷఔ౓୹ॖͰ͖Δ͔Λ༧උ࣮ݧ͢Δ ɾIOT50ʢ6݄18೔ʒ੾ʣͰͦͷ࣌఺ͰͷਐḿΛ੔ཧͯ͠ൃද͢Δ ɾΑ͍݁Ռ͕ͰΕ͹IOTS2020ʢ9্݄०ʒ੾ʣͰൃද͢Δ ࠓޙͷ༧ఆ