分散アプリケーションの異常の因果関係を即時に推論するための手法の構想 / On-time Causal Tracing for System Failures

分散アプリケーションの異常の因果関係を即時に推論するための手法の構想 / On-time Causal Tracing for System Failures

坪内佑樹, 分散アプリケーションの異常の因果関係を即時に推論するための手法の構想, 第6回WebSystemArchitecture研究会, 2020/04.
https://websystemarchitecture.hatenablog.jp/entry/2019/12/11/165624

A658ec7f1badf73819dfa501165016c1?s=128

Yuuki Tsubouchi (yuuk1)

April 26, 2020
Tweet

Transcript

  1. ෼ࢄΞϓϦέʔγϣϯͷҟৗͷҼՌؔ܎ Λଈ࣌ʹਪ࿦͢ΔͨΊͷख๏ͷߏ૝ ௶಺ ༎थ@͘͞ΒΠϯλʔωοτ ୈ6ճWebSystemArchitectureݚڀձ 2020.04.26

  2. 1. ͸͡Ίʹ

  3. 3 ɾ৘ใγεςϜͷ৴པੑʹର͢Δཁٻ͕ߴ·͍ͬͯΔ ɾγεςϜো֐Λݕ஌͢ΔͨΊʹɺγεςϜͷঢ়ଶΛࣗಈͰ؂ࢹ͢Δ γεςϜʢ؂ࢹγεςϜʣΛར༻͢Δ ɾϝʔϧ΍νϟοτπʔϧʹΑΓγεςϜ؅ཧऀʹΞϥʔτΛ௨஌ ɾγεςϜͷঢ়ଶࢦඪʢϝτϦοΫʣͷมԽΛ࣌ܥྻͷάϥϑͰՄࢹ Խ ɾϚΠΫϩαʔϏε΍ෳ਺Ϧʔδϣϯߏ੒ͷ୆಄ͳͲʹΑΓɺίϯ ϙʔωϯτ਺͕૿େ͠ɺͦͷ෼؂ࢹର৅͕૿Ճ͍ͯ͠Δ ෼ࢄΞϓϦέʔγϣϯͷෳࡶԽͱ؂ࢹ

  4. 4 ίϯϙʔωϯτ਺૿େ࣌ͷ؂ࢹʹ͓͚Δ໰୊ ɾ໨ࢹͷ୳ࡧ࣌ؒ૿େ: γεςϜ؅ཧऀ͕ݪҼͱͳΔίϯϙʔωϯτͱ ϝτϦοΫΛ໨ࢹʹΑΔ୳ࡧ͕࣌ؒ૿େ ɾ݁Ռతʹɺো֐͔Βͷճ෮͕஗ΕΔ γεςϜ؅ཧऀ͕ো֐ͷݪҼΛ୳ࡧ࣌ؒΛ୹ॖͤ͞ΔͨΊͷ ࣗಈԽͷͨΊͷΞϓϩʔν͕ඞཁͱͳΔ

  5. 5 ઌߦख๏ͷ՝୊ ɾཁٻτϨʔγϯάϕʔεͷࠜຊݪҼ෼ੳ[1] ɾΞϓϦέʔγϣϯʹܭଌίʔυΛ௥Ճ͢Δඞཁੑ ɾґଘؔ܎άϥϑͱϝτϦοΫϕʔεͷࠜຊݪҼ෼ੳ[2] ɾґଘؔ܎ͷมԽ΍ϫʔΫϩʔυͷมԽ͕͋Δͱɺґଘؔ܎ͷநग़ ॲཧΛ΍Γ௚͢ඞཁੑ ɾSLOϝτϦοΫͱҼՌؔ܎άϥϑͰϦΞϧλΠϜʹҼՌਪ࿦[3] ɾίϯϙʔωϯτ୯Ґͷਪ࿦͸Մೳ͕ͩɺϝτϦοΫ͸ର৅֎ [1]:

    H. Jayathilaka, C. Krintz and R. Wolski, Performance monitoring and root cause analysis for cloud-hosted web applications, WWW, pp. 469–478, 2017. [2]: J. Thalheim, A. Rodrigues, I. Akkus, and others, Sieve: Actionable Insights from Monitored Metrics in Distributed Systems, ACM/IFIP/USENIX Middleware, pp.14-27 2017. [3]: J. Lin, C. Pengfei, and Z. Zibin, "Microscope: Pinpoint performance issues with causal graphs in micro-service environments." ICSO, pp.3-20, 2018.
  6. 6 ݚڀͷ໨త 1. ݪҼͱͳΔίϯϙʔωϯτͱϝτϦοΫͷҼՌਪ࿦͕Մೳ 2. ϑϩϯτίϯϙʔωϯτҎ֎ͷίϯϙʔωϯτͰ͸SLOΛઃఆෆཁ 3. εέʔϥϏϦςΟͱଈ࣌ੑͷཱ྆ ɾߏ੒ཁૉ਺ͱϝτϦοΫ਺͕૿Ճͯ͠΋ɺݪҼՕॴΛଈ࣌ʹਪ࿦Մೳ ઌߦݚڀͷ՝୊ʹରͯ͠ɺҎԼͷཁ݅Λຬͨ͢SLOϝτϦοΫىҼͷҼՌਪ

    ࿦ͷͨΊͷج൫ΛఏҊ͢Δ
  7. 7 ɾγεςϜ؅ཧऀ͕େྔͷ࣌ܥྻάϥϑ͔ΒҟৗՕॴΛಛఆ͢Δඞཁ ͕ͳ͍ͨΊɺߴ଎ʹݪҼՕॴͷީิΛߜΓࠐΈՄೳ ɾγεςϜ؅ཧऀ͕؂ࢹઃఆΛ͢ΔͨΊͷखؒΛ௿ݮՄೳ ɾϑϩϯτίϯϙʔωϯτͷSLOϝτϦοΫͷΈద੾ʹઃఆ͢Ε͹ɺ γεςϜ؅ཧऀ͸ΞϓϦέʔγϣϯͷ஌ࣝͳ͠ͰҼՌΛਪ࿦Մೳ ظ଴͢Δݚڀͷߩݙ

  8. 2. ؔ࿈ݚڀ

  9. 9 ɾطଘͷ؂ࢹج൫͔ΒΞΫγϣϯՄೳͳಎ࡯ΛಘΔγεςϜΛఏҊ ɾ࣌ܥྻΫϥελ෼ੳʹΑΓίϯϙʔωϯτؒͷґଘؔ܎Λࣝผ͢Δ ɾԠ༻1: Ϋϥελͷ୅දϝτϦοΫΛΦʔτεέʔϧͷࢦඪʹར༻ ɾԠ༻2: τϙϩδάϥϑΛར༻ͯ͠ɺࠜຊݪҼ෼ੳʹར༻ State-of-the-artख๏: Sieve [2]:

    J. Thalheim, A. Rodrigues, I. Akkus, and others, Sieve: Actionable Insights from Monitored Metrics in Distributed Systems, ACM/IFIP/USENIX Middleware, pp.14-27 2017.
  10. 10 ɾߏ੒มߋ΍ϫʔΫϩʔυͷมԽʹ௥ै͠Α͏ͱ͢ΔͱɺεςοϓΛ ࠷ॳ͔Β΍Γ௚͢ඞཁ͕͋Δ ɾଈ࣌ੑͷཁ݅Λຬͨͤͳ͍ State-of-the-artख๏ͷ՝୊

  11. 11 ɾϚΠΫϩαʔϏεͷίʔυΛܭଌԽ͢Δ͜ͱͳ͘ɺαʔϏεͷҼՌ ؔ܎άϥϑΛޮ཰తʹߏங͠ɺҟৗͷݪҼΛϦΞϧλΠϜͰਪଌ ɾ՝୊1: ֤ίϯϙʔωϯτͷSLOϝτϦοΫͷมಈΛΈͯߜΓࠐΉ ͕ɺ ͦΕҎ֎ͷݪҼͱͳΔϝτϦοΫީิΛग़ྗ͠ͳ͍ ɾ՝୊2: SLOͷઃఆͷͨΊʹαʔϏεͷ஌͕ࣝඞཁ State-of-the-artख๏:

    Microscope [3]: J. Lin, C. Pengfei, and Z. Zibin, "Microscope: Pinpoint performance issues with causal graphs in micro-service environments." ICSO, pp.3-20, 2018.
  12. 3. ఏҊख๏

  13. 13 ఏҊख๏ͷཁ݅ 1. ݪҼͱͳΔίϯϙʔωϯτͱϝτϦοΫͷҼՌਪ࿦͕Մೳ 2. ϑϩϯτίϯϙʔωϯτҎ֎ͷίϯϙʔωϯτͰ͸SLOΛઃఆෆཁ 3. εέʔϥϏϦςΟͱଈ࣌ੑͷཱ྆ ɾߏ੒ཁૉ਺ͱϝτϦοΫ਺͕૿Ճͯ͠΋ɺݪҼՕॴΛଈ࣌ʹਪ࿦Մೳ

  14. 14 ఏҊ͢ΔҼՌਪ࿦γεςϜͷϫʔΫϑϩʔ 1. SLOͷҧ൓Λݕ஌ 2. ґଘؔ܎άϥϑΛ΋ͱʹҼՌਪ࿦Λ࣮ߦ ɾϑϩϯτίϯϙʔωϯτ͔Βྡ઀͢ΔϊʔυΛऔಘ ɾ֤ϊʔυ্ͰSLOϝτϦοΫͷ࣌ܥྻಛ௃ͱ૬ؔ͢ΔϝτϦοΫΛ อ࣋͢ΔϊʔυΛ୳ࡧ ɾൃݟͨ͠৔߹͸ɺ౰֘ϊʔυͱϝτϦοΫͷ૊ΛݪҼީิϦετ΁

    ɾશϊʔυΛ୳ࡧ͢Δ·Ͱґଘؔ܎άϥϑΛτϥόʔε ɾείΞʹج͍ͮͯީิϦετΛϥϯΩϯάԽ 0. ϑϩϯτίϯϙʔωϯτͷSLOϝτϦοΫऩूͱґଘάϥϑͷܧଓߏங
  15. 15 ҼՌਪ࿦ͷϫʔΫϑϩʔਤղ reqs/sec errors/sec latency CPU usage … ݪҼީิϦετ Frontend

    Component ֤ίϯϙʔωϯτϊʔυ͸ ϝτϦοΫΛ΋ͭ 1. SLOҧ൓ݕ஌ 2. ґଘؔ܎ͷτϥόʔε (component, metric, score) (component, metric, score) . . .
  16. 16 ఏҊ͢ΔҼՌਪ࿦γεςϜͷར༻ྫ SLOҧ൓Λࣔ͢ Ξϥʔτ৚݅Λઃఆ ݕ஌ ௨஌ ਪ࿦ॲཧ - τϙϩδάϥϑ -

    ݪҼՕॴͷߏ੒ཁૉͱ ϝτϦοΫͷީิҰཡ ݁ՌΛΞϥʔτ ϖʔδʹදࣔ
  17. 17 ͍͔ʹߴ଎ʹҼՌਪ࿦͢Δ͔ (Just Idea) ɾάϥϑͷ୳ࡧΛฒྻʹॲཧ͢Δ (ઌߦख๏ͰఏҊࡁΈ) ɾϝτϦοΫ୯ҐͰ૬ؔੑͷ෼ੳΛฒྻʹॲཧ͢Δ ɾGPU΍FaaSͷΦϯσϚϯυεέʔϦϯάΛར༻ ɾϑϩϯτSLOͱ૬ؔ͠΍͍͢ϝτϦοΫΛࣄલֶश͓ͯ͘͠ ɾSLOҧ൓͠ͳ͍ఔ౓ͷมಈΛݕ஌͢ΔͨͼʹɺҼՌਪ࿦ॲཧΛ࣮ߦ

    ɾγεςϜ؅ཧऀͷମײ࣌ؒΛ௿ݮͤ͞ΔͨΊʹɺࣄલֶशͨ݁͠Ռ Λઌʹฦ͠ɺͷͪʹ஗Εͯશ୳ࡧͷਪ࿦݁ՌΛฦ͢
  18. 5. ·ͱΊ

  19. 19 ɾ෼ࢄΞϓϦέʔγϣϯͷো֐ൃੜ࣌ʹɺΞϓϦέʔγϣϯͷ஌ࣝΛ ΋ͨͣʹɺϝτϦοΫ୯ҐͰͷݪҼީิΛਪ࿦͢Δख๏ΛఏҊ ɾࠓޙͷ༧ఆ ɾstate-of-the-artͱͷࠩ෼Λ໌֬ʹ͢Δ ɾҼՌਪ࿦ͷ࣮ߦ࣌ؒΛͲͷఔ౓୹ॖͰ͖Δ͔ΛݟཱͯΔ ɾґଘؔ܎ͱϝτϦοΫͷ࣌ܥྻಛ௃ͷ૬ؔੑΛར༻ ·ͱΊ