Upgrade to Pro — share decks privately, control downloads, hide ads and more …

分散アプリケーションの異常の因果関係を即時に推論するための手法の構想 / On-time Causal Tracing for System Failures

分散アプリケーションの異常の因果関係を即時に推論するための手法の構想 / On-time Causal Tracing for System Failures

坪内佑樹, 分散アプリケーションの異常の因果関係を即時に推論するための手法の構想, 第6回WebSystemArchitecture研究会, 2020/04.
https://websystemarchitecture.hatenablog.jp/entry/2019/12/11/165624

Yuuki Tsubouchi (yuuk1)

April 26, 2020
Tweet

More Decks by Yuuki Tsubouchi (yuuk1)

Other Decks in Research

Transcript

  1. ෼ࢄΞϓϦέʔγϣϯͷҟৗͷҼՌؔ܎
    Λଈ࣌ʹਪ࿦͢ΔͨΊͷख๏ͷߏ૝
    ௶಺ ༎थ@͘͞ΒΠϯλʔωοτ
    ୈ6ճWebSystemArchitectureݚڀձ 2020.04.26

    View full-size slide

  2. 3
    ɾ৘ใγεςϜͷ৴པੑʹର͢Δཁٻ͕ߴ·͍ͬͯΔ
    ɾγεςϜো֐Λݕ஌͢ΔͨΊʹɺγεςϜͷঢ়ଶΛࣗಈͰ؂ࢹ͢Δ
    γεςϜʢ؂ࢹγεςϜʣΛར༻͢Δ
    ɾϝʔϧ΍νϟοτπʔϧʹΑΓγεςϜ؅ཧऀʹΞϥʔτΛ௨஌
    ɾγεςϜͷঢ়ଶࢦඪʢϝτϦοΫʣͷมԽΛ࣌ܥྻͷάϥϑͰՄࢹ
    Խ
    ɾϚΠΫϩαʔϏε΍ෳ਺Ϧʔδϣϯߏ੒ͷ୆಄ͳͲʹΑΓɺίϯ
    ϙʔωϯτ਺͕૿େ͠ɺͦͷ෼؂ࢹର৅͕૿Ճ͍ͯ͠Δ
    ෼ࢄΞϓϦέʔγϣϯͷෳࡶԽͱ؂ࢹ

    View full-size slide

  3. 4
    ίϯϙʔωϯτ਺૿େ࣌ͷ؂ࢹʹ͓͚Δ໰୊
    ɾ໨ࢹͷ୳ࡧ࣌ؒ૿େ: γεςϜ؅ཧऀ͕ݪҼͱͳΔίϯϙʔωϯτͱ
    ϝτϦοΫΛ໨ࢹʹΑΔ୳ࡧ͕࣌ؒ૿େ
    ɾ݁Ռతʹɺো֐͔Βͷճ෮͕஗ΕΔ
    γεςϜ؅ཧऀ͕ো֐ͷݪҼΛ୳ࡧ࣌ؒΛ୹ॖͤ͞ΔͨΊͷ
    ࣗಈԽͷͨΊͷΞϓϩʔν͕ඞཁͱͳΔ

    View full-size slide

  4. 5
    ઌߦख๏ͷ՝୊
    ɾཁٻτϨʔγϯάϕʔεͷࠜຊݪҼ෼ੳ[1]
    ɾΞϓϦέʔγϣϯʹܭଌίʔυΛ௥Ճ͢Δඞཁੑ
    ɾґଘؔ܎άϥϑͱϝτϦοΫϕʔεͷࠜຊݪҼ෼ੳ[2]
    ɾґଘؔ܎ͷมԽ΍ϫʔΫϩʔυͷมԽ͕͋Δͱɺґଘؔ܎ͷநग़
    ॲཧΛ΍Γ௚͢ඞཁੑ
    ɾSLOϝτϦοΫͱҼՌؔ܎άϥϑͰϦΞϧλΠϜʹҼՌਪ࿦[3]
    ɾίϯϙʔωϯτ୯Ґͷਪ࿦͸Մೳ͕ͩɺϝτϦοΫ͸ର৅֎
    [1]: H. Jayathilaka, C. Krintz and R. Wolski, Performance monitoring and root cause analysis for cloud-hosted web applications, WWW, pp. 469–478, 2017.
    [2]: J. Thalheim, A. Rodrigues, I. Akkus, and others, Sieve: Actionable Insights from Monitored Metrics in Distributed Systems, ACM/IFIP/USENIX Middleware, pp.14-27 2017.
    [3]: J. Lin, C. Pengfei, and Z. Zibin, "Microscope: Pinpoint performance issues with causal graphs in micro-service environments." ICSO, pp.3-20, 2018.

    View full-size slide

  5. 6
    ݚڀͷ໨త
    1. ݪҼͱͳΔίϯϙʔωϯτͱϝτϦοΫͷҼՌਪ࿦͕Մೳ
    2. ϑϩϯτίϯϙʔωϯτҎ֎ͷίϯϙʔωϯτͰ͸SLOΛઃఆෆཁ
    3. εέʔϥϏϦςΟͱଈ࣌ੑͷཱ྆
    ɾߏ੒ཁૉ਺ͱϝτϦοΫ਺͕૿Ճͯ͠΋ɺݪҼՕॴΛଈ࣌ʹਪ࿦Մೳ
    ઌߦݚڀͷ՝୊ʹରͯ͠ɺҎԼͷཁ݅Λຬͨ͢SLOϝτϦοΫىҼͷҼՌਪ
    ࿦ͷͨΊͷج൫ΛఏҊ͢Δ

    View full-size slide

  6. 7
    ɾγεςϜ؅ཧऀ͕େྔͷ࣌ܥྻάϥϑ͔ΒҟৗՕॴΛಛఆ͢Δඞཁ
    ͕ͳ͍ͨΊɺߴ଎ʹݪҼՕॴͷީิΛߜΓࠐΈՄೳ
    ɾγεςϜ؅ཧऀ͕؂ࢹઃఆΛ͢ΔͨΊͷखؒΛ௿ݮՄೳ
    ɾϑϩϯτίϯϙʔωϯτͷSLOϝτϦοΫͷΈద੾ʹઃఆ͢Ε͹ɺ
    γεςϜ؅ཧऀ͸ΞϓϦέʔγϣϯͷ஌ࣝͳ͠ͰҼՌΛਪ࿦Մೳ
    ظ଴͢Δݚڀͷߩݙ

    View full-size slide

  7. 2. ؔ࿈ݚڀ

    View full-size slide

  8. 9
    ɾطଘͷ؂ࢹج൫͔ΒΞΫγϣϯՄೳͳಎ࡯ΛಘΔγεςϜΛఏҊ
    ɾ࣌ܥྻΫϥελ෼ੳʹΑΓίϯϙʔωϯτؒͷґଘؔ܎Λࣝผ͢Δ
    ɾԠ༻1: Ϋϥελͷ୅දϝτϦοΫΛΦʔτεέʔϧͷࢦඪʹར༻
    ɾԠ༻2: τϙϩδάϥϑΛར༻ͯ͠ɺࠜຊݪҼ෼ੳʹར༻
    State-of-the-artख๏: Sieve
    [2]: J. Thalheim, A. Rodrigues, I. Akkus, and others, Sieve: Actionable Insights from Monitored Metrics in Distributed Systems, ACM/IFIP/USENIX Middleware, pp.14-27 2017.

    View full-size slide

  9. 10
    ɾߏ੒มߋ΍ϫʔΫϩʔυͷมԽʹ௥ै͠Α͏ͱ͢ΔͱɺεςοϓΛ
    ࠷ॳ͔Β΍Γ௚͢ඞཁ͕͋Δ
    ɾଈ࣌ੑͷཁ݅Λຬͨͤͳ͍
    State-of-the-artख๏ͷ՝୊

    View full-size slide

  10. 11
    ɾϚΠΫϩαʔϏεͷίʔυΛܭଌԽ͢Δ͜ͱͳ͘ɺαʔϏεͷҼՌ
    ؔ܎άϥϑΛޮ཰తʹߏங͠ɺҟৗͷݪҼΛϦΞϧλΠϜͰਪଌ
    ɾ՝୊1: ֤ίϯϙʔωϯτͷSLOϝτϦοΫͷมಈΛΈͯߜΓࠐΉ
    ͕ɺ ͦΕҎ֎ͷݪҼͱͳΔϝτϦοΫީิΛग़ྗ͠ͳ͍
    ɾ՝୊2: SLOͷઃఆͷͨΊʹαʔϏεͷ஌͕ࣝඞཁ
    State-of-the-artख๏: Microscope
    [3]: J. Lin, C. Pengfei, and Z. Zibin, "Microscope: Pinpoint performance issues with causal graphs in micro-service environments." ICSO, pp.3-20, 2018.

    View full-size slide

  11. 3. ఏҊख๏

    View full-size slide

  12. 13
    ఏҊख๏ͷཁ݅
    1. ݪҼͱͳΔίϯϙʔωϯτͱϝτϦοΫͷҼՌਪ࿦͕Մೳ
    2. ϑϩϯτίϯϙʔωϯτҎ֎ͷίϯϙʔωϯτͰ͸SLOΛઃఆෆཁ
    3. εέʔϥϏϦςΟͱଈ࣌ੑͷཱ྆
    ɾߏ੒ཁૉ਺ͱϝτϦοΫ਺͕૿Ճͯ͠΋ɺݪҼՕॴΛଈ࣌ʹਪ࿦Մೳ

    View full-size slide

  13. 14
    ఏҊ͢ΔҼՌਪ࿦γεςϜͷϫʔΫϑϩʔ
    1. SLOͷҧ൓Λݕ஌
    2. ґଘؔ܎άϥϑΛ΋ͱʹҼՌਪ࿦Λ࣮ߦ
    ɾϑϩϯτίϯϙʔωϯτ͔Βྡ઀͢ΔϊʔυΛऔಘ
    ɾ֤ϊʔυ্ͰSLOϝτϦοΫͷ࣌ܥྻಛ௃ͱ૬ؔ͢ΔϝτϦοΫΛ
    อ࣋͢ΔϊʔυΛ୳ࡧ
    ɾൃݟͨ͠৔߹͸ɺ౰֘ϊʔυͱϝτϦοΫͷ૊ΛݪҼީิϦετ΁
    ɾશϊʔυΛ୳ࡧ͢Δ·Ͱґଘؔ܎άϥϑΛτϥόʔε
    ɾείΞʹج͍ͮͯީิϦετΛϥϯΩϯάԽ
    0. ϑϩϯτίϯϙʔωϯτͷSLOϝτϦοΫऩूͱґଘάϥϑͷܧଓߏங

    View full-size slide

  14. 15
    ҼՌਪ࿦ͷϫʔΫϑϩʔਤղ
    reqs/sec
    errors/sec
    latency
    CPU usage

    ݪҼީิϦετ
    Frontend
    Component
    ֤ίϯϙʔωϯτϊʔυ͸
    ϝτϦοΫΛ΋ͭ
    1. SLOҧ൓ݕ஌ 2. ґଘؔ܎ͷτϥόʔε
    (component, metric, score)
    (component, metric, score)
    .
    .
    .

    View full-size slide

  15. 16
    ఏҊ͢ΔҼՌਪ࿦γεςϜͷར༻ྫ
    SLOҧ൓Λࣔ͢
    Ξϥʔτ৚݅Λઃఆ
    ݕ஌ ௨஌
    ਪ࿦ॲཧ
    - τϙϩδάϥϑ
    - ݪҼՕॴͷߏ੒ཁૉͱ
    ϝτϦοΫͷީิҰཡ
    ݁ՌΛΞϥʔτ
    ϖʔδʹදࣔ

    View full-size slide

  16. 17
    ͍͔ʹߴ଎ʹҼՌਪ࿦͢Δ͔ (Just Idea)
    ɾάϥϑͷ୳ࡧΛฒྻʹॲཧ͢Δ (ઌߦख๏ͰఏҊࡁΈ)
    ɾϝτϦοΫ୯ҐͰ૬ؔੑͷ෼ੳΛฒྻʹॲཧ͢Δ
    ɾGPU΍FaaSͷΦϯσϚϯυεέʔϦϯάΛར༻
    ɾϑϩϯτSLOͱ૬ؔ͠΍͍͢ϝτϦοΫΛࣄલֶश͓ͯ͘͠
    ɾSLOҧ൓͠ͳ͍ఔ౓ͷมಈΛݕ஌͢ΔͨͼʹɺҼՌਪ࿦ॲཧΛ࣮ߦ
    ɾγεςϜ؅ཧऀͷମײ࣌ؒΛ௿ݮͤ͞ΔͨΊʹɺࣄલֶशͨ݁͠Ռ
    Λઌʹฦ͠ɺͷͪʹ஗Εͯશ୳ࡧͷਪ࿦݁ՌΛฦ͢

    View full-size slide

  17. 19
    ɾ෼ࢄΞϓϦέʔγϣϯͷো֐ൃੜ࣌ʹɺΞϓϦέʔγϣϯͷ஌ࣝΛ
    ΋ͨͣʹɺϝτϦοΫ୯ҐͰͷݪҼީิΛਪ࿦͢Δख๏ΛఏҊ
    ɾࠓޙͷ༧ఆ
    ɾstate-of-the-artͱͷࠩ෼Λ໌֬ʹ͢Δ
    ɾҼՌਪ࿦ͷ࣮ߦ࣌ؒΛͲͷఔ౓୹ॖͰ͖Δ͔ΛݟཱͯΔ
    ɾґଘؔ܎ͱϝτϦοΫͷ࣌ܥྻಛ௃ͷ૬ؔੑΛར༻
    ·ͱΊ

    View full-size slide