分散アプリケーションの異常の因果関係を即時に推論するための手法の構想 / On-time Causal Tracing for System Failures

分散アプリケーションの異常の因果関係を即時に推論するための手法の構想 / On-time Causal Tracing for System Failures

坪内佑樹, 分散アプリケーションの異常の因果関係を即時に推論するための手法の構想, 第6回WebSystemArchitecture研究会, 2020/04.
https://websystemarchitecture.hatenablog.jp/entry/2019/12/11/165624

A658ec7f1badf73819dfa501165016c1?s=128

Yuuki Tsubouchi (yuuk1)

April 26, 2020
Tweet

Transcript

  1. 5.

    5 ઌߦख๏ͷ՝୊ ɾཁٻτϨʔγϯάϕʔεͷࠜຊݪҼ෼ੳ[1] ɾΞϓϦέʔγϣϯʹܭଌίʔυΛ௥Ճ͢Δඞཁੑ ɾґଘؔ܎άϥϑͱϝτϦοΫϕʔεͷࠜຊݪҼ෼ੳ[2] ɾґଘؔ܎ͷมԽ΍ϫʔΫϩʔυͷมԽ͕͋Δͱɺґଘؔ܎ͷநग़ ॲཧΛ΍Γ௚͢ඞཁੑ ɾSLOϝτϦοΫͱҼՌؔ܎άϥϑͰϦΞϧλΠϜʹҼՌਪ࿦[3] ɾίϯϙʔωϯτ୯Ґͷਪ࿦͸Մೳ͕ͩɺϝτϦοΫ͸ର৅֎ [1]:

    H. Jayathilaka, C. Krintz and R. Wolski, Performance monitoring and root cause analysis for cloud-hosted web applications, WWW, pp. 469–478, 2017. [2]: J. Thalheim, A. Rodrigues, I. Akkus, and others, Sieve: Actionable Insights from Monitored Metrics in Distributed Systems, ACM/IFIP/USENIX Middleware, pp.14-27 2017. [3]: J. Lin, C. Pengfei, and Z. Zibin, "Microscope: Pinpoint performance issues with causal graphs in micro-service environments." ICSO, pp.3-20, 2018.
  2. 15.

    15 ҼՌਪ࿦ͷϫʔΫϑϩʔਤղ reqs/sec errors/sec latency CPU usage … ݪҼީิϦετ Frontend

    Component ֤ίϯϙʔωϯτϊʔυ͸ ϝτϦοΫΛ΋ͭ 1. SLOҧ൓ݕ஌ 2. ґଘؔ܎ͷτϥόʔε (component, metric, score) (component, metric, score) . . .
  3. 16.
  4. 18.