$30 off During Our Annual Pro Sale. View Details »

AIOps研究録―SREのための
システム障害の自動原因診断 / SRE NEXT 2022

AIOps研究録―SREのための
システム障害の自動原因診断 / SRE NEXT 2022

Tweet

More Decks by Yuuki Tsubouchi (yuuk1)

Other Decks in Research

Transcript

  1. AIOpsݚڀ࿥ʕSREͷͨΊͷ 
 γεςϜো֐ͷࣗಈݪҼ਍அ yuuk1 @͘͞ΒΠϯλʔωοτݚڀॴ 
 2022/05/15 SRE NEXT 2022

    ONLINE
  2. 2 ϓϩϑΟʔϧ yuuk1 (Yuuki Tsubouchi) ͘͞ΒΠϯλʔωοτݚڀॴɹݚڀһ ژ౎େֶେֶӃ৘ใֶݚڀՊɹത࢜ޙظ՝ఔ3೥ TopotalɹςΫϊϩδΞυόΠβʔ ৽ଔ͔Β5೥ؒWebΦϖϨʔγϣϯɾSREͷΤϯδχΞ https://yuuk.io/

    3೥લΑΓ͘͞ΒΠϯλʔωοτʹస৬͠ɺݚڀ։ൃͷੈք΁ 2೥લʹେֶӃത࢜ޙظ՝ఔʹೖֶ SRE NEXT 2020 IN TOKYO جௐߨԋ @yuuk1t
  3. AIOpsͷݚڀ ΢ΣϒΦϖϨʔγϣϯɾSRE 2013 2018 2020ʙ ӡ༻σʔλͷޮ཰తͳ؍ଌ๏ͷݚڀ SRE DIVERSITY [Y.Tsubouchi 2021],

    [Y. Tsubouchi 2022] [Y. Tsubouchi 2021]: ௶಺༎थ, ࿬ࡔேਓ, ᖛా݈, দ໦խ޾, খྛོߒ, Ѩ෦ത, দຊ྄հ, HeteroTSDB: ҟछ෼ࢄKVSؒͷࣗಈ֊૚ԽʹΑΔߴੑೳͳ࣌ܥྻσʔ λϕʔε, ৘ใॲཧֶձ࿦จࢽ, Vol.62, No.3, pp.818-828, 2021೥3݄. [Y. Tsubouchi 2022]: Yuuki Tsubouchi, Masayoshi Furukawa, Ryosuke Matsumoto, Low Overhead TCP/UDP Socket-based Tracing for Discovering Network Services Dependencies, Journal of Information Processing, Vol.30, pp.260-268, 2022.
  4. AIOpsͷ֤ݚڀͷମܥత੔ཧΑΓ͸Ή͠Ζɺσʔλα ΠΤϯεͷܦݧ͕ͳ͍தɺݚڀͷݱ৔Ͱૺ۰͖ͯͨ͠ ໰୊΁ͷࢼߦࡨޡΛ͓࿩͠·͢ɻ ιϑτ΢ΣΞ։ൃɾӡ༻ͷݱ৔Ͱ໌೔͔Β࢖͑Δ஌ࣝ΍ ςΫχοΫ͸ఏڙͰ͖·ͤΜ͕ɺ΄Μͷগ͠ઌͷະདྷΛ ײ͡ΔΑ͏ͳͳʹ͔Λ࣋ͪؼ͍͚ͬͯͨͩΕ͹޾͍Ͱ͢ɻ

  5. 1. SREͱAI

  6. 6 HAL 9000 ʰ2001 ೥Ӊ஦ͷཱྀʱ 18ষ SREͷͨΊͷػցֶशೖ໳ ͔ΒͷҾ༻ ” ͨͬͨࠓɺAE35Ϣχοτͷো֐Λݕग़͠·ͨ͠ɻ

    ࢲ͸72࣌ؒҎ಺ʹ100%ͷ֬཰Ͱػೳఀࢭ͠·͢ɻ” ― HAL 9000ɺʰ2001 ೥Ӊ஦ͷཱྀʱ “͜ͷөը͕ඳ͘ະདྷΛઌݟͷ໌Λ΋ͬͯߏ૝ͨ͠ͷ͸Ξʔ αʔɾCɾΫϥʔΫ(Arthur C. Clarke)ͰɺγεςϜͱϋʔυ΢Σ Ξͷো֐ൃੜΛԿ࣌ؒ΋લʹ༧ଌͰ͖Δ׬શࣗಈԽαʔϏεͱ AI Λ૊Έ߹Θͤ·ͨ͠ɻHAL 9000 ͸ɺཱࣗͨࣗ͠ݾௐ੔ܕͷ ܽ఺͕ͳ͍ػցͱ͍͏ਓྨͷເ(͋Δ͍͸ѱເ)Ͱ͋Γɺਓؒʹ Αͬͯఆٛ͞Εͨ໨ඪΛୡ੒͢ΔͨΊʹɺӉ஦ધͷ৐һͱϛο γϣϯͷ྆ํʹไ࢓͠·͢ɻ”
  7. 7 ɾ1980೥୅ʹ͸ɺωοτϫʔΫ؅ཧʹɺ஌ࣝϕʔεAI΍χϡʔϥϧωοτ ϕʔεAIΛԠ༻͢ΔՄೳੑ͕ٞ࿦͞Ε͍ͯΔ ৘ใγεςϜͷӡ༻ʹAIΛԠ༻͢ΔىݯΛ୳Δ [Cebulka 1989]: Cebulka KD, et al.,

    Applications of arti fi cial intelligence for meeting network management challenges in the 1990s, IEEE GLOBECOM 1989. ɾಛఆͷαʔϏεΛαϙʔτ͢ΔͨΊͷωοτϫʔΫͷॳظઃܭ ɾηϯτϥϧΦϑΟεؒͷઓज़తͳઃඋܭը ɾεΠον͔Βͷϝοηʔδͷ؂ࢹͱ਍அ [Notaro 2021]: Notaro P, et al., A Survey of AIOps Methods for Failure Management. ACM TIST, 2021. ɾ1990೥୅ॳ಄͔ΒΦϯϥΠϯͷιϑτ΢ΣΞ΍ϋʔυ΢ΣΞͷނো༧஌ Ϟσϧ͕͍͔ͭ͘ఏҊ͞Ε͍ͯΔɽͦͷଞͷނো๷ࢭํ๏ͳͲ΋ಉ࣌ظ [Cebulka 1989] [Notaro 2021]
  8. 8 ݱ୅ʹ͓͚ΔAIOpsͷߩݙྖҬ [Notaro ’20]: Notaro, P, Jorge C, and Michael

    G. "A Systematic Mapping Study in AIOps.” ICSOC. Springer, Cham, 2020. [Notaro ’20]: Fig.2 Taxonomy of AIOps as observed in the identified contributions 
 ΑΓసࡌ ো֐؅ཧʹؔ͢Δݚڀ Ϧιʔεͷׂ౰ͳͲͷ 
 ࠷దԽʹؔ͢Δݚڀ
  9. 9 AIOpsͷݚڀྖҬ͝ͱͷ࿦จ਺ [Notaro ’20]: Notaro, P, Jorge C, and Michael

    G. "A Systematic Mapping Study in AIOps.” ICSOC. Springer, Cham, 2020. ɾAIOpsؔ࿈ͷ࿦จ਺ɿ670 ɾ670݅ͷ62.1%͕Failure Managementʢো֐؅ཧʣʹؔ࿈͍ͯ͠Δ ɾো֐༧ଌʢ26.4ˋʣো֐ݕग़ʢ33.7ˋʣݪҼ෼ੳʢ26.7ˋʣ ࿦จ਺͸૿Ճ܏޲
  10. 10 γεςϜ؂ࢹͷ੒ख़͍ͯ͠Ε͹ɺো֐ͷൃੜ ʹؾ͔ͮͳ͍͜ͱ͸ͳ͍ͷͰ͸ͳ͍͔ʁ AIOpsͷͲͷྖҬʹऔΓ૊Ή͔ʁ ো֐ͷݪҼ͸ͳʹ͔ʁʹ౴͑Δ΄͏͕೉͍͠ ༧ଌ/༧๷ ݪҼ਍அ ؇࿨ ࠜຊݪҼ෼ੳ ݕ஌

    Failure Managementʢো֐؅ཧʣͷ΄͏͕৴པੑʹ௚݁ म෮ AIOpsͰ 
 ࠷ॳʹ࿈૝ ?
  11. 2. γεςϜো֐ͷݪҼ਍அΛ 
 ࣗಈԽ͢ΔͨΊͷߏ૝

  12. 12 ΞϥʔτετʔϜ γεςϜো֐ݕ஌ޙͷ՝୊ େྔͷϝτϦΫεͷӾཡ CRITICAL: front-end - http_request latency_95 CRITICAL:

    user - latency_95 CRITICAL: user-db_memory_usage CRITICAL: user-db_cpu_user_usage CRITICAL: orders - jvm_heap_memory_usage CRITICAL: orders-db - network_transmit_bytes CRITICAL: payment - http_request_error_5xx CRITICAL: front-end - http_request_latency_50 CRITICAL: front-end - http_request_latency_90 CRITICAL: user-db - cpu_system_usage ೝ஌ෛՙ 
 ૿େ
  13. 13 ঱ঢ়ɿͳʹ͕յΕͨͷ͔ʁ ݪҼɿͳͥյΕͨͷ͔ʁ ͳͥΞϥʔτετʔϜ͕ൃੜ͢Δͷ͔ʁ ঱ঢ়ͱݪҼʹ͸૬ରؔ܎͕͋ΔͨΊɺ۠ผ͢Δ͜ͱ͕࣮͸೉͍͠ ঱ঢ়ͱݪҼ͕ಉ࣌ʹΞϥʔτ͞ΕΔͨΊ } ঱ঢ় ݪҼ 1

    HTTP 500΋͘͠͸400͕ฦ͞Ε͍ͯΔ σʔλϕʔεαʔό͕઀ଓΛڋ൱͍ͯ͠Δ 2 σʔλϕʔεαʔό͕઀ଓΛڋ൱͍ͯ͠Δ σʔλϕʔεαʔόͷ σΟεΫ࢖༻ྔ͕ຬഋͱͳ͍ͬͯΔ 3 σʔλϕʔεαʔόͷ σΟεΫ࢖༻ྔ͕ຬഋͱͳ͍ͬͯΔ ΫΤϦϩάͷϑΝΠϧαΠζ͕ٸ଎ʹ૿Ճ 4 ΫΤϦϩάͷϑΝΠϧαΠζ͕ٸ଎ʹ૿Ճ … ঱ঢ়ͷΈΛΞϥʔτ͢Ε͹Α͍ͷͰ͸ʁ
  14. 14 “Alert symptoms, diagnose causes” SLOʹجͮ͘ΞϥʔςΟϯά [SRWbook 18] Chapter 5

    "Alerting on SLOs", Beyer B, et al., The Site Reliability Workbook: Practical ways to implement SRE. O'Reilly Media, Inc."; 2018. ※ [SRWbook 18] Four Golden Signals / RED Latency, Traffic, Errors, Saturation Rate, Errors, Duration αʔϏεશମͷ঱ঢ়ʹର͢ΔΞϥʔτ AIʹΑΔݪҼ਍அ τϦΨʔ ӡ༻σʔλͷࣗಈղੳ ӡ༻σʔλ ϝτϦΫε ϩά τϨʔε Πϕϯτ SREs
  15. 15 ྨࣅͷண૝Λ΋ͭઌਓ͸͍Δ Chen P, et al., Causeinfer: Automatic and distributed

    performance diagnosis with hierarchical causality graph in large distributed systems. IEEE INFOCOM 2014. CauseInfer (2014) ɾίϯϙʔωϯτ͝ͱʹͷύέοτͷண৴࣌ ࠁ͔Βਪఆ͢ΔTCP஗ԆΛܭଌ ɾ౷ܭతҼՌ୳ࡧ෼໺ͷPCΞϧΰϦζϜʢޙ ड़ʣʹΑΓҼՌάϥϑੜ੒ SLOϝτϦΫεͷҧ൓ݕ஌ΛτϦΨʔͱͯ͠ҼՌਪ࿦ CauseInfer (2014) Fig. 2.ΑΓҰ෦సࡌ Lin J, et al., Microscope: Pinpoint performance issues with causal graphs in micro-service environments. ICSOC, 2018. Microscope (2018) CauseInferʹରͯ͠ɺඇ௨৴ؔ܎ͷґଘΛߟྀ ɾCaudeInferಉ༷ʹҼՌάϥϑΛੜ੒ ɾϚΠΫϩαʔϏεͰ͸ɺ֤αʔϏεͰ Ԡ౴͕࣌ؒܭଌ͞ΕΔ Microscope (2018) Fig. 2. ΑΓసࡌ
  16. ҼՌάϥϑͷܦ࿏ͷϥϯΩϯάग़ྗ ɾAutoMAP (2020): ϚΠΫϩαʔϏε୯ҐͰԠ౴͚࣌ؒͩͰͳ͘7छྨͷϝ τϦΫε͔ΒಘͨݸผͷҼՌάϥϑΛ߹੒ ɾFluxInfer (2020): PCΞϧΰϦζϜͰ͸ͳ͘ɺॏΈ෇͖ແ޲ґଘάϥϑ + ϖʔδϥϯΫ

    ɾMicroCause (2020): PCΞϧΰϦζϜͷ࣌ܥྻͷϥάΛߟྀ͢Δվྑ 16 ྨࣅͷண૝Λ΋ͭઌਓ͸͍Δ ʔ ൃలฤ ݪҼϝτϦΫεͷϥϯΩϯάग़ྗ ɾPatternMatcher (2021): ϝτϦΫεͷҟৗύλʔϯΛCNNͰ෼ྨ ɾFluxRank (2019): ྨࣅͷ࣌ܥྻΛΫϥελϦϯάޙʹϩδεςΟοΫճؼ ͰϥϯΩϯά
  17. 17 ݪҼ਍அ࿦จͷϦετ https://github.com/dreamhomes/RCAPapers ͞ΒͳΔઌߦݚڀͨͪ Notaro P, Cardoso J, Gerndt M.

    A Survey of AIOps Methods for Failure Management. ACM Transactions on Intelligent Systems and Technology (TIST). 2021 Nov 30;12(6):1-45. Lyu Y, Rajbahadur GK, Lin D, Chen B, Jiang ZM. Towards a Consistent Interpretation of AIOps Models. ACM Transactions on Software Engineering and Methodology (TOSEM). 2021 Nov 15;31(1):1-38. Soldani J, Brogi A. Anomaly detection and failure root cause analysis in (micro) service-based cloud applications: A survey. ACM Computing Surveys (CSUR). 2022 Feb 3;55(3):1-39. https://blog.yuuk.io/entry/2020/ieeecloud2020 https://netman.aiops.org/publications/ ਗ਼՚େֶ NETMAN LAB ࠃࡍձٞ IEEE CLOUD AIOpsؔ࿈ͷαʔϕΠ࿦จ
  18. 18 ɾ՝୊̍ ೖྗͱͳΔϝτϦΫεͷछྨ ΍ݸ਺Λ༧Ίࢦఆ͢Δඞཁ͕͋Δɻ ɾ՝୊̎ ࢦఆͨ͠ม਺ʹରͯ͠ݸผʹ ద੾ͳؔ਺ΛબΜͩΓɺνϡʔχϯ ά͢Δඞཁ͕͋Δɻ ɾ՝୊̏ ਪఆ݁Ռͷઆ໌ੑͷͨΊʹ͸

    ϝτϦΫεͷΈͰ͸ෆ଍͢Δɻ ઌߦݚڀͰͷະղܾͳྖҬΛ୳Δ ɾେྔͷϝτϦΫεΛऩूͰ͖Δ Α͏ʹͳͬͨ ɾʹ΋ؔΘΒͣɺͦΕΒͷϝτϦ ΫεΛ׆༻Ͱ͖͍ͯΔ͔ʁ જࡏత໰୊ҙࣝ γεςϜͰ؍ଌ͞Εͨશͯͷ 
 ϝτϦΫεΛೖྗͱͯ͠ 
 ҼՌάϥϑΛੜ੒
  19. ҟछࠞ߹ϝτϦΫεʹରͯ͠ݸผʹԾఆΛஔ͔ͳ͍ 19 શϝτϦΫε͔ΒݪҼ਍அʹ͔͚Δͱॲཧ͕࣌ؒ௕͘ͳΔ ݚڀͷ໰୊ઃఆɿݪҼ਍அͷલॲཧ ো֐ 
 ݕ஌ ݪҼ਍அ ఏҊ ࣌ܥྻͷ

    
 ݸ਺Λ࡟ݮ ௶಺༎थ΄͔, TSifter: ϚΠΫϩαʔϏεʹ͓͚Δੑೳҟৗͷਝ଎ͳ਍அʹ޲͍ͨ࣌ܥྻσʔλͷ࣍ݩ࡟ݮख๏, Πϯλʔωοτͱӡ༻ٕज़γϯϙδ΢Ϝ࿦จू, 2020೥. ௚ۙͷݻఆ෯ͷ 
 ϝτϦΫεΛऔಘ ਺෼୯Ґͷ࣮ߦ࣌ؒ ΦϑϥΠϯղੳ ϝτϦΫεΛࣄલࢦ ఆͤͣʹɺߴ଎ͳݪ Ҽ਍அ͕Մೳʹ ҼՌάϥϑ 
 ͷੜ੒
  20. 3. ࣌ܥྻղੳɾҼՌάϥϑੜ੒ 
 ͷͨΊͷࢼߦࡨޡ

  21. 21 ΦϖϨʔλʔͷೝ஌ॲཧͷྲྀΕΛτϨʔε͢Δ ᶃظؒ಺ͰҟৗΛؚΉ 
 ɹ࣌ܥྻΛൃݟ ᶄ࣌ܥྻάϥϑͷܗঢ় ͕ࣅ͍ͯΔ΋ͷΛ 
 άϧʔϓԽ ΫϥελϦϯά

    ΦϑϥΠϯ 
 ҟৗݕ஌ ϑΣʔζ̍ ϑΣʔζ̎
  22. ϑΣʔζ̍ ΦϑϥΠϯ ҟৗݕ஌ ϑΣʔζ2 ܗঢ় 
 ΫϥελϦϯά ҼՌάϥϑͷ 
 ੜ੒

    લॲཧ ݪҼ਍அ ҼՌάϥϑͷ 

  23. 23 ϑΣʔζ̍ɿ୯มྔ࣌ܥྻͷҟৗੑʹண໨ ࣌ܥྻͷҟৗύλʔϯΛ13ύλʔϯʹ෼ྨͨ͠ྫ [PatternMatcher 2021]: Wu C, Zhao N, Wang

    L, Yang X, Li S, Zhang M, Jin X, Wen X, Nie X, Zhang W, Sui K. Identifying Root-Cause Metrics for Incident Diagnosis in Online Service Systems. [PatternMatcher 2021] 30෼ఔ౓ͷ୹ظؒͷղੳͰΑ͍ͨΊɺقઅੑ΍ϦϦʔεʹΑΔਖ਼ৗ ϞʔυͷมԽΛߟྀ͠ͳͯ͘Α͍ɻ ͢Ͱʹো֐ݕग़͞ΕͨޙͳͷͰɺΦϑϥΠϯҟৗݕ஌ͰΑ͍
  24. [PatternMatcher 2021]: Wu C, Zhao N, Wang L, Yang X,

    Li S, Zhang M, Jin X, Wen X, Nie X, Zhang W, Sui K. Identifying Root-Cause Metrics for Incident Diagnosis in Online Service Systems. 24 ඪຊXͱඪຊY͕ಉҰͷ฼ूஂͷ෼෍ΑΓੜ͍ͯ͡ Δ͔Λݕఆ͢Δ 2ඪຊؒͷ෼෍ͷࠩΛΈΔݕఆɿK-Sݕఆ ͏·͍͔͘ͳ͔ͬͨέʔε ɾγϣʔτεύΠΫͷΑ͏ͳݦஶͳ֎Ε஋ΛؚΉ ࣌ܥྻ p஋: 0.11 p஋: 0.51 ɾগ਺ͷ֎Ε஋Ͱ͸ɺ෼෍͕ҧ͏ͱΈͳ͞ΕΔ΄ ͲͰ͸ͳ͍ ࣌ܥྻΛ௨ৗظؒͱςετظؒʹ2෼ׂ͠ɺظؒؒ ͷ෼෍ࠩҟΛݕఆ ʢ[PatternMatcher 2021]Ͱ࠾༻͞Ε͍ͯΔʣ
  25. 25 ɾ௚؍ɿ࣌ܥྻͷಛੑ͕࣌ؒͷܦաͱͱ΋ʹมԽ͠ͳ͍ ɾఆٛɿܥྻͷฏۉ͓Αͼ෼ࢄ͕࣌ؒʹΑΒͣҰఆɼ͔ͭࣗݾڞ෼ࢄ͕࣌ؒ ࠩͷΈʹґଘ͢Δੑ࣭ ࣌ܥྻͷੑ࣭ʮఆৗੑʯͷ͋ͯ͸Ί ɾఆৗੑʹ͋ͯ͸·Βͳ͍ܥྻ͸ҟৗͱΈͳ͢ γϣʔτεύΠΫ 1֊ 
 ࠩ෼

    ͏·͍͔͘ͳ͔ͬͨέʔε ADFݕఆͰݕఆՄೳ ɾγϣʔτεύΠΫ͕ 
 ఆৗͱ൑ఆ͞ΕΔ ɾADFݕఆ͸ʮࠩ෼ܥྻʯ ͕ఆৗੑͷੑ࣭Λຬͨ͢ ͔ݕఆ εύΠΫ͸ฏۉ΁ͱ 
 ճؼͯ͠͠·͏ ߨԋͰ͸εΩοϓ
  26. 26 ҟৗݕ஌ख๏ͱͯ͠޿͘஌ΒΕ͍ͯΔख๏Λ࢖͏ ౷ܭతҟৗݕ஌ͷجຊͷεςοϓ ֬཰෼෍Ͱਖ਼ৗύλʔϯΛදݱͰ͖ΔͱԾఆ 1. ෼෍ͷਪఆ 2. ҟৗ౓ͷࢉग़ 3. ᮢ஋ͷઃఆ

    ҟৗ౓ʹର͢Δᮢ஋ͷઃఆʹΑΓҟৗΛ൑ఆ ະ஌ύϥϝʔλΛؚΉ֬཰෼෍Λਖ਼ৗϞσϧͱͯ͠Ծఆ σʔλ͔Βະ஌ύϥϝʔλΛਪఆ ਖ਼ৗ͔ΒͷͣΕͷ౓߹͍Λ༧Ίఆٛͯ͠ࢉग़ ࢀߟɿҪख߶, ೖ໳ ػցֶशʹΑΔҟৗ ݕ஌, 7.3અ ίϩφࣾ, 2015..
  27. 27 ࣌ܥྻͷ֤఺ͷ෼෍͕ਖ਼ن෼෍ʹै͏ͱԾఆ ֎Ε஋ݕग़ɿϗςϦϯάͷ ๏ T2 ࢀߟɿҪख߶, ೖ໳ ػցֶशʹΑΔҟৗݕ஌, 7.3અ ίϩφࣾ,

    2015.. ౓਺෼෍΁ 1. ෼෍ਪఆ f(x) = 1 2πσ2 exp ( − (x − μ)2 2σ2 ) ωοτϫʔΫૹ৴όΠτ਺ͷϝτϦΫε μ = 1 N N ∑ i=1 xi σ2 = 1 N N ∑ i=1 (xi − μ)2 2. ҟৗ౓ͷܭࢉ ඪຊฏۉ ඪຊ෼ࢄ a(x) = ( x − μ σ ) 2 3. ᮢ஋൑ఆ ҟৗ౓ͷ෼෍͸ࣗ༝౓1ͷΧΠೋ৐෼෍ʹै͏ ᮢ஋Λ֬཰ͰܾఆͰ͖Δʢ0.01, 0.05ͳͲʣ ϊΠζʹऑ͍ ະ஌ 
 ύϥϝʔλ
  28. 28 ɾࣗݾճؼϞσϧɿ͋Δ࣌ࠁ t ͷ஋Λɺ࣌ࠁ t Ҏલͷ஋Λ࢖ͬͯճؼ͢ΔϞσϧ ɾ௚؍ɿະདྷ͸աڈͷ஋͔Β༧ଌͰ͖ΔͱԾఆ ࣗݾճؼϞσϧʢARϞσϧʣʹΑΔҟৗݕ஌ ҟৗ౓ ࣮ଌ஋ʢ੨ʣ༧ଌ஋ʢᒵ৭ʣ

    yn = r ∑ t=1 at yn−t + vn ʢ༧ଌ஋ − ؍ଌ஋ = ༧ଌޡࠩʣ= ҟৗ౓ ࣌ܥྻ ͕༩͑ΒΕͨͱ͖ y1 , . . . yN : ϥά࣍਺ r : ࣗݾճؼ܎਺ at : ฏۉ0ɺ෼ࢄ ͷ 
 ਖ਼ن෼෍ʹै͏ϗϫΠτ ϊΠζ vn σ2 ɾ Λܾఆ͠ɺ࣮ଌ஋͔Β܎਺ ͱ෼ࢄ Λਪఆ͢Δ ɾ ͸੺஑৘ใྔج४ʢAICʣΑΓܾఆ͢Δͷ͕Ұൠత ɾ ͱ ͷۙࣅ஋Λ࠷খೋ৐๏ʹΑΓ࠷໬ਪఆ r at σ2 r at σ2 ࢀߟɿҪख߶, ೖ໳ ػցֶशʹΑΔҟৗݕ஌, 7.3અ ίϩφࣾ, 2015..
  29. 29 ະ஌ύϥϝʔλʢ܎਺ ʣΛ؍ଌσʔλ ͢΂ͯʹରͯ͠ਪఆ͢Δ at ࣗݾճؼϞσϧɿαϯϓϧ಺༧ଌͱαϯϓϧ֎༧ଌ αϯϓϧ಺༧ଌͷޡࠩ αϯϓϧ֎༧ଌͷޡࠩ ֶशظؒͱݕূظؒʹ෼ׂ͢ΔʢਤͰ͸1:1ʣ ̋ݦஶͳ

    
 ֎Ε஋ ະ஌ύϥϝʔλΛֶशσʔλ͔Βਪఆ ͠ɺݕূσʔλͱͷ༧ଌޡࠩΛࢉग़͢Δ ̋ϗϫΠτ 
 ϊΠζ ✗Ϩϕϧγϑτʹ ա৒ద߹ͯ͠ޡ ͕ࠩখ͍͞ ✗ϊΠζͷ൓ԠʹΑΔ 
 ޡݕ஌ ϗςϦϯάͱಉ͡…? ̋Ϩϕϧγϑτ
  30. 30 ɾ఺୯Ґͷ֎Ε஋ΛΈΔΑΓɺMSEʢฏۉ༧ଌޡࠩʣͷΑ͏ͳྦྷੵͷޡࠩΛ ΈΔҟৗ౓Λ࠾༻͢Δ ɾෳ਺ͷख๏ͷ૊Έ߹Θͤ ͜Ε͔ΒͲ͏͢Δ͔ ࿥ը࣌఺Ͱະղܾ ୹࣌ؒͷҟৗͱɺͦΕҎ֎ͷҟৗͷ྆ํΛ͏·͘ͱΒ͑Δͷ͕೉͍͠ [PatternMatcher 2021] Fig.

    2ΑΓసࡌ
  31. ϑΣʔζ̍ ΦϑϥΠϯ ҟৗݕ஌ ϑΣʔζ2 ܗঢ় 
 ΫϥελϦϯά ҼՌάϥϑͷ 
 ੜ੒

    લॲཧ ݪҼ਍அ ҼՌάϥϑͷ 
 ߨԋͰ͸εΩοϓ ϑΣʔζ̎ ܗঢ়ΫϥελϦϯά ߨԋͰ͸εΩοϓ
  32. 32 ɾػցֶशʹ͓͚ΔλεΫͷ໊শ͸ʮΫϥελϦϯάʯ ɾݸʑͷσʔλ͕Ͳͷఔ౓ྨࣅ͍ͯ͠Δ͔ ʢྨࣅ౓ɾڑ཭ई౓ʣ ɾΫϥελΛͲͷΑ͏ͳखॱͰൃݟ͍͔ͯ͘͠ ϑΣʔζ̎ɿܗঢ়͕ࣅ͍ͯΔ࣌ܥྻΛάϧʔϓԽ ωοτϫʔΫͷૹड৴ଳҬͱύέοτ਺ͷϝτϦΫε͕ΫϥελԽ͞Εͨྫ Ϋϥελ୅දΛબ୒ ॏ৺͔Βͷ࠷ۙ๣ͷ࣌ܥྻ ੵ෼஋͕࠷େͷ࣌ܥྻ

    ߨԋͰ͸εΩοϓ
  33. 33 ΫϥελϦϯάͷൣғ Proxy App DB Cluster 1 Cluster 2 ಉҰίϯϙʔωϯτ಺Ͱ

    
 ΫϥελϦϯά ɾίϯϙʔωϯτΛ·͍ͨͩϝτϦΫεΛΫϥελϦϯά͢Δͱ 
 ҼՌάϥϑʹඞཁͳϊʔυ͕ࣦΘΕΔ ɾͨ·ͨ·঱ঢ়ϝτϦΫεͱݪҼϝτϦΫεͷܗঢ়͕ྨࣅ͢ΔέʔεͳͲ ߨԋͰ͸εΩοϓ
  34. 34 ࣌ܥྻΫϥελϦϯάͷੈք [Paparrizos 15]: Paparrizos J, Gravano L. k-Shape: Ef

    fi cient and accurate clustering of time series. SIGMOD 2015. [Paparrizos 15] Fig. 1. Time-series clustering taxonomy. ΑΓసࡌ [Paparrizos 15] Fig. 2. The time-series clustering approaches. ΑΓసࡌ ࣌ܥྻશମͷ 
 ΫϥελϦϯά ܗঢ়ʹجͮ͘ 
 Ξϓϩʔν ਓ͕ؒ࣌ܥྻάϥϑΛΈͯࣅ͍ͯΔͱ 
 ൑அ͢Δͷ͸ɺܗঢ়ʹج͍͍ͮͯΔ͸ͣ ߨԋͰ͸εΩοϓ
  35. 35 ࣌ܥྻΫϥελϦϯάͷܗঢ়ྨࣅੑ Paparrizos J, Gravano L. k-Shape: Ef fi cient

    and accurate clustering of time series. SIGMOD 2015. [k-Shape 2015] Figure 2: Similarity computation ΑΓ 
 Ұ෦ൈਮͯ͠సࡌ ED: ϢʔΫϦουڑ཭ [k-Shape 2015] Figure1: ΑΓҰ෦ൈਮͯ͠సࡌ ॎ࣠ʹରͯ͠৳ॖͨ͠ͱ͖ʹࣅ͍ͯΔ͔ʁ ԣ࣠ʹରͯ͠γϑτͨ͠ͱ͖ʹࣅ͍ͯΔ͔ʁ z-scoreม׵ (ฏۉ0,ඪ४ภࠩ1ͱ͢Δม׵)Ͱୡ੒ ԣ࣠γϑτʹର͢Δߟྀ͕ͳ͍ Scalingෆมੑ Shifttingෆมੑ DTW: ಈతۭؒ৳ॖ๏ʢओྲྀʣ 2ຊͷ࣌ܥྻͷ֤఺ಉ࢜ͷڑ཭Λશͯܭࢉ͢Δͨ Ίɺܭࢉྔ͕େ͖͍ ߨԋͰ͸εΩοϓ
  36. 36 SBD (Shape-Based Distance) Paparrizos J, Gravano L. k-Shape: Ef

    fi cient and accurate clustering of time series. SIGMOD 2015. ૬ޓ૬ؔ ϕΫτϧͷ಺ੵɿϕΫτϧಉ͕࢜֯౓͕͍ۙͱ͖ʢ ͕0ʹ͍ۙͱ͖ʣେ͖ͳ஋ΛͱΔ θ x ⋅ y = |x||y|cos θ 1ຊͷ࣌ܥྻΛ1ݸͷϕΫτϧͱΈͳ͢ ϕΫτϧΛͣΒ͠ͳ͕Β಺ੵΛܭࢉ͍͖ͯ͠ɺ಺ੵ͕࠷େͱͳΔγϑτ Λൃݟ͢Δ w CCw (x, y) = Rw−m (x, y) Rk (x, y) = { ∑m−k l=1 xl+k ⋅ yl R−k (y, x) x = (x1 , . . . , xm ) y = (y1 , . . . , ym ) ճ ΛӈʹฏߦҠಈͤͨ͞ 
 ͱ͖ͷ ͱͷྨࣅ౓Λ ͱ͢Δ k = w − m y x CCw (x, y) Λ 
 ʹରͯ͠ܭࢉ CCw (x, y) w ∈ 1,2,...,2m − 1 ͷܭࢉྔΛ 
 ߴ଎ϑʔϦΤม׵Ͱ 
 ΁ O(m2) O(mlogm) ߨԋͰ͸εΩοϓ
  37. 37 ΫϥελԽͷखॱͷݕ౼ ΫϥελϦϯά ֊૚తΫϥελϦϯά ෼ׂ࠷దԽతΫϥελϦϯά ࢀߟɿਆቇ හ߂, ΫϥελϦϯάʢClusteringʣ https://www.kamishima.net/archive/clustering.pdf Ϋϥελͷྑ͞Λදؔ͢਺Λఆٛ͠ɼ

    
 ͦͷؔ਺Λ࠷దԽ͢ΔΑ͏ͳ 
 ΫϥελΛൃݟ ڽूܕ ෼ׂܕ σʔλҰ͕ͭݸʑͷΫϥελͷ 
 ঢ়ଶ͔Βɼॱ࣍ΫϥελΛซ߹ σʔλू߹શମ͕ҰͭͷΫϥελ ͷঢ়ଶ͔Βɼॱ࣍ΫϥελΛ෼ׂ ܭࢉྔ͕ଟ͍ ܭࢉྔ͕ଟ͍ ࣄલʹΫϥελ਺Λ 
 ܾΊͳ͚Ε͹ͳΒͳ͍ ༗໊ͳk-means͸ 
 ͪ͜Β ܭࢉྔ͕গͳ͍ ߨԋͰ͸εΩοϓ
  38. 38 ֊૚తΫϥελϦϯάͷߴ଎ੑ ࠷దͳΫϥελ਺Λܾఆ͢ΔͨΊʹ ܁Γฦ࣮͠ߦ ֊૚తΫϥελϦϯάʢ࠷୹ڑ཭๏ʣ ڑ཭ͷᮢ஋Λࣄલઃఆ ͯ͠Ϋϥελ਺Λܾఆ ෼ׂ࠷దԽΫϥελϦϯά ߨԋͰ͸εΩοϓ

  39. 39 ɾܗঢ়͕ࣅ͍ͯΔ࣌ܥྻΛάϧʔϓԽ ɾڑ཭ई౓ͱͯ͠ɺ૬ޓ૬ؔϕʔεͷSBDΛ࠾༻ ɾߴ଎ੑͷͨΊɺ֊૚తΫϥελϦϯά+࠷୹ڑ཭๏ͷ࠾༻ ϑΣʔζ̎ɿΫϥελϦϯά·ͱΊ [FluxRank 2019]͸ɺີ౓४ڌΫϥελϦϯά + ϐΞιϯ૬ؔʹΑΔ࣌ܥྻ ΫϥελϦϯά͕࠾༻͞Ε͍ͯͨ͜ͱʹؾ͍ͮͨͷͰɺൺֱ͢Δඞཁ͕͋Δ

    ɾݱ࣌఺Ͱ͸՝୊ͳ͠ͷ͸͕ͣ… ߨԋͰ͸εΩοϓ
  40. ϑΣʔζ̍ ΦϑϥΠϯ ҟৗݕ஌ ϑΣʔζ2 ܗঢ় 
 ΫϥελϦϯά લॲཧ ݪҼ਍அ ҼՌάϥϑͷ

    
 ੜ੒ͯ֬͠ೝ͍ͨ͠ ݪҼ਍அɿҼՌάϥϑੜ੒ ҼՌάϥϑͷ 
 ੜ੒ 
  41. 41 ɾMݸͷ֬཰ม਺ ɾ֬཰ม਺͝ͱʹNݸͷඪຊ ҼՌ୳ࡧΛϝτϦΫεͷݪҼ਍அʹԠ༻ DAGʢ༗޲ඇ८ճάϥϑʣ ग़ྗ front-end:latency user:latency ɾ 


    ɾ 
 ɾ Mݸͷ 
 ࣌ܥྻ Nݸͷඪຊ orders:latency user:cpu_usage user-db:cpu_usage orders:network_transmit_bytes ঱ঢ়ϝτϦΫε orders-db:network_receive_bytes
  42. 42 ౷ܭతҼՌ୳ࡧͷੈք ؍ଌσʔλ͔ΒҼՌؔ܎Λਪఆ͢Δ ੍໿ϕʔε είΞϕʔε Glymour C, et al., Review

    of causal discovery methods based on 
 graphical models. Frontiers in genetics. 2019 ߏ଄ํఔࣜϕʔε ϊϯύϥϝτϦοΫʢԾఆΛ͓͔ͳ͍ʣ ύϥϝτϦοΫɾ 
 ηϛύϥϝτϦοΫ PCΞϧΰϦζϜ (1991) ม਺ؒͷ৚݅෇͖ಠཱੑΛ੍໿ 
 ͱͯ͠ҼՌάϥϑΛߏங͢Δ FCIΞϧΰϦζϜ (1995) ະ؍ଌڞ௨ݪҼΛߟྀ ಉ͡৚݅෇͖ಠཱੑΛ༩ ͑ΔҼՌάϥϑͷू߹Ͱ ͋ΔϚϧίϑಉ஋ྨ͝ͱ ʹɼϞσϧͷΑ͞ΛධՁ GESΞϧΰϦζϜ (2003) LiNGAM (2006) ݪҼΛXΛ݁ՌΛYͱͯ͠ ߏ଄ΛํఔࣜͰදݱ ؔ਺ܥ΍ޡࠩม਺ͷ෼෍ ʹԾఆΛஔ͘
  43. 43 1. ॳظԽɿ Mݸͷ֬཰ม਺ͷ׬શ࿈݁άϥϑΛߏங͢Δ PCΞϧΰϦζϜ 2. Τοδ࡟আɿ ྡ઀͢Δ֤ม਺ʹ͍ͭͯɺ৚݅෇͖ಠཱੑ͕ଘࡏ͢Ε͹ɺ2 ม਺ؒͷΤοδΛ࡟আɻ 3.

    ϧʔϧϕʔεͷํ޲ܾఆʢv-structureɺΦϦΤϯςʔγϣϯϧʔϧʣ ಠཱੑɿม਺ؒͷ૬ؔΛΈΔͳͲͷख๏Ͱಠཱ͔Ͳ͏͔Λਪఆɻ ৚݅෇͖ಠཱੑɿ͋Δม਺Λ৚݅ͱͨ͠ͱ͖ͷଞͷ2ม਺ͷಠཱੑ
  44. 44 1. ॳظԽ PCΞϧΰϦζϜͷࣄલ஌ࣝʹΑΔಠ֦ࣗு 2. Τοδ࡟আ 3. ϧʔϧϕʔεͷํ޲ܾఆ ํ޲ܾఆ͸ؒҧ͍͑ͯΔ͜ͱ͕͋Δ ᶃ

    ௚઀ωοτϫʔΫ௨৴Λ͍ͯ͠ͳ͍ม਺ؒͷΤοδΛແ৚݅Ͱ࡟আ ᶄ ௨৴ͷऩूσʔλΛجʹɺํ޲Λܾఆɾमਖ਼͢Δ 
 ɹL4௨৴ɿTCP/UDPͷ઀ଓ։࢝ํ޲ɺύέοτͷϥά૬ؔ ɹL7௨৴ɿHTTPͷϦΫΤετͷ޲͖ … ׬શ࿈݁άϥϑͰ͸Τοδ͕ଟ͍ͨΊɺΤοδ࡟আͷॲཧ͕࣌ؒ௕͍ ࡟আͨ݁͠Ռ ߨԋͰ͸εΩοϓ
  45. ϝοηʔδ ੜ੒͞ΕͨҼՌάϥϑͷྫ

  46. 46 ݪҼͱ঱ঢ়ͷϝτϦΫεؒͷ 
 ܦ࿏͕੾ΒΕΔ PCΞϧΰϦζϜͷ৚݅෇͖ಠཱੑݕఆͷ՝୊ front-end:latency user:latency orders:latency user:memory_usage user-db:cpu_usage

    orders:network_receive_bytes ঱ঢ়ϝτϦΫε orders-db:network_transmit_bytes ݪҼ 
 ϝτϦΫε ௐࠪ ৚݅෇͖ಠཱੑݕఆͰ͸ɺ৚݅ม਺ͷӨڹΛআڈ͠ ্ͨͰɺ2ม਺ͷ૬ؔΛΈΔʢภ૬ؔʣ ো֐ൃੜ࣌ͷࣅͨมಈͷܥྻ͕ଟ͍ͱ͖ʹޡͬͯ Τοδ͕੾ΒΕ΍͍͢Α͏ʹΈ͑Δ front-end: 
 latency orders-db: 
 network_transmit 
 _bytes user: 
 memory_ 
 usage user: 
 latency ࣅͨมಈΛࣔ͢3ม਺ؒͷ Τοδ͕ͳ͔ͥ੾ΒΕ͍ͯΔ ✗
  47. 47 2ม਺ͷಠཱੑݕఆͷΈͰΤοδ࡟আ ɾ૬ؔͷڧ͍ม਺͕ଟ͍ͨΊɺҼՌάϥϑ͕ڊେʹͳΔ ɾʮมԽͷ։࢝఺ʯʹண໨͠ɺ։࢝఺ͷҐஔͰΫϥελϦϯά 
 Ϋϥελ͝ͱʹҼՌάϥϑߏங PCΞϧΰϦζϜͷ՝୊ΛͲͷΑ͏ʹղܾ͢Δ͔ʁ ࿥ը࣌఺Ͱະղܾ ৚݅ม਺ͷબ୒Λߟྀ͢Δ ɾมԽ܏޲͕͍ۙม਺Λબ୒ީิ͔Β֎͢

  48. 48 ɾ֤࿦Λ௨ఈ͢ΔΑ͏ͳʮ஌ʯ͸·ͩݟ͍͍ͩͤͯͳ͍ ɾݪҼ਍அʹ޲͍ͨલॲཧϑϨʔϜϫʔΫͷఏҊΛ໨ࢦ͢ 3. ·ͱΊɿ֤࿦ͷ੔ཧͱֶज़తߩݙ ϑΣʔζ̍ ΦϑϥΠϯ ҟৗݕ஌ ϑΣʔζ2 ܗঢ়

    
 ΫϥελϦϯά ҼՌάϥϑͷ 
 ੜ੒ લॲཧ ݪҼ਍அ ҼՌάϥϑͷ 
 PCΞϧΰϦζϜʹ͸੍໿͕͋Δ
  49. 4. ·ͱΊ

  50. 50 1. AIOpsݚڀ͸ɺো֐؅ཧͱϦιʔεϓϩϏδϣχϯάͷ2ͭ 2. “Alert symptoms, diagnose causes”ͷண૝ͱઌߦݚڀ 3. ݪҼ਍அʹ޲͚ͨલॲཧϑϨʔϜϫʔΫͷߏ੒Λ໨ࢦͨ͠ࢼߦࡨޡ

    4. ҼՌάϥϑͷੜ੒ͱ఻౷తͳख๏ͷ՝୊ ·ͱΊ
  51. 51 ɾσʔλαΠΤϯεͷಛఆྖҬͷ஌͚ࣝͩͰ͸׬݁͠ͳ͍ͨΊɺֶͿ͜ͱ ͕େม ɾ AIOps෼໺ͷॻ੶͕ӳޠݍ΋ؚΊͯଘࡏ͠ͳ͍ͨΊɺ෼໺ͷ೴಺Ϛοϓ Λߏங͢Δ͜ͱ͕೉͍͠ ɾዞҙతͳύϥϝʔλઃఆͳ͠ͰϞσϧͷੑೳΛߴΊΔ͜ͱ͕೉͍͠ ɾՄࢹԽ͠ͳ͍ͱਖ਼͍݁͠ՌΛฦ͍ͯ͠Δ͔͕Θ͔Βͳ͍ ɾ࣮ݧʹ࢖͑ΔΑ͏ͳ࣮؀ڥσʔλ͕ೖखͮ͠Β͍ ɾࠃ಺Ͱ͸ɺAIOpsʹऔΓ૊ΜͰ͍Δਓ͕গͳ͍ͨΊɺίϛϡχςΟͰ৘

    ใΛڞ༗͠ͳ͕ΒΈΜͳͰࢁΛొΔ͜ͱ͕೉͍͠ ΍ͬͯΈͯ೉͔ͬͨ͜͠ͱ
  52. 52 σʔληοτͷಈతੜ੒γεςϜɿMeltria ࿩͞ͳ͔ͬͨ͜ͱ ௶಺༎थ, ੨ࢁਅ໵, MeltriaɿϚΠΫϩαʔϏεʹ͓͚Δҟৗݕ஌ɾݪҼ෼ੳͷͨΊͷσʔληοτͷಈతੜ੒γεςϜ, Πϯλʔωοτͱӡ༻ٕज़ γϯϙδ΢Ϝ࿦จू, 2021, 63-70

    (2021-11-18), 2021೥11݄. https://github.com/ai4sre/meltria Kubernets Sock Shop: ϚΠΫϩαʔϏεΞϓϦ LitmusChaos: ނো஫ೖ Argo Workflows: εέδϡʔϥ
  53. 53 ɾΠϯγσϯτͱϙετϞʔςϜͷڞ༗ ɾݪҼϝτϦΫε΍ϩά͕ͳΜͰ͔͋ͬͨ ɾো֐ͷ࢑ఆճ෮ʹࢸΔ·ͰʹσʔλΛͲͷखॱͰΈ͔ͨ কདྷͷల๬ ো֐؅ཧʹؔ͢ΔAIOpsͷίϛϡχςΟΛৢ੒͍͖͍ͯͨ͠ ɾϝτϦΫεͳͲͷӡ༻σʔλͷڞ༗ ɾશσʔλͷڞ༗͸೉͘͠ͱ΋ɺಛఆͷϝτϦΫεͷσʔλ͚ͩͰ΋ ɾো֐஫ೖ& ෛՙࢼݧ

  54. ෳࡶͳιϑτ΢ΣΞͷӡ༻ʹదԠ͢ΔͨΊʹɺɹɹɹɹɹ "*ʹΑΓਓؒͷೝ஌ॲཧ͕ࣗಈԽ͞ΕΔɻ 
 ͦͷҰํͰɺ"*ͱ͍͏ผछͷෳࡶ͞Λ΋ͭιϑτ΢ΣΞΛ ৽ͨʹӡ༻͢Δ͜ͱʹͳΔɻ 
 
 ͜ͷ૖େͳ*SPOJFTPG"VUPNBUJPOʢࣗಈԽͷൽ೑ʣΛ 
 ਓྨ͸ͲͷΑ͏ʹղܾ͍ͯ͘͠ͷ͔͸Θ͔Βͳ͍ɻ