Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Meltria: マイクロサービスにおける
異常検知・原因分析のための
データセットの動的生成...

Meltria: マイクロサービスにおける
異常検知・原因分析のための
データセットの動的生成システム / Meltria in IOTS2021

https://www.iot.ipsj.or.jp/symposium/iots2021-program/
(9) Meltria:マイクロサービスにおける異常検知・原因分析のためのデータセットの動的生成システム
◎坪内佑樹(さくらインターネット, 京都大学), 青山真也(さくらインターネット)

Yuuki Tsubouchi (yuuk1)

November 26, 2021
Tweet

More Decks by Yuuki Tsubouchi (yuuk1)

Other Decks in Research

Transcript

  1. 2 Ϋϥ΢υΞϓϦέʔγϣϯͷෳࡶԽͱAIOps ϞϊϦε 
 ΞʔΩςΫνϟ ϚΠΫϩαʔϏε 
 ΞʔΩςΫνϟ ‣ มߋස౓ͷ૿େ

    ‣ ґଘؔ܎ͷෳࡶੑ ‣ ؂ࢹσʔλྔͷ૿େ ΦϖϨʔλʔͷܦݧ΍௚ײ 
 ʹཔΔΠϯγσϯτରԠ͕೉Խ ೝ஌ෛՙͷ 
 ૿େ ౷ܭ෼ੳɾػցֶशͰղܾʢAIOpsʣ ։ൃऀͷίʔυ ͷมߋ͕೉͍͠ ػೳ୯ҐͰ 
 ෼ࢄ [Soldani 21]: Soldani, J. and Brogi, A., Anomaly Detection and Failure Root Cause Analysis in (Micro) Service- Based Cloud Applications: A Survey, arXiv preprint 2021. 
 [Soldani 21]
  2. 3 ධՁ༻ͷӡ༻σʔληοτͷ՝୊ ɾAIϞσϧͷֶश΍ධՁͷͨΊʹɺҟৗΛؚΉσʔληοτ͕ඞཁ ɾاۀ͸ϓϥΠόγʔ΍ηΩϡϦςΟͷ౎߹্ɺɹɹɹɹɹɹɹɹɹɹɹɹɹ ӡ༻σʔλͷެ։ʹফۃత [Loghub 20]: He, Shilin, et

    al. "Loghub: A large collection of system log datasets towards automated log analytics." arXiv preprint 2020. [Exathlon 20]: Jacob, Vincent, et al. "Exathlon: A Benchmark for Explainable Anomaly Detection over Time Series." arXiv preprint 2020. 
 [LogAD 21]: Zhao, Nengwen, et al. "An empirical investigation of practical log anomaly detection for online service systems." ACM ESEC/FSE. 2021. ެ։σʔληοτ ɾݶΒΕͨҟৗύλʔϯͷΈ ɾಛఆͷσʔληοτʹա৒ద ߹͢ΔڪΕ͕͋Δ [Loghub 20] 
 [Exathlon 20] ɾա৒ద߹Λආ͚ΔͨΊʹɺ๛෋ͳ ҟৗύλʔϯʹΑΔֶश΍ධՁ͕ ඞཁ ɾ͋ΒΏΔҟৗύλʔϯΛ಺แͨ͠ σʔληοτͷࣄલ࡞੒͸೉͍͠ [LogAD 21]
  3. 4 ఏҊɿσʔληοτΛಈతʹੜ੒͢ΔγεςϜ ಈత ɾҟৗͷύλʔϯɺσʔλܭଌ৚͕݅Մม ҟৗͷ࠶ݱ৚݅ 
 σʔλܭଌ৚݅ ੜ੒γεςϜ ೖྗ ग़ྗ

    σʔλ 
 ηοτ ա৒ద߹Λ 
 ൃݟɾճආ΁ ɾAIϞσϧͷֶश΍ධՁͷͨΊʹσʔλʹʮҟৗͷ༗ແͱҐஔʯͱ ͍ͬͨจ຺ʢσʔλϥϕϧʣ͕ඞཁ ɾੜ੒͢ΔͨͼʹϥϕϦϯά͢Δ࿑ྗΛ࡟ݮ͍ͨ͠ => ࣗಈԽͷఏҊ
  4. 6 ӡ༻σʔληοτͷείʔϓͷઃఆ ӡ༻σʔλͷछผ ର৅ΞϓϦέʔγϣϯ ϝτϦοΫ: ࣌ܥྻͷ਺஋σʔλ ϩά τϨʔε: ϦΫΤετͷ࣮ߦܦ࿏ Sock

    Shop: খن໛(8αʔϏε) Train Ticket: தن໛(41αʔϏε) Pymicro: γϛϡϨʔλ(16αʔϏε) ϚΠΫϩαʔϏεͷҟৗݕ஌ɾݪҼ෼ੳʹؔ͢Δ11݅ͷ࿦จΛௐࠪ ஫ೖ͢Δނোͷछྨ 7݅ 5݅ 3݅ 2݅ 2݅ 5݅ ܭࢉػࢿݯʹؔ͢Δނো αʔϏεؒϦΫΤετͷ੒൱ͱ஗Ԇ 
 ϚΠΫϩαʔϏεಛ༗ͷނো 7݅ 2݅ 1݅ ϝτϦοΫ ܭࢉػࢿݯʹؔ͢Δނো Sock Shop ࠷ଟͷέʔεʹείʔϓΛઃఆ
  5. 7 ؔ࿈ٕज़ͱͷൺֱ Chaos 
 Engineering ෼ࢄγεςϜ͕༧ظͤ͵ࣄଶʹ଱͑ΒΕΔ͔Ͳ͏͔ͷ֬ূΛ ಘΔͨΊͷݕূͷن཯ [Basiri 16]: Basiri,

    A., et al., Chaos Engineering, IEEE Software, 2016. [Basiri 16] Chaos EngineeringΛ࣮ફ͢ΔπʔϧɿLitmusChaos, Chaos Mesh ‣ ނো஫ೖ: ܭࢉػࢿݯʹؔ͢ΔނোΛαϙʔτ ‣ CPUɺϝϞϦɺσΟεΫͷա৒࢖༻ɺύέοτϩεͳͲ ‣ ނো஫ೖͷεέδϡʔϦϯά࣮ߦ ӡ༻σʔλͷ؅ཧ΍ϥϕϦϯάͳͲͷ 
 Chaos Engineeringࣗମʹඞཁͷͳ͍ػೳ͸ؚ·Εͳ͍
  6. 8 σʔληοτͷಈతੜ੒γεςϜͷઃܭ ୈҰઃܭج४ ୈೋઃܭج४ σʔλ؅ཧΛؚΊͨނো஫ೖ ͷεέδϡʔϦϯά ϥϕϦϯάͷࣗಈԽ ‣ ނো஫ೖͷӨڹͷ༗ແ ‣

    ૝ఆ֎ͷҟৗͷ༗ແ σʔληοτʹؚ·ΕΔܥྻ͝ͱʹ ނো஫ೖ৚݅ 
 σʔλܭଌ৚݅ ੜ੒ϫʔΫϑϩʔ ೖྗ ग़ྗ σʔληοτ ϥϕϦϯά ‣ Chaos EngineeringΛ֦ு ‣ ಉҰ࣌ؒଳʹނো஫ೖͷॏෳͳ͠ ‣ 1஫ೖʹରԠ͢ΔσʔλΛඥ෇͚Մೳ ‣ ਖ਼ৗͱҟৗσʔλΛ྆ํؚΉ Λ෇༩
  7. 9 ୈҰͷઃܭج४ɿނো஫ೖͷεέδϡʔϦϯάཁ݅ σʔλͷ 
 جຊ୯Ґ Component A Component B Component

    C Time Slot Fault α Injected Span Fault β Injected Span Component A Recovery time 
 ίϯϙʔωϯτ × ނো 
 ͷ૊Έ߹Θ͚ͤͩ஫ೖ ਖ਼ৗ࣌σʔλ΋ඞཁͳͷͰ ଴ػ ̍εϩοτ̍ނো
  8. 10 ୈҰͷઃܭج४ɿγεςϜߏ੒ Workflow Scheduler Operational Data Stoage Load Generator Target

    Application 1. Inject faults Datasets Repositorry 2. Pick latest data to datasets 3. Wait until the application recovers ᶅ ࣍ͷinjection࣌ؒ 
 ɹɹɹɹɹ·Ͱ଴ػ ᶃ ނোΛ஫ೖ ᶄ εϩοτͷσʔλΛ 
 ࠾औ
  9. 11 ୈೋͷઃܭج४ɿσʔληοτͷϥϕϦϯά (1) ϥϕϦϯάର৅ͷܥྻબ୒ (2) ֤ܥྻͷ෼ྨ ΞϓϦέʔγϣϯ 
 ϨϕϧϝτϦοΫ s1

    s2 s3 ނো஫ೖͨ͠ϚΠΫϩ 
 αʔϏεϨϕϧͷ 
 ϝτϦοΫ ނো஫ೖʹ࠷΋ؔ ࿈͢ΔϝτϦοΫ NOT_FOUND FOUND_INSIDE_ANOMALY 
 (ظ଴͞ΕΔঢ়ଶ) FOUND_OUTSIDE_ANOMALY ҟৗͷ఻೻ܦ࿏ͷΈϥϕϦϯά ҟৗͷ༗ແͱҐஔͰ3෼ྨ
  10. 13 ධՁ࣮ݧͷઃఆ σʔλͷܭଌ৚݅ ɾσʔλऔಘͷִؒ͸ 15 ඵ ɾεϩοτͷظؒΛ 30 ෼ ɾSock

    Shop಺ͷ8छྨͷίϯςφ ɾCPUա࢖༻ͱϝϞϦϦʔΫ ɾ5ճͣͭ஫ೖ ܭ90ճͷނো஫ೖ s1 s2 s3 ܥྻͷબ୒ front-endίϯςφͷ 
 ฏۉϨεϙϯελΠϜ ނো஫ೖαʔϏεͷ 
 ฏۉϨεϙϯελΠϜ ނোؔ࿈ͷϝτϦοΫ ɾϢʔβۭؒͷCPUར༻ ɾϝϞϦ࢖༻ྔ
  11. 17 ·ͱΊ ๛෋ͳҟৗύλʔϯʹΑΓա৒ద߹Λճආ͢ΔͨΊʹɺσʔληοτͷ 
 ಈతੜ੒γεςϜMeltriaΛఏҊ ᶃ ʮσʔλ؅ཧΛؚΊͨʯނো஫ೖͷεέδϡʔϦϯά ᶄ ਖ਼ن෼෍ͷܦݧଇʹΑΔʮҟৗͷ༗ແͱҐஔʯͷϥϕϦϯάͷࣗಈԽ ຊ࣮૷͸GitHubʹͯެ։ࡁΈ

    https://github.com/ai4sre/meltria 90ճͷނো஫ೖͷ࣮ݧ ᶃ ظ଴͞ΕΔҟৗ͕ى͖ͳ͔ͬͨέʔε͸2छྨɻMeltriaʹಈ࡞อূΛՃ ͑ͯରԠՄೳ ᶄ ϥϕϦϯάͷਖ਼֬ੑ͸85%ɻޡ෼ྨͷཁҼ͸ɼมಈ͕খ͍͞έʔε