Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Meltria: マイクロサービスにおける
異常検知・原因分析のための
データセットの動的生成システム / Meltria in IOTS2021

Meltria: マイクロサービスにおける
異常検知・原因分析のための
データセットの動的生成システム / Meltria in IOTS2021

https://www.iot.ipsj.or.jp/symposium/iots2021-program/
(9) Meltria:マイクロサービスにおける異常検知・原因分析のためのデータセットの動的生成システム
◎坪内佑樹(さくらインターネット, 京都大学), 青山真也(さくらインターネット)

A658ec7f1badf73819dfa501165016c1?s=128

Yuuki Tsubouchi (yuuk1)

November 26, 2021
Tweet

More Decks by Yuuki Tsubouchi (yuuk1)

Other Decks in Research

Transcript

  1. Meltria: ϚΠΫϩαʔϏεʹ͓͚Δ 
 ɹɹɹɹҟৗݕ஌ɾݪҼ෼ੳͷͨΊͷ 
 ɹɹɹɹσʔληοτͷಈతੜ੒γεςϜ ௶಺ ༎थʢ͘͞ΒΠϯλʔωοτɾژ౎େֶʣ
 ੨ࢁ ਅ໵ʢ͘͞ΒΠϯλʔωοτʣ


    ৘ใॲཧֶձ ୈ14ճΠϯλʔωοτͱӡ༻ٕज़γϯϙδ΢ϜʢIOTS2021ʣ 2021೥11݄26೔
  2. 2 Ϋϥ΢υΞϓϦέʔγϣϯͷෳࡶԽͱAIOps ϞϊϦε 
 ΞʔΩςΫνϟ ϚΠΫϩαʔϏε 
 ΞʔΩςΫνϟ ‣ มߋස౓ͷ૿େ

    ‣ ґଘؔ܎ͷෳࡶੑ ‣ ؂ࢹσʔλྔͷ૿େ ΦϖϨʔλʔͷܦݧ΍௚ײ 
 ʹཔΔΠϯγσϯτରԠ͕೉Խ ೝ஌ෛՙͷ 
 ૿େ ౷ܭ෼ੳɾػցֶशͰղܾʢAIOpsʣ ։ൃऀͷίʔυ ͷมߋ͕೉͍͠ ػೳ୯ҐͰ 
 ෼ࢄ [Soldani 21]: Soldani, J. and Brogi, A., Anomaly Detection and Failure Root Cause Analysis in (Micro) Service- Based Cloud Applications: A Survey, arXiv preprint 2021. 
 [Soldani 21]
  3. 3 ධՁ༻ͷӡ༻σʔληοτͷ՝୊ ɾAIϞσϧͷֶश΍ධՁͷͨΊʹɺҟৗΛؚΉσʔληοτ͕ඞཁ ɾاۀ͸ϓϥΠόγʔ΍ηΩϡϦςΟͷ౎߹্ɺɹɹɹɹɹɹɹɹɹɹɹɹɹ ӡ༻σʔλͷެ։ʹফۃత [Loghub 20]: He, Shilin, et

    al. "Loghub: A large collection of system log datasets towards automated log analytics." arXiv preprint 2020. [Exathlon 20]: Jacob, Vincent, et al. "Exathlon: A Benchmark for Explainable Anomaly Detection over Time Series." arXiv preprint 2020. 
 [LogAD 21]: Zhao, Nengwen, et al. "An empirical investigation of practical log anomaly detection for online service systems." ACM ESEC/FSE. 2021. ެ։σʔληοτ ɾݶΒΕͨҟৗύλʔϯͷΈ ɾಛఆͷσʔληοτʹա৒ద ߹͢ΔڪΕ͕͋Δ [Loghub 20] 
 [Exathlon 20] ɾա৒ద߹Λආ͚ΔͨΊʹɺ๛෋ͳ ҟৗύλʔϯʹΑΔֶश΍ධՁ͕ ඞཁ ɾ͋ΒΏΔҟৗύλʔϯΛ಺แͨ͠ σʔληοτͷࣄલ࡞੒͸೉͍͠ [LogAD 21]
  4. 4 ఏҊɿσʔληοτΛಈతʹੜ੒͢ΔγεςϜ ಈత ɾҟৗͷύλʔϯɺσʔλܭଌ৚͕݅Մม ҟৗͷ࠶ݱ৚݅ 
 σʔλܭଌ৚݅ ੜ੒γεςϜ ೖྗ ग़ྗ

    σʔλ 
 ηοτ ա৒ద߹Λ 
 ൃݟɾճආ΁ ɾAIϞσϧͷֶश΍ධՁͷͨΊʹσʔλʹʮҟৗͷ༗ແͱҐஔʯͱ ͍ͬͨจ຺ʢσʔλϥϕϧʣ͕ඞཁ ɾੜ੒͢ΔͨͼʹϥϕϦϯά͢Δ࿑ྗΛ࡟ݮ͍ͨ͠ => ࣗಈԽͷఏҊ
  5. 5 ຊݚڀͷߩݙ ࣮ݧͷ݁ՌɺϥϕϦϯάͷਖ਼֬ੑ͕85% 1. σʔληοτΛಈతʹੜ੒͢Δͱ͍͏৽نੑ 2. ҟৗͷ༗ແͱҐஔͷϥϕϦϯάͷࣗಈԽ AIOpsจ຺ͰσʔληοτʹࣗಈϥϕϦϯά͢Δٞ࿦͸΄ͲΜͲͳ͍ σʔληοτͱ͍͏ͷ͸௨ৗɺ੩తͰ͔ͭ࡞੒ʹख͕͔͔ؒΔ΋ͷ ։ൃͨ͠ϓϩτλΠϓʹΑΔ࣮ݧͷ݁Ռɺظ଴௨Γʹಈ࡞͠ͳ͍2छྨέʔε

    ಈతੜ੒ʹΑΓɺա৒ద߹ΛͲͷఔ౓๷͛Δͷ͔ͷධՁ͕ࠓޙඞཁ AIOpsΞϓϩʔνͷҰൠద༻ੑΛ֬อ͍ͯͨ͘͠Ίʹ 
 σʔλʹண໨͢ΔݚڀʹҰา౿Έग़ͨ͠
  6. 6 ӡ༻σʔληοτͷείʔϓͷઃఆ ӡ༻σʔλͷछผ ର৅ΞϓϦέʔγϣϯ ϝτϦοΫ: ࣌ܥྻͷ਺஋σʔλ ϩά τϨʔε: ϦΫΤετͷ࣮ߦܦ࿏ Sock

    Shop: খن໛(8αʔϏε) Train Ticket: தن໛(41αʔϏε) Pymicro: γϛϡϨʔλ(16αʔϏε) ϚΠΫϩαʔϏεͷҟৗݕ஌ɾݪҼ෼ੳʹؔ͢Δ11݅ͷ࿦จΛௐࠪ ஫ೖ͢Δނোͷछྨ 7݅ 5݅ 3݅ 2݅ 2݅ 5݅ ܭࢉػࢿݯʹؔ͢Δނো αʔϏεؒϦΫΤετͷ੒൱ͱ஗Ԇ 
 ϚΠΫϩαʔϏεಛ༗ͷނো 7݅ 2݅ 1݅ ϝτϦοΫ ܭࢉػࢿݯʹؔ͢Δނো Sock Shop ࠷ଟͷέʔεʹείʔϓΛઃఆ
  7. 7 ؔ࿈ٕज़ͱͷൺֱ Chaos 
 Engineering ෼ࢄγεςϜ͕༧ظͤ͵ࣄଶʹ଱͑ΒΕΔ͔Ͳ͏͔ͷ֬ূΛ ಘΔͨΊͷݕূͷن཯ [Basiri 16]: Basiri,

    A., et al., Chaos Engineering, IEEE Software, 2016. [Basiri 16] Chaos EngineeringΛ࣮ફ͢ΔπʔϧɿLitmusChaos, Chaos Mesh ‣ ނো஫ೖ: ܭࢉػࢿݯʹؔ͢ΔނোΛαϙʔτ ‣ CPUɺϝϞϦɺσΟεΫͷա৒࢖༻ɺύέοτϩεͳͲ ‣ ނো஫ೖͷεέδϡʔϦϯά࣮ߦ ӡ༻σʔλͷ؅ཧ΍ϥϕϦϯάͳͲͷ 
 Chaos Engineeringࣗମʹඞཁͷͳ͍ػೳ͸ؚ·Εͳ͍
  8. 8 σʔληοτͷಈతੜ੒γεςϜͷઃܭ ୈҰઃܭج४ ୈೋઃܭج४ σʔλ؅ཧΛؚΊͨނো஫ೖ ͷεέδϡʔϦϯά ϥϕϦϯάͷࣗಈԽ ‣ ނো஫ೖͷӨڹͷ༗ແ ‣

    ૝ఆ֎ͷҟৗͷ༗ແ σʔληοτʹؚ·ΕΔܥྻ͝ͱʹ ނো஫ೖ৚݅ 
 σʔλܭଌ৚݅ ੜ੒ϫʔΫϑϩʔ ೖྗ ग़ྗ σʔληοτ ϥϕϦϯά ‣ Chaos EngineeringΛ֦ு ‣ ಉҰ࣌ؒଳʹނো஫ೖͷॏෳͳ͠ ‣ 1஫ೖʹରԠ͢ΔσʔλΛඥ෇͚Մೳ ‣ ਖ਼ৗͱҟৗσʔλΛ྆ํؚΉ Λ෇༩
  9. 9 ୈҰͷઃܭج४ɿނো஫ೖͷεέδϡʔϦϯάཁ݅ σʔλͷ 
 جຊ୯Ґ Component A Component B Component

    C Time Slot Fault α Injected Span Fault β Injected Span Component A Recovery time 
 ίϯϙʔωϯτ × ނো 
 ͷ૊Έ߹Θ͚ͤͩ஫ೖ ਖ਼ৗ࣌σʔλ΋ඞཁͳͷͰ ଴ػ ̍εϩοτ̍ނো
  10. 10 ୈҰͷઃܭج४ɿγεςϜߏ੒ Workflow Scheduler Operational Data Stoage Load Generator Target

    Application 1. Inject faults Datasets Repositorry 2. Pick latest data to datasets 3. Wait until the application recovers ᶅ ࣍ͷinjection࣌ؒ 
 ɹɹɹɹɹ·Ͱ଴ػ ᶃ ނোΛ஫ೖ ᶄ εϩοτͷσʔλΛ 
 ࠾औ
  11. 11 ୈೋͷઃܭج४ɿσʔληοτͷϥϕϦϯά (1) ϥϕϦϯάର৅ͷܥྻબ୒ (2) ֤ܥྻͷ෼ྨ ΞϓϦέʔγϣϯ 
 ϨϕϧϝτϦοΫ s1

    s2 s3 ނো஫ೖͨ͠ϚΠΫϩ 
 αʔϏεϨϕϧͷ 
 ϝτϦοΫ ނো஫ೖʹ࠷΋ؔ ࿈͢ΔϝτϦοΫ NOT_FOUND FOUND_INSIDE_ANOMALY 
 (ظ଴͞ΕΔঢ়ଶ) FOUND_OUTSIDE_ANOMALY ҟৗͷ఻೻ܦ࿏ͷΈϥϕϦϯά ҟৗͷ༗ແͱҐஔͰ3෼ྨ
  12. 12 ୈೋͷઃܭج४ɿ3ϥϕϧ΁ͷ෼ྨख๏ ਖ਼ن෼෍ͷ68-95-99ଇʢ3γάϚଇʣ 2ඪ४ภࠩͷ֎ͷൣғͷܥྻ఺Λҟৗͱ͢Δ Wikipedia “68–95–99.7 rule” ΑΓҾ༻ ҟৗͱ൑ఆ͞Εͨσʔλ఺ͷҐஔʹΑͬͯ 


    FOUND_INSIDE_ANOMALY ·ͨ͸ 
 FOUND_OUTSIDE_ANOMALY 
 ͕ܾ·Δ 68%, 95%, 99.7%ͷ஋͕ͦΕͧΕฏۉͷ 1, 2, 3ඪ४ภࠩҎ಺ʹऩ·͍ͬͯΔ
  13. 13 ධՁ࣮ݧͷઃఆ σʔλͷܭଌ৚݅ ɾσʔλऔಘͷִؒ͸ 15 ඵ ɾεϩοτͷظؒΛ 30 ෼ ɾSock

    Shop಺ͷ8छྨͷίϯςφ ɾCPUա࢖༻ͱϝϞϦϦʔΫ ɾ5ճͣͭ஫ೖ ܭ90ճͷނো஫ೖ s1 s2 s3 ܥྻͷબ୒ front-endίϯςφͷ 
 ฏۉϨεϙϯελΠϜ ނো஫ೖαʔϏεͷ 
 ฏۉϨεϙϯελΠϜ ނোؔ࿈ͷϝτϦοΫ ɾϢʔβۭؒͷCPUར༻ ɾϝϞϦ࢖༻ྔ
  14. 14 ୈҰઃܭج४ͷධՁ 1. ނো஫ೖʹࣦഊ͠ɺނো͕ൃੜ͠ͳ͔ͬͨ ɾނো஫ೖͷࣦഊݪҼ͸ௐࠪத ɾނো஫ೖͷࣦഊΛݕ஌͠ɺࣦഊΛϥϕϧ෇͚͢Δඞཁ͕͋Δ 2. ௚લͷނো஫ೖͷӨڹ͕ࠞͬͨ͟ ɾΞϓϦέʔγϣϯͷճ෮Λอূ͢Δػߏ͕ඞཁͱͳΔ ఆྔతͳධՁ͸ະ࣮ࢪ

    ʮ໨ࢹʹΑΔ෼ྨʯ͕FOUND_INSIDE_ANOMALYʹҰக͠ͳ͔ͬͨέʔε͸ 
 ࣍ͷ2छྨ s1, s2, s3͕શͯFOUND_INSIDE_ANOMALYͱͳΔ͜ͱΛظ଴
  15. 15 ୈೋઃܭج४ͷධՁɿϥϕϦϯάͷਖ਼֬ੑ 257ݸͷܥྻΛ໨ࢹͰ෼ྨͨ͠ ΋ͷΛਖ਼ղͱͨ͠ ޡ෼ྨ͞Εͨྫ ɾނো஫ೖʹࣦഊ (a), (b) ɾਖ਼ৗ࣌ʹεύΠΫมಈ (c)

    ɾલճͷ஫ೖӨڹͷࠞೖ (d) 85%ͷਖ਼ղ཰Λࣔͨ͠
  16. 16 ϥϕϦϯάͷ՝୊ ϥϕϦϯάͷਖ਼֬ੑ͸ɺώϡʔϚϯΤϥʔΛؚΉखಈϥϕϦϯάͱಉ౳ఔ౓ɹɹ Ͱ͋Δ͜ͱ͕๬·͍͠ ɹώϡʔϚϯΤϥʔ͕15%΋ؚ·ΕΔͱ͸ࢥ͑ͳ͍ͨΊɺΑΓߴ͍ੑೳ͕ඞཁ ɾ68-95-99ଇ͸ɺᮢ஋Ͱ൑ఆ͢ΔͨΊɺᮢ஋෇ۙͷܥྻ͸ޡ෼ྨ͠΍͍͢ ɾਓؒͰ΋ҟৗ͔Ͳ͏͔൑அʹ໎͏ܥྻ ɾʮࣗ৴ͷͳ͞ʯͷఔ౓ΛείΞԽͯ͠ϥϕϦϯάͰ͖ΔͱΑ͍ ɾڭࢣσʔλΛ࡞੒͠΍͍ͨ͢Ίɺڭࢣ͋ΓֶशΛ࢖༻͢Δ͜ͱ΋ݕ౼ ɾྨࣅͷมಈ܏޲Ͱ΋ɺϝτϦοΫͷछྨʹґΔ෦෼͕͋Δ

    ਖ਼֬ੑ޲্Ҋ
  17. 17 ·ͱΊ ๛෋ͳҟৗύλʔϯʹΑΓա৒ద߹Λճආ͢ΔͨΊʹɺσʔληοτͷ 
 ಈతੜ੒γεςϜMeltriaΛఏҊ ᶃ ʮσʔλ؅ཧΛؚΊͨʯނো஫ೖͷεέδϡʔϦϯά ᶄ ਖ਼ن෼෍ͷܦݧଇʹΑΔʮҟৗͷ༗ແͱҐஔʯͷϥϕϦϯάͷࣗಈԽ ຊ࣮૷͸GitHubʹͯެ։ࡁΈ

    https://github.com/ai4sre/meltria 90ճͷނো஫ೖͷ࣮ݧ ᶃ ظ଴͞ΕΔҟৗ͕ى͖ͳ͔ͬͨέʔε͸2छྨɻMeltriaʹಈ࡞อূΛՃ ͑ͯରԠՄೳ ᶄ ϥϕϦϯάͷਖ਼֬ੑ͸85%ɻޡ෼ྨͷཁҼ͸ɼมಈ͕খ͍͞έʔε
  18. 18 ࠓޙͷల๬ ɾϝτϦοΫҎ֎ͷσʔλछผͷσʔληοτੜ੒ ɾΑΓن໛ͷେ͖͍ΞϓϦέʔγϣϯͷαϙʔτ ɾଟ༷ͳछྨͷނোͷαϙʔτ ɾσʔληοτͷੜ੒࣌ؒͷ୹ॖ ػೳͷ֦ॆ ֶज़ੑͷ޲্ ɾಈతੜ੒ʹΑΓͲͷఔ౓ա৒ద߹Λ๷͛Δͷ͔ͷධՁ ɾఏҊ͢ΔϥϕϦϯά͕Ͳͷఔ౓༗༻ͳͷ͔ͷධՁ

    ೚ҙͷ࣮ΞϓϦέʔγϣϯʹରͯ͠ɺ͞·͟·ͳAIOpsख๏Λ ࣗಈධՁ͢ΔγεςϜ΁ൃల