Slide 1

Slide 1 text

Meltria: ϚΠΫϩαʔϏεʹ͓͚Δ 
 ɹɹɹɹҟৗݕ஌ɾݪҼ෼ੳͷͨΊͷ 
 ɹɹɹɹσʔληοτͷಈతੜ੒γεςϜ ௶಺ ༎थʢ͘͞ΒΠϯλʔωοτɾژ౎େֶʣ
 ੨ࢁ ਅ໵ʢ͘͞ΒΠϯλʔωοτʣ
 ৘ใॲཧֶձ ୈ14ճΠϯλʔωοτͱӡ༻ٕज़γϯϙδ΢ϜʢIOTS2021ʣ 2021೥11݄26೔

Slide 2

Slide 2 text

2 Ϋϥ΢υΞϓϦέʔγϣϯͷෳࡶԽͱAIOps ϞϊϦε 
 ΞʔΩςΫνϟ ϚΠΫϩαʔϏε 
 ΞʔΩςΫνϟ ‣ มߋස౓ͷ૿େ ‣ ґଘؔ܎ͷෳࡶੑ ‣ ؂ࢹσʔλྔͷ૿େ ΦϖϨʔλʔͷܦݧ΍௚ײ 
 ʹཔΔΠϯγσϯτରԠ͕೉Խ ೝ஌ෛՙͷ 
 ૿େ ౷ܭ෼ੳɾػցֶशͰղܾʢAIOpsʣ ։ൃऀͷίʔυ ͷมߋ͕೉͍͠ ػೳ୯ҐͰ 
 ෼ࢄ [Soldani 21]: Soldani, J. and Brogi, A., Anomaly Detection and Failure Root Cause Analysis in (Micro) Service- Based Cloud Applications: A Survey, arXiv preprint 2021. 
 [Soldani 21]

Slide 3

Slide 3 text

3 ධՁ༻ͷӡ༻σʔληοτͷ՝୊ ɾAIϞσϧͷֶश΍ධՁͷͨΊʹɺҟৗΛؚΉσʔληοτ͕ඞཁ ɾاۀ͸ϓϥΠόγʔ΍ηΩϡϦςΟͷ౎߹্ɺɹɹɹɹɹɹɹɹɹɹɹɹɹ ӡ༻σʔλͷެ։ʹফۃత [Loghub 20]: He, Shilin, et al. "Loghub: A large collection of system log datasets towards automated log analytics." arXiv preprint 2020. [Exathlon 20]: Jacob, Vincent, et al. "Exathlon: A Benchmark for Explainable Anomaly Detection over Time Series." arXiv preprint 2020. 
 [LogAD 21]: Zhao, Nengwen, et al. "An empirical investigation of practical log anomaly detection for online service systems." ACM ESEC/FSE. 2021. ެ։σʔληοτ ɾݶΒΕͨҟৗύλʔϯͷΈ ɾಛఆͷσʔληοτʹա৒ద ߹͢ΔڪΕ͕͋Δ [Loghub 20] 
 [Exathlon 20] ɾա৒ద߹Λආ͚ΔͨΊʹɺ๛෋ͳ ҟৗύλʔϯʹΑΔֶश΍ධՁ͕ ඞཁ ɾ͋ΒΏΔҟৗύλʔϯΛ಺แͨ͠ σʔληοτͷࣄલ࡞੒͸೉͍͠ [LogAD 21]

Slide 4

Slide 4 text

4 ఏҊɿσʔληοτΛಈతʹੜ੒͢ΔγεςϜ ಈత ɾҟৗͷύλʔϯɺσʔλܭଌ৚͕݅Մม ҟৗͷ࠶ݱ৚݅ 
 σʔλܭଌ৚݅ ੜ੒γεςϜ ೖྗ ग़ྗ σʔλ 
 ηοτ ա৒ద߹Λ 
 ൃݟɾճආ΁ ɾAIϞσϧͷֶश΍ධՁͷͨΊʹσʔλʹʮҟৗͷ༗ແͱҐஔʯͱ ͍ͬͨจ຺ʢσʔλϥϕϧʣ͕ඞཁ ɾੜ੒͢ΔͨͼʹϥϕϦϯά͢Δ࿑ྗΛ࡟ݮ͍ͨ͠ => ࣗಈԽͷఏҊ

Slide 5

Slide 5 text

5 ຊݚڀͷߩݙ ࣮ݧͷ݁ՌɺϥϕϦϯάͷਖ਼֬ੑ͕85% 1. σʔληοτΛಈతʹੜ੒͢Δͱ͍͏৽نੑ 2. ҟৗͷ༗ແͱҐஔͷϥϕϦϯάͷࣗಈԽ AIOpsจ຺ͰσʔληοτʹࣗಈϥϕϦϯά͢Δٞ࿦͸΄ͲΜͲͳ͍ σʔληοτͱ͍͏ͷ͸௨ৗɺ੩తͰ͔ͭ࡞੒ʹख͕͔͔ؒΔ΋ͷ ։ൃͨ͠ϓϩτλΠϓʹΑΔ࣮ݧͷ݁Ռɺظ଴௨Γʹಈ࡞͠ͳ͍2छྨέʔε ಈతੜ੒ʹΑΓɺա৒ద߹ΛͲͷఔ౓๷͛Δͷ͔ͷධՁ͕ࠓޙඞཁ AIOpsΞϓϩʔνͷҰൠద༻ੑΛ֬อ͍ͯͨ͘͠Ίʹ 
 σʔλʹண໨͢ΔݚڀʹҰา౿Έग़ͨ͠

Slide 6

Slide 6 text

6 ӡ༻σʔληοτͷείʔϓͷઃఆ ӡ༻σʔλͷछผ ର৅ΞϓϦέʔγϣϯ ϝτϦοΫ: ࣌ܥྻͷ਺஋σʔλ ϩά τϨʔε: ϦΫΤετͷ࣮ߦܦ࿏ Sock Shop: খن໛(8αʔϏε) Train Ticket: தن໛(41αʔϏε) Pymicro: γϛϡϨʔλ(16αʔϏε) ϚΠΫϩαʔϏεͷҟৗݕ஌ɾݪҼ෼ੳʹؔ͢Δ11݅ͷ࿦จΛௐࠪ ஫ೖ͢Δނোͷछྨ 7݅ 5݅ 3݅ 2݅ 2݅ 5݅ ܭࢉػࢿݯʹؔ͢Δނো αʔϏεؒϦΫΤετͷ੒൱ͱ஗Ԇ 
 ϚΠΫϩαʔϏεಛ༗ͷނো 7݅ 2݅ 1݅ ϝτϦοΫ ܭࢉػࢿݯʹؔ͢Δނো Sock Shop ࠷ଟͷέʔεʹείʔϓΛઃఆ

Slide 7

Slide 7 text

7 ؔ࿈ٕज़ͱͷൺֱ Chaos 
 Engineering ෼ࢄγεςϜ͕༧ظͤ͵ࣄଶʹ଱͑ΒΕΔ͔Ͳ͏͔ͷ֬ূΛ ಘΔͨΊͷݕূͷن཯ [Basiri 16]: Basiri, A., et al., Chaos Engineering, IEEE Software, 2016. [Basiri 16] Chaos EngineeringΛ࣮ફ͢ΔπʔϧɿLitmusChaos, Chaos Mesh ‣ ނো஫ೖ: ܭࢉػࢿݯʹؔ͢ΔނোΛαϙʔτ ‣ CPUɺϝϞϦɺσΟεΫͷա৒࢖༻ɺύέοτϩεͳͲ ‣ ނো஫ೖͷεέδϡʔϦϯά࣮ߦ ӡ༻σʔλͷ؅ཧ΍ϥϕϦϯάͳͲͷ 
 Chaos Engineeringࣗମʹඞཁͷͳ͍ػೳ͸ؚ·Εͳ͍

Slide 8

Slide 8 text

8 σʔληοτͷಈతੜ੒γεςϜͷઃܭ ୈҰઃܭج४ ୈೋઃܭج४ σʔλ؅ཧΛؚΊͨނো஫ೖ ͷεέδϡʔϦϯά ϥϕϦϯάͷࣗಈԽ ‣ ނো஫ೖͷӨڹͷ༗ແ ‣ ૝ఆ֎ͷҟৗͷ༗ແ σʔληοτʹؚ·ΕΔܥྻ͝ͱʹ ނো஫ೖ৚݅ 
 σʔλܭଌ৚݅ ੜ੒ϫʔΫϑϩʔ ೖྗ ग़ྗ σʔληοτ ϥϕϦϯά ‣ Chaos EngineeringΛ֦ு ‣ ಉҰ࣌ؒଳʹނো஫ೖͷॏෳͳ͠ ‣ 1஫ೖʹରԠ͢ΔσʔλΛඥ෇͚Մೳ ‣ ਖ਼ৗͱҟৗσʔλΛ྆ํؚΉ Λ෇༩

Slide 9

Slide 9 text

9 ୈҰͷઃܭج४ɿނো஫ೖͷεέδϡʔϦϯάཁ݅ σʔλͷ 
 جຊ୯Ґ Component A Component B Component C Time Slot Fault α Injected Span Fault β Injected Span Component A Recovery time 
 ίϯϙʔωϯτ × ނো 
 ͷ૊Έ߹Θ͚ͤͩ஫ೖ ਖ਼ৗ࣌σʔλ΋ඞཁͳͷͰ ଴ػ ̍εϩοτ̍ނো

Slide 10

Slide 10 text

10 ୈҰͷઃܭج४ɿγεςϜߏ੒ Workflow Scheduler Operational Data Stoage Load Generator Target Application 1. Inject faults Datasets Repositorry 2. Pick latest data to datasets 3. Wait until the application recovers ᶅ ࣍ͷinjection࣌ؒ 
 ɹɹɹɹɹ·Ͱ଴ػ ᶃ ނোΛ஫ೖ ᶄ εϩοτͷσʔλΛ 
 ࠾औ

Slide 11

Slide 11 text

11 ୈೋͷઃܭج४ɿσʔληοτͷϥϕϦϯά (1) ϥϕϦϯάର৅ͷܥྻબ୒ (2) ֤ܥྻͷ෼ྨ ΞϓϦέʔγϣϯ 
 ϨϕϧϝτϦοΫ s1 s2 s3 ނো஫ೖͨ͠ϚΠΫϩ 
 αʔϏεϨϕϧͷ 
 ϝτϦοΫ ނো஫ೖʹ࠷΋ؔ ࿈͢ΔϝτϦοΫ NOT_FOUND FOUND_INSIDE_ANOMALY 
 (ظ଴͞ΕΔঢ়ଶ) FOUND_OUTSIDE_ANOMALY ҟৗͷ఻೻ܦ࿏ͷΈϥϕϦϯά ҟৗͷ༗ແͱҐஔͰ3෼ྨ

Slide 12

Slide 12 text

12 ୈೋͷઃܭج४ɿ3ϥϕϧ΁ͷ෼ྨख๏ ਖ਼ن෼෍ͷ68-95-99ଇʢ3γάϚଇʣ 2ඪ४ภࠩͷ֎ͷൣғͷܥྻ఺Λҟৗͱ͢Δ Wikipedia “68–95–99.7 rule” ΑΓҾ༻ ҟৗͱ൑ఆ͞Εͨσʔλ఺ͷҐஔʹΑͬͯ 
 FOUND_INSIDE_ANOMALY ·ͨ͸ 
 FOUND_OUTSIDE_ANOMALY 
 ͕ܾ·Δ 68%, 95%, 99.7%ͷ஋͕ͦΕͧΕฏۉͷ 1, 2, 3ඪ४ภࠩҎ಺ʹऩ·͍ͬͯΔ

Slide 13

Slide 13 text

13 ධՁ࣮ݧͷઃఆ σʔλͷܭଌ৚݅ ɾσʔλऔಘͷִؒ͸ 15 ඵ ɾεϩοτͷظؒΛ 30 ෼ ɾSock Shop಺ͷ8छྨͷίϯςφ ɾCPUա࢖༻ͱϝϞϦϦʔΫ ɾ5ճͣͭ஫ೖ ܭ90ճͷނো஫ೖ s1 s2 s3 ܥྻͷબ୒ front-endίϯςφͷ 
 ฏۉϨεϙϯελΠϜ ނো஫ೖαʔϏεͷ 
 ฏۉϨεϙϯελΠϜ ނোؔ࿈ͷϝτϦοΫ ɾϢʔβۭؒͷCPUར༻ ɾϝϞϦ࢖༻ྔ

Slide 14

Slide 14 text

14 ୈҰઃܭج४ͷධՁ 1. ނো஫ೖʹࣦഊ͠ɺނো͕ൃੜ͠ͳ͔ͬͨ ɾނো஫ೖͷࣦഊݪҼ͸ௐࠪத ɾނো஫ೖͷࣦഊΛݕ஌͠ɺࣦഊΛϥϕϧ෇͚͢Δඞཁ͕͋Δ 2. ௚લͷނো஫ೖͷӨڹ͕ࠞͬͨ͟ ɾΞϓϦέʔγϣϯͷճ෮Λอূ͢Δػߏ͕ඞཁͱͳΔ ఆྔతͳධՁ͸ະ࣮ࢪ ʮ໨ࢹʹΑΔ෼ྨʯ͕FOUND_INSIDE_ANOMALYʹҰக͠ͳ͔ͬͨέʔε͸ 
 ࣍ͷ2छྨ s1, s2, s3͕શͯFOUND_INSIDE_ANOMALYͱͳΔ͜ͱΛظ଴

Slide 15

Slide 15 text

15 ୈೋઃܭج४ͷධՁɿϥϕϦϯάͷਖ਼֬ੑ 257ݸͷܥྻΛ໨ࢹͰ෼ྨͨ͠ ΋ͷΛਖ਼ղͱͨ͠ ޡ෼ྨ͞Εͨྫ ɾނো஫ೖʹࣦഊ (a), (b) ɾਖ਼ৗ࣌ʹεύΠΫมಈ (c) ɾલճͷ஫ೖӨڹͷࠞೖ (d) 85%ͷਖ਼ղ཰Λࣔͨ͠

Slide 16

Slide 16 text

16 ϥϕϦϯάͷ՝୊ ϥϕϦϯάͷਖ਼֬ੑ͸ɺώϡʔϚϯΤϥʔΛؚΉखಈϥϕϦϯάͱಉ౳ఔ౓ɹɹ Ͱ͋Δ͜ͱ͕๬·͍͠ ɹώϡʔϚϯΤϥʔ͕15%΋ؚ·ΕΔͱ͸ࢥ͑ͳ͍ͨΊɺΑΓߴ͍ੑೳ͕ඞཁ ɾ68-95-99ଇ͸ɺᮢ஋Ͱ൑ఆ͢ΔͨΊɺᮢ஋෇ۙͷܥྻ͸ޡ෼ྨ͠΍͍͢ ɾਓؒͰ΋ҟৗ͔Ͳ͏͔൑அʹ໎͏ܥྻ ɾʮࣗ৴ͷͳ͞ʯͷఔ౓ΛείΞԽͯ͠ϥϕϦϯάͰ͖ΔͱΑ͍ ɾڭࢣσʔλΛ࡞੒͠΍͍ͨ͢Ίɺڭࢣ͋ΓֶशΛ࢖༻͢Δ͜ͱ΋ݕ౼ ɾྨࣅͷมಈ܏޲Ͱ΋ɺϝτϦοΫͷछྨʹґΔ෦෼͕͋Δ ਖ਼֬ੑ޲্Ҋ

Slide 17

Slide 17 text

17 ·ͱΊ ๛෋ͳҟৗύλʔϯʹΑΓա৒ద߹Λճආ͢ΔͨΊʹɺσʔληοτͷ 
 ಈతੜ੒γεςϜMeltriaΛఏҊ ᶃ ʮσʔλ؅ཧΛؚΊͨʯނো஫ೖͷεέδϡʔϦϯά ᶄ ਖ਼ن෼෍ͷܦݧଇʹΑΔʮҟৗͷ༗ແͱҐஔʯͷϥϕϦϯάͷࣗಈԽ ຊ࣮૷͸GitHubʹͯެ։ࡁΈ https://github.com/ai4sre/meltria 90ճͷނো஫ೖͷ࣮ݧ ᶃ ظ଴͞ΕΔҟৗ͕ى͖ͳ͔ͬͨέʔε͸2छྨɻMeltriaʹಈ࡞อূΛՃ ͑ͯରԠՄೳ ᶄ ϥϕϦϯάͷਖ਼֬ੑ͸85%ɻޡ෼ྨͷཁҼ͸ɼมಈ͕খ͍͞έʔε

Slide 18

Slide 18 text

18 ࠓޙͷల๬ ɾϝτϦοΫҎ֎ͷσʔλछผͷσʔληοτੜ੒ ɾΑΓن໛ͷେ͖͍ΞϓϦέʔγϣϯͷαϙʔτ ɾଟ༷ͳछྨͷނোͷαϙʔτ ɾσʔληοτͷੜ੒࣌ؒͷ୹ॖ ػೳͷ֦ॆ ֶज़ੑͷ޲্ ɾಈతੜ੒ʹΑΓͲͷఔ౓ա৒ద߹Λ๷͛Δͷ͔ͷධՁ ɾఏҊ͢ΔϥϕϦϯά͕Ͳͷఔ౓༗༻ͳͷ͔ͷධՁ ೚ҙͷ࣮ΞϓϦέʔγϣϯʹରͯ͠ɺ͞·͟·ͳAIOpsख๏Λ ࣗಈධՁ͢ΔγεςϜ΁ൃల