Slide 1

Slide 1 text

AIOpsͷݚڀಈ޲ͱAIOps޲͚σʔλ ηοτͷಈతੜ੒ͷݚڀ ௶಺ ༎थ @yuuk1 t ୈ15ճ͘͞ΒΠϯλʔωοτݚڀձ 2021೥10݄07೔

Slide 2

Slide 2 text

2 ຊൃදͷझࢫ ɾITαʔϏεͷΦϖϨʔγϣϯʹAIΛద༻͢ΔAIOps͕஫໨͞Ε͍ͯΔ ɾAIOpsͷ࠷৽ͷಈ޲Λڞ༗্ͨ͠ͰɺAIOps޲͚ͷσʔληοτͷ࡞੒ʹண ໨ͨ͠࠷৽ͷݚڀΛ঺հ͢Δ ɾ͘͞ΒΠϯλʔωοτͷ͓٬༷ʹAIOpsͷఏڙՄೳੑͷݕ౼΁ͱͭͳ͛Δ

Slide 3

Slide 3 text

3 1. AIOpsͷݚڀಈ޲ 2. ϚΠΫϩαʔϏεʹ͓͚Δҟৗݕ஌ɾݪҼ෼ੳͷݚڀௐࠪ 3. Meltria: σʔληοτͷಈతੜ੒γεςϜ 4. ·ͱΊ ໨࣍

Slide 4

Slide 4 text

1. AIOpsͷݚڀಈ޲

Slide 5

Slide 5 text

5 Ϋϥ΢υ্ͷ෼ࢄΞϓϦέʔγϣϯͷෳࡶԽ ɾ ɾ ɾ ΞΫηε૿Ճ εέʔϧΞ΢τʹΑΔ ϗετ਺ͷ૿Ճ ػೳͷ૿Ճ ΑΓେن໛ͳ 
 ෼ࢄΞʔΩςΫνϟ ϛυϧ΢ΣΞͷ૿Ճ RDBαʔό Ωϟογϡ 
 αʔό ݕࡧαʔό Webαʔό ωοτϫʔΫ 
 αʔϏε TCP/UDP ɾ ɾ ɾ ϝσΟΞɺECαΠτɺSNSɺIoTͳͲΛߏ੒͢Δ෼ࢄΞϓϦέʔγϣϯ

Slide 6

Slide 6 text

6 ɾITΦϖϨʔλ͸खಈͰ໘౗ͳ؅ཧ࡞ۀ΍ೝ஌ෛՙͷߴ͍࡞ۀΛߦΘͳ͚Ε͹ ͳΒͳ͍ ɾෛՙʹԠͨ͡εέʔϧΞ΢τɾεέʔϧΠϯ ɾΞϥʔςΟϯάͷ؅ཧɺΠϯγσϯτରԠ ɾITαʔϏεͷ؅ཧͱվળʹɺ౷ܭղੳ΍ػցֶशΛ͸͡Ίͱ͢ΔAIʢਓ޻஌ ೳʣΛద༻͢ΔऔΓ૊Έ͕ண໨͞Ε͍ͯΔ ໘౗ͳ؅ཧ࡞ۀ΍ೝ஌ෛՙͷ޲্ AIOps (Artificial Intelligence for IT Operations) [Notaro ’20]: Notaro, P, Jorge C, and Michael G. "A Systematic Mapping Study in AIOps.” ICSOC. Springer, Cham, 2020. [Dang’19]: Dang, Y, Qingwei L, and Peng H. "AIOps: Real-World Challenges and Research Innovations." ICSE-Companion. IEEE, 2019.

Slide 7

Slide 7 text

7 AIOpsͷߩݙྖҬ [Notaro ’20]: Notaro, P, Jorge C, and Michael G. "A Systematic Mapping Study in AIOps.” ICSOC. Springer, Cham, 2020. [Notaro ’20]: Fig.2ΑΓҾ༻ ITαʔϏεͷఏڙʹ͓͍ͯ๬·͘͠ ͳ͍ಈ࡞ʹରॲ͢Δํ๏ͷݚڀ ITαʔϏεΛ࠷దʹఏڙ͢ΔͨΊʹ ΤωϧΪʔɺܭࢉɺετϨʔδɺ࣌ ؒͷϦιʔεΛׂΓ౰ͯΔݚڀ

Slide 8

Slide 8 text

8 AIOpsͷݚڀྖҬ͝ͱͷ࿦จ਺ [Notaro ’20]: Notaro, P, Jorge C, and Michael G. "A Systematic Mapping Study in AIOps.” ICSOC. Springer, Cham, 2020. ɾAIOpsؔ࿈ͷ࿦จ਺ɿ670 ɾ670݅ͷ62.1%͕Failure Managementʹؔ࿈͍ͯ͠Δ ɾΦϯϥΠϯো֐༧஌ʢ26.4ˋʣো֐ݕग़ʢ33.7ˋʣݪҼ෼ੳʢ26.7ˋʣ

Slide 9

Slide 9 text

9 AIOpsؔ࿈ͷ࿦จ਺ͷਪҠ [Notaro ’20]: Notaro, P, Jorge C, and Michael G. "A Systematic Mapping Study in AIOps.” ICSOC. Springer, Cham, 2020. ɾ࿦จ਺͸૿Ճ܏޲ʹ͋Δ ɾFailure Detectionʢҟৗݕ஌ʣ͕18~19೥Ͱ71݅ͷ࿦จ਺ ɾߩݙ౓͸ɺResource Provisioning͕େ͖͘ɺFailure Preventionʢো֐༧ଌʣ ͱRemediationʢো֐༧๷ʣ͸ɺߩݙͷ਺͕࠷খ

Slide 10

Slide 10 text

10 SaaSʹΑΔAIOpsαʔϏεͷఏڙ ɾZebrium: https://www.zebrium.com/ ɾDatadog: https://www.datadoghq.com/solutions/machine-learning/ ɾNewRelic: https://newrelic.com/platform/applied-intelligence ɾPagerDuty: https://www.pagerduty.com/resources/learn/what-is-aiops/ ɾSplunk: https://www.splunk.com/ja_jp/data-insider/ai-for-it-operations-aiops.html ɾMackerel: https://mackerel.io/docs/entry/howto/anomaly-detection-for-roles ࠃ಺ͰͷAIOpsػೳͷར༻ࣄྫ͸·ͩ·ͩগͳ͍ γεςϜ؂ࢹSaaS͕AIOpsػೳΛఏڙ͢Δͱ͍͏ܗଶ͕Α͘ΈΒΕΔ

Slide 11

Slide 11 text

2. ϚΠΫϩαʔϏεͷ 
 ҟৗݕ஌ͱݪҼ෼ੳͷݚڀ֓ཁ

Slide 12

Slide 12 text

12 ҟৗݕ஌ɾݪҼ෼ੳͷ༻ޠ Avizienis, A., et al. "Basic concepts and taxonomy of dependable and secure computing." IEEE transactions on dependable and secure computing 2004. 
 Figure 3.8: Error propagationΛ΋ͱʹஶऀ͕खΛՃ͑ͯ࡞੒ ҟৗ 
 (Anomaly) ো֐(Failure) ো֐ ނো(Fault)

Slide 13

Slide 13 text

13 AIOpsʹ͓͚Δҟৗݕ஌ɾݪҼ෼ੳ ҟৗݕ஌ ɾҟৗΛݕ஌ޙʹɺނোΛਪఆ͢Δ ɾਪఆͷϨϕϧ͕͋Δ ɾίϯϙʔωϯτ୯Ґ < ϝτϦοΫ ΍ϩάͷߦ୯Ґ ɾʮଈ࣌ੑʯͱʮਖ਼֬ੑʯ͕ཁٻ͞Ε Δ ɾγεςϜ͕ҟৗͷ঱ঢ়Λࣔ͢͜ͱ Λࣗಈతʹݕ஌͢Δ͜ͱ ɾݕ஌ޙʹΦϖϨʔλʔʹΑΔରԠ ΛٻΊΔ͜ͱ͕͋Δ ɾʮଈ࣌ੑʯͱʮ௿ϊΠζߴγά φϧʯ͕ཁٻ͞ΕΔ ݪҼ෼ੳ

Slide 14

Slide 14 text

14 ҟৗݕ஌ɾݪҼ෼ੳʹ࢖༻͞ΕΔσʔλιʔε ɾϝτϦοΫɿ࣌ܥྻͷ਺஋σʔλ ɾΠϕϯτɿγεςϜΠϕϯτʹؔ͢Δߴ౓ʹߏ଄Խ͞Εͨσʔλ ɾҎ߱Ͱ͸ΠϕϯτΛϩάʹ౷߹͢Δ ɾϩάɿߏ଄Խ͞Ε͍ͯͳ͍จࣈྻͷϩά ɾτϨʔεɿϦΫΤετͷ࣮ߦܦ࿏ͷάϥϑ ಄จࣈΛ݁߹ͯ͠”MELT”ͱݺͿ͜ͱ΋͋Δ [Karumuri 21] Karumuri, S., Solleza, F., Zdonik, S. and Tatbul, N., Towards Observability Data Management at Scale, ACM SIGMOD Record, Vol. 49, No. 4, pp. 18–23 2021. [Karumuri 21]

Slide 15

Slide 15 text

15 ϚΠΫϩαʔϏεͷҟৗݕ஌ ڭࢣͳֶ͠शͰ͸ɺਖ਼ৗ࣌ͷσʔλΛֶश͠ɺ৽ͨͳσʔλ͕ਖ਼ৗ࣌ͱͲΕͩ ͚ဃ཭͍ͯ͠Δ͔Ͱҟৗͷఔ౓Λࢉग़͢Δ ڭࢣͳֶ͠श͕௨ৗ࢖༻ ͞ΕΔɻ ڭࢣͳֶ͠श ϩάϕʔε τϨʔεϕʔε ϝτϦοΫϕʔε DNN΍ओ੒෼෼ੳͳͲ ڭࢣ͋Γֶश Soldani, J., and Antonio B., "Anomaly Detection and Failure Root Cause Analysis in (Micro) Service-Based Cloud Applications: A Survey." arXiv preprint arXiv:2105.12378 (2021). ڭࢣσʔλ͸঎༻؀ڥͷҟৗͱ 
 ෼͔͍ͬͯΔσʔλΛ࢖͏͔ɺ 
 ςετέʔεΛೖྗͯ͠ҙਤతʹ 
 ނো஫ೖֶͯ͠शͤ͞Δ͔ ϩάʹه࿥͢΂͖Πϕϯτ Λه࿥͠ͳ͔ͬͨΓɺ༧ظ ͤ͵ϩάͷൃੜΛೝࣝɻ ϩάΛςϯϓϨʔτԽͯ͠ 
 ௨ৗ࣌ͷςϯϓϨʔτग़ݱ ॱͱҰக͢Δ͔Ͳ͏͔ɻ ڭࢣͳֶ͠श ෳ਺ͷϝτϦοΫΛΫϥε λԽͯ͠ɺΫϥελͷॏ৺ ͷҐஔͷͣΕͰҟৗ൑ఆ Ұ୴ֶशͨ͠ϞσϧΛదԠత ʹߋ৽͍ͯ͘͠ݚڀ΋͋Δ ڭࢣ͋Γֶश ҙਤతʹނোΛ஫ೖͯ͠ 
 ڭࢣσʔλͱ͢Δ

Slide 16

Slide 16 text

16 ϚΠΫϩαʔϏεͷݪҼ෼ੳ Soldani, J., and Antonio B., "Anomaly Detection and Failure Root Cause Analysis in (Micro) Service-Based Cloud Applications: A Survey." arXiv preprint arXiv:2105.12378 (2021). ϩάΛ࣌ܥྻσʔλͱͯ͠ ѻ͍ɺܥྻؒͷҼՌΛਪఆ ௚઀෼ੳ ϩάϕʔε τϨʔεϕʔε ϝτϦοΫϕʔε τϨʔεʹؚ·ΕΔԠ౴࣌ؒΛ 
 ௚઀෼ੳɻมಈ܎਺ΛΈΔͳͲɻ τϙϩδʹΑΔ෼ੳ ಉ͡ϢʔβɾϦΫΤετʹର͢Δ 
 αʔϏεݺͼग़͠νΣʔϯΛߏங ҟৗݕ஌ͷΠϕϯτΛτϦΨʔͱͯ͠ɺݪҼ෼ੳ͕։࢝͞ΕΔલఏͷݚڀ͕ओ ҼՌάϥϑʹΑΔ෼ੳ ҼՌάϥϑʹରͯ͠ɺ 
 ϥϯμϜ΢ΥʔΫͳͲͷ୳ ࡧΞϧΰϦζϜͰݪҼͱͳ ΔϩάΛਪఆ͢Δ Page Rank΍ϥϯμϜ΢ΥʔΫ 
 ͰݪҼΛਪఆ ௚઀෼ੳ ঱ঢ়Λࣔ͢ϝτϦοΫͱͦΕҎ 
 ֎ͷϝτϦοΫͷྨࣅ౓ΛΈΔ ϩά͔Βࢉग़ͨ͠ҟৗ౓ͱϝτ ϦοΫΛ੔߹ͤͯ͞૬ؔ෼ੳ 
 ҼՌάϥϑʹΑΔ෼ੳ 1ϝτϦοΫ1ม਺ͱͯ͠ม਺ؒ 
 ͷҼՌΛਪఆ(PCΞϧΰϦζϜ) 


Slide 17

Slide 17 text

3. Meltria: σʔληοτͷ 
 ಈతੜ੒γεςϜ

Slide 18

Slide 18 text

18 σʔληοτʹؔ͢Δ৽͍͠ಈػ ɾ৽͍͠՝୊Λൃݟ͢ΔͨΊʹɺطଘݚڀͷ࣮ݧͱ͸ҟͳΔঢ়گʹରͯ͠ੑೳ ͕มԽ͢Δ͔Λݕূ͍ͨ͠ ɾσʔλ෼ੳख๏ΛධՁ͠Α͏ͱ͢Δͱɺσʔληοτ͕ඞཁͱͳΔ ɾެ։ࡁΈͷσʔληοτ͸ଘࡏ͢Δ͕ɺ੩తͳੑ࣭ނʹɺҟৗͷύλʔϯʹ ͸ݶΓ͕͋Δ ɾ೚ҙͷҟৗͷύλʔϯΛؚΊΒΕΔΑ͏ʹɺσʔλ෼ੳऀͷཁٻʹ͕ͨͬ͠ ͯɺಈతʹσʔληοτΛੜ੒͍ͨ͠

Slide 19

Slide 19 text

19 Meltria: σʔληοτͷಈతੜ੒γεςϜ ϚΠΫϩαʔϏεͷҟৗݕ஌ɾݪҼ෼ੳͷධՁͷͨΊʹɺ 
 ଟ༷ͳҟৗͷύλʔϯΛؚΉσʔληοτͷಈతੜ੒γεςϜMeltriaΛఏҊ ୈҰઃܭج४ ୈೋઃܭج४ ނো஫ೖͷεέδϡʔϦϯά ͱσʔλ؅ཧ σʔληοτͷݕূͷࣗಈԽ ނো஫ೖͷӨڹͷ༗ແͱ૝ఆ֎ͷҟ ৗͷ༗ແΛੜ੒͞Εͨσʔληοτ ʹϥϕϧ෇͚͢Δɻ MeltriaͷجຊػೳͰ͋Γɺσʔλʹ ҟৗΛؚΊΔͨΊʹɺނোΛ஫ೖ͠ ͯނҙʹҟৗΛൃੜͤ͞Δɻ 
 ద੾ͳجຊ୯ҐͰσʔλΛ࠾औɾ؅ ཧ͢Δɻ

Slide 20

Slide 20 text

20 Meltriaͷར༻ऀ΁ͷߩݙ ɾσʔλ෼ੳऀ͕ɺطଘख๏΍ߟҊதͷख๏ʹରͯ͠ɺ՝୊Λൃݟ͢Δϓϩη εΛࢧԉ͢Δ σʔλ෼ੳऀ Meltria ᶃͱᶄΛߴ଎ʹճ͢͜ͱͰ 
 ৽՝୊Λൃݟʂ ᶃ ৚݅XͰσʔλੜ੒ཁٻɾฦ٫ ᶄ ੜ੒͞ΕͨσʔλͰ෼ੳ݁ՌΛ֬ೝ ਫ਼౓͕૝ఆҎԼͷ৔߹ɺͲͷϥϕϧͷσʔλʹରͯ͠ 
 ਫ਼౓͕௿Լ͔ͨ͠Λ֬ೝͰ͖Δ

Slide 21

Slide 21 text

21 Meltriaͷݱࡏͷείʔϓ ɾσʔλιʔεʹ͸ɺ࣌ܥྻͷ਺஋σʔλͰ͋ΔϝτϦοΫͷΈΛର৅ ɾނো஫ೖͷλΠϓ͸ɺܭࢉػϦιʔεʹؔ͢ΔނোͷΈΛର৅ ɾιʔείʔυ΍ઃఆͷมߋʹؔ͢Δނো͸ະରԠ

Slide 22

Slide 22 text

22 ୈҰͷઃܭج४ɿނো஫ೖͷεέδϡʔϦϯάཁ݅ Slot͕جຊ୯Ґ Component A Component B Component C Time Slot Fault α Injected Span Fault β Injected Span Time Component A Component B Component C Recovery time 1Slot͕1 FaultΛؚΉ 
 Component x Faultछ 
 ͷ૊Έ߹Θ͚ͤͩ஫ೖ ਖ਼ৗ࣌σʔλ΋ඞཁ 
 ͳͷͰ଴ػ͢Δඞཁ͋Γ

Slide 23

Slide 23 text

23 ୈҰͷઃܭج४ɿγεςϜߏ੒ 
 1. Work fl ow Scheduler͕ɺTarget Applicationʹରͯ͠ނোΛ஫ೖ 
 2. ϝτϦοΫͷετϨʔδ͔ΒɺSlotͷσʔλΛ࠾औ͢Δ 
 3. ࣍ͷinjection࣌ؒ·Ͱ଴ͭ 
 Workflow Scheduler Operational Data Stoage Load Generator Target Application 1. Inject faults Datasets Repositorry 2. Pick latest data to datasets 3. Wait until the application recovers ४උ 
 a) Load Generator͕ Target Applicationʹෛՙ Λ͔͚Δ 
 b) Target Application͔Β 
 σʔλΛৗ࣌ऩू͓ͯ͘͠

Slide 24

Slide 24 text

24 ୈೋͷઃܭج४ɿσʔληοτͷݕূ (1) ݕূର৅ͷܥྻͷબ୒ (2) બ୒ܥྻͷঢ়ଶ෼ྨ ΞϓϦέʔγϣϯ 
 ϨϕϧϝτϦοΫ v1 v2 v3 ނো஫ೖͨ͠ϚΠΫϩ 
 αʔϏεϨϕϧͷ 
 ϝτϦοΫ ނো஫ೖʹ࠷΋ؔ ࿈͢ΔϝτϦοΫ NOT_FOUND ਖ਼ن෼෍ͷ68-95-99.7ଇʹै͍ɺ 
 2ඪ४ภࠩΑΓେ͖͍஋͸ҟৗͱ͢Δ FOUND_INSIDE_ANOMALY(ظ଴͞ΕΔ෼ྨ) FOUND_OUTSIDE_ANOMALY ނো͔Βো֐΁ͷ఻೻ܦ࿏ͷΈΛ 
 ݕࠪ͢Ε͹Α͍ͱ͍͏ԾఆΛ͓͘

Slide 25

Slide 25 text

25 Meltriaͷ࣮૷ ɾKubernetes؀ڥʹɺTarget Applicationͱͯ͠Sock ShopΛ഑ஔ ɾSock Shop͸ؔ࿈ݚڀͰΑ͘࢖༻͞ΕΔখن໛ͷϚΠΫϩαʔϏεΞϓϦ ɾWork fl ow Schedulerʹ͸ɺArgo Work fl owsΛ࢖༻ ɾArgo Work fl ows͸ɺKubernetes্ͰϫʔΫϑϩʔΛ૊ΊΔπʔϧ ɾނো஫ೖ͸LitmusΛ࢖༻͢Δ ɾLitmus͸Chaos EngineeringͷϑϨʔϜϫʔΫͷҰͭ

Slide 26

Slide 26 text

26 ୈೋͷઃܭج४ͷݕূͷਖ਼֬౓ ɾ8छͷίϯςφʹର͢ΔCPUͱϝϞϦͷނো஫ೖΛߦͬͨ ɾ257ݸͷܥྻΛऔಘ͠ɺͦΕΒΛ໨ࢹͰ3ঢ়ଶʹ෼ྨͨ͠΋ͷΛਖ਼ղσʔλͱ ͨ͠ ɾ85%ͷਖ਼֬౓Λࣔͨ͠ ɾޡ෼ྨ͞Εͨྫ ɾނো஫ೖʹࣦഊ (a), (b) ɾਖ਼ৗ࣌ʹεύΠΫมಈ (c) ɾલճͷ஫ೖӨڹͷࠞೖ (d)

Slide 27

Slide 27 text

4. ·ͱΊ

Slide 28

Slide 28 text

28 ·ͱΊ ɾAIOpsͷݚڀಈ޲ͷௐࠪʹΑΔͱɺFailure Managementʢো֐؅ཧʣʹؔ͢ Δݚڀ͕૿Ճ͍ͯ͠Δɻ ɾൃදऀͷ؍ଌൣғͰ͸ɺো֐؅ཧͷ͏ͪɺಛʹϚΠΫϩαʔϏεͷҟৗݕ ஌ɾݪҼ෼ੳͷݚڀ͕૿Ճ͍ͯ͠Δɻ ɾҟৗݕ஌ɾݪҼ෼ੳख๏ͷ՝୊Λൃݟ͢ΔͨΊʹɺσʔληοτͷಈతੜ੒ γεςϜΛݚڀ͍ͯ͠Δɻ

Slide 29

Slide 29 text

29 ؔ࿈ࢿྉ https://speakerdeck.com/yuukit/cloud-ai https://speakerdeck.com/yuukit/a-survey-for-cases-of-applying- machine-learning-to-sre