$30 off During Our Annual Pro Sale. View Details »

Meltria: マイクロサービスにおける
異常検知・原因分析のための
データセットの動的生成システム / Meltria in IOTS2021

Meltria: マイクロサービスにおける
異常検知・原因分析のための
データセットの動的生成システム / Meltria in IOTS2021

https://www.iot.ipsj.or.jp/symposium/iots2021-program/
(9) Meltria:マイクロサービスにおける異常検知・原因分析のためのデータセットの動的生成システム
◎坪内佑樹(さくらインターネット, 京都大学), 青山真也(さくらインターネット)

Yuuki Tsubouchi (yuuk1)

November 26, 2021
Tweet

More Decks by Yuuki Tsubouchi (yuuk1)

Other Decks in Research

Transcript

  1. Meltria: ϚΠΫϩαʔϏεʹ͓͚Δ

    ɹɹɹɹҟৗݕ஌ɾݪҼ෼ੳͷͨΊͷ

    ɹɹɹɹσʔληοτͷಈతੜ੒γεςϜ
    ௶಺ ༎थʢ͘͞ΒΠϯλʔωοτɾژ౎େֶʣ

    ੨ࢁ ਅ໵ʢ͘͞ΒΠϯλʔωοτʣ

    ৘ใॲཧֶձ ୈ14ճΠϯλʔωοτͱӡ༻ٕज़γϯϙδ΢ϜʢIOTS2021ʣ


    2021೥11݄26೔

    View Slide

  2. 2
    Ϋϥ΢υΞϓϦέʔγϣϯͷෳࡶԽͱAIOps
    ϞϊϦε

    ΞʔΩςΫνϟ
    ϚΠΫϩαʔϏε

    ΞʔΩςΫνϟ
    ‣ มߋස౓ͷ૿େ


    ‣ ґଘؔ܎ͷෳࡶੑ


    ‣ ؂ࢹσʔλྔͷ૿େ
    ΦϖϨʔλʔͷܦݧ΍௚ײ

    ʹཔΔΠϯγσϯτରԠ͕೉Խ
    ೝ஌ෛՙͷ

    ૿େ
    ౷ܭ෼ੳɾػցֶशͰղܾʢAIOpsʣ
    ։ൃऀͷίʔυ
    ͷมߋ͕೉͍͠
    ػೳ୯ҐͰ

    ෼ࢄ
    [Soldani 21]: Soldani, J. and Brogi, A., Anomaly Detection and Failure Root Cause Analysis in (Micro) Service- Based Cloud Applications: A Survey, arXiv preprint
    2021.

    [Soldani 21]

    View Slide

  3. 3
    ධՁ༻ͷӡ༻σʔληοτͷ՝୊
    ɾAIϞσϧͷֶश΍ධՁͷͨΊʹɺҟৗΛؚΉσʔληοτ͕ඞཁ


    ɾاۀ͸ϓϥΠόγʔ΍ηΩϡϦςΟͷ౎߹্ɺɹɹɹɹɹɹɹɹɹɹɹɹɹ
    ӡ༻σʔλͷެ։ʹফۃత
    [Loghub 20]: He, Shilin, et al. "Loghub: A large collection of system log datasets towards automated log analytics." arXiv preprint 2020.


    [Exathlon 20]: Jacob, Vincent, et al. "Exathlon: A Benchmark for Explainable Anomaly Detection over Time Series." arXiv preprint 2020.

    [LogAD 21]: Zhao, Nengwen, et al. "An empirical investigation of practical log anomaly detection for online service systems." ACM ESEC/FSE. 2021.
    ެ։σʔληοτ
    ɾݶΒΕͨҟৗύλʔϯͷΈ


    ɾಛఆͷσʔληοτʹա৒ద
    ߹͢ΔڪΕ͕͋Δ
    [Loghub 20]

    [Exathlon 20]
    ɾա৒ద߹Λආ͚ΔͨΊʹɺ๛෋ͳ
    ҟৗύλʔϯʹΑΔֶश΍ධՁ͕
    ඞཁ


    ɾ͋ΒΏΔҟৗύλʔϯΛ಺แͨ͠
    σʔληοτͷࣄલ࡞੒͸೉͍͠
    [LogAD 21]

    View Slide

  4. 4
    ఏҊɿσʔληοτΛಈతʹੜ੒͢ΔγεςϜ
    ಈత ɾҟৗͷύλʔϯɺσʔλܭଌ৚͕݅Մม
    ҟৗͷ࠶ݱ৚݅

    σʔλܭଌ৚݅
    ੜ੒γεςϜ
    ೖྗ ग़ྗ σʔλ

    ηοτ
    ա৒ద߹Λ

    ൃݟɾճආ΁
    ɾAIϞσϧͷֶश΍ධՁͷͨΊʹσʔλʹʮҟৗͷ༗ແͱҐஔʯͱ
    ͍ͬͨจ຺ʢσʔλϥϕϧʣ͕ඞཁ


    ɾੜ੒͢ΔͨͼʹϥϕϦϯά͢Δ࿑ྗΛ࡟ݮ͍ͨ͠ => ࣗಈԽͷఏҊ

    View Slide

  5. 5
    ຊݚڀͷߩݙ
    ࣮ݧͷ݁ՌɺϥϕϦϯάͷਖ਼֬ੑ͕85%
    1. σʔληοτΛಈతʹੜ੒͢Δͱ͍͏৽نੑ
    2. ҟৗͷ༗ແͱҐஔͷϥϕϦϯάͷࣗಈԽ
    AIOpsจ຺ͰσʔληοτʹࣗಈϥϕϦϯά͢Δٞ࿦͸΄ͲΜͲͳ͍
    σʔληοτͱ͍͏ͷ͸௨ৗɺ੩తͰ͔ͭ࡞੒ʹख͕͔͔ؒΔ΋ͷ


    ։ൃͨ͠ϓϩτλΠϓʹΑΔ࣮ݧͷ݁Ռɺظ଴௨Γʹಈ࡞͠ͳ͍2छྨέʔε
    ಈతੜ੒ʹΑΓɺա৒ద߹ΛͲͷఔ౓๷͛Δͷ͔ͷධՁ͕ࠓޙඞཁ
    AIOpsΞϓϩʔνͷҰൠద༻ੑΛ֬อ͍ͯͨ͘͠Ίʹ

    σʔλʹண໨͢ΔݚڀʹҰา౿Έग़ͨ͠

    View Slide

  6. 6
    ӡ༻σʔληοτͷείʔϓͷઃఆ
    ӡ༻σʔλͷछผ
    ର৅ΞϓϦέʔγϣϯ
    ϝτϦοΫ: ࣌ܥྻͷ਺஋σʔλ
    ϩά
    τϨʔε: ϦΫΤετͷ࣮ߦܦ࿏
    Sock Shop: খن໛(8αʔϏε)
    Train Ticket: தن໛(41αʔϏε)
    Pymicro: γϛϡϨʔλ(16αʔϏε)
    ϚΠΫϩαʔϏεͷҟৗݕ஌ɾݪҼ෼ੳʹؔ͢Δ11݅ͷ࿦จΛௐࠪ
    ஫ೖ͢Δނোͷछྨ






    ܭࢉػࢿݯʹؔ͢Δނো
    αʔϏεؒϦΫΤετͷ੒൱ͱ஗Ԇ

    ϚΠΫϩαʔϏεಛ༗ͷނো



    ϝτϦοΫ
    ܭࢉػࢿݯʹؔ͢Δނো
    Sock Shop
    ࠷ଟͷέʔεʹείʔϓΛઃఆ

    View Slide

  7. 7
    ؔ࿈ٕज़ͱͷൺֱ
    Chaos

    Engineering
    ෼ࢄγεςϜ͕༧ظͤ͵ࣄଶʹ଱͑ΒΕΔ͔Ͳ͏͔ͷ֬ূΛ
    ಘΔͨΊͷݕূͷن཯
    [Basiri 16]: Basiri, A., et al., Chaos Engineering, IEEE Software, 2016.
    [Basiri 16]
    Chaos EngineeringΛ࣮ફ͢ΔπʔϧɿLitmusChaos, Chaos Mesh
    ‣ ނো஫ೖ: ܭࢉػࢿݯʹؔ͢ΔނোΛαϙʔτ


    ‣ CPUɺϝϞϦɺσΟεΫͷա৒࢖༻ɺύέοτϩεͳͲ


    ‣ ނো஫ೖͷεέδϡʔϦϯά࣮ߦ
    ӡ༻σʔλͷ؅ཧ΍ϥϕϦϯάͳͲͷ

    Chaos Engineeringࣗମʹඞཁͷͳ͍ػೳ͸ؚ·Εͳ͍

    View Slide

  8. 8
    σʔληοτͷಈతੜ੒γεςϜͷઃܭ
    ୈҰઃܭج४ ୈೋઃܭج४
    σʔλ؅ཧΛؚΊͨނো஫ೖ
    ͷεέδϡʔϦϯά
    ϥϕϦϯάͷࣗಈԽ
    ‣ ނো஫ೖͷӨڹͷ༗ແ


    ‣ ૝ఆ֎ͷҟৗͷ༗ແ
    σʔληοτʹؚ·ΕΔܥྻ͝ͱʹ
    ނো஫ೖ৚݅

    σʔλܭଌ৚݅
    ੜ੒ϫʔΫϑϩʔ
    ೖྗ ग़ྗ
    σʔληοτ
    ϥϕϦϯά
    ‣ Chaos EngineeringΛ֦ு


    ‣ ಉҰ࣌ؒଳʹނো஫ೖͷॏෳͳ͠


    ‣ 1஫ೖʹରԠ͢ΔσʔλΛඥ෇͚Մೳ


    ‣ ਖ਼ৗͱҟৗσʔλΛ྆ํؚΉ
    Λ෇༩

    View Slide

  9. 9
    ୈҰͷઃܭج४ɿނো஫ೖͷεέδϡʔϦϯάཁ݅
    σʔλͷ

    جຊ୯Ґ
    Component
    A
    Component
    B
    Component
    C
    Time
    Slot
    Fault α
    Injected Span
    Fault β
    Injected Span
    Component
    A
    Recovery time

    ίϯϙʔωϯτ × ނো

    ͷ૊Έ߹Θ͚ͤͩ஫ೖ
    ਖ਼ৗ࣌σʔλ΋ඞཁͳͷͰ
    ଴ػ
    ̍εϩοτ̍ނো

    View Slide

  10. 10
    ୈҰͷઃܭج४ɿγεςϜߏ੒
    Workflow
    Scheduler
    Operational
    Data Stoage
    Load
    Generator
    Target Application
    1. Inject faults
    Datasets
    Repositorry
    2. Pick latest data
    to datasets
    3. Wait until the
    application recovers
    ᶅ ࣍ͷinjection࣌ؒ

    ɹɹɹɹɹ·Ͱ଴ػ
    ᶃ ނোΛ஫ೖ
    ᶄ εϩοτͷσʔλΛ

    ࠾औ

    View Slide

  11. 11
    ୈೋͷઃܭج४ɿσʔληοτͷϥϕϦϯά
    (1) ϥϕϦϯάର৅ͷܥྻબ୒ (2) ֤ܥྻͷ෼ྨ
    ΞϓϦέʔγϣϯ

    ϨϕϧϝτϦοΫ
    s1
    s2
    s3
    ނো஫ೖͨ͠ϚΠΫϩ

    αʔϏεϨϕϧͷ

    ϝτϦοΫ
    ނো஫ೖʹ࠷΋ؔ
    ࿈͢ΔϝτϦοΫ
    NOT_FOUND
    FOUND_INSIDE_ANOMALY

    (ظ଴͞ΕΔঢ়ଶ)
    FOUND_OUTSIDE_ANOMALY
    ҟৗͷ఻೻ܦ࿏ͷΈϥϕϦϯά ҟৗͷ༗ແͱҐஔͰ3෼ྨ

    View Slide

  12. 12
    ୈೋͷઃܭج४ɿ3ϥϕϧ΁ͷ෼ྨख๏
    ਖ਼ن෼෍ͷ68-95-99ଇʢ3γάϚଇʣ
    2ඪ४ภࠩͷ֎ͷൣғͷܥྻ఺Λҟৗͱ͢Δ
    Wikipedia “68–95–99.7 rule” ΑΓҾ༻
    ҟৗͱ൑ఆ͞Εͨσʔλ఺ͷҐஔʹΑͬͯ

    FOUND_INSIDE_ANOMALY ·ͨ͸

    FOUND_OUTSIDE_ANOMALY

    ͕ܾ·Δ
    68%, 95%, 99.7%ͷ஋͕ͦΕͧΕฏۉͷ
    1, 2, 3ඪ४ภࠩҎ಺ʹऩ·͍ͬͯΔ

    View Slide

  13. 13
    ධՁ࣮ݧͷઃఆ
    σʔλͷܭଌ৚݅


    ɾσʔλऔಘͷִؒ͸ 15 ඵ


    ɾεϩοτͷظؒΛ 30 ෼
    ɾSock Shop಺ͷ8छྨͷίϯςφ


    ɾCPUա࢖༻ͱϝϞϦϦʔΫ


    ɾ5ճͣͭ஫ೖ
    ܭ90ճͷނো஫ೖ
    s1
    s2
    s3
    ܥྻͷબ୒
    front-endίϯςφͷ

    ฏۉϨεϙϯελΠϜ
    ނো஫ೖαʔϏεͷ

    ฏۉϨεϙϯελΠϜ
    ނোؔ࿈ͷϝτϦοΫ
    ɾϢʔβۭؒͷCPUར༻


    ɾϝϞϦ࢖༻ྔ

    View Slide

  14. 14
    ୈҰઃܭج४ͷධՁ
    1. ނো஫ೖʹࣦഊ͠ɺނো͕ൃੜ͠ͳ͔ͬͨ


    ɾނো஫ೖͷࣦഊݪҼ͸ௐࠪத


    ɾނো஫ೖͷࣦഊΛݕ஌͠ɺࣦഊΛϥϕϧ෇͚͢Δඞཁ͕͋Δ


    2. ௚લͷނো஫ೖͷӨڹ͕ࠞͬͨ͟


    ɾΞϓϦέʔγϣϯͷճ෮Λอূ͢Δػߏ͕ඞཁͱͳΔ
    ఆྔతͳධՁ͸ະ࣮ࢪ
    ʮ໨ࢹʹΑΔ෼ྨʯ͕FOUND_INSIDE_ANOMALYʹҰக͠ͳ͔ͬͨέʔε͸

    ࣍ͷ2छྨ
    s1, s2, s3͕શͯFOUND_INSIDE_ANOMALYͱͳΔ͜ͱΛظ଴

    View Slide

  15. 15
    ୈೋઃܭج४ͷධՁɿϥϕϦϯάͷਖ਼֬ੑ
    257ݸͷܥྻΛ໨ࢹͰ෼ྨͨ͠
    ΋ͷΛਖ਼ղͱͨ͠
    ޡ෼ྨ͞Εͨྫ


    ɾނো஫ೖʹࣦഊ (a), (b)


    ɾਖ਼ৗ࣌ʹεύΠΫมಈ (c)


    ɾલճͷ஫ೖӨڹͷࠞೖ (d)
    85%ͷਖ਼ղ཰Λࣔͨ͠

    View Slide

  16. 16
    ϥϕϦϯάͷ՝୊
    ϥϕϦϯάͷਖ਼֬ੑ͸ɺώϡʔϚϯΤϥʔΛؚΉखಈϥϕϦϯάͱಉ౳ఔ౓ɹɹ
    Ͱ͋Δ͜ͱ͕๬·͍͠


    ɹώϡʔϚϯΤϥʔ͕15%΋ؚ·ΕΔͱ͸ࢥ͑ͳ͍ͨΊɺΑΓߴ͍ੑೳ͕ඞཁ
    ɾ68-95-99ଇ͸ɺᮢ஋Ͱ൑ఆ͢ΔͨΊɺᮢ஋෇ۙͷܥྻ͸ޡ෼ྨ͠΍͍͢


    ɾਓؒͰ΋ҟৗ͔Ͳ͏͔൑அʹ໎͏ܥྻ


    ɾʮࣗ৴ͷͳ͞ʯͷఔ౓ΛείΞԽͯ͠ϥϕϦϯάͰ͖ΔͱΑ͍


    ɾڭࢣσʔλΛ࡞੒͠΍͍ͨ͢Ίɺڭࢣ͋ΓֶशΛ࢖༻͢Δ͜ͱ΋ݕ౼


    ɾྨࣅͷมಈ܏޲Ͱ΋ɺϝτϦοΫͷछྨʹґΔ෦෼͕͋Δ
    ਖ਼֬ੑ޲্Ҋ

    View Slide

  17. 17
    ·ͱΊ
    ๛෋ͳҟৗύλʔϯʹΑΓա৒ద߹Λճආ͢ΔͨΊʹɺσʔληοτͷ

    ಈతੜ੒γεςϜMeltriaΛఏҊ


    ᶃ ʮσʔλ؅ཧΛؚΊͨʯނো஫ೖͷεέδϡʔϦϯά


    ᶄ ਖ਼ن෼෍ͷܦݧଇʹΑΔʮҟৗͷ༗ແͱҐஔʯͷϥϕϦϯάͷࣗಈԽ
    ຊ࣮૷͸GitHubʹͯެ։ࡁΈ https://github.com/ai4sre/meltria
    90ճͷނো஫ೖͷ࣮ݧ


    ᶃ ظ଴͞ΕΔҟৗ͕ى͖ͳ͔ͬͨέʔε͸2छྨɻMeltriaʹಈ࡞อূΛՃ
    ͑ͯରԠՄೳ


    ᶄ ϥϕϦϯάͷਖ਼֬ੑ͸85%ɻޡ෼ྨͷཁҼ͸ɼมಈ͕খ͍͞έʔε

    View Slide

  18. 18
    ࠓޙͷల๬
    ɾϝτϦοΫҎ֎ͷσʔλछผͷσʔληοτੜ੒


    ɾΑΓن໛ͷେ͖͍ΞϓϦέʔγϣϯͷαϙʔτ


    ɾଟ༷ͳछྨͷނোͷαϙʔτ


    ɾσʔληοτͷੜ੒࣌ؒͷ୹ॖ
    ػೳͷ֦ॆ
    ֶज़ੑͷ޲্
    ɾಈతੜ੒ʹΑΓͲͷఔ౓ա৒ద߹Λ๷͛Δͷ͔ͷධՁ


    ɾఏҊ͢ΔϥϕϦϯά͕Ͳͷఔ౓༗༻ͳͷ͔ͷධՁ
    ೚ҙͷ࣮ΞϓϦέʔγϣϯʹରͯ͠ɺ͞·͟·ͳAIOpsख๏Λ
    ࣗಈධՁ͢ΔγεςϜ΁ൃల

    View Slide