Upgrade to Pro — share decks privately, control downloads, hide ads and more …

TSifter: マイクロサービスにおける性能異常の迅速な診断に向いた時系列データの次元削減手法 / TSifter in proceedings of IOTS2020

TSifter: マイクロサービスにおける性能異常の迅速な診断に向いた時系列データの次元削減手法 / TSifter in proceedings of IOTS2020

第13回情報処理学会インターネットと運用技術シンポジウムhttps://www.iot.ipsj.or.jp/symposium/iots2020-program/

Yuuki Tsubouchi (yuuk1)

December 03, 2020
Tweet

More Decks by Yuuki Tsubouchi (yuuk1)

Other Decks in Research

Transcript

  1. TSifter: ϚΠΫϩαʔϏεʹ͓͚Δੑೳҟৗͷ
    ਝ଎ͳ਍அʹ޲͍ͨ࣌ܥྻσʔλͷ࣍ݩ࡟ݮख๏
    ௶಺ ༎थʢ͘͞ΒΠϯλʔωοτɺژ౎େֶʣ
    ௽ా തจʢ͘͞ΒΠϯλʔωοτʣ
    ݹ઒ խେʢ͸ͯͳʣ
    ৘ใॲཧֶձ ୈ13ճΠϯλʔωοτͱӡ༻ٕज़γϯϙδ΢ϜʢIOTS2020ʣ
    2020೥12݄3೔

    View Slide

  2. 2
    1. ݚڀͷഎܠͱ໨త
    2. ҟৗͷݪҼ਍அʹ޲͍ͨϝτϦοΫͷ࣍ݩ࡟ݮख๏
    3. ࣮ݧͱධՁ
    4. ·ͱΊͱࠓޙͷల๬
    ໨࣍

    View Slide

  3. 1.
    ݚڀͷഎܠͱ໨త

    View Slide

  4. 4
    ϚΠΫϩαʔϏεߏ੒ͷීٴ
    ϞϊϦε ϚΠΫϩαʔϏε
    ػೳผͷ෼ࢄߏ੒΁
    มભ
    WebαʔϏεͷιϑτ΢ΣΞن໛͕૿େ͠ɺ։ൃऀ͕ιϑτ΢ΣΞΛ
    มߋ͢Δ͜ͱ͕೉͘͠ͳ͍ͬͯΔ

    View Slide

  5. 5
    ؂ࢹσʔλྔ
    ͷ૿େ
    ϚΠΫϩαʔϏεΛӡ༻͢Δࡍͷ໰୊ҙࣝ
    ґଘؔ܎ͷෳࡶੑ
    ιϑτ΢ΣΞͷ
    มߋස౓޲্
    γεςϜͷೝ஌ෛՙ͕ߴ·Δ
    ੑೳҟৗͷݪҼΛ਍அ͢ΔͨΊͷ࣌ؒΛཁ͢ΔΑ͏ʹͳΔ

    View Slide

  6. 6
    ੑೳҟৗΛ਍அ͢ΔͨΊͷطଘͷΞϓϩʔν
    ϝτϦοΫ
    ςΩετϩά
    ࣮ߦτϨʔε
    ๛෋ͳ৘ใΛ΋͕ͭϩάʹग़ྗ͞Εͳ͍΋ͷ΋
    ͋Δ
    ॲཧܦ࿏ͷล୯Ґͷεϧʔϓοτ΍࣮ߦ࣌ؒΛ
    ೺ѲͰ͖ΔɻΞϓϦέʔγϣϯʹܭଌॲཧΛઃ
    ఆ͢Δख͕ؒ͋Δ
    ݸʑͷ৘ใྔ͸গͳ͍͕ऩूɺอଘɺՄࢹԽ͠
    ΍͍͢ɻ
    ࣮؀ڥ΁ͷద༻ੑΛ౿·͑ͯɺʮϝτϦοΫʯʹண໨

    View Slide

  7. 7
    ϝτϦοΫϕʔεΞϓϩʔν
    ֤αʔϏεͷܥྻάϥϑ͔Β૬ؔΛൃݟͰ͖Δ͕ɺݪҼՕॴ͕ෆ໌
    ʮ౷ܭతҼՌ୳ࡧʯΛԠ༻ͨ͠Ξϓϩʔν͕௚ۙ਺೥ͰఏҊ͞Ε͍ͯΔ
    Service A
    response time
    Service D
    response time
    Service E
    response time
    Service A
    response time
    Service D
    response time
    Service E
    response time
    Service F
    response time
    Service C
    response time
    ᶃܥྻؒͷҼՌ఻ൖάϥϑͷߏங ᶄҼՌͷܦ࿏ͷਪ࿦
    Ma, M.,et al., AutoMAP: Diagnose Your Microservice-based Web Applications Automatically, WWW2020.
    Qiu, J.,et al., A Causality Mining and Knowledge Graph Based Method of Root Cause Diagnosis for Performance Anomaly in Cloud Applications, Applied Sciences, 2020.
    Lin, J.,et al., Microscope: Pinpoint Performance Issues with Causal Graphs in Micro-Service Environments, ICSOC2018.
    Service A
    response time
    Service F
    response time
    Service C
    response time
    Top-1
    Top-2
    Service B
    response time

    View Slide

  8. ɾ਍அʹར༻͢ΔϝτϦοΫͷछྨͷ૊߹ͤ͸ݻఆ ʢ1ʙ7ݸఔ౓ʣ
    ɾྫʣԠ౴஗ԆͷΈɺ{Ԡ౴஗Ԇ, CPUར༻཰, ϝϞϦ࢖༻ྔ,…} ͳͲ
    ɾΑΓݪҼʹ͍ۙϝτϦοΫ͕݁Ռ͔Βআ֎͞ΕΔՄೳੑ͕͋Δ
    8
    ϝτϦοΫϕʔεΞϓϩʔνͷ՝୊
    Ͱ͖ΔݶΓଟ͘ͷϝτϦοΫͷܥྻΛ୳ࡧ͢Δඞཁ͕͋Δ
    TCPͷ࠶ૹΤϥʔ͕ൃੜ͍ͯ͠Δ͕ɺ
    ωοτϫʔΫଳҬͷมԽྔ͕খ͍͞ͳͲ

    View Slide

  9. 9
    ੑೳҟৗʹର͢ΔϝτϦοΫͷܥྻͷ࣍ݩ࡟ݮͷఏҊ
    ໨త: ϚΠΫϩαʔϏεʹͯҟৗͷ఻೻ܦ࿏ΛࣗಈͰਪ࿦͢ΔͨΊͷج൫
    ఏҊ: ҟৗͷݕ஌ʹ൓Ԡͯ͠ɺʮҰ࣌తʹʯ਍அʹ༗༻ͳܥྻΛશܥྻ͔
    Βߴ଎ʹநग़͢Δ࣍ݩ࡟ݮख๏ “TSifter” (Time series Sifter)
    ɾᶃਖ਼֬ੑ :਍அʹ༗༻ͳܥྻ͕࡟ݮ͞Ε͍ͯͳ͍
    ɾᶄ࣍ݩ࡟ݮ཰: ແ༻ͳܥྻΛͳΔ΂͘ଟ͘࡟ݮ͍ͨ͠
    ɾᶅߴ଎ੑ : ਝ଎ʹݪҼΛΈ͚͍ͭͨ (ཧ૝͸1෼ఔ౓)
    ܥྻ਺ʢ=࣍ݩ਺ʣ͕૿Ճ͢ΔͱҼՌ఻ൖάϥϑ͕ڊେԽ͢Δ
    3ͭͷཁ݅

    View Slide

  10. 10
    ࠷ऴతʹ࣮ݱ͍ͨ͠ݪҼ਍அγεςϜͷશମ૾
    શܥྻ
    औಘ
    ܥྻͷ
    ࣍ݩ࡟ݮ
    ݪҼ਍அ
    ࣌ܥྻ
    σʔλϕʔε
    ܥྻͷऩू
    ҟৗݕ஌
    ఏҊख๏ͷείʔϓ
    YES
    Service A/
    req_errors
    Service D/
    connections
    Service E/
    ܥྻؒͷҼՌͷܦ࿏


    ᶅ ᶆ ᶇ
    ʢҼՌάϥϑߏஙʣ
    αʔϏε୯ҐͰ
    ࣍ݩ࡟ݮ

    View Slide

  11. 2.
    ੑೳҟৗͷݪҼ਍அʹ޲͍ͨ
    ϝτϦοΫͷ࣍ݩ࡟ݮख๏

    View Slide

  12. 12
    ఏҊख๏ TSifter ͷཁ݅ͱղܾ
    ᶃਖ਼֬ੑ
    ᶄ࣍ݩ࡟ݮ཰
    ᶅߴ଎ੑ
    ಎ࡯1
    ಎ࡯2
    ҟৗൃੜલޙͰ࣌ܥྻͷ܏޲͕มԽ͠ͳ͍
    ܥྻ͸਍அ࣌ʹෆཁ
    → ࣌ܥྻσʔλͷఆৗੑΛ΋ͭܥྻΛআ֎
    ࣌ܥྻάϥϑͷܗঢ়͕ࣅ͍ͯΔܥྻ܈͸ҟ
    ৗͷ਍அ࣌ʹ৑௕
    → αʔϏε୯ҐͰ࣌ܥྻͷΫϥελϦϯά
    ܥྻ਺nʹରͯ͠ΫϥελϦϯάॲཧ͸ , ...
    → ಎ࡯1ͷআ֎ॲཧ Λઌʹ࣮ߦ͢Δ
    O(kn) O(n2)
    O(n)

    View Slide

  13. 13
    TSifter: 2ஈ֊ͷ࣍ݩ࡟ݮख๏
    ɾɾɾ
    ɾɾɾ
    ɾɾɾ
    εςοϓ1
    ఆৗੑΛ΋ͭ
    ܥྻΛআڈ
    ੜͷܥྻ ඇఆৗͳܥྻ ΫϥελԽ͞Εͨܥྻ
    ୅දܥྻ
    ҟৗظؒ
    ΫϥελϦϯάޙʹΫϥελ
    ͷ୅දܥྻΛબ୒
    εςοϓ2
    ྨࣅͷܗঢ়Λ
    ͱΔܥྻΛ
    ΫϥελϦϯά
    ҟৗൃੜલn෼ͷ
    ݻఆ௕ͷ΢Οϯυ΢෯

    View Slide

  14. 14
    ɾఆৗੑ: σʔλͷਫ४΍͹Β͖ͭɺࣗݾ૬ؔͷؔ܎͕࣌఺ʹΑΒͣҰఆ
    ɾ࣌ܥྻσʔλͷఆৗੑݕఆʹ޿͘ར༻͞ΕΔADFݕఆΛར༻
    ɾશͯͷܥྻΛ1ͭͣͭݕఆ͠ɺఆৗੑΛ΋ͭܥྻΛআڈ
    εςοϓ1: ݸʑͷܥྻͷఆৗੑʹண໨
    ࢒ཹ͢Δඇఆৗͳܥྻͷྫ

    View Slide

  15. 15
    εςοϓ2: ܥྻؒͷܗঢ়ྨࣅੑʹண໨
    αʔϏε಺ͷܥྻ܈
    ܗঢ়ͷྨࣅੑΛද͢ڑ཭ई౓ shape-based distance (SBD) Λ࠾༻
    ʢ࣌ؒ࣠ํ޲ʹγϑτɺॎ࣠ʹ৳ॖ͍ͯͯ͠΋ྨࣅͱΈͳ͢ʣ
    Paparrizos, J. and Gravano, L., k-Shape: Efficient and Accurate Clustering of Time Series,(SIGMOD2015)
    ߴ଎ԽͷͨΊɺ1ճͷॲཧͰΫϥελ਺ΛܾఆՄೳͳ֊૚తΫϥελϦϯάΛ࠾༻
    ʢ ͕ͩɺ1αʔϏε͋ͨΓͷܥྻ਺͕খ͍ͨ͞Ί໰୊ʹͳΒͳ͍ʣ
    O(n2)
    αʔϏεͷ୅දܥྻ܈
    Ϋϥελ
    ୅දܥྻͷબ୒
    ଞͷܥྻͱͷڑ཭ͷ૯࿨͕࠷খͷܥྻ

    View Slide

  16. 3.
    ࣮ݧͱධՁ

    View Slide

  17. 17
    ࣮ݧ؀ڥ
    ੍ޚαʔό
    Locust
    Kubernetes
    CPUෛՙ஫ೖ ωοτϫʔΫ஗Ԇ஫ೖ
    ϚΠΫϩαʔϏεΫϥελ
    Front-End
    Catalogue Orders
    Payment
    Shipping
    User
    Carts
    ղੳαʔό
    Prometheus
    ֎෦ෛՙͷ
    ੜ੒
    ܥྻऔಘϞδϡʔϧ
    stress-ng tc
    ղੳϞδϡʔϧ
    ܥྻऩूִؒ: 5ඵ
    ܥྻͷ΢Οϯυ΢෯: 30෼
    Intel Xeon 3.10GHz, 8core,32GB
    ܥྻͷऩूɾอଘ
    Sock Shop

    View Slide

  18. 18
    ϕʔεϥΠϯख๏: Sieve
    ɾεςοϓ1: ෼ࢄ஋ͷখ͍͞ϝτϦοΫΛऔΓআ͘
    ɾεςοϓ2: k-ShapeʹΑΔΫϥελϦϯά
    Thalheim, J., et al., Sieve: Actionable Insights from Monitored Metrics in Distributed Systems, (Middleware 2017)
    ߃ৗతʹར༻ՄೳͳγεςϜͷಛ௃Λநग़͢Δ͜ͱ͕໨తͰ͋Γɺຊ
    ݚڀͱ͸໨త͕ҟͳΔ͕ɺҟͳΔ໨తʹ΋Ԡ༻Ͱ͖ΔՄೳੑ͕͋Δ
    ࣌ܥྻσʔλͷ࣍ݩ࡟ݮख๏
    Paparrizos, J. and Gravano, L., k-Shape: Efficient and Accurate Clustering of Time Series,(SIGMOD2015)

    View Slide

  19. 19
    ᶃਖ਼֬ੑ: ҟৗ͝ͱͷݪҼͱͳΔܥྻͷਖ਼ޡ
    TSifter͸શͯͷέʔεʹରͯ͠ਖ਼͘͠ݪҼͱͳΔܥྻΛநग़
    ϕʔεϥΠϯख๏͸shippingαʔϏεͷCPUաෛՙͷέʔεͷΈෆਖ਼ղ

    View Slide

  20. 20
    ᶄ࣍ݩ࡟ݮ཰ͷධՁ: ҟৗ4έʔε
    ɾ͍ͣΕͷέʔεʹ͓͍ͯ΋ɺ91%Ҏ্ͷ࣍ݩ࡟ݮ཰Ͱ͋Γɺ1/10Ҏ
    ԼʹߜΓࠐΊ͍ͯΔ
    ɾϕʔεϥΠϯख๏ͷ΄͏͕࣍ݩ࡟ݮ཰͸Θ͔ͣʹߴ͍
    ɾTSifter͸εςοϓ1ͰΑΓଟ͘ͷϝτϦοΫΛ࡟ݮͰ͖͍ͯΔ

    View Slide

  21. 21
    ᶅߴ଎ੑͷධՁ: ֤ॲཧεςοϓͷ࣮ߦ࣌ؒ
    ɾCPUίΞ਺4ɺϝτϦοΫ਺100kͷ؀ڥ
    ɾTSifter͸ϕʔεϥΠϯʹରͯ͠ɺ311ഒߴ଎ͱͳͬͨ
    ɾʢޙड़ͷ௥Ճ࣮ݧͰ͸ɺ࠷௿Ͱ΋270ഒߴ଎ʣ
    εςοϓ1 (sec)
    ࣄલআڈ
    εςοϓ2 (sec)
    ΫϥελϦϯά
    ߹ܭ࣮ߦ࣌ؒ (sec)
    TSifter 54.41 8.68 63.09
    ϕʔεϥΠϯ 32.33 19590.83 19623.16

    View Slide

  22. 22
    ɾ྆ख๏ͱ΋ʹɺCPUίΞ਺·ͨ͸ܥྻ਺ʹରͯ͠ɺઢܗʹεέʔϧ
    ᶅߴ଎ੑͷධՁ: εέʔϥϏϦςΟ
    TSifter ϕʔεϥΠϯ
    0
    20
    40
    60
    20000 40000 60000 80000 100000
    Execution time (sec)
    Number of metrics
    Clustering
    1.21
    2.43
    3.81
    5.72
    8.68
    Filtering
    10.24
    20.28
    31.05
    42.14
    54.41
    Total
    11.45
    22.71
    34.86
    47.86
    63.09
    0
    5000
    10000
    15000
    20000
    20000 40000 60000 80000 100000
    Execution time (sec)
    Number of metrics
    Clustering
    3908.10
    7773.00
    11710.26
    15670.81
    19590.83
    Filtering
    2.88 7.63 13.54 22.91 32.33
    Total
    3910.98
    7780.63
    11723.80
    15693.72
    19623.16
    0
    200
    400
    600
    800
    1000
    1200
    1400
    1 2 3 4
    Execution time (sec)
    Number of CPU cores
    Clustering
    1224.87
    613.31
    416.55
    317.65
    Filtering
    0.17 0.17 0.17 0.17
    Total
    1225.04
    613.48
    416.72
    317.82
    0
    1
    2
    3
    4
    1 2 3 4
    Execution time (sec)
    Number of CPU cores
    Clustering
    0.37
    0.21
    0.20
    0.15
    Filtering
    3.57
    1.81
    1.26
    0.99
    Total
    3.93
    2.02
    1.46
    1.14
    TSifter ϕʔεϥΠϯ

    View Slide

  23. 23
    ࣮ߦ࣌ؒ͸1෼Ҏ಺͕ཧ૝Ͱ͋ΓɺϕʔεϥΠϯख๏ͷ࣮ߦ
    ࣌ؒ͸1225ඵʢ20෼ʣͰ͋Γɺݱ৔Ͱͷཁ݅Λຬͨͤͳ͍
    ֤ཁ݅ʹର͢ΔධՁͷ·ͱΊ
    ᶃਖ਼֬ੑ
    ᶄ࣍ݩ
    ࡟ݮ཰
    ᶅߴ଎ੑ
    ࣮ݧͰ͸ɺαʔϏεͷछྨ΍ނোέʔε͕ݶఆతͳͨΊɺ
    ௥ՃͷධՁ͕ඞཁ
    ࣍ݩ࡟ݮ཰͸ϕʔεϥΠϯख๏͕Θ͔ͣʹ্ճΔ
    ࠷ऴతʹཁٻ͞ΕΔ࣍ݩ࡟ݮ཰ͷఔ౓͸ࠓޙͷ՝୊
    CPUίΞ਺ͱܥྻ਺͕มԽͯ͠΋ɺ྆ख๏ͷ࣮ߦ࣌ؒൺ͸ಉ

    View Slide

  24. 24
    ͳͥϕʔεϥΠϯख๏ʹରͯ͠ߴ଎ͳͷ͔ʁ
    ϕʔεϥΠϯ TSifter
    ࠷దͳΫϥελ਺Λܾఆ͢ΔͨΊʹ
    ܁Γฦ࣮͠ߦ
    ΫϥελϦϯά࣮ߦճ਺͸310ճ
    ֊૚తΫϥελϦϯά
    ΫϥελϦϯά
    ࣮ߦճ਺͸7ճ
    ڑ཭ͷᮢ஋Λઃఆͯ͠
    Ϋϥελ਺Λܾఆ

    View Slide

  25. 4.
    ·ͱΊͱࠓޙͷల๬

    View Slide

  26. 26
    ɾҟৗͷݕ஌ʹ൓Ԡͯ͠ɺେྔͷϝτϦοΫ͔ΒʮҰ࣌తʹʯ਍அʹ༗༻ͳ
    ϝτϦοΫΛߴ଎ʹநग़͢ΔͨΊͷ࣍ݩ࡟ݮख๏ΛఏҊ
    ɾ࣮ݧͷൣғ಺Ͱ͸ɺϕʔεϥΠϯʹରͯ͠ɺ࠷௿Ͱ΋270ഒͷߴ଎ԽΛୡ੒
    ɾਖ਼֬ੑɺ࣍ݩ࡟ݮ཰ɺεέʔϥϏϦςΟͰ͸ಉ౳ఔ౓
    ɾ10 ສϝτϦοΫʹରͯ͠1෼ఔ౓ͷ࣌ؒͰ࣮ߦՄೳ
    ·ͱΊͱࠓޙͷల๬
    ɾࠓޙͷల๬
    ɾఏҊͷྑ͕͞ΑΓ໌֬ͱͳΔධՁͷ௥ՃʢΑΓదͨ͠ϕʔεϥΠϯͷબ୒
    ͳͲʣ
    ɾTSifterΛ૊ΈࠐΜͩݪҼ਍அγεςϜͷ࣮ݱ

    View Slide

  27. 0.
    ิ଍εϥΠυ

    View Slide

  28. 28
    TSifterͷ੍໿
    ɾ෼ੳظ͕ؒݻఆ஋Ͱ͋ΔͨΊɺ෼ੳظؒ֎ͷมಈΛߟྀͰ͖ͳ͍
    ࣌ؒ࣠

    View Slide