Upgrade to Pro — share decks privately, control downloads, hide ads and more …

TSifter: マイクロサービスにおける性能異常の迅速な診断に向いた時系列データの次元削減手法 / TSifter in proceedings of IOTS2020

TSifter: マイクロサービスにおける性能異常の迅速な診断に向いた時系列データの次元削減手法 / TSifter in proceedings of IOTS2020

第13回情報処理学会インターネットと運用技術シンポジウムhttps://www.iot.ipsj.or.jp/symposium/iots2020-program/

A658ec7f1badf73819dfa501165016c1?s=128

Yuuki Tsubouchi (yuuk1)

December 03, 2020
Tweet

Transcript

  1. TSifter: ϚΠΫϩαʔϏεʹ͓͚Δੑೳҟৗͷ ਝ଎ͳ਍அʹ޲͍ͨ࣌ܥྻσʔλͷ࣍ݩ࡟ݮख๏ ௶಺ ༎थʢ͘͞ΒΠϯλʔωοτɺژ౎େֶʣ ௽ా തจʢ͘͞ΒΠϯλʔωοτʣ ݹ઒ խେʢ͸ͯͳʣ ৘ใॲཧֶձ

    ୈ13ճΠϯλʔωοτͱӡ༻ٕज़γϯϙδ΢ϜʢIOTS2020ʣ 2020೥12݄3೔
  2. 2 1. ݚڀͷഎܠͱ໨త 2. ҟৗͷݪҼ਍அʹ޲͍ͨϝτϦοΫͷ࣍ݩ࡟ݮख๏ 3. ࣮ݧͱධՁ 4. ·ͱΊͱࠓޙͷల๬ ໨࣍

  3. 1. ݚڀͷഎܠͱ໨త

  4. 4 ϚΠΫϩαʔϏεߏ੒ͷීٴ ϞϊϦε ϚΠΫϩαʔϏε ػೳผͷ෼ࢄߏ੒΁ มભ WebαʔϏεͷιϑτ΢ΣΞن໛͕૿େ͠ɺ։ൃऀ͕ιϑτ΢ΣΞΛ มߋ͢Δ͜ͱ͕೉͘͠ͳ͍ͬͯΔ

  5. 5 ؂ࢹσʔλྔ ͷ૿େ ϚΠΫϩαʔϏεΛӡ༻͢Δࡍͷ໰୊ҙࣝ ґଘؔ܎ͷෳࡶੑ ιϑτ΢ΣΞͷ มߋස౓޲্ γεςϜͷೝ஌ෛՙ͕ߴ·Δ ੑೳҟৗͷݪҼΛ਍அ͢ΔͨΊͷ࣌ؒΛཁ͢ΔΑ͏ʹͳΔ

  6. 6 ੑೳҟৗΛ਍அ͢ΔͨΊͷطଘͷΞϓϩʔν ϝτϦοΫ ςΩετϩά ࣮ߦτϨʔε ๛෋ͳ৘ใΛ΋͕ͭϩάʹग़ྗ͞Εͳ͍΋ͷ΋ ͋Δ ॲཧܦ࿏ͷล୯Ґͷεϧʔϓοτ΍࣮ߦ࣌ؒΛ ೺ѲͰ͖ΔɻΞϓϦέʔγϣϯʹܭଌॲཧΛઃ ఆ͢Δख͕ؒ͋Δ

    ݸʑͷ৘ใྔ͸গͳ͍͕ऩूɺอଘɺՄࢹԽ͠ ΍͍͢ɻ ࣮؀ڥ΁ͷద༻ੑΛ౿·͑ͯɺʮϝτϦοΫʯʹண໨
  7. 7 ϝτϦοΫϕʔεΞϓϩʔν ֤αʔϏεͷܥྻάϥϑ͔Β૬ؔΛൃݟͰ͖Δ͕ɺݪҼՕॴ͕ෆ໌ ʮ౷ܭతҼՌ୳ࡧʯΛԠ༻ͨ͠Ξϓϩʔν͕௚ۙ਺೥ͰఏҊ͞Ε͍ͯΔ Service A response time Service D

    response time Service E response time Service A response time Service D response time Service E response time Service F response time Service C response time ᶃܥྻؒͷҼՌ఻ൖάϥϑͷߏங ᶄҼՌͷܦ࿏ͷਪ࿦ Ma, M.,et al., AutoMAP: Diagnose Your Microservice-based Web Applications Automatically, WWW2020. Qiu, J.,et al., A Causality Mining and Knowledge Graph Based Method of Root Cause Diagnosis for Performance Anomaly in Cloud Applications, Applied Sciences, 2020. Lin, J.,et al., Microscope: Pinpoint Performance Issues with Causal Graphs in Micro-Service Environments, ICSOC2018. Service A response time Service F response time Service C response time Top-1 Top-2 Service B response time
  8. ɾ਍அʹར༻͢ΔϝτϦοΫͷछྨͷ૊߹ͤ͸ݻఆ ʢ1ʙ7ݸఔ౓ʣ ɾྫʣԠ౴஗ԆͷΈɺ{Ԡ౴஗Ԇ, CPUར༻཰, ϝϞϦ࢖༻ྔ,…} ͳͲ ɾΑΓݪҼʹ͍ۙϝτϦοΫ͕݁Ռ͔Βআ֎͞ΕΔՄೳੑ͕͋Δ 8 ϝτϦοΫϕʔεΞϓϩʔνͷ՝୊ Ͱ͖ΔݶΓଟ͘ͷϝτϦοΫͷܥྻΛ୳ࡧ͢Δඞཁ͕͋Δ

    TCPͷ࠶ૹΤϥʔ͕ൃੜ͍ͯ͠Δ͕ɺ ωοτϫʔΫଳҬͷมԽྔ͕খ͍͞ͳͲ
  9. 9 ੑೳҟৗʹର͢ΔϝτϦοΫͷܥྻͷ࣍ݩ࡟ݮͷఏҊ ໨త: ϚΠΫϩαʔϏεʹͯҟৗͷ఻೻ܦ࿏ΛࣗಈͰਪ࿦͢ΔͨΊͷج൫ ఏҊ: ҟৗͷݕ஌ʹ൓Ԡͯ͠ɺʮҰ࣌తʹʯ਍அʹ༗༻ͳܥྻΛશܥྻ͔ Βߴ଎ʹநग़͢Δ࣍ݩ࡟ݮख๏ “TSifter” (Time series

    Sifter) ɾᶃਖ਼֬ੑ :਍அʹ༗༻ͳܥྻ͕࡟ݮ͞Ε͍ͯͳ͍ ɾᶄ࣍ݩ࡟ݮ཰: ແ༻ͳܥྻΛͳΔ΂͘ଟ͘࡟ݮ͍ͨ͠ ɾᶅߴ଎ੑ : ਝ଎ʹݪҼΛΈ͚͍ͭͨ (ཧ૝͸1෼ఔ౓) ܥྻ਺ʢ=࣍ݩ਺ʣ͕૿Ճ͢ΔͱҼՌ఻ൖάϥϑ͕ڊେԽ͢Δ 3ͭͷཁ݅
  10. 10 ࠷ऴతʹ࣮ݱ͍ͨ͠ݪҼ਍அγεςϜͷશମ૾ શܥྻ औಘ ܥྻͷ ࣍ݩ࡟ݮ ݪҼ਍அ ࣌ܥྻ σʔλϕʔε ܥྻͷऩू

    ҟৗݕ஌ ఏҊख๏ͷείʔϓ YES Service A/ req_errors Service D/ connections Service E/ ܥྻؒͷҼՌͷܦ࿏ ᶃ ᶄ ᶅ ᶆ ᶇ ʢҼՌάϥϑߏஙʣ αʔϏε୯ҐͰ ࣍ݩ࡟ݮ
  11. 2. ੑೳҟৗͷݪҼ਍அʹ޲͍ͨ ϝτϦοΫͷ࣍ݩ࡟ݮख๏

  12. 12 ఏҊख๏ TSifter ͷཁ݅ͱղܾ ᶃਖ਼֬ੑ ᶄ࣍ݩ࡟ݮ཰ ᶅߴ଎ੑ ಎ࡯1 ಎ࡯2 ҟৗൃੜલޙͰ࣌ܥྻͷ܏޲͕มԽ͠ͳ͍

    ܥྻ͸਍அ࣌ʹෆཁ → ࣌ܥྻσʔλͷఆৗੑΛ΋ͭܥྻΛআ֎ ࣌ܥྻάϥϑͷܗঢ়͕ࣅ͍ͯΔܥྻ܈͸ҟ ৗͷ਍அ࣌ʹ৑௕ → αʔϏε୯ҐͰ࣌ܥྻͷΫϥελϦϯά ܥྻ਺nʹରͯ͠ΫϥελϦϯάॲཧ͸ , ... → ಎ࡯1ͷআ֎ॲཧ Λઌʹ࣮ߦ͢Δ O(kn) O(n2) O(n)
  13. 13 TSifter: 2ஈ֊ͷ࣍ݩ࡟ݮख๏ ɾɾɾ ɾɾɾ ɾɾɾ εςοϓ1 ఆৗੑΛ΋ͭ ܥྻΛআڈ ੜͷܥྻ

    ඇఆৗͳܥྻ ΫϥελԽ͞Εͨܥྻ ୅දܥྻ ҟৗظؒ ΫϥελϦϯάޙʹΫϥελ ͷ୅දܥྻΛબ୒ εςοϓ2 ྨࣅͷܗঢ়Λ ͱΔܥྻΛ ΫϥελϦϯά ҟৗൃੜલn෼ͷ ݻఆ௕ͷ΢Οϯυ΢෯
  14. 14 ɾఆৗੑ: σʔλͷਫ४΍͹Β͖ͭɺࣗݾ૬ؔͷؔ܎͕࣌఺ʹΑΒͣҰఆ ɾ࣌ܥྻσʔλͷఆৗੑݕఆʹ޿͘ར༻͞ΕΔADFݕఆΛར༻ ɾશͯͷܥྻΛ1ͭͣͭݕఆ͠ɺఆৗੑΛ΋ͭܥྻΛআڈ εςοϓ1: ݸʑͷܥྻͷఆৗੑʹண໨ ࢒ཹ͢Δඇఆৗͳܥྻͷྫ

  15. 15 εςοϓ2: ܥྻؒͷܗঢ়ྨࣅੑʹண໨ αʔϏε಺ͷܥྻ܈ ܗঢ়ͷྨࣅੑΛද͢ڑ཭ई౓ shape-based distance (SBD) Λ࠾༻ ʢ࣌ؒ࣠ํ޲ʹγϑτɺॎ࣠ʹ৳ॖ͍ͯͯ͠΋ྨࣅͱΈͳ͢ʣ

    Paparrizos, J. and Gravano, L., k-Shape: Efficient and Accurate Clustering of Time Series,(SIGMOD2015) ߴ଎ԽͷͨΊɺ1ճͷॲཧͰΫϥελ਺ΛܾఆՄೳͳ֊૚తΫϥελϦϯάΛ࠾༻ ʢ ͕ͩɺ1αʔϏε͋ͨΓͷܥྻ਺͕খ͍ͨ͞Ί໰୊ʹͳΒͳ͍ʣ O(n2) αʔϏεͷ୅දܥྻ܈ Ϋϥελ ୅දܥྻͷબ୒ ଞͷܥྻͱͷڑ཭ͷ૯࿨͕࠷খͷܥྻ
  16. 3. ࣮ݧͱධՁ

  17. 17 ࣮ݧ؀ڥ ੍ޚαʔό Locust Kubernetes CPUෛՙ஫ೖ ωοτϫʔΫ஗Ԇ஫ೖ ϚΠΫϩαʔϏεΫϥελ Front-End Catalogue

    Orders Payment Shipping User Carts ղੳαʔό Prometheus ֎෦ෛՙͷ ੜ੒ ܥྻऔಘϞδϡʔϧ stress-ng tc ղੳϞδϡʔϧ ܥྻऩूִؒ: 5ඵ ܥྻͷ΢Οϯυ΢෯: 30෼ Intel Xeon 3.10GHz, 8core,32GB ܥྻͷऩूɾอଘ Sock Shop
  18. 18 ϕʔεϥΠϯख๏: Sieve ɾεςοϓ1: ෼ࢄ஋ͷখ͍͞ϝτϦοΫΛऔΓআ͘ ɾεςοϓ2: k-ShapeʹΑΔΫϥελϦϯά Thalheim, J., et

    al., Sieve: Actionable Insights from Monitored Metrics in Distributed Systems, (Middleware 2017) ߃ৗతʹར༻ՄೳͳγεςϜͷಛ௃Λநग़͢Δ͜ͱ͕໨తͰ͋Γɺຊ ݚڀͱ͸໨త͕ҟͳΔ͕ɺҟͳΔ໨తʹ΋Ԡ༻Ͱ͖ΔՄೳੑ͕͋Δ ࣌ܥྻσʔλͷ࣍ݩ࡟ݮख๏ Paparrizos, J. and Gravano, L., k-Shape: Efficient and Accurate Clustering of Time Series,(SIGMOD2015)
  19. 19 ᶃਖ਼֬ੑ: ҟৗ͝ͱͷݪҼͱͳΔܥྻͷਖ਼ޡ TSifter͸શͯͷέʔεʹରͯ͠ਖ਼͘͠ݪҼͱͳΔܥྻΛநग़ ϕʔεϥΠϯख๏͸shippingαʔϏεͷCPUաෛՙͷέʔεͷΈෆਖ਼ղ

  20. 20 ᶄ࣍ݩ࡟ݮ཰ͷධՁ: ҟৗ4έʔε ɾ͍ͣΕͷέʔεʹ͓͍ͯ΋ɺ91%Ҏ্ͷ࣍ݩ࡟ݮ཰Ͱ͋Γɺ1/10Ҏ ԼʹߜΓࠐΊ͍ͯΔ ɾϕʔεϥΠϯख๏ͷ΄͏͕࣍ݩ࡟ݮ཰͸Θ͔ͣʹߴ͍ ɾTSifter͸εςοϓ1ͰΑΓଟ͘ͷϝτϦοΫΛ࡟ݮͰ͖͍ͯΔ

  21. 21 ᶅߴ଎ੑͷධՁ: ֤ॲཧεςοϓͷ࣮ߦ࣌ؒ ɾCPUίΞ਺4ɺϝτϦοΫ਺100kͷ؀ڥ ɾTSifter͸ϕʔεϥΠϯʹରͯ͠ɺ311ഒߴ଎ͱͳͬͨ ɾʢޙड़ͷ௥Ճ࣮ݧͰ͸ɺ࠷௿Ͱ΋270ഒߴ଎ʣ εςοϓ1 (sec) ࣄલআڈ εςοϓ2

    (sec) ΫϥελϦϯά ߹ܭ࣮ߦ࣌ؒ (sec) TSifter 54.41 8.68 63.09 ϕʔεϥΠϯ 32.33 19590.83 19623.16
  22. 22 ɾ྆ख๏ͱ΋ʹɺCPUίΞ਺·ͨ͸ܥྻ਺ʹରͯ͠ɺઢܗʹεέʔϧ ᶅߴ଎ੑͷධՁ: εέʔϥϏϦςΟ TSifter ϕʔεϥΠϯ 0 20 40 60

    20000 40000 60000 80000 100000 Execution time (sec) Number of metrics Clustering 1.21 2.43 3.81 5.72 8.68 Filtering 10.24 20.28 31.05 42.14 54.41 Total 11.45 22.71 34.86 47.86 63.09 0 5000 10000 15000 20000 20000 40000 60000 80000 100000 Execution time (sec) Number of metrics Clustering 3908.10 7773.00 11710.26 15670.81 19590.83 Filtering 2.88 7.63 13.54 22.91 32.33 Total 3910.98 7780.63 11723.80 15693.72 19623.16 0 200 400 600 800 1000 1200 1400 1 2 3 4 Execution time (sec) Number of CPU cores Clustering 1224.87 613.31 416.55 317.65 Filtering 0.17 0.17 0.17 0.17 Total 1225.04 613.48 416.72 317.82 0 1 2 3 4 1 2 3 4 Execution time (sec) Number of CPU cores Clustering 0.37 0.21 0.20 0.15 Filtering 3.57 1.81 1.26 0.99 Total 3.93 2.02 1.46 1.14 TSifter ϕʔεϥΠϯ
  23. 23 ࣮ߦ࣌ؒ͸1෼Ҏ಺͕ཧ૝Ͱ͋ΓɺϕʔεϥΠϯख๏ͷ࣮ߦ ࣌ؒ͸1225ඵʢ20෼ʣͰ͋Γɺݱ৔Ͱͷཁ݅Λຬͨͤͳ͍ ֤ཁ݅ʹର͢ΔධՁͷ·ͱΊ ᶃਖ਼֬ੑ ᶄ࣍ݩ ࡟ݮ཰ ᶅߴ଎ੑ ࣮ݧͰ͸ɺαʔϏεͷछྨ΍ނোέʔε͕ݶఆతͳͨΊɺ ௥ՃͷධՁ͕ඞཁ

    ࣍ݩ࡟ݮ཰͸ϕʔεϥΠϯख๏͕Θ͔ͣʹ্ճΔ ࠷ऴతʹཁٻ͞ΕΔ࣍ݩ࡟ݮ཰ͷఔ౓͸ࠓޙͷ՝୊ CPUίΞ਺ͱܥྻ਺͕มԽͯ͠΋ɺ྆ख๏ͷ࣮ߦ࣌ؒൺ͸ಉ ౳
  24. 24 ͳͥϕʔεϥΠϯख๏ʹରͯ͠ߴ଎ͳͷ͔ʁ ϕʔεϥΠϯ TSifter ࠷దͳΫϥελ਺Λܾఆ͢ΔͨΊʹ ܁Γฦ࣮͠ߦ ΫϥελϦϯά࣮ߦճ਺͸310ճ ֊૚తΫϥελϦϯά ΫϥελϦϯά ࣮ߦճ਺͸7ճ

    ڑ཭ͷᮢ஋Λઃఆͯ͠ Ϋϥελ਺Λܾఆ
  25. 4. ·ͱΊͱࠓޙͷల๬

  26. 26 ɾҟৗͷݕ஌ʹ൓Ԡͯ͠ɺେྔͷϝτϦοΫ͔ΒʮҰ࣌తʹʯ਍அʹ༗༻ͳ ϝτϦοΫΛߴ଎ʹநग़͢ΔͨΊͷ࣍ݩ࡟ݮख๏ΛఏҊ ɾ࣮ݧͷൣғ಺Ͱ͸ɺϕʔεϥΠϯʹରͯ͠ɺ࠷௿Ͱ΋270ഒͷߴ଎ԽΛୡ੒ ɾਖ਼֬ੑɺ࣍ݩ࡟ݮ཰ɺεέʔϥϏϦςΟͰ͸ಉ౳ఔ౓ ɾ10 ສϝτϦοΫʹରͯ͠1෼ఔ౓ͷ࣌ؒͰ࣮ߦՄೳ ·ͱΊͱࠓޙͷల๬ ɾࠓޙͷల๬ ɾఏҊͷྑ͕͞ΑΓ໌֬ͱͳΔධՁͷ௥ՃʢΑΓదͨ͠ϕʔεϥΠϯͷબ୒

    ͳͲʣ ɾTSifterΛ૊ΈࠐΜͩݪҼ਍அγεςϜͷ࣮ݱ
  27. 0. ิ଍εϥΠυ

  28. 28 TSifterͷ੍໿ ɾ෼ੳظ͕ؒݻఆ஋Ͱ͋ΔͨΊɺ෼ੳظؒ֎ͷมಈΛߟྀͰ͖ͳ͍ ࣌ؒ࣠