Pro Yearly is on sale from $80 to $50! »

マイクロサービスにおける性能異常の迅速な診断に向いた時系列データの次元削減手法 / Dimention Reduction of Time Series Data in Microservices

マイクロサービスにおける性能異常の迅速な診断に向いた時系列データの次元削減手法 / Dimention Reduction of Time Series Data in Microservices

第7回WebSystemArchitecture研究会(WSA研)
https://wsa.connpass.com/event/187128/

A658ec7f1badf73819dfa501165016c1?s=128

Yuuki Tsubouchi (yuuk1)

November 13, 2020
Tweet

Transcript

  1. ϚΠΫϩαʔϏεʹ͓͚Δੑೳҟৗͷ ਝ଎ͳ਍அʹ޲͍ͨ࣌ܥྻσʔλͷ࣍ݩ࡟ݮख๏ ௶಺ ༎थ(@yuuk1t), ௽ా തจ(@tsurubee3), ݹ઒ խେ(@yoyogidesaiz) ୈ7ճWebSystemArchitectureݚڀձ 2020/11/13

  2. 2 1. ݚڀͷഎܠͱ໨త 2. ҟৗͷݪҼ਍அʹ޲͍ͨϝτϦοΫͷ࣍ݩ࡟ݮख๏ 3. ࣮ݧͱධՁ 4. ·ͱΊͱࠓޙͷల๬ ໨࣍

  3. 1. ݚڀͷഎܠͱ໨త

  4. 4 ϚΠΫϩαʔϏεߏ੒ͷීٴ ϞϊϦγοΫ ϚΠΫϩαʔϏε

  5. 5 ؍ଌσʔλྔ ͷ૿େ ϚΠΫϩαʔϏεʹΑΔ෼ࢄԽʹ·ͭΘΔ໰୊ҙࣝ ґଘؔ܎ͷෳࡶੑ ιϑτ΢ΣΞͷ ಈతͳมߋ γεςϜͷೝ஌ෛՙ͕ߴ·Δ ҟৗͷݪҼΛ਍அ͢ΔͨΊͷ࣌ؒΛཁ͢ΔΑ͏ʹͳΔ

  6. 6 ੑೳҟৗΛ਍அ͢ΔͨΊͷطଘͷΞϓϩʔν ϝτϦοΫ ςΩετϩά ࣮ߦτϨʔε ๛෋ͳ৘ใΛ΋͕ͭϩάʹग़ྗ͞Εͳ͍΋ͷ΋ ͋Δ ॲཧܦ࿏ͷล୯Ґͷεϧʔϓοτ΍࣮ߦ࣌ؒΛ ೺ѲͰ͖Δ͕ɺΞϓϦέʔγϣϯʹܭଌॲཧΛ ઃఆ͢Δख͕ؒ͋Δ

    ݸʑͷ৘ใྔ͸গͳ͍͕ऩूɺอଘɺՄࢹԽ͠ ΍͍͢ ࣮؀ڥ΁ͷద༻ੑΛ౿·͑ͯɺʮϝτϦοΫʯʹண໨
  7. ɾ਍அʹར༻͢Δݻఆ਺ͷϝτϦοΫΛࢦఆ͠ͳ͚Ε͹ͳΒͳ͍ ɾΑΓݪҼʹ͍ۙϝτϦοΫ͕݁Ռ͔Βআ֎͞ΕΔՄೳੑ͕͋Δ 7 ɾߏ੒ཁૉؒΛԣஅ͢ΔϝτϦοΫؒͷҼՌؔ܎Λਪఆ͢Δ ɾߏ੒ཁૉ୯ҐͷҼՌͷܦ࿏Λਪ࿦͢Δ ϝτϦοΫϕʔεΞϓϩʔν Ͱ͖ΔݶΓଟ͘ͷؔ࿈͢ΔϝτϦοΫΛߴ଎ʹղੳ͢Δඞཁ͕͋Δ ՝୊ ਺ଟ͘ͷϝτϦοΫΛ໨ࢹ͢Δͷ͸࣌ؒΛཁ͢Δ ख๏

    ※ ௽ాതจ, άϥϑΟΧϧϞσϧʹجͮ͘ҼՌ୳ࡧख๏ͷௐࠪ, 2020. https://blog.tsurubee.tech/entry/2020/10/08/085158
  8. 8 ੑೳҟৗʹର͢ΔϝτϦοΫͷ࣍ݩ࡟ݮͷఏҊ ໨త: ϚΠΫϩαʔϏεʹ͓͍ͯɺҟৗͷ఻೻ܦ࿏ΛࣗಈͰਪ࿦͢Δͨ Ίͷج൫Λఏڙ͢Δ ఏҊ: ͢΂ͯͷϝτϦοΫ͔Β਍அʹ༗༻ͳϝτϦοΫΛߴ଎ʹநग़͢ Δख๏ “TSifter” (Time

    series Sifter) ɾᶃਖ਼֬ੑ :਍அʹ༗༻ͳϝτϦοΫ͕࡟ݮ͞Ε͍ͯͳ͍ ɾᶄ࣍ݩ࡟ݮ཰: ແ༻ͳϝτϦοΫΛͳΔ΂͘ଟ͘࡟ݮ͍ͨ͠ ɾᶅߴ଎ੑ : ͳΔ΂͘ૣ͘ো֐͔Β෮چ͍ͨ͠ (ཧ૝͸1෼ఔ౓)
  9. 9 ݪҼ਍அγεςϜͷશମ૾ ϝτϦοΫ औಘ ϝτϦοΫ ࣍ݩ࡟ݮ ݪҼ਍அ ϝτϦοΫ σʔλϕʔε σʔλऩू

    ҟৗݕ஌ ఏҊख๏ͷείʔϓ YES Service A/ req_errors Service D/ connections Service E/ ҼՌͷܦ࿏ ᶃ ᶄ ᶅ ᶆ ᶇ
  10. 2. ੑೳҟৗͷݪҼ਍அʹ޲͍ͨ ϝτϦοΫͷ࣍ݩ࡟ݮख๏

  11. 11 ఏҊख๏ͷཁ݅ͱղܾ ɾᶃਖ਼֬ੑͱᶄ࣍ݩ࡟ݮ཰ͷཱ྆ ɾಎ࡯1: ҟৗൃੜલޙͰ࣌ܥྻͷ܏޲͕มԽ͠ͳ͍ϝτϦοΫ͸਍ அ࣌ʹෆཁͰ͋Δ → ࣌ܥྻσʔλͷఆৗੑݕఆ ɾಎ࡯2: ಉҰαʔϏε಺ͷྨࣅͷ࣌ܥྻมԽͷܗঢ়Λ΋ͭϝτϦο

    Ϋ܈͸ҟৗͷ਍அ࣌ʹ৑௕Ͱ͋Δ → ࣌ܥྻΫϥελϦϯά ɾᶅߴ଎ੑ ɾΫϥελϦϯάॲཧͷߴ଎ԽͷͨΊʹࣄલʹϝτϦοΫΛϑΟϧλ Ϧϯά͢Δ
  12. 12 TSifter: 2ஈ֊ͷ࣍ݩ࡟ݮख๏ ɾɾɾ ɾɾɾ ɾɾɾ 4UFQ 'JMUFSJOH 4UFQ $MVTUFSJOH

    ऩूͨ͠ϝτϦοΫ ඇఆৗͳϝτϦοΫ ΫϥελԽ͞Εͨ ϝτϦοΫ 3FQSFTFOUBUJWFNFUSJD ҟৗظؒ ୹ظͷࣗݾ૬ؔ पظతมಈ ϗϫΠτϊΠζ ΫϥελϦϯάޙʹ ୅දϝτϦοΫબ୒
  13. 13 ɾఆৗੑ: σʔλͷฏۉ͓Αͼ෼ࢄ͕࣌ؒʹΑΒͣҰఆɼ͔ͭࣗݾڞ ෼ࢄ͕࣌ؒࠩͷΈʹґଘ͢Δੑ࣭ ɾ࣌ܥྻσʔλͷఆৗੑݕఆʹ޿͘ར༻͞ΕΔADFݕఆΛར༻͢Δ ɾશͯͷϝτϦοΫΛ1ͭͣͭݕఆ͠ɺఆৗੑΛ΋ͭϝτϦοΫΛআ ڈ͢Δ εςοϓ1: ݸʑͷϝτϦοΫͷఆৗੑʹண໨

  14. 14 ɾ֤ϚΠΫϩαʔϏε಺Ͱɺಉ༷ͷมಈ܏޲Λ΋ͭϝτϦοΫΛΫϥ ελϦϯάޙɼ֤Ϋϥελͷ୅දϝτϦοΫΛҰͭબ୒ ɾ୅දϝτϦοΫ: ଞͷϝτϦοΫͱͷڑ཭ͷ૯࿨͕࠷খͷ΋ͷ ɾܗঢ়ͷྨࣅੑΛߟྀͨ͠ڑ཭ई౓ͱͯ͠ shape-based distance(SBD) Λ࠾༻ (2ͭͷܥྻΛεϥΠυͤͯ͞૬ؔΛΈΔ)

    ɾߴ଎ԽͷͨΊʹɺ1ճͷΫϥελϦϯάॲཧͰΫϥελ਺ΛܾఆՄ ೳͳ֊૚తΫϥελϦϯάΛ࠾༻ εςοϓ2: ϝτϦοΫؒͷܗঢ়ྨࣅੑʹண໨
  15. 3. ࣮ݧͱධՁ

  16. 16 ࣮ݧઃఆ ɾΞϓϦέʔγϣϯ: Sock Shop ɾςετϕου: GKE্ʹߏங ɾϝτϦοΫऩू: Prometheus ɾෛՙੜ੒:

    Locust ɾނো஫ೖ ɾCPUෛՙ: stress-ng ɾωοτϫʔΫ஗Ԇ: tc ϋʔυ΢ΣΞߏ੒͸༧ߘΛࢀর Sock Shopͷߏ੒ਤ https://microservices-demo.github.io/
  17. 17 ϕʔεϥΠϯख๏: Sieve [Thalheim 17] ɾεςοϓ1: ෼ࢄ஋ͷখ͍͞ϝτϦοΫΛऔΓআ͘ ɾεςοϓ2: ࣌ܥྻΫϥελϦϯάख๏k-ShapeʹΑΓΫϥελϦϯ άͨ͠ͷͪʹ୅දϝτϦοΫΛબग़͢Δ

    [Thalheim 17] Thalheim, J., Rodrigues, A., Akkus, I. E., Bhatotia, P., Chen, R., Viswanath, B., Jiao, L. and Fetzer, C., Sieve: Actionable Insights from Monitored Metrics in Distributed Systems, the ACM/IFIP/USENIX Middleware, pp. 14–27 2017. ߃ৗతʹར༻ՄೳͳγεςϜͷಛ௃Λநग़͢Δ͜ͱ͕໨తͰ͋Γɺຊ ݚڀͱ͸໨త͕ҟͳΔ͕ɺҟͳΔ໨తʹ΋Ԡ༻Ͱ͖ΔՄೳੑ͕͋Δ
  18. 18 ਖ਼֬ੑ: ނোέʔε͝ͱͷݪҼͱͳΔϝτϦοΫͷਖ਼ޡ TSifter͸શͯͷέʔεʹରͯ͠ਖ਼͘͠ݪҼͱͳΔϝτϦοΫΛநग़ͨ͠ ϕʔεϥΠϯख๏͸shippingαʔϏεͷCPUաෛՙͷέʔεͷΈෆਖ਼ղͱͳͬͨ

  19. 19 ࣍ݩ࡟ݮ཰ͷධՁ ɾ͍ͣΕͷέʔεʹ͓͍ͯ΋ɺ91%Ҏ্ͷ࣍ݩ࡟ݮ཰Ͱ͋Γɺ1/10Ҏ ԼʹߜΓࠐΊ͍ͯΔ ɾϕʔεϥΠϯख๏ͷ΄͏͕࣍ݩ࡟ݮ཰͸Θ͔ͣʹߴ͍ ɾTSifter͸εςοϓ1ͰΑΓଟ͘ͷϝτϦοΫΛ࡟ݮͰ͖͍ͯΔ

  20. 20 ɾTSifterͷ࣮ߦ࣌ؒ͸ϕʔεϥΠϯͷ270ഒҎ্Ͱ͋ͬͨ ɾ͍ͣΕͷख๏΋CPUίΞʹର࣮ͯ͠ߦ࣌ؒ͸εέʔϧͨ͠ ߴ଎ੑͷධՁ: CPUίΞ਺ʹର͢Δ࣮ߦ࣌ؒ 0 200 400 600 800

    1000 1200 1400 1 2 3 4 Execution time (sec) Number of CPU cores Clustering 1224.87 613.31 416.55 317.65 Filtering 0.17 0.17 0.17 0.17 Total 1225.04 613.48 416.72 317.82 0 1 2 3 4 1 2 3 4 Execution time (sec) Number of CPU cores Clustering 0.37 0.21 0.20 0.15 Filtering 3.57 1.81 1.26 0.99 Total 3.93 2.02 1.46 1.14 TSifter ϕʔεϥΠϯ
  21. 21 ɾ͍ͣΕͷख๏΋ϝτϦοΫ਺͕૿େʹରͯ͠ઢܗʹεέʔϧͨ͠ ߴ଎ੑͷධՁ: ϝτϦοΫ਺ʹର͢Δ࣮ߦ࣌ؒ TSifter ϕʔεϥΠϯ 0 20 40 60

    20000 40000 60000 80000 100000 Execution time (sec) Number of metrics Clustering 1.21 2.43 3.81 5.72 8.68 Filtering 10.24 20.28 31.05 42.14 54.41 Total 11.45 22.71 34.86 47.86 63.09 0 5000 10000 15000 20000 20000 40000 60000 80000 100000 Execution time (sec) Number of metrics Clustering 3908.10 7773.00 11710.26 15670.81 19590.83 Filtering 2.88 7.63 13.54 22.91 32.33 Total 3910.98 7780.63 11723.80 15693.72 19623.16
  22. 22 ɾຊ࣮ݧͰ͸ɺαʔϏεͷछྨ΍ނোέʔε͕ݶఆతͳͨΊɺਖ਼֬ੑ ʹ͓͍ͯ͸௥ՃͷධՁ͕ඞཁͰ͋Δ ɾ࣍ݩ࡟ݮ཰͸ϕʔεϥΠϯख๏্͕ճΔ͕ɺͦΕҎ্ʹTSifter͸ߴ ଎ʹಈ࡞͢Δ ɾ࣮ߦ࣌ؒ͸1෼Ҏ಺͕ཧ૝Ͱ͋ΓɺϕʔεϥΠϯख๏ͷ࣮ߦ࣌ؒ͸ 1225ඵʢ20෼ʣͰ͋Γɺݱ৔Ͱͷཁ݅Λຬͨͤͳ͍ ֤ཁ݅ʹର͢ΔධՁͷ·ͱΊ

  23. 23 ɾϕʔεϥΠϯख๏͸ɺ࠷దͳΫϥελ਺Λܾఆ͢ΔͨΊʹɺ Ϋϥε λ਺ΛมԽͤ͞ͳ͕Βɺ܁Γฦ͠k-ShapeΞϧΰϦζϜΛ࣮ߦ͢Δ ɾΫϥελϦϯάͷ࣮ߦճ਺͕ෳ਺ճ ɾTSifterͷ֊૚తΫϥελϦϯά͸ɺΫϥελϦϯά࣮ߦޙʹڑ཭ͷ ᮢ஋Λ༻͍ͯΫϥελ਺ΛܾఆͰ͖Δ ɾΫϥελϦϯάͷ࣮ߦճ਺͸1 ϕʔεϥΠϯʹରͯ͠ߴ଎ͱͳΔཧ༝

  24. 4. ·ͱΊͱࠓޙͷల๬

  25. 25 ɾେྔͷϝτϦοΫ͔Β਍அʹ༗༻ͳϝτϦοΫΛநग़͢ΔͨΊͷ࣍ ݩ࡟ݮख๏ΛఏҊͨ͠ ɾ࣮ݧͷൣғ಺Ͱ͸ɺϕʔεϥΠϯख๏ʹରͯ͠ɺ࠷௿Ͱ΋270ഒͷ ߴ଎ԽΛୡ੒ͨ͠ ɾਖ਼֬ੑɺ࣍ݩ࡟ݮ཰ɺεέʔϥϏϦςΟͰ͸ಉ౳ఔ౓Ͱ͋Δ ɾ10 ສϝτϦοΫʹରͯ͠1෼ఔ౓ͷ࣌ؒͰ࣮ߦՄೳͰ͋Δ ·ͱΊͱࠓޙͷల๬

  26. 26 ɾTSifterΛ૊ΈࠐΜͩݪҼ਍அγεςϜΛ࣮ݱ͢Δ ɾTSifterΛChaos Engineeringͷ࣮ફʹԠ༻͢Δ ɾʮ࣮ݧʯޙʹࣄલͷԾઆ(known-unknowns)ͱҟͳΔݱ৅͕ൃੜ͠ ͨͱ͖ʹɺͦͷݱ৅Λ਍அ͢ΔͨΊʹɺTSifterΛ࢖͏ ɾTSifter͸શͯͷϝτϦοΫΛ૸ࠪ͢ΔͨΊɺະ஌ͷݱ৅ ʢunknown-unknownsʣʹରॲ͠΍͍͢ ࠓޙͷల๬