Upgrade to Pro — share decks privately, control downloads, hide ads and more …

TSifter: マイクロサービスにおける性能異常の 迅速な診断に向いた時系列データの次元削減手法

TSifter: マイクロサービスにおける性能異常の 迅速な診断に向いた時系列データの次元削減手法

A658ec7f1badf73819dfa501165016c1?s=128

Yuuki Tsubouchi (yuuk1)

November 24, 2020
Tweet

Transcript

  1. TSifter: ϚΠΫϩαʔϏεʹ͓͚Δੑೳҟৗͷ ਝ଎ͳ਍அʹ޲͍ͨ࣌ܥྻσʔλͷ࣍ݩ࡟ݮख๏ ௶಺ ༎थ(@yuuk1t), ௽ా തจ(@tsurubee3), ݹ઒ խେ(@yoyogidesaiz) @͸ͯͳ

    2020/11/24
  2. 2 SLIɾSLOΛத৺ʹਾ͑ͨγεςϜҟৗ΁ͷΞϓϩʔν ɾ༧ଌɹաڈʹSLI͕௿Լͨ͠௚લͷৼΔ෣͍ͱྨࣅ͢ΔৼΔ෣͍Λൃݟ ɾҟৗՕॴͷಛఆɹSLOҧ൓࣌ʹҟৗՕॴΛ୳ࡧ ɾݪҼڀ໌ɹʢ޻ࣄதʣ ɾճ෮ɹSLIΛλʔήοτɺใुͱ͢ΔϑΟʔυόοΫ੍ޚɺڧԽֶश ɾSLIͷ୅ସɹࠓ͋ΔϝτϦοΫΛ૊Έ߹ΘͤͯSLIͷ୅ସࢦඪΛ࡞੒ ༧ଌ ҟৗՕॴͷಛఆ ݪҼڀ໌

    ճ෮ ݕ஌ SLOʹجͮ͘Ξϥʔτ ௚ۙͷڵຯର৅
  3. 3 1. ݚڀͷഎܠͱ໨త 2. ҟৗͷݪҼ਍அʹ޲͍ͨϝτϦοΫͷ࣍ݩ࡟ݮख๏ 3. ࣮ݧͱධՁ 4. ·ͱΊͱࠓޙͷల๬ ໨࣍

  4. 1. ݚڀͷഎܠͱ໨త

  5. 5 ϚΠΫϩαʔϏεߏ੒ͷීٴ ϞϊϦγοΫ ϚΠΫϩαʔϏε

  6. 6 ؍ଌσʔλྔ ͷ૿େ ϚΠΫϩαʔϏεʹΑΔ෼ࢄԽʹ·ͭΘΔ໰୊ҙࣝ ґଘؔ܎ͷෳࡶੑ ιϑτ΢ΣΞͷ ಈతͳมߋ γεςϜͷೝ஌ෛՙ͕ߴ·Δ ҟৗͷݪҼΛ਍அ͢ΔͨΊͷ࣌ؒΛཁ͢ΔΑ͏ʹͳΔ

  7. 7 ੑೳҟৗΛ਍அ͢ΔͨΊͷطଘͷΞϓϩʔν ϝτϦοΫ ςΩετϩά ࣮ߦτϨʔε ๛෋ͳ৘ใΛ΋͕ͭϩάʹग़ྗ͞Εͳ͍΋ͷ΋ ͋Δ ॲཧܦ࿏ͷล୯Ґͷεϧʔϓοτ΍࣮ߦ࣌ؒΛ ೺ѲͰ͖Δ͕ɺΞϓϦέʔγϣϯʹܭଌॲཧΛ ઃఆ͢Δख͕ؒ͋Δ

    ݸʑͷ৘ใྔ͸গͳ͍͕ऩूɺอଘɺՄࢹԽ͠ ΍͍͢ɻ ࣮؀ڥ΁ͷద༻ੑΛ౿·͑ͯɺʮϝτϦοΫʯʹண໨
  8. 8 1. ϚΠΫϩαʔϏεؒΛԣஅ͢ΔܥྻؒͷҼՌͷ఻ൖϞσϧΛߏங※ 2. ϚΠΫϩαʔϏε୯ҐͷҼՌͷܦ࿏Λਪ࿦͢Δ ϝτϦοΫϕʔεΞϓϩʔν ҟৗͷݕ஌ޙʹɺϝτϦοΫͷܥྻάϥϑΛ໨ࢹ͢Δͷ͸࣌ؒΛཁ͢Δ ख๏ ※ ௽ాതจ,

    άϥϑΟΧϧϞσϧʹجͮ͘ҼՌ୳ࡧख๏ͷௐࠪ, 2020. https://blog.tsurubee.tech/entry/2020/10/08/085158 Service A/ req_errors Service D/ connections Service E/ disk IOPS ҼՌͷܦ࿏
  9. ɾ਍அʹར༻͢Δݻఆ਺ͷϝτϦοΫΛࢦఆ͠ͳ͚Ε͹ͳΒͳ͍ ɾྫʣԠ౴஗ԆͷΈɺ{Ԡ౴஗Ԇ,CPUར༻཰,ϝϞϦ࢖༻ྔ,…} ͳͲ ɾΑΓݪҼʹ͍ۙϝτϦοΫ͕݁Ռ͔Βআ֎͞ΕΔՄೳੑ͕͋Δ 9 ϝτϦοΫϕʔεΞϓϩʔνͷ՝୊ Ͱ͖ΔݶΓଟ͘ͷϝτϦοΫΛ෼ੳ͢Δඞཁ͕͋Δ

  10. 10 ੑೳҟৗʹର͢ΔϝτϦοΫͷܥྻͷ࣍ݩ࡟ݮͷఏҊ ໨త: ϚΠΫϩαʔϏεʹ͓͍ͯɺҟৗͷ఻೻ܦ࿏ΛࣗಈͰਪ࿦͢ΔͨΊ ͷج൫Λఏڙ͢Δ ఏҊ: ҟৗͷݕ஌ʹ൓Ԡͯ͠ɺʮҰ࣌తʹʯ਍அʹ༗༻ͳܥྻΛશܥྻ͔ Βߴ଎ʹநग़͢Δख๏ “TSifter” (Time

    series Sifter) ɾᶃਖ਼֬ੑ :਍அʹ༗༻ͳܥྻ͕࡟ݮ͞Ε͍ͯͳ͍ ɾᶄ࣍ݩ࡟ݮ཰: ແ༻ͳܥྻΛͳΔ΂͘ଟ͘࡟ݮ͍ͨ͠ ɾᶅߴ଎ੑ : ਝ଎ʹݪҼΛΈ͚͍ͭͨ (ཧ૝͸1෼ఔ౓)
  11. 11 ࠷ऴతʹ࣮ݱ͍ͨ͠ݪҼ਍அγεςϜͷશମ૾ શܥྻ औಘ ܥྻͷ ࣍ݩ࡟ݮ ݪҼ਍அ ࣌ܥྻ σʔλϕʔε σʔλऩू

    ҟৗݕ஌ ఏҊख๏ͷείʔϓ YES Service A/ req_errors Service D/ connections Service E/ ܥྻؒͷҼՌͷܦ࿏ ᶃ ᶄ ᶅ ᶆ ᶇ
  12. 2. ੑೳҟৗͷݪҼ਍அʹ޲͍ͨ ϝτϦοΫͷ࣍ݩ࡟ݮख๏

  13. 13 ఏҊख๏ TSifter ͷཁ݅ͱղܾ ᶃਖ਼֬ੑ ᶄ࣍ݩ࡟ݮ཰ ᶅߴ଎ੑ ಎ࡯1 ಎ࡯2 ҟৗൃੜલޙͰ࣌ܥྻͷ܏޲͕มԽ͠ͳ͍

    ܥྻ͸਍அ࣌ʹෆཁ → ࣌ܥྻσʔλͷఆৗੑΛ΋ͭܥྻΛআ֎ ࣌ܥྻมԽͷܗঢ়͕ࣅ͍ͯΔܥྻ܈͸ҟৗ ͷ਍அ࣌ʹ৑௕ → ࣌ܥྻͷΫϥελϦϯά ܥྻ਺͕େ͖͍΄ͲΫϥελϦϯάॲཧ͕஗͍ → ಎ࡯1ͷআ֎ॲཧΛઌʹ࣮ߦ͢Δ
  14. 14 TSifter: 2ஈ֊ͷ࣍ݩ࡟ݮख๏ ɾɾɾ ɾɾɾ ɾɾɾ εςοϓ1 ఆৗੑΛ΋ͭ ܥྻΛআڈ ੜͷܥྻ

    ඇఆৗͳܥྻ ΫϥελԽ͞Εͨܥྻ ୅දܥྻ ҟৗظؒ ΫϥελϦϯάޙʹΫϥελ ͷ୅දܥྻΛબ୒ εςοϓ2 ྨࣅͷܗঢ়Λ ͱΔܥྻΛ ΫϥελϦϯά ҟৗൃੜલn෼ͷ ݻఆ௕ͷ΢Οϯυ΢෯
  15. 15 ɾఆৗੑ: σʔλͷฏۉ͓Αͼ෼ࢄ͕࣌ؒʹΑΒͣҰఆɼ͔ͭࣗݾڞ ෼ࢄ͕࣌ؒࠩͷΈʹґଘ͢Δੑ࣭ ɾ࣌ܥྻσʔλͷఆৗੑݕఆʹ޿͘ར༻͞ΕΔADFݕఆΛར༻͢Δ ɾશͯͷܥྻΛ1ͭͣͭݕఆ͠ɺఆৗੑΛ΋ͭܥྻΛআڈ͢Δ εςοϓ1: ݸʑͷϝτϦοΫͷఆৗੑʹண໨ ఆৗੑΛ΋ͭܥྻͷྫ

  16. 16 ɾ֤αʔϏε಺Ͱɺάϥϑͷܗঢ়͕ࣅ͍ͯΔܥྻΛΫϥελϦϯά ɾܗঢ়ͷྨࣅੑΛߟྀͨ͠ڑ཭ई౓ͱͯ͠ shape-based distance (SBD) Λ࠾༻ (2ͭͷܥྻΛεϥΠυͤͯ͞૬ؔΛΈΔ) ɾߴ଎ԽͷͨΊʹɺ1ճͷΫϥελϦϯάॲཧͰΫϥελ਺ΛܾఆՄ ೳͳ֊૚తΫϥελϦϯάΛ࠾༻

    ɾ࠷ޙʹɺ֤Ϋϥελͷ୅දͱͳΔܥྻΛҰͭબ୒ ɾଞͷܥྻͱͷڑ཭ͷ૯࿨͕࠷খͷܥྻΛબ୒ εςοϓ2: ܥྻؒͷܗঢ়ྨࣅੑʹண໨ [Paparrizos 15]: Paparrizos, J. and Gravano, L., k-Shape: Efficient and Accurate Clustering of Time Series, The ACM Special Interest Group on Management of Data (SIGMOD), pp. 1855– 1870 2015. [Paparrizos 15]
  17. 17 TSifterͷ੍໿ ɾ෼ੳظ͕ؒݻఆ஋Ͱ͋ΔͨΊɺ෼ੳظؒ֎ͷมಈΛߟྀͰ͖ͳ͍ ɾྫ͑͹ɺ༧Ί෼ੳظؒΛ30෼ͱ͢Δͱɺҟৗͷݕ஌࣌ࠁͷ40෼લ ʹ܏޲͕มԽ͠͸͡Ίͯɺ30෼લ͔Β0෼લͷؒͰ͸ఆৗੑΛ΋ͭ ৔߹ɺ͜ͷܥྻ͸আڈ͞ΕΔ

  18. 3. ࣮ݧͱධՁ

  19. 19 ࣮ݧઃఆ ɾΞϓϦέʔγϣϯ: Sock Shop ɾςετϕου: GKE্ʹߏங ɾϝτϦοΫऩू: Prometheus ɾෛՙੜ੒:

    Locust ɾނো஫ೖ ɾCPUෛՙ: stress-ng ɾωοτϫʔΫ஗Ԇ: tc ϋʔυ΢ΣΞߏ੒͸༧ߘΛࢀর Sock Shopͷߏ੒ਤ https://microservices-demo.github.io/
  20. 20 ϕʔεϥΠϯख๏: Sieve [Thalheim 17] ɾεςοϓ1: ෼ࢄ஋ͷখ͍͞ϝτϦοΫΛऔΓআ͘ ɾεςοϓ2: ࣌ܥྻΫϥελϦϯάख๏k-ShapeʹΑΓΫϥελϦϯ άͨ͠ͷͪʹ୅දϝτϦοΫΛબग़͢Δ

    [Thalheim 17] Thalheim, J., Rodrigues, A., Akkus, I. E., Bhatotia, P., Chen, R., Viswanath, B., Jiao, L. and Fetzer, C., Sieve: Actionable Insights from Monitored Metrics in Distributed Systems, the ACM/IFIP/USENIX Middleware, pp. 14–27 2017. ߃ৗతʹར༻ՄೳͳγεςϜͷಛ௃Λநग़͢Δ͜ͱ͕໨తͰ͋Γɺຊ ݚڀͱ͸໨త͕ҟͳΔ͕ɺҟͳΔ໨తʹ΋Ԡ༻Ͱ͖ΔՄೳੑ͕͋Δ
  21. 21 ਖ਼֬ੑ: ނোέʔε͝ͱͷݪҼͱͳΔϝτϦοΫͷਖ਼ޡ TSifter͸શͯͷέʔεʹରͯ͠ਖ਼͘͠ݪҼͱͳΔϝτϦοΫΛநग़ͨ͠ ϕʔεϥΠϯख๏͸shippingαʔϏεͷCPUաෛՙͷέʔεͷΈෆਖ਼ղͱͳͬͨ

  22. 22 ࣍ݩ࡟ݮ཰ͷධՁ ɾ͍ͣΕͷέʔεʹ͓͍ͯ΋ɺ91%Ҏ্ͷ࣍ݩ࡟ݮ཰Ͱ͋Γɺ1/10Ҏ ԼʹߜΓࠐΊ͍ͯΔ ɾϕʔεϥΠϯख๏ͷ΄͏͕࣍ݩ࡟ݮ཰͸Θ͔ͣʹߴ͍ ɾTSifter͸εςοϓ1ͰΑΓଟ͘ͷϝτϦοΫΛ࡟ݮͰ͖͍ͯΔ

  23. 23 ɾTSifterͷ࣮ߦ࣌ؒ͸ϕʔεϥΠϯͷ270ഒҎ্Ͱ͋ͬͨ ɾ͍ͣΕͷख๏΋CPUίΞʹର࣮ͯ͠ߦ࣌ؒ͸εέʔϧͨ͠ ߴ଎ੑͷධՁ: CPUίΞ਺ʹର͢Δ࣮ߦ࣌ؒ 0 200 400 600 800

    1000 1200 1400 1 2 3 4 Execution time (sec) Number of CPU cores Clustering 1224.87 613.31 416.55 317.65 Filtering 0.17 0.17 0.17 0.17 Total 1225.04 613.48 416.72 317.82 0 1 2 3 4 1 2 3 4 Execution time (sec) Number of CPU cores Clustering 0.37 0.21 0.20 0.15 Filtering 3.57 1.81 1.26 0.99 Total 3.93 2.02 1.46 1.14 TSifter ϕʔεϥΠϯ
  24. 24 ɾ͍ͣΕͷख๏΋ϝτϦοΫ਺͕૿େʹରͯ͠ઢܗʹεέʔϧͨ͠ ߴ଎ੑͷධՁ: ϝτϦοΫ਺ʹର͢Δ࣮ߦ࣌ؒ TSifter ϕʔεϥΠϯ 0 20 40 60

    20000 40000 60000 80000 100000 Execution time (sec) Number of metrics Clustering 1.21 2.43 3.81 5.72 8.68 Filtering 10.24 20.28 31.05 42.14 54.41 Total 11.45 22.71 34.86 47.86 63.09 0 5000 10000 15000 20000 20000 40000 60000 80000 100000 Execution time (sec) Number of metrics Clustering 3908.10 7773.00 11710.26 15670.81 19590.83 Filtering 2.88 7.63 13.54 22.91 32.33 Total 3910.98 7780.63 11723.80 15693.72 19623.16
  25. 25 ࣮ߦ࣌ؒ͸1෼Ҏ಺͕ཧ૝Ͱ͋ΓɺϕʔεϥΠϯख๏ͷ࣮ߦ ࣌ؒ͸1225ඵʢ20෼ʣͰ͋Γɺݱ৔Ͱͷཁ݅Λຬͨͤͳ͍ ֤ཁ݅ʹର͢ΔධՁͷ·ͱΊ ᶃਖ਼֬ੑ ᶄ࣍ݩ ࡟ݮ཰ ᶅߴ଎ੑ ࣮ݧͰ͸ɺαʔϏεͷछྨ΍ނোέʔε͕ݶఆతͳͨΊɺ ௥ՃͷධՁ͕ඞཁ

    ࣍ݩ࡟ݮ཰͸ϕʔεϥΠϯख๏͕Θ͔ͣʹ্ճΔɻ ݪҼ਍அγεςϜͱͯ͠Ͳͷఔ౓ͷ࣍ݩ࡟ݮ཰͕ཁٻ͞ΕΔ ͔͸ࠓޙͷ՝୊
  26. 26 ɾϕʔεϥΠϯख๏͸ɺ࠷దͳΫϥελ਺Λܾఆ͢ΔͨΊʹɺ Ϋϥε λ਺ΛมԽͤ͞ͳ͕Βɺ܁Γฦ͠k-ShapeΞϧΰϦζϜΛ࣮ߦ͢Δ ɾΫϥελϦϯάͷ࣮ߦճ਺͕ෳ਺ճ ɾTSifterͷ֊૚తΫϥελϦϯά͸ɺΫϥελϦϯά࣮ߦޙʹڑ཭ͷ ᮢ஋Λ༻͍ͯΫϥελ਺ΛܾఆͰ͖Δ ɾΫϥελϦϯάͷ࣮ߦճ਺͸1 ϕʔεϥΠϯʹରͯ͠ߴ଎ͱͳΔཧ༝

  27. 4. ·ͱΊͱࠓޙͷల๬

  28. 28 ɾେྔͷϝτϦοΫ͔Β਍அʹ༗༻ͳϝτϦοΫΛநग़͢ΔͨΊͷ࣍ ݩ࡟ݮख๏ΛఏҊͨ͠ ɾ࣮ݧͷൣғ಺Ͱ͸ɺϕʔεϥΠϯख๏ʹରͯ͠ɺ࠷௿Ͱ΋270ഒͷ ߴ଎ͱͳͬͨ ɾਖ਼֬ੑɺ࣍ݩ࡟ݮ཰ɺεέʔϥϏϦςΟͰ͸ಉ౳ఔ౓Ͱ͋Δ ɾ10 ສϝτϦοΫʹରͯ͠1෼ఔ౓ͷ࣌ؒͰ࣮ߦՄೳͰ͋Δ ·ͱΊͱࠓޙͷల๬

  29. 29 ɾTSifterΛ૊ΈࠐΜͩݪҼ਍அγεςϜΛ࣮ݱ͢Δ ɾTSifterΛChaos EngineeringʹԠ༻͢Δ ɾʮ࣮ݧʯޙʹࣄલͷԾઆ(known-unknowns)ͱҟͳΔݱ৅͕ൃੜ͠ ͨͱ͖ʹɺͦͷݱ৅Λ਍அ͢ΔͨΊʹɺTSifterΛ࢖͏ ɾTSifter͸શͯͷϝτϦοΫΛ૸ࠪ͢ΔͨΊɺະ஌ͷݱ৅ ʢunknown-unknownsʣʹରॲ͠΍͍͢ ࠓޙͷల๬