Upgrade to Pro — share decks privately, control downloads, hide ads and more …

クラウドのシステム運用技術に機械学習を応用する研究 / CLOUD AI

クラウドのシステム運用技術に機械学習を応用する研究 / CLOUD AI

第6回さくらインターネット研究会

Yuuki Tsubouchi (yuuk1)

December 09, 2020
Tweet

More Decks by Yuuki Tsubouchi (yuuk1)

Other Decks in Research

Transcript

  1. 2 ࣗݾ঺հ Yuuki Tsubouchi / Ώ͏͏͖ https://yuuk.io/ ܦྺ גࣜձࣾ͸ͯͳ WebΦϖϨʔγϣϯΤϯδχΞɾSRE

    ͘͞ΒΠϯλʔωοτݚڀॴ ݚڀһ ژ౎େֶ৘ใֶݚڀՊ ത࢜ޙظ՝ఔ WebαʔϏεͷ ։ൃɾӡ༻ SREͷݚڀ 5೥ ݱࡏ Site Reliability Engineering(SRE) Researcher @yuuk1t ৘ใॲཧֶձ Πϯλʔωοτͱӡ༻ٕज़ݚڀձ ӡӦҕһ
  2. 7 Ironies of Automation (1983) 1. ࣗಈԽʹΑΓਓؒͷ࡞ۀෛ୲Λ௿ݮͰ͖Δ 2. ͔͠͠ɼࣗಈԽ͢Ε͹͢Δ΄Ͳਓؒͷೝ஌ෛՙ͕ߴ·Δ 3.

    ೝ஌ෛՙʹ଱͑ΒΕΔΑ͏ʹߴ౓ͳ܇࿅͕ඞཁͱͳΔ • L. Bainbridge, "Ironies of automation,” Analysis, design and evaluation of man–machine systems, pp.129-135 1983. • G. Baxter, et al. "The ironies of automation: still going strong at 30?,” the 30th European Conference on Cognitive Ergonomics 2012. • B. Strauch, "Ironies of automation: Still unresolved after all these years," IEEE Transactions on Human-Machine Systems, vol. 48, no. 5, pp. 419-433 2017. • J. Paul Reed, “When /bin/sh Attacks: Revisiting "Automate All the Things”,” USENIX LISA19 2019. • Tanner Lund, “Ironies of Automation: A Comedy in Three Parts,”, USENIX SREcon19 Asia/Pacific 2019. શͯΛࣗಈԽ͠Α͏ͱͯ͠΋͏·͍͔͘ͳ͍ ෳࡶ͞Λࣺͯͯɺγϯϓϧʹͨ͠ͱͯ͠΋ɺͰ͖ͳ͍͜ͱ͸Ͱ͖ͳ͍·· ෳࡶ͞Λड༰ͯ͠ɺ͍͔ʹਓؒͷೝ஌ෛՙΛ௿Լͤ͞Δ͔ ࣗಈԽͷൽ೑
  3. 8 Ironies Of Automationʹର͢ΔେલఏΞϓϩʔν Ironies of Automationʹର͢ΔΞϓϩʔν ɾ͍͔ʹෳࡶͳγεςϜʹରͯ͠ɺೝ஌Λ֫ಘ͍͔ͯ͘͠ ɾೝ஌ෆ଍ͷঢ়ଶ =>

    ࣦഊ͕ා͍ͷͰγεςϜมߋͷͨΊʹௐࠪʹ࣌ؒ Λ͔͚Δ => ෳࡶ͕͞૿͍ͯ͘͠΄Ͳ͕͔͔࣌ؒΔ => ೝ஌͕௥͍͔ͭ ͳ͍ Site Reliability Engineering (SRE) 1. ࣦഊΛڐ༰͢ΔલఏͰӡ༻ΛઃܭΛ͢Δ 2. ਓؒͷೝ஌ͦͷ΋ͷΛίϯϐϡʔλʹΦϑϩʔυ͢Δ AIٕज़ ※௶಺༎थ, Ϋϥ΢υܥͷࠃࡍձٞIEEE CLOUD 2020ࢀՃ࿥, https://blog.yuuk.io/entry/2020/ieeecloud2020 ※
  4. ৴པੑͱ͸ γεςϜ͕ٻΊΒΕΔػೳΛɼఆΊΒΕͨ৚݅ͷԼͰఆ ΊΒΕͨظؒʹΘͨΓো֐Λى͜͢͜ͱͳ࣮͘ߦ͢Δ֬཰ ※2 9 ɾαʔϏεϨϕϧࢦඪʢService Level Indicator, SLIʣ ɾαʔϏεϨϕϧ໨ඪʢService

    Level Objective, SLOʣ SRE: 100%ͷ৴པੑΛ໨ࢦ͞ͳ͍ *2 P. O’Connor, A. Kleyner. Practical Reliability Engineering, 5th edition, Wiley 2012. Ϋϥ΢υ্Ͱల։͍ͯ͠Δଟ਺ͷαʔϏεࣄۀऀ͕SLIɾSLOʹΑΓ ৴པੑΛఆྔతʹܭଌ͠ɼ݁ՌΛҙࢥܾఆʹར༻͍ͯ͠Δ ৴པੑͷࢦඪͱͦͷ໨ඪ஋Λܾఆ͠ɼܭଌظؒதʹ໨ඪ஋ΛԼճΒͳ ͍ݶΓɼαʔϏεࣄۀऀ͸ੵۃతʹγεςϜΛมߋͰ͖Δ ※SLAʢService Level Agreementʣ͸Ϗδωε্ͷܖ໿Ͱ͋Γɺ Ϣʔβʔͷෆຬʹର͢ΔิঈͳͲؚ͕·ΕΔ
  5. 10 1. ࣦഊͷڐ༰ => ҙਤతʹނোΛ஫ೖ SREΛલఏʹAIΛద༻͢Δ Chaos Engineering 2. SLIɾSLOΛϕʔεʹAIͰো֐͔Βճ෮ΛࣗಈԽ

    ɾChaosͷఆٛʮγεςϜͷऑ఺Λ໌Β͔ʹ͢ΔͨΊͷ࣮ݧͷଅਐʯ ɾʮఆৗঢ়ଶʯΛܾΊͯɺԾઆݕূͷϑϨʔϜϫʔΫΛར༻ ɾγεςϜͷऑ఺ʹର͢Δೝ஌Λଅਐͤ͞Δ ɾ࠷ۙɺऔΓ૊ΜͰ͍Δͷ͸ͪ͜Β
  6. 12 ɾػցֶश͸ɺσʔλΛར༻ֶͯ͠श͢Δ ɾ͔͠͠ɺҟৗൃੜ࣌ͷσʔλ͸ݱ࣮ʹ͸͋·ΓೖखͰ͖ͳ͍… ɾChaos EngineeringʹΑΓɺҙਤతʹҟৗΛൃੜͤ͞ɺֶशʹར༻ ՄೳͳσʔλΛੜ੒ͤ͞Δ ͞ΒͳΔల๬: Chaos Engineering x

    AI ɾ੍ޚཧ࿦Ͱ͸ಈతγεςϜϞσϧͷύϥϝʔλΛܾఆ͢ΔͨΊʹɺ ࣮ݧʹಈ࡞ͤ͞ΔϑΣʔζ͕͋ΔʢγεςϜಉఆʣ ɾΫϥ΢υͷγεςϜͰ੍ޚϞσϧΛઃఆͯ͠ɺγεςϜಉఆ͸ Chaos EngineneringͰߦ͏
  7. 19 ϝτϦοΫϕʔεΞϓϩʔν ֤αʔϏεͷܥྻάϥϑ͔Β૬ؔΛൃݟͰ͖Δ͕ɺݪҼՕॴ͕ෆ໌ ʮ౷ܭతҼՌ୳ࡧʯΛԠ༻ͨ͠Ξϓϩʔν͕௚ۙ਺೥ͰఏҊ͞Ε͍ͯΔ Service A response time Service D

    response time Service E response time Service A response time Service D response time Service E response time Service F response time Service C response time ᶃܥྻؒͷҼՌ఻ൖάϥϑͷߏங ᶄҼՌͷܦ࿏ͷਪ࿦ Ma, M.,et al., AutoMAP: Diagnose Your Microservice-based Web Applications Automatically, WWW2020. Qiu, J.,et al., A Causality Mining and Knowledge Graph Based Method of Root Cause Diagnosis for Performance Anomaly in Cloud Applications, Applied Sciences, 2020. Lin, J.,et al., Microscope: Pinpoint Performance Issues with Causal Graphs in Micro-Service Environments, ICSOC2018. Service A response time Service F response time Service C response time Top-1 Top-2 Service B response time
  8. 21 ੑೳҟৗʹର͢ΔϝτϦοΫͷܥྻͷ࣍ݩ࡟ݮͷఏҊ ໨త: ϚΠΫϩαʔϏεʹͯҟৗͷ఻೻ܦ࿏ΛࣗಈͰਪ࿦͢ΔͨΊͷج൫ ఏҊ: ҟৗͷݕ஌ʹ൓Ԡͯ͠ɺʮҰ࣌తʹʯ਍அʹ༗༻ͳܥྻΛશܥྻ͔ Βߴ଎ʹநग़͢Δ࣍ݩ࡟ݮख๏ “TSifter” (Time series

    Sifter) ɾᶃਖ਼֬ੑ :਍அʹ༗༻ͳܥྻ͕࡟ݮ͞Ε͍ͯͳ͍ ɾᶄ࣍ݩ࡟ݮ཰: ແ༻ͳܥྻΛͳΔ΂͘ଟ͘࡟ݮ͍ͨ͠ ɾᶅߴ଎ੑ : ਝ଎ʹݪҼΛΈ͚͍ͭͨ (ཧ૝͸1෼ఔ౓) ܥྻ਺ʢ=࣍ݩ਺ʣ͕૿Ճ͢ΔͱҼՌ఻ൖάϥϑ͕ڊେԽ͢Δ 3ͭͷཁ݅
  9. 22 ࠷ऴతʹ࣮ݱ͍ͨ͠ݪҼ਍அγεςϜͷશମ૾ શܥྻ औಘ ܥྻͷ ࣍ݩ࡟ݮ ݪҼ਍அ ࣌ܥྻ σʔλϕʔε ܥྻͷऩू

    ҟৗݕ஌ ఏҊख๏ͷείʔϓ YES Service A/ req_errors Service D/ connections Service E/ ܥྻؒͷҼՌͷܦ࿏ ᶃ ᶄ ᶅ ᶆ ᶇ ʢҼՌάϥϑߏஙʣ αʔϏε୯ҐͰ ࣍ݩ࡟ݮ
  10. 24 ఏҊख๏ TSifter ͷཁ݅ͱղܾ ᶃਖ਼֬ੑ ᶄ࣍ݩ࡟ݮ཰ ᶅߴ଎ੑ ಎ࡯1 ಎ࡯2 ҟৗൃੜલޙͰ࣌ܥྻͷ܏޲͕มԽ͠ͳ͍

    ܥྻ͸਍அ࣌ʹෆཁ → ࣌ܥྻσʔλͷఆৗੑΛ΋ͭܥྻΛআ֎ ࣌ܥྻάϥϑͷܗঢ়͕ࣅ͍ͯΔܥྻ܈͸ҟ ৗͷ਍அ࣌ʹ৑௕ → αʔϏε୯ҐͰ࣌ܥྻͷΫϥελϦϯά ܥྻ਺nʹରͯ͠ΫϥελϦϯάॲཧ͸ , ... → ಎ࡯1ͷআ֎ॲཧ Λઌʹ࣮ߦ͢Δ O(kn) O(n2) O(n)
  11. 25 TSifter: 2ஈ֊ͷ࣍ݩ࡟ݮख๏ ɾɾɾ ɾɾɾ ɾɾɾ εςοϓ1 ఆৗੑΛ΋ͭ ܥྻΛআڈ ੜͷܥྻ

    ඇఆৗͳܥྻ ΫϥελԽ͞Εͨܥྻ ୅දܥྻ ҟৗظؒ ΫϥελϦϯάޙʹΫϥελ ͷ୅දܥྻΛબ୒ εςοϓ2 ྨࣅͷܗঢ়Λ ͱΔܥྻΛ ΫϥελϦϯά ҟৗൃੜલn෼ͷ ݻఆ௕ͷ΢Οϯυ΢෯
  12. 27 εςοϓ2: ܥྻؒͷܗঢ়ྨࣅੑʹண໨ αʔϏε಺ͷܥྻ܈ ܗঢ়ͷྨࣅੑΛද͢ڑ཭ई౓ shape-based distance (SBD) Λ࠾༻ ʢ࣌ؒ࣠ํ޲ʹγϑτɺॎ࣠ʹ৳ॖ͍ͯͯ͠΋ྨࣅͱΈͳ͢ʣ

    Paparrizos, J. and Gravano, L., k-Shape: Efficient and Accurate Clustering of Time Series,(SIGMOD2015) ߴ଎ԽͷͨΊɺ1ճͷॲཧͰΫϥελ਺ΛܾఆՄೳͳ֊૚తΫϥελϦϯάΛ࠾༻ ʢ ͕ͩɺ1αʔϏε͋ͨΓͷܥྻ਺͕খ͍ͨ͞Ί໰୊ʹͳΒͳ͍ʣ O(n2) αʔϏεͷ୅දܥྻ܈ Ϋϥελ ୅දܥྻͷબ୒ ଞͷܥྻͱͷڑ཭ͷ૯࿨͕࠷খͷܥྻ
  13. 29 ࣮ݧ؀ڥ ੍ޚαʔό Locust Kubernetes CPUෛՙ஫ೖ ωοτϫʔΫ஗Ԇ஫ೖ ϚΠΫϩαʔϏεΫϥελ Front-End Catalogue

    Orders Payment Shipping User Carts ղੳαʔό Prometheus ֎෦ෛՙͷ ੜ੒ ܥྻऔಘϞδϡʔϧ stress-ng tc ղੳϞδϡʔϧ ܥྻऩूִؒ: 5ඵ ܥྻͷ΢Οϯυ΢෯: 30෼ Intel Xeon 3.10GHz, 8core,32GB ܥྻͷऩूɾอଘ Sock Shop
  14. 30 ϕʔεϥΠϯख๏: Sieve ɾεςοϓ1: ෼ࢄ஋ͷখ͍͞ϝτϦοΫΛऔΓআ͘ ɾεςοϓ2: k-ShapeʹΑΔΫϥελϦϯά Thalheim, J., et

    al., Sieve: Actionable Insights from Monitored Metrics in Distributed Systems, (Middleware 2017) ߃ৗతʹར༻ՄೳͳγεςϜͷಛ௃Λநग़͢Δ͜ͱ͕໨తͰ͋Γɺຊ ݚڀͱ͸໨త͕ҟͳΔ͕ɺҟͳΔ໨తʹ΋Ԡ༻Ͱ͖ΔՄೳੑ͕͋Δ ࣌ܥྻσʔλͷ࣍ݩ࡟ݮख๏ Paparrizos, J. and Gravano, L., k-Shape: Efficient and Accurate Clustering of Time Series,(SIGMOD2015)
  15. 34 ɾ྆ख๏ͱ΋ʹɺCPUίΞ਺·ͨ͸ܥྻ਺ʹରͯ͠ɺઢܗʹεέʔϧ ᶅߴ଎ੑͷධՁ: εέʔϥϏϦςΟ TSifter ϕʔεϥΠϯ 0 20 40 60

    20000 40000 60000 80000 100000 Execution time (sec) Number of metrics Clustering 1.21 2.43 3.81 5.72 8.68 Filtering 10.24 20.28 31.05 42.14 54.41 Total 11.45 22.71 34.86 47.86 63.09 0 5000 10000 15000 20000 20000 40000 60000 80000 100000 Execution time (sec) Number of metrics Clustering 3908.10 7773.00 11710.26 15670.81 19590.83 Filtering 2.88 7.63 13.54 22.91 32.33 Total 3910.98 7780.63 11723.80 15693.72 19623.16 0 200 400 600 800 1000 1200 1400 1 2 3 4 Execution time (sec) Number of CPU cores Clustering 1224.87 613.31 416.55 317.65 Filtering 0.17 0.17 0.17 0.17 Total 1225.04 613.48 416.72 317.82 0 1 2 3 4 1 2 3 4 Execution time (sec) Number of CPU cores Clustering 0.37 0.21 0.20 0.15 Filtering 3.57 1.81 1.26 0.99 Total 3.93 2.02 1.46 1.14 TSifter ϕʔεϥΠϯ
  16. 40 ɾෳࡶԽ͢ΔγεςϜʹରͯ͠ɺIronies of Automationͷڭ܇Λൽ੾ Γʹɺೝ஌ෛՙΛ௿ݮͤ͞Δ͜ͱ͕ॏཁͱͳΔ ɾΤϥʔΛڐ༰ => SRE => Chaos

    Engineering ɾೝ஌ͷࣗಈԽ => AI => ҟৗൃੜ͔Βճ෮·ͰΛAIͰղܾ ɾ࠷ॳͷऔΓ૊Έͱͯ͠ɺ࠷৽ͷݚڀ੒ՌͰ͋Δ࣌ܥྻσʔλͷ࣍ݩ ࡟ݮख๏Λ঺հͨ͠ ·ͱΊ