SREへの機械学習適用に関するサーベイ / A Survey for Cases of Applying Machine Learning to SRE

SREへの機械学習適用に関するサーベイ / A Survey for Cases of Applying Machine Learning to SRE

MACHINE LEARNING Meetup KANSAI #4 LT
https://mlm-kansai.connpass.com/event/119084/

A658ec7f1badf73819dfa501165016c1?s=128

Yuuki Tsubouchi (yuuk1)

March 27, 2019
Tweet

Transcript

  1. ͘͞ΒΠϯλʔωοτ גࣜձࣾ (C) Copyright 1996-2019 SAKURA Internet Inc ͘͞ΒΠϯλʔωοτ ݚڀॴ

    SRE΁ͷػցֶशద༻ʹؔ͢Δ αʔϕΠ 2019/03/27 ݚڀһ ௶಺ ༎थ Machine Learning Meetup KANSAI #4 LT @yuuk1t / id:y_uuki
  2. 1. Site Reliability Engineering (SRE)

  3. 3 Site Reliability Engineeringͱ͸ ɾReliability = ৴པੑ: Ϣʔβʔ͕շదʹ αʔϏεΛར༻Ͱ͖Δ౓߹͍ ɾίϯϐϡʔλγεςϜͷ৴པੑΛ੍ޚ

    ͢Δ͜ͱΛ໨ࢦͨ͠޻ֶ෼໺ ɾैདྷͷγεςϜ؅ཧΛιϑτ΢ΣΞΤ ϯδχΞϦϯάʹΑΓ࠶ߏங ɾϞχλϦϯά, ΠϯγσϯτରԠ, มߋ ؅ཧ, ΩϟύγςΟϓϥϯχϯά, ϓϩ Ϗδϣχϯά, ޮ཰ͱύϑΥʔϚϯε…
  4. 4 ػցֶशద༻ͷಈػ ɾϚχϡΞϧ࡞ۀΛιϑτ΢ΣΞͰࣗಈԽ͍ͨ͠ ɾ͔͠͠ɺख़࿅ͷ৬ਓ͕΍͍ͬͯΔߴ౓ͳ൑அΛࣗಈԽ͢Δͷ͸ ೉͍͠ ɾͦ͜Ͱɺػցֶश΍ϑΟʔυόοΫ੍ޚͳͲͷΠϯςϦδΣϯ τͳख๏ʹண໨͢Δ

  5. 5 ߴ౓ͳ൑அͱ͸ͳʹ͔ ɾαʔϏεͷෛՙ૿ݮʹԠͯ͡ܭࢉػϦιʔεΛ͍ͭɺͲͷఔ౓૿ݮ͞ ͤΔ͔ͷ൑அ ɾVMʹͲͷఔ౓ͷϦιʔεΛׂΓ౰ͯΔͷ൑அ ɾCPUར༻཰ͳͲͷ֤छϝτϦοΫΛ֬ೝ্ͨ͠Ͱͷҟৗͷ൑அ ɾ෼ࢄͨ͠ϊʔυؒͷґଘؔ܎ͷ஌ࣝΛ΋ͱʹͨ͠ҟৗͷ൑அ ɾϛυϧ΢ΣΞͷઃఆ஋ͰͲΕ͕ྑ͍͔ͷ൑அ ɾetc

  6. 6 ɾ৴པੑΛอূ͢ΔΑ͏ʹࣗಈ੍ޚ ɾܭࢉػϦιʔεͷޮ཰ར༻ ΠϯςϦδΣϯτʹ΍Γ͍ͨ͜ͱ

  7. 7 ͳʹ͔Β࢝ΊΔͷ͔ ɾػցֶशΛษڧ࢝͠ΊΔલʹɺԠ༻ͷΠϝʔδΛ͓͖͍ͭͬͯͨ͘ ɾSREͷݱ৔ͰΠϯςϦδΣϯτͳ࢓૊Έͷಋೖࣄྫ͸ଟ͘ͳ͍ ɾαʔό؂ࢹαʔϏεMackerelͷϩʔϧ಺ҟৗݕ஌ػೳͳͲ ɾͦ͜Ͱɺݚڀ࿦จΛαʔϕΠ͠ɺͲͷΑ͏ͳख๏͕ར༻͞Ε͍ͯΔ͔ Λ஌Δ

  8. 2. ػցֶशͷSRE΁ͷద༻

  9. 9 ୅දతͳΞϓϦέʔγϣϯͷ෼ྨ ϞχλϦϯά Φʔτ εέʔϦϯά ෼ࢄγεςϜ ͷґଘؔ܎ ࠓ೔ͷείʔϓ Ϋϥ΢υͷ Ϧιʔε੍ޚ

    ϛυϧ΢ΣΞઃఆ ͷࣗಈνϡʔχϯά
  10. 10 αʔόͷΦʔτεέʔϦϯά ɾ[1]: PerfEnforce: a dynamic scaling engine for analytics

    with performance guarantees ɾRedShiftͷΑ͏ͳOLAPͷΫΤϦηογϣϯதʹ੍ޚث͕໨ඪͱΫΤϦ࣌ؒ Λ্ճΒͳ͍Α͏ʹɺDBαʔόͷ୆਺ΛεέʔϦϯάͤ͞ΔΤϯδϯ ɾ༧ଌతख๏ͱͯ͠ΦϯϥΠϯֶश(ύʔηϓτϩϯ)ɺ൓Ԡతख๏ͱͯ͠ڧԽ ֶश(Qֶश)·ͨ͸ϑΟʔυόοΫ੍ޚ(PI)Λར༻͠ൺֱ͢Δ [3]: Figure 1. PerfEnforce deployment ɾධՁͷ݁Ռɺύʔηϓτϩϯ͕ྑ͍ ݁Ռͱͳͬͨ ɾ൓Ԡతख๏͸΍͸ΓಥൃతͳมԽ ΁ͷରԠ͕஗͍
  11. 11 Ϋϥ΢υϦιʔε੍ޚ ɾ[2]: Self-Adaptive and Self-Configured CPU Resource Provisioning for

    Virtualized Servers Using Kalman Filters ɾΧϧϚϯϑΟϧλʔʹΑΓɺదԠతʹVMͷCPUϦιʔεΛׂΓ౰ͯΔ [2]: Figure 1. Virtualized prototype and control system. ɾCPUݸ਺Λ੩తʹܾఆ͍ͯͯ͠Ϧιʔε͕ ଍Γͳ͔ͬͨΓ༨Δ໰୊͕͋Δ ɾController͕VMͷCPU࢖༻཰ΛτϥοΩϯ ά͠ɺᮢ஋ʹୡ͢ΔͱɺΧϧϚϯϑΟϧλʔ ʹै͍ɺCPU਺Λมߋ͢Δ
  12. 12 ϛυϧ΢ΣΞઃఆͷࣗಈνϡʔχϯά ɾ[3]: Automatic Database Management System Tuning Through Large-Scale

    Machine Learning. ɾMySQL/PostgresͷઃఆΛࣗಈνϡʔχϯάɻઐ໳Ոͷઃఆʹ͍ۙੑೳʹɻ ɾϝτϦοΫΛҼࢠ෼ੳ͠ɺK-MeansΫϥελϦϯάͯ͠ॏཁͳ΋ͷΛநग़ ɾLassoʹΑΓγεςϜશମͷੑೳʹରͯ͠૬ؔͷେ͖͍ઃఆ߲໨Λಛ௃બ୒ ɾνϡʔφʔ͕ઃఆΛมߋ࣮ͭͭ͠ࡍʹܭଌͯ͠ྑ͍஋Λܾఆ [3]: Figure 4.
  13. 13 αʔϕΠ࿦จ ɾ[4]: A Control Theoretical View of Cloud Elasticity:

    Taxonomy, Survey and Challenges (2018) ɾΫϥ΢υͷ৳ॖੑʹ੍ޚཧ࿦ͷख๏Λద༻ͨ͠ݚڀΛ·ͱΊͨαʔϕΠ ɾػցֶशΑΓ΋ϑΟʔυόοΫ੍ޚ΍ϑΝδʔ੍ޚ͕த৺ ɾ[5]: Adaptation in Cloud Resource Configuration: A Survey (2016) ɾΫϥ΢υͷϦιʔεઃఆ΁దԠతख๏Λద༻ͨ͠ݚڀΛ·ͱΊͨαʔϕΠ ɾώϡʔϦεςΟοΫɺ੍ޚཧ࿦ɺػցֶशɺ଴ͪߦྻཧ࿦ʹ෼ྨ ɾ[6]: Resource Management in Clouds: Survey and Research Challenges (2015) ɾ[7]: What Does Control Theory Bring to Systems Research? (2009)
  14. 3. ·ͱΊ

  15. 15 ·ͱΊ ɾSRE͸ίϯϐϡʔλγεςϜͷ৴པੑΛ੍ޚ͢Δ޻ֶ෼໺ ɾػցֶशΛؚΊͨΠϯςϦδΣϯτͳख๏ΛSRE΁ద༻͢Δಈػͱ ͯ͠ɺख़࿅ͷ৬ਓͷϚχϡΞϧ࡞ۀͷࣗಈԽΛڍ͛ͨ ɾطଘख๏ͱͯ͠ɺαʔόͷΦʔτεέʔϦϯάɺΫϥ΢υͷϦιʔ ε੍ޚɺϛυϧ΢ΣΞઃఆͷࣗಈνϡʔχϯάΛ঺հͨ͠ ɾཧ૝తͳঢ়گͰ͸ػೳ͢Δ͕ɺࠓޙ͸ɺଟ͘ͷύϥϝʔλ΍֎ཚɺ ଴ͪߦྻ͕ෳࡶʹབྷΈ߹͏ຊ൪؀ڥʹ͍͔ʹద༻͍͔͕ͯ͘͠՝୊

  16. https://www.slideshare.net/syou6162/mackerel-108429592

  17. https://speakerdeck.com/rrreeeyyy/a-survey-of-anomaly-detection-methodologies-for-web-system

  18. https://speakerdeck.com/tsurubee/ji-jie-xue-xi-tesahafalsefu-he-zhuang-tai-woba-wo-sitai

  19. ػցֶश΍੍ޚ޻ֶͷద༻ઌͱͯ͠ͷ SRE

  20. ࢀߟจݙ

  21. 21 ࢀߟจݙ ɾ[1]: ORTIZ, Jennifer, et al. PerfEnforce: a dynamic

    scaling engine for analytics with performance guarantees. arXiv preprint arXiv:1605.09753, 2016. ɾ[2]: KALYVIANAKI, Evangelia; CHARALAMBOUS, Themistoklis; HAND, Steven. Self-adaptive and self-configured CPU resource provisioning for virtualized servers using Kalman filters. In: Proceedings of the 6th international conference on Autonomic computing. ACM, 2009. p. 117-126. ɾ[3]: VAN AKEN, Dana, et al. Automatic database management system tuning through large-scale machine learning. In: Proceedings of the 2017 ACM International Conference on Management of Data. ACM, 2017. p. 1009-1024.
  22. 22 ࢀߟจݙ ɾ[4]: ULLAH, Amjad, et al. A control theoretical

    view of cloud elasticity: taxonomy, survey and challenges. Cluster Computing, 2018, 21.4: 1735-1764. ɾ[5]: HUMMAIDA, Abdul R.; PATON, Norman W.; SAKELLARIOU, Rizos. Adaptation in cloud resource configuration: a survey. Journal of Cloud Computing, 2016, 5.1: 7. ɾ[6]: JENNINGS, Brendan; STADLER, Rolf. Resource management in clouds: Survey and research challenges. Journal of Network and Systems Management, 2015, 23.3: 567-619. ɾ[7]: ZHU, Xiaoyun, et al. What does control theory bring to systems research?. ACM SIGOPS Operating Systems Review, 2009, 43.1: 62-69.