Upgrade to Pro — share decks privately, control downloads, hide ads and more …

SREへの機械学習適用に関するサーベイ / A Survey for Cases of Applying Machine Learning to SRE

SREへの機械学習適用に関するサーベイ / A Survey for Cases of Applying Machine Learning to SRE

MACHINE LEARNING Meetup KANSAI #4 LT
https://mlm-kansai.connpass.com/event/119084/

Yuuki Tsubouchi (yuuk1)

March 27, 2019
Tweet

More Decks by Yuuki Tsubouchi (yuuk1)

Other Decks in Research

Transcript

  1. ͘͞ΒΠϯλʔωοτ
    גࣜձࣾ
    (C) Copyright 1996-2019 SAKURA Internet Inc
    ͘͞ΒΠϯλʔωοτ
    ݚڀॴ
    SRE΁ͷػցֶशద༻ʹؔ͢Δ
    αʔϕΠ
    2019/03/27 ݚڀһ ௶಺ ༎थ
    Machine Learning Meetup KANSAI #4 LT
    @yuuk1t / id:y_uuki

    View Slide

  2. 1.
    Site Reliability Engineering
    (SRE)

    View Slide

  3. 3
    Site Reliability Engineeringͱ͸
    ɾReliability = ৴པੑ: Ϣʔβʔ͕շదʹ
    αʔϏεΛར༻Ͱ͖Δ౓߹͍
    ɾίϯϐϡʔλγεςϜͷ৴པੑΛ੍ޚ
    ͢Δ͜ͱΛ໨ࢦͨ͠޻ֶ෼໺
    ɾैདྷͷγεςϜ؅ཧΛιϑτ΢ΣΞΤ
    ϯδχΞϦϯάʹΑΓ࠶ߏங
    ɾϞχλϦϯά, ΠϯγσϯτରԠ, มߋ
    ؅ཧ, ΩϟύγςΟϓϥϯχϯά, ϓϩ
    Ϗδϣχϯά, ޮ཰ͱύϑΥʔϚϯε…

    View Slide

  4. 4
    ػցֶशద༻ͷಈػ
    ɾϚχϡΞϧ࡞ۀΛιϑτ΢ΣΞͰࣗಈԽ͍ͨ͠
    ɾ͔͠͠ɺख़࿅ͷ৬ਓ͕΍͍ͬͯΔߴ౓ͳ൑அΛࣗಈԽ͢Δͷ͸
    ೉͍͠
    ɾͦ͜Ͱɺػցֶश΍ϑΟʔυόοΫ੍ޚͳͲͷΠϯςϦδΣϯ
    τͳख๏ʹண໨͢Δ

    View Slide

  5. 5
    ߴ౓ͳ൑அͱ͸ͳʹ͔
    ɾαʔϏεͷෛՙ૿ݮʹԠͯ͡ܭࢉػϦιʔεΛ͍ͭɺͲͷఔ౓૿ݮ͞
    ͤΔ͔ͷ൑அ
    ɾVMʹͲͷఔ౓ͷϦιʔεΛׂΓ౰ͯΔͷ൑அ
    ɾCPUར༻཰ͳͲͷ֤छϝτϦοΫΛ֬ೝ্ͨ͠Ͱͷҟৗͷ൑அ
    ɾ෼ࢄͨ͠ϊʔυؒͷґଘؔ܎ͷ஌ࣝΛ΋ͱʹͨ͠ҟৗͷ൑அ
    ɾϛυϧ΢ΣΞͷઃఆ஋ͰͲΕ͕ྑ͍͔ͷ൑அ
    ɾetc

    View Slide

  6. 6
    ɾ৴པੑΛอূ͢ΔΑ͏ʹࣗಈ੍ޚ
    ɾܭࢉػϦιʔεͷޮ཰ར༻
    ΠϯςϦδΣϯτʹ΍Γ͍ͨ͜ͱ

    View Slide

  7. 7
    ͳʹ͔Β࢝ΊΔͷ͔
    ɾػցֶशΛษڧ࢝͠ΊΔલʹɺԠ༻ͷΠϝʔδΛ͓͖͍ͭͬͯͨ͘
    ɾSREͷݱ৔ͰΠϯςϦδΣϯτͳ࢓૊Έͷಋೖࣄྫ͸ଟ͘ͳ͍
    ɾαʔό؂ࢹαʔϏεMackerelͷϩʔϧ಺ҟৗݕ஌ػೳͳͲ
    ɾͦ͜Ͱɺݚڀ࿦จΛαʔϕΠ͠ɺͲͷΑ͏ͳख๏͕ར༻͞Ε͍ͯΔ͔
    Λ஌Δ

    View Slide

  8. 2.
    ػցֶशͷSRE΁ͷద༻

    View Slide

  9. 9
    ୅දతͳΞϓϦέʔγϣϯͷ෼ྨ
    ϞχλϦϯά
    Φʔτ
    εέʔϦϯά
    ෼ࢄγεςϜ
    ͷґଘؔ܎
    ࠓ೔ͷείʔϓ
    Ϋϥ΢υͷ
    Ϧιʔε੍ޚ
    ϛυϧ΢ΣΞઃఆ
    ͷࣗಈνϡʔχϯά

    View Slide

  10. 10
    αʔόͷΦʔτεέʔϦϯά
    ɾ[1]: PerfEnforce: a dynamic scaling engine for analytics with performance
    guarantees
    ɾRedShiftͷΑ͏ͳOLAPͷΫΤϦηογϣϯதʹ੍ޚث͕໨ඪͱΫΤϦ࣌ؒ
    Λ্ճΒͳ͍Α͏ʹɺDBαʔόͷ୆਺ΛεέʔϦϯάͤ͞ΔΤϯδϯ
    ɾ༧ଌతख๏ͱͯ͠ΦϯϥΠϯֶश(ύʔηϓτϩϯ)ɺ൓Ԡతख๏ͱͯ͠ڧԽ
    ֶश(Qֶश)·ͨ͸ϑΟʔυόοΫ੍ޚ(PI)Λར༻͠ൺֱ͢Δ
    [3]: Figure 1. PerfEnforce deployment
    ɾධՁͷ݁Ռɺύʔηϓτϩϯ͕ྑ͍
    ݁Ռͱͳͬͨ
    ɾ൓Ԡతख๏͸΍͸ΓಥൃతͳมԽ
    ΁ͷରԠ͕஗͍

    View Slide

  11. 11
    Ϋϥ΢υϦιʔε੍ޚ
    ɾ[2]: Self-Adaptive and Self-Configured CPU Resource Provisioning for
    Virtualized Servers Using Kalman Filters
    ɾΧϧϚϯϑΟϧλʔʹΑΓɺదԠతʹVMͷCPUϦιʔεΛׂΓ౰ͯΔ
    [2]: Figure 1. Virtualized prototype and control system.
    ɾCPUݸ਺Λ੩తʹܾఆ͍ͯͯ͠Ϧιʔε͕
    ଍Γͳ͔ͬͨΓ༨Δ໰୊͕͋Δ
    ɾController͕VMͷCPU࢖༻཰ΛτϥοΩϯ
    ά͠ɺᮢ஋ʹୡ͢ΔͱɺΧϧϚϯϑΟϧλʔ
    ʹै͍ɺCPU਺Λมߋ͢Δ

    View Slide

  12. 12
    ϛυϧ΢ΣΞઃఆͷࣗಈνϡʔχϯά
    ɾ[3]: Automatic Database Management System Tuning Through Large-Scale
    Machine Learning.
    ɾMySQL/PostgresͷઃఆΛࣗಈνϡʔχϯάɻઐ໳Ոͷઃఆʹ͍ۙੑೳʹɻ
    ɾϝτϦοΫΛҼࢠ෼ੳ͠ɺK-MeansΫϥελϦϯάͯ͠ॏཁͳ΋ͷΛநग़
    ɾLassoʹΑΓγεςϜશମͷੑೳʹରͯ͠૬ؔͷେ͖͍ઃఆ߲໨Λಛ௃બ୒
    ɾνϡʔφʔ͕ઃఆΛมߋ࣮ͭͭ͠ࡍʹܭଌͯ͠ྑ͍஋Λܾఆ
    [3]: Figure 4.

    View Slide

  13. 13
    αʔϕΠ࿦จ
    ɾ[4]: A Control Theoretical View of Cloud Elasticity: Taxonomy, Survey and
    Challenges (2018)
    ɾΫϥ΢υͷ৳ॖੑʹ੍ޚཧ࿦ͷख๏Λద༻ͨ͠ݚڀΛ·ͱΊͨαʔϕΠ
    ɾػցֶशΑΓ΋ϑΟʔυόοΫ੍ޚ΍ϑΝδʔ੍ޚ͕த৺
    ɾ[5]: Adaptation in Cloud Resource Configuration: A Survey (2016)
    ɾΫϥ΢υͷϦιʔεઃఆ΁దԠతख๏Λద༻ͨ͠ݚڀΛ·ͱΊͨαʔϕΠ
    ɾώϡʔϦεςΟοΫɺ੍ޚཧ࿦ɺػցֶशɺ଴ͪߦྻཧ࿦ʹ෼ྨ
    ɾ[6]: Resource Management in Clouds: Survey and Research Challenges
    (2015)
    ɾ[7]: What Does Control Theory Bring to Systems Research? (2009)

    View Slide

  14. 3.
    ·ͱΊ

    View Slide

  15. 15
    ·ͱΊ
    ɾSRE͸ίϯϐϡʔλγεςϜͷ৴པੑΛ੍ޚ͢Δ޻ֶ෼໺
    ɾػցֶशΛؚΊͨΠϯςϦδΣϯτͳख๏ΛSRE΁ద༻͢Δಈػͱ
    ͯ͠ɺख़࿅ͷ৬ਓͷϚχϡΞϧ࡞ۀͷࣗಈԽΛڍ͛ͨ
    ɾطଘख๏ͱͯ͠ɺαʔόͷΦʔτεέʔϦϯάɺΫϥ΢υͷϦιʔ
    ε੍ޚɺϛυϧ΢ΣΞઃఆͷࣗಈνϡʔχϯάΛ঺հͨ͠
    ɾཧ૝తͳঢ়گͰ͸ػೳ͢Δ͕ɺࠓޙ͸ɺଟ͘ͷύϥϝʔλ΍֎ཚɺ
    ଴ͪߦྻ͕ෳࡶʹབྷΈ߹͏ຊ൪؀ڥʹ͍͔ʹద༻͍͔͕ͯ͘͠՝୊

    View Slide

  16. https://www.slideshare.net/syou6162/mackerel-108429592

    View Slide

  17. https://speakerdeck.com/rrreeeyyy/a-survey-of-anomaly-detection-methodologies-for-web-system

    View Slide

  18. https://speakerdeck.com/tsurubee/ji-jie-xue-xi-tesahafalsefu-he-zhuang-tai-woba-wo-sitai

    View Slide

  19. ػցֶश΍੍ޚ޻ֶͷద༻ઌͱͯ͠ͷ
    SRE

    View Slide

  20. ࢀߟจݙ

    View Slide

  21. 21
    ࢀߟจݙ
    ɾ[1]: ORTIZ, Jennifer, et al. PerfEnforce: a dynamic scaling engine for analytics
    with performance guarantees. arXiv preprint arXiv:1605.09753, 2016.
    ɾ[2]: KALYVIANAKI, Evangelia; CHARALAMBOUS, Themistoklis; HAND, Steven.
    Self-adaptive and self-configured CPU resource provisioning for virtualized
    servers using Kalman filters. In: Proceedings of the 6th international conference
    on Autonomic computing. ACM, 2009. p. 117-126.
    ɾ[3]: VAN AKEN, Dana, et al. Automatic database management system tuning
    through large-scale machine learning. In: Proceedings of the 2017 ACM
    International Conference on Management of Data. ACM, 2017. p. 1009-1024.

    View Slide

  22. 22
    ࢀߟจݙ
    ɾ[4]: ULLAH, Amjad, et al. A control theoretical view of cloud elasticity: taxonomy,
    survey and challenges. Cluster Computing, 2018, 21.4: 1735-1764.
    ɾ[5]: HUMMAIDA, Abdul R.; PATON, Norman W.; SAKELLARIOU, Rizos.
    Adaptation in cloud resource configuration: a survey. Journal of Cloud
    Computing, 2016, 5.1: 7.
    ɾ[6]: JENNINGS, Brendan; STADLER, Rolf. Resource management in clouds:
    Survey and research challenges. Journal of Network and Systems Management,
    2015, 23.3: 567-619.
    ɾ[7]: ZHU, Xiaoyun, et al. What does control theory bring to systems
    research?. ACM SIGOPS Operating Systems Review, 2009, 43.1: 62-69.

    View Slide