Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Site Reliability Engineering における 重要領域とパフォーマンス指標の提案 / Performance Indicators for SRE

Site Reliability Engineering における 重要領域とパフォーマンス指標の提案 / Performance Indicators for SRE

2021/06/04
第8回WebSystemArchitecture研究会(オンライン)
https://wsa.connpass.com/event/207143/

Takeshi Kondo

June 04, 2021
Tweet

More Decks by Takeshi Kondo

Other Decks in Technology

Transcript

 1. Site Reliability Engineering ʹ͓͚Δ


  ॏཁྖҬͱύϑΥʔϚϯεࢦඪͷఏҊ


  Takeshi Kondo / @chaspy


  2021/06/04


  ୈ8ճWebSystemArchitectureݚڀձʢΦϯϥΠϯʣ

  View Slide

 2. Who am I
  chaspy chaspy_
  Lead Software Engineer

  Site Reliability at Quipper
  Takeshi Kondo

  View Slide

 3. Agenda
  1. എܠͱ໨త


  2. SRE ͕ؔΘΔྖҬ


  3. ఏҊࢦඪͱଌఆํ๏


  4. ଌఆ݁Ռ


  5. ·ͱΊͱࠓޙͷల๬

  View Slide

 4. എܠͱ໨త
  • SRE ͱ͍͏ Role ͕޿͘ීٴ͠ɺ࣮ફ͢Δاۀ͕૿͖͑ͯͨ


  • Ϗδωεɺ૊৫ͷن໛΍ੑ࣭ʹΑΓͦͷ໾ׂ͸ҟͳΔ


  →SRE ͕ؔΘΔॏཁͳྖҬΛ෼ྨ͍ͨ͠


  • ϓϩμΫτ։ൃͷΑ͏ʹϏδωεKPIΛઃఆ͠ɺͦΕΛ܁Γฦ
  ͠վળ͢Δͱ͍͏Ξϓϩʔν͸ SRE ʹ΋༗ޮͳ͸ͣ


  →ྖҬ͝ͱͷύϑΥʔϚϯεࢦඪΛఆٛɾܭଌ͍ͨ͠

  View Slide

 5. SRE͕ؔΘΔྖҬ
  • Ϋϥ΢υ্Ͱ Web αʔϏεΛఏڙ͢ΔاۀΛ૝ఆ


  • 100+ Developer


  • 30M+ Access / Day

  View Slide

 6. ʲࢀߟʳAWS Well-Architected Framework
  https://aws.amazon.com/architecture/well-architected/

  View Slide

 7. ଌఆࢦඪͱଌఆํ๏
  • Reliability


  • Developer Productivity


  • Cost


  • Security


  • Platform

  View Slide

 8. ྖҬ͝ͱͷؔ܎ੑ
  1MBUGPSN
  3FMJBCJMJUZ
  $PTU
  %FWFMPQFS
  1SPEVDUJWJUZ
  4FDVSJUZ
  Empowerment Empowerment Empowerment
  Trade-Off
  Trade-Off Trade-Off

  View Slide

 9. ʲࢀߟʳLean ͱ DevOps ͷՊֶ
  • ΤϦʔτاۀ͸ҎԼ͢΂͕ͯ༏Ε͍ͯΔ


  • σϓϩΠͷස౓


  • มߋͷϦʔυλΠϜ


  • MTTR


  • มߋࣦഊ཰
  https://book.impress.co.jp/books/1118101029

  View Slide

 10. ଌఆࢦඪͱଌఆํ๏
  • Reliability


  • Developer Productivity


  • Cost


  • Security


  • Platform

  View Slide

 11. ଌఆࢦඪͱଌఆํ๏
  ྖҬ ࢦඪ ଌఆํ๏
  3FMJBCJMJUZ .553
  ো֐ൃੜ࣌ʹɺো֐ใࠂϑϩʔ಺ͷ
  ඞਢ߲໨ͱͯ͠खಈͰܭଌ
  %FWFMPQFS1SPEVDUJWJUZ σϓϩΠճ਺ $*αʔϏεͷNFUSJDT
  %FWFMPQFS1SPEVDUJWJUZ σϓϩΠ࣌ؒ $*αʔϏεͷNFUSJDT
  %FWFMPQFS1SPEVDUJWJUZ $*҆ఆੑ $*αʔϏεͷNFUSJDT
  %FWFMPQFS1SPEVDUJWJUZ มߋࣦഊ཰
  ຊ൪؀ڥσϓϩΠʹରԠ͢Δϒϥϯ
  νͷ3FWFSUDPNNJUͷ਺

  View Slide

 12. ଌఆ݁Ռ
  • MTTR


  • σϓϩΠճ਺


  • σϓϩΠ࣌ؒ


  • CI ҆ఆੑ


  • มߋࣦഊ཰

  View Slide

 13. MTTR Plot with Trendline

  View Slide

 14. MTTR per half year

  View Slide

 15. Histgram

  View Slide

 16. MTTR ʹର͢Δߟ࡯
  • ܭଌՄೳੑͷ໰୊


  • ࣗಈͰूܭͰ͖Δ࢓૊Έ͕ඞཁ


  • ͦͷͨΊʹ͸ Incident Response ͷܕԽͱͦΕʹର͢Δ Tool ͕ඞཁ


  • SeverityʢIncident ͷ Level ఆٛʣ/ ো֐ൃੜɾݕ஌ɾ෮چͷ࣌ؒΛඞͣه࿥͢ΔϧʔϧͳͲ


  • σʔλྔͷ໰୊


  • Πϯγσϯτ਺͸2೥Ͱ͔͕ͨͩ50ఔ౓ɺे෼ͳσʔλྔ͕ಘΒΕͳ͍


  • Πϯγσϯτ਺͕ଟ͍͜ͱ͸ SRE ͷ໨తͱ૬൓͢Δ


  • ͹Β͖ͭͷ໰୊


  • σʔλྔ͕े෼Ͱͳ͍ͱɺҰ෦ͷ௕࣌ؒো֐ʹҾ͖ͣΒΕͯ͠·͏

  View Slide

 17. MTTR ʹର͢Δߟ࡯
  • ࢦඪͱͯ͠༗ӹ͔Ͳ͏͔͸·ͩ൑அͰ͖ͳ͍


  • গͳ͘ͱ΋ҎԼͷ఺Ͱ͸༗ӹͳͷͰτϥοΩϯάΛଓ͚Δ


  • Incident Response ͷܕԽ


  • ௕࣌ؒΠϯγσϯτʹର͢Δվળ

  View Slide

 18. ʲࢀߟʳIncident Metrics in SRE
  • MTTR ͸ࢦඪʹ͢΂͖Ͱ͸ͳ͍ͱओு


  • ݅਺ෆ଍ͱ͹Β͖ͭͷେ͖͕͞ཧ༝


  • Ͱ͸ͲΕΛ࠾༻͢΂͖͔͸ݴٴ͕ͳ͍
  https://sre.google/resources/practices-and-processes/incident-metrics-in-sre/

  View Slide

 19. ʲิ଍ʳDeveloper Productivity ྖҬͷܭଌର৅
  • monorepo Λ࠾༻


  • master branch Ͱ͸ෳ਺ͷΞϓϦ͕ಉ࣌ʹσϓϩΠ͞ΕΔ


  • Database Λڞ༗͢Δ Distributed monolith ͱͳ͍ͬͯΔ


  • ͜ΕΒ͸جຊతʹि࣍ϦϦʔε͞ΕΔ


  • ͜ΕҎ֎ͷ microservices ͸ݸผͰϦϦʔε͞ΕΔ͕ɺࠓճ
  ͸ܭଌର৅֎

  View Slide

 20. σϓϩΠճ਺

  View Slide

 21. σϓϩΠճ਺ʹର͢Δߟ࡯
  • جຊతʹ Weekly Release Ͱ͋ΔͨΊɺ൒೥ʹ26ճ͸ඞͣσϓϩΠ
  ͞ΕΔ


  • ࢒Γ͸ HOTFIX


  • ԿͷͨΊͷ HOTFIX ͔ʁ


  • มߋࣦഊ཰ʢޙड़ʣͱ߹ΘͤͯΈͳ͍ͱҙຯ͕ബͦ͏


  • ࣮ࡍ2020೥લ൒͸ Production ͷ Kubernetes manifest มߋͷͨΊͷ HOTFIX ͕ଟ͔ͬͨ


  • σϓϩΠ਺͕ݮগ܏޲ͳͷ͸ Microservices Խ͍ͯ͠Δ͔Β


  • Microservices ΛؚΊͯܭଌ͢Δඞཁ͕͋Δ

  View Slide

 22. σϓϩΠ࣌ؒ

  View Slide

 23. σϓϩΠ࣌ؒʹର͢Δߟ࡯
  • ະ෼ੳʢ௥ه༧ఆʣ


  • ͜ͷ࣌ؒ͸มߋࣦഊ࣌ͷ Revert ͷ࣌ؒͱҰக͢ΔͷͰɺ୹
  ͘͢Ε͹͢Δ΄Ͳ MTTR ࡟ݮʹͭͳ͕Δ͸ͣ

  View Slide

 24. CI ҆ఆੑ

  View Slide

 25. CI ҆ఆੑʹؔ͢Δߟ࡯
  • Time Window ͸ 7 Days


  • 30 Days, 90 Days ͳͲෳ਺ͷ Time Window Ͱܭଌͨ͠΄͏͕ྑ͍


  • ຊ൪͚ͩͰͳ͘ɺ։ൃϒϥϯν΋ಉ༷ʹܭଌ͢Δ΂͖


  • ࢦඪͱͯ͠͸ෆద౰


  • جຊతʹ 100% ʹ͚ۙΕ͹͍ۙ΄Ͳྑ͍


  • SLO ͱͯ͠ଊ͑ͯɺ໨ඪ஋Λҧ൓ͨ͠Βࠜຊमਖ਼͢ΔΞϓϩʔν͕ྑ͍


  • ͜ͷ஋ΛؚΉผͷࢦඪΛ༻͍ͨ΄͏͕ྑ͍


  • Time To DeliveryʢมߋͷϦʔυλΠϜʣ


  • MTTR


  • ͨͩ͠ɺ෼ੳՄೳੑ͸ॏཁɻCI ͕ෆ҆ఆͳͱ͖ɺͲͷ Job ͕Ͳͷఔ౓ෆ҆ఆ͔Λ஌Δඞཁ͸͋Δ

  View Slide

 26. มߋࣦഊ਺

  View Slide

 27. มߋࣦഊ཰

  View Slide

 28. มߋࣦഊ཰ʹؔ͢Δߟ࡯
  • "มߋࣦഊ"ͷఆٛͷ໰୊


  • ԿΛ΋ͬͯ"มߋࣦഊ"ͱ͢Δ͔ͷఆ͕ٛඞཁ


  • Label ෇༩ͳͲͷӡ༻ϧʔϧ͕ͳ͍ͱܭଌ͕೉͍͠


  • ܭଌํ๏ͷ໰୊


  • ຊ൪ϒϥϯν΁ͷ Revert ͸"มߋࣦഊ"Ҏ֎Ͱ΋ى͖͍ͯͨ


  • Argo Rollouts Λ࠾༻͍ͯ͠Δ


  • ௨ৗ͸ Canary Strategy Λ༗ޮʹ͍ͯ͠ͳ͍


  • ॏཁػೳͳͲ Canary ͍ͨ͠ͱ͖͚ͩ༗ޮʹ͠ɺ100% ϦϦʔεͨ͠Β Revert ͍ͯͨ͠


  • σʔλྔͷ໰୊


  • MTTR ಉ༷ͷ໰୊

  View Slide

 29. ·ͱΊͱߟ࡯
  • SRE ۀ຿Λ5ͭͷྖҬʹ෼ྨ͠ɺ͏ͪ2ͭͷྖҬ͔ΒɺʮLean ͱ DevOps ͷՊֶʯΛࢀߟʹɺࢦ
  ඪʹͳΓ͏Δ͔Λܭଌͨ͠


  • ༗ޮͳࢦඪͷ৚݅


  • े෼ʹσʔλྔ͕͋Δ͜ͱ


  • MTTR, มߋࣦഊ཰͸σʔλྔΛಘΔ͜ͱ͕೉͍͠


  • ͜ΕΒ͕සൃ͢Δঢ়ଶ͸ SRE ͷ໨తͱ൓͢Δ


  • ͦͷࢦඪΛؚΉଞͷࢦඪ͕ଘࡏ͠ͳ͍͜ͱ


  • CI ҆ఆੑ͸ MTTR, Time To DeliveryʢมߋͷϦʔυλΠϜʣͰิ͑Δ


  • σϓϩΠ࣌ؒ΋ MTTR, Time To DeliveryʢมߋͷϦʔυλΠϜʣͰิ͑Δ


  • σϓϩΠճ਺Λ݈શʹ૿΍͢ʹ͸มߋࣦഊ཰ͷܭଌ͕ඞཁ

  View Slide

 30. ·ͱΊͱߟ࡯
  • MTTR 🚀


  • τϥοΩϯά͸ܧଓ


  • ܧଓతʹऔಘ͢ΔͨΊʹ Incident Response ͷվળ͕ඞཁ


  • σϓϩΠճ਺🚀


  • microservices ؚΊͯܭଌ


  • σϓϩΠ࣌ؒ🤔


  • ։ൃϒϥϯνͷܭଌ͕ඞཁ


  • ௕ظతʹ͸ MTTR / มߋͷϦʔυλΠϜͰิ͏


  • CI ҆ఆੑ🤔


  • ։ൃϒϥϯνͷܭଌ͕ඞཁ


  • ௕ظతʹ͸ MTTR / มߋͷϦʔυλΠϜͰิ͏


  • มߋࣦഊ཰🚀


  • มߋࣦഊͷఆٛͱӡ༻ϧʔϧࡦఆ͕ඞཁ


  • มߋͷϦʔυλΠϜ🤔


  • Develop branch Ͱͷ First commit ͔Β Production ΁ͷ Code มߋ·ͰΛܭଌͰ͖Δͱྑ͍͕ɺม਺͕ଟ͘ɺ͹Β͖͕ͭେ͖͍Մೳੑ͕͋Δ

  View Slide

 31. ࠓޙͷల๬
  • ݱঢ়ଌఆ͍ͯ͠Δ΋ͷ͸ܧଓɺࣗಈԽΛ໨ࢦ͢


  • "ࣦഊ"ʹؔ͢Δࢦඪ͸׆༻ͮ͠Β͍Մೳੑ͕͋Δ͕ɺܧଓͯ͠ܭ


  • MTTR, มߋࣦഊ཰


  • ਓ͕ؒؔΘΔϓϩηεͰ͸ܭଌͷͨΊʹఆٛɺϧʔϧɺن໿͕ඞཁ


  • ଞͷྖҬʹؔͯ͠΋ࢦඪΛఏҊ͢Δ


  • ܭଌɾՄࢹԽͷσβΠϯύλʔϯͷ੔ཧΛ͍ͨ͠


  • ୭΋͕ԿͰ΋ܭଌͯ͠ՄࢹԽͯࣗ͠཯తʹܧଓతվળ͕Ͱ͖ΔੈքΛ໨ࢦ͢

  View Slide

 32. Thank you!
  chaspy chaspy_
  Lead Software Engineer

  Site Reliability at Quipper
  Takeshi Kondo

  View Slide