Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Site Reliability Engineering における 重要領域とパフォーマンス指標の提案 / Performance Indicators for SRE

Site Reliability Engineering における 重要領域とパフォーマンス指標の提案 / Performance Indicators for SRE

2021/06/04
第8回WebSystemArchitecture研究会(オンライン)
https://wsa.connpass.com/event/207143/

93c80c388fe9d8f9df7d030549a0ff0b?s=128

Takeshi Kondo

June 04, 2021
Tweet

Transcript

  1. Site Reliability Engineering ʹ͓͚Δ ॏཁྖҬͱύϑΥʔϚϯεࢦඪͷఏҊ Takeshi Kondo / @chaspy 2021/06/04

    ୈ8ճWebSystemArchitectureݚڀձʢΦϯϥΠϯʣ
  2. Who am I chaspy chaspy_ Lead Software Engineer Site Reliability

    at Quipper Takeshi Kondo
  3. Agenda 1. എܠͱ໨త 2. SRE ͕ؔΘΔྖҬ 3. ఏҊࢦඪͱଌఆํ๏ 4. ଌఆ݁Ռ

    5. ·ͱΊͱࠓޙͷల๬
  4. എܠͱ໨త • SRE ͱ͍͏ Role ͕޿͘ීٴ͠ɺ࣮ફ͢Δاۀ͕૿͖͑ͯͨ • Ϗδωεɺ૊৫ͷن໛΍ੑ࣭ʹΑΓͦͷ໾ׂ͸ҟͳΔ →SRE ͕ؔΘΔॏཁͳྖҬΛ෼ྨ͍ͨ͠

    • ϓϩμΫτ։ൃͷΑ͏ʹϏδωεKPIΛઃఆ͠ɺͦΕΛ܁Γฦ ͠վળ͢Δͱ͍͏Ξϓϩʔν͸ SRE ʹ΋༗ޮͳ͸ͣ →ྖҬ͝ͱͷύϑΥʔϚϯεࢦඪΛఆٛɾܭଌ͍ͨ͠
  5. SRE͕ؔΘΔྖҬ • Ϋϥ΢υ্Ͱ Web αʔϏεΛఏڙ͢ΔاۀΛ૝ఆ • 100+ Developer • 30M+

    Access / Day
  6. ʲࢀߟʳAWS Well-Architected Framework https://aws.amazon.com/architecture/well-architected/

  7. ଌఆࢦඪͱଌఆํ๏ • Reliability • Developer Productivity • Cost • Security

    • Platform
  8. ྖҬ͝ͱͷؔ܎ੑ 1MBUGPSN 3FMJBCJMJUZ $PTU %FWFMPQFS 1SPEVDUJWJUZ 4FDVSJUZ Empowerment Empowerment Empowerment

    Trade-Off Trade-Off Trade-Off
  9. ʲࢀߟʳLean ͱ DevOps ͷՊֶ • ΤϦʔτاۀ͸ҎԼ͢΂͕ͯ༏Ε͍ͯΔ • σϓϩΠͷස౓ • มߋͷϦʔυλΠϜ

    • MTTR • มߋࣦഊ཰ https://book.impress.co.jp/books/1118101029
  10. ଌఆࢦඪͱଌఆํ๏ • Reliability • Developer Productivity • Cost • Security

    • Platform
  11. ଌఆࢦඪͱଌఆํ๏ ྖҬ ࢦඪ ଌఆํ๏ 3FMJBCJMJUZ .553 ো֐ൃੜ࣌ʹɺো֐ใࠂϑϩʔ಺ͷ ඞਢ߲໨ͱͯ͠खಈͰܭଌ %FWFMPQFS1SPEVDUJWJUZ σϓϩΠճ਺

    $*αʔϏεͷNFUSJDT %FWFMPQFS1SPEVDUJWJUZ σϓϩΠ࣌ؒ $*αʔϏεͷNFUSJDT %FWFMPQFS1SPEVDUJWJUZ $*҆ఆੑ $*αʔϏεͷNFUSJDT %FWFMPQFS1SPEVDUJWJUZ มߋࣦഊ཰ ຊ൪؀ڥσϓϩΠʹରԠ͢Δϒϥϯ νͷ3FWFSUDPNNJUͷ਺
  12. ଌఆ݁Ռ • MTTR • σϓϩΠճ਺ • σϓϩΠ࣌ؒ • CI ҆ఆੑ

    • มߋࣦഊ཰
  13. MTTR Plot with Trendline

  14. MTTR per half year

  15. Histgram

  16. MTTR ʹର͢Δߟ࡯ • ܭଌՄೳੑͷ໰୊ • ࣗಈͰूܭͰ͖Δ࢓૊Έ͕ඞཁ • ͦͷͨΊʹ͸ Incident Response

    ͷܕԽͱͦΕʹର͢Δ Tool ͕ඞཁ • SeverityʢIncident ͷ Level ఆٛʣ/ ো֐ൃੜɾݕ஌ɾ෮چͷ࣌ؒΛඞͣه࿥͢ΔϧʔϧͳͲ • σʔλྔͷ໰୊ • Πϯγσϯτ਺͸2೥Ͱ͔͕ͨͩ50ఔ౓ɺे෼ͳσʔλྔ͕ಘΒΕͳ͍ • Πϯγσϯτ਺͕ଟ͍͜ͱ͸ SRE ͷ໨తͱ૬൓͢Δ • ͹Β͖ͭͷ໰୊ • σʔλྔ͕े෼Ͱͳ͍ͱɺҰ෦ͷ௕࣌ؒো֐ʹҾ͖ͣΒΕͯ͠·͏
  17. MTTR ʹର͢Δߟ࡯ • ࢦඪͱͯ͠༗ӹ͔Ͳ͏͔͸·ͩ൑அͰ͖ͳ͍ • গͳ͘ͱ΋ҎԼͷ఺Ͱ͸༗ӹͳͷͰτϥοΩϯάΛଓ͚Δ • Incident Response ͷܕԽ

    • ௕࣌ؒΠϯγσϯτʹର͢Δվળ
  18. ʲࢀߟʳIncident Metrics in SRE • MTTR ͸ࢦඪʹ͢΂͖Ͱ͸ͳ͍ͱओு • ݅਺ෆ଍ͱ͹Β͖ͭͷେ͖͕͞ཧ༝ •

    Ͱ͸ͲΕΛ࠾༻͢΂͖͔͸ݴٴ͕ͳ͍ https://sre.google/resources/practices-and-processes/incident-metrics-in-sre/
  19. ʲิ଍ʳDeveloper Productivity ྖҬͷܭଌର৅ • monorepo Λ࠾༻ • master branch Ͱ͸ෳ਺ͷΞϓϦ͕ಉ࣌ʹσϓϩΠ͞ΕΔ

    • Database Λڞ༗͢Δ Distributed monolith ͱͳ͍ͬͯΔ • ͜ΕΒ͸جຊతʹि࣍ϦϦʔε͞ΕΔ • ͜ΕҎ֎ͷ microservices ͸ݸผͰϦϦʔε͞ΕΔ͕ɺࠓճ ͸ܭଌର৅֎
  20. σϓϩΠճ਺

  21. σϓϩΠճ਺ʹର͢Δߟ࡯ • جຊతʹ Weekly Release Ͱ͋ΔͨΊɺ൒೥ʹ26ճ͸ඞͣσϓϩΠ ͞ΕΔ • ࢒Γ͸ HOTFIX

    • ԿͷͨΊͷ HOTFIX ͔ʁ • มߋࣦഊ཰ʢޙड़ʣͱ߹ΘͤͯΈͳ͍ͱҙຯ͕ബͦ͏ • ࣮ࡍ2020೥લ൒͸ Production ͷ Kubernetes manifest มߋͷͨΊͷ HOTFIX ͕ଟ͔ͬͨ • σϓϩΠ਺͕ݮগ܏޲ͳͷ͸ Microservices Խ͍ͯ͠Δ͔Β • Microservices ΛؚΊͯܭଌ͢Δඞཁ͕͋Δ
  22. σϓϩΠ࣌ؒ

  23. σϓϩΠ࣌ؒʹର͢Δߟ࡯ • ະ෼ੳʢ௥ه༧ఆʣ • ͜ͷ࣌ؒ͸มߋࣦഊ࣌ͷ Revert ͷ࣌ؒͱҰக͢ΔͷͰɺ୹ ͘͢Ε͹͢Δ΄Ͳ MTTR ࡟ݮʹͭͳ͕Δ͸ͣ

  24. CI ҆ఆੑ

  25. CI ҆ఆੑʹؔ͢Δߟ࡯ • Time Window ͸ 7 Days • 30

    Days, 90 Days ͳͲෳ਺ͷ Time Window Ͱܭଌͨ͠΄͏͕ྑ͍ • ຊ൪͚ͩͰͳ͘ɺ։ൃϒϥϯν΋ಉ༷ʹܭଌ͢Δ΂͖ • ࢦඪͱͯ͠͸ෆద౰ • جຊతʹ 100% ʹ͚ۙΕ͹͍ۙ΄Ͳྑ͍ • SLO ͱͯ͠ଊ͑ͯɺ໨ඪ஋Λҧ൓ͨ͠Βࠜຊमਖ਼͢ΔΞϓϩʔν͕ྑ͍ • ͜ͷ஋ΛؚΉผͷࢦඪΛ༻͍ͨ΄͏͕ྑ͍ • Time To DeliveryʢมߋͷϦʔυλΠϜʣ • MTTR • ͨͩ͠ɺ෼ੳՄೳੑ͸ॏཁɻCI ͕ෆ҆ఆͳͱ͖ɺͲͷ Job ͕Ͳͷఔ౓ෆ҆ఆ͔Λ஌Δඞཁ͸͋Δ
  26. มߋࣦഊ਺

  27. มߋࣦഊ཰

  28. มߋࣦഊ཰ʹؔ͢Δߟ࡯ • "มߋࣦഊ"ͷఆٛͷ໰୊ • ԿΛ΋ͬͯ"มߋࣦഊ"ͱ͢Δ͔ͷఆ͕ٛඞཁ • Label ෇༩ͳͲͷӡ༻ϧʔϧ͕ͳ͍ͱܭଌ͕೉͍͠ • ܭଌํ๏ͷ໰୊

    • ຊ൪ϒϥϯν΁ͷ Revert ͸"มߋࣦഊ"Ҏ֎Ͱ΋ى͖͍ͯͨ • Argo Rollouts Λ࠾༻͍ͯ͠Δ • ௨ৗ͸ Canary Strategy Λ༗ޮʹ͍ͯ͠ͳ͍ • ॏཁػೳͳͲ Canary ͍ͨ͠ͱ͖͚ͩ༗ޮʹ͠ɺ100% ϦϦʔεͨ͠Β Revert ͍ͯͨ͠ • σʔλྔͷ໰୊ • MTTR ಉ༷ͷ໰୊
  29. ·ͱΊͱߟ࡯ • SRE ۀ຿Λ5ͭͷྖҬʹ෼ྨ͠ɺ͏ͪ2ͭͷྖҬ͔ΒɺʮLean ͱ DevOps ͷՊֶʯΛࢀߟʹɺࢦ ඪʹͳΓ͏Δ͔Λܭଌͨ͠ • ༗ޮͳࢦඪͷ৚݅

    • े෼ʹσʔλྔ͕͋Δ͜ͱ • MTTR, มߋࣦഊ཰͸σʔλྔΛಘΔ͜ͱ͕೉͍͠ • ͜ΕΒ͕සൃ͢Δঢ়ଶ͸ SRE ͷ໨తͱ൓͢Δ • ͦͷࢦඪΛؚΉଞͷࢦඪ͕ଘࡏ͠ͳ͍͜ͱ • CI ҆ఆੑ͸ MTTR, Time To DeliveryʢมߋͷϦʔυλΠϜʣͰิ͑Δ • σϓϩΠ࣌ؒ΋ MTTR, Time To DeliveryʢมߋͷϦʔυλΠϜʣͰิ͑Δ • σϓϩΠճ਺Λ݈શʹ૿΍͢ʹ͸มߋࣦഊ཰ͷܭଌ͕ඞཁ
  30. ·ͱΊͱߟ࡯ • MTTR 🚀 • τϥοΩϯά͸ܧଓ • ܧଓతʹऔಘ͢ΔͨΊʹ Incident Response

    ͷվળ͕ඞཁ • σϓϩΠճ਺🚀 • microservices ؚΊͯܭଌ • σϓϩΠ࣌ؒ🤔 • ։ൃϒϥϯνͷܭଌ͕ඞཁ • ௕ظతʹ͸ MTTR / มߋͷϦʔυλΠϜͰิ͏ • CI ҆ఆੑ🤔 • ։ൃϒϥϯνͷܭଌ͕ඞཁ • ௕ظతʹ͸ MTTR / มߋͷϦʔυλΠϜͰิ͏ • มߋࣦഊ཰🚀 • มߋࣦഊͷఆٛͱӡ༻ϧʔϧࡦఆ͕ඞཁ • มߋͷϦʔυλΠϜ🤔 • Develop branch Ͱͷ First commit ͔Β Production ΁ͷ Code มߋ·ͰΛܭଌͰ͖Δͱྑ͍͕ɺม਺͕ଟ͘ɺ͹Β͖͕ͭେ͖͍Մೳੑ͕͋Δ
  31. ࠓޙͷల๬ • ݱঢ়ଌఆ͍ͯ͠Δ΋ͷ͸ܧଓɺࣗಈԽΛ໨ࢦ͢ • "ࣦഊ"ʹؔ͢Δࢦඪ͸׆༻ͮ͠Β͍Մೳੑ͕͋Δ͕ɺܧଓͯ͠ܭ • MTTR, มߋࣦഊ཰ • ਓ͕ؒؔΘΔϓϩηεͰ͸ܭଌͷͨΊʹఆٛɺϧʔϧɺن໿͕ඞཁ

    • ଞͷྖҬʹؔͯ͠΋ࢦඪΛఏҊ͢Δ • ܭଌɾՄࢹԽͷσβΠϯύλʔϯͷ੔ཧΛ͍ͨ͠ • ୭΋͕ԿͰ΋ܭଌͯ͠ՄࢹԽͯࣗ͠཯తʹܧଓతվળ͕Ͱ͖ΔੈքΛ໨ࢦ͢
  32. Thank you! chaspy chaspy_ Lead Software Engineer Site Reliability at

    Quipper Takeshi Kondo