Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Site Reliability Engineering における 重要領域とパフォーマンス指標の提案 / Performance Indicators for SRE

Site Reliability Engineering における 重要領域とパフォーマンス指標の提案 / Performance Indicators for SRE

2021/06/04
第8回WebSystemArchitecture研究会(オンライン)
https://wsa.connpass.com/event/207143/

Takeshi Kondo

June 04, 2021
Tweet

More Decks by Takeshi Kondo

Other Decks in Technology

Transcript

  1. എܠͱ໨త • SRE ͱ͍͏ Role ͕޿͘ීٴ͠ɺ࣮ફ͢Δاۀ͕૿͖͑ͯͨ • Ϗδωεɺ૊৫ͷن໛΍ੑ࣭ʹΑΓͦͷ໾ׂ͸ҟͳΔ →SRE ͕ؔΘΔॏཁͳྖҬΛ෼ྨ͍ͨ͠

    • ϓϩμΫτ։ൃͷΑ͏ʹϏδωεKPIΛઃఆ͠ɺͦΕΛ܁Γฦ ͠վળ͢Δͱ͍͏Ξϓϩʔν͸ SRE ʹ΋༗ޮͳ͸ͣ →ྖҬ͝ͱͷύϑΥʔϚϯεࢦඪΛఆٛɾܭଌ͍ͨ͠
  2. ଌఆࢦඪͱଌఆํ๏ ྖҬ ࢦඪ ଌఆํ๏ 3FMJBCJMJUZ .553 ো֐ൃੜ࣌ʹɺো֐ใࠂϑϩʔ಺ͷ ඞਢ߲໨ͱͯ͠खಈͰܭଌ %FWFMPQFS1SPEVDUJWJUZ σϓϩΠճ਺

    $*αʔϏεͷNFUSJDT %FWFMPQFS1SPEVDUJWJUZ σϓϩΠ࣌ؒ $*αʔϏεͷNFUSJDT %FWFMPQFS1SPEVDUJWJUZ $*҆ఆੑ $*αʔϏεͷNFUSJDT %FWFMPQFS1SPEVDUJWJUZ มߋࣦഊ཰ ຊ൪؀ڥσϓϩΠʹରԠ͢Δϒϥϯ νͷ3FWFSUDPNNJUͷ਺
  3. MTTR ʹର͢Δߟ࡯ • ܭଌՄೳੑͷ໰୊ • ࣗಈͰूܭͰ͖Δ࢓૊Έ͕ඞཁ • ͦͷͨΊʹ͸ Incident Response

    ͷܕԽͱͦΕʹର͢Δ Tool ͕ඞཁ • SeverityʢIncident ͷ Level ఆٛʣ/ ো֐ൃੜɾݕ஌ɾ෮چͷ࣌ؒΛඞͣه࿥͢ΔϧʔϧͳͲ • σʔλྔͷ໰୊ • Πϯγσϯτ਺͸2೥Ͱ͔͕ͨͩ50ఔ౓ɺे෼ͳσʔλྔ͕ಘΒΕͳ͍ • Πϯγσϯτ਺͕ଟ͍͜ͱ͸ SRE ͷ໨తͱ૬൓͢Δ • ͹Β͖ͭͷ໰୊ • σʔλྔ͕े෼Ͱͳ͍ͱɺҰ෦ͷ௕࣌ؒো֐ʹҾ͖ͣΒΕͯ͠·͏
  4. ʲࢀߟʳIncident Metrics in SRE • MTTR ͸ࢦඪʹ͢΂͖Ͱ͸ͳ͍ͱओு • ݅਺ෆ଍ͱ͹Β͖ͭͷେ͖͕͞ཧ༝ •

    Ͱ͸ͲΕΛ࠾༻͢΂͖͔͸ݴٴ͕ͳ͍ https://sre.google/resources/practices-and-processes/incident-metrics-in-sre/
  5. ʲิ଍ʳDeveloper Productivity ྖҬͷܭଌର৅ • monorepo Λ࠾༻ • master branch Ͱ͸ෳ਺ͷΞϓϦ͕ಉ࣌ʹσϓϩΠ͞ΕΔ

    • Database Λڞ༗͢Δ Distributed monolith ͱͳ͍ͬͯΔ • ͜ΕΒ͸جຊతʹि࣍ϦϦʔε͞ΕΔ • ͜ΕҎ֎ͷ microservices ͸ݸผͰϦϦʔε͞ΕΔ͕ɺࠓճ ͸ܭଌର৅֎
  6. σϓϩΠճ਺ʹର͢Δߟ࡯ • جຊతʹ Weekly Release Ͱ͋ΔͨΊɺ൒೥ʹ26ճ͸ඞͣσϓϩΠ ͞ΕΔ • ࢒Γ͸ HOTFIX

    • ԿͷͨΊͷ HOTFIX ͔ʁ • มߋࣦഊ཰ʢޙड़ʣͱ߹ΘͤͯΈͳ͍ͱҙຯ͕ബͦ͏ • ࣮ࡍ2020೥લ൒͸ Production ͷ Kubernetes manifest มߋͷͨΊͷ HOTFIX ͕ଟ͔ͬͨ • σϓϩΠ਺͕ݮগ܏޲ͳͷ͸ Microservices Խ͍ͯ͠Δ͔Β • Microservices ΛؚΊͯܭଌ͢Δඞཁ͕͋Δ
  7. CI ҆ఆੑʹؔ͢Δߟ࡯ • Time Window ͸ 7 Days • 30

    Days, 90 Days ͳͲෳ਺ͷ Time Window Ͱܭଌͨ͠΄͏͕ྑ͍ • ຊ൪͚ͩͰͳ͘ɺ։ൃϒϥϯν΋ಉ༷ʹܭଌ͢Δ΂͖ • ࢦඪͱͯ͠͸ෆద౰ • جຊతʹ 100% ʹ͚ۙΕ͹͍ۙ΄Ͳྑ͍ • SLO ͱͯ͠ଊ͑ͯɺ໨ඪ஋Λҧ൓ͨ͠Βࠜຊमਖ਼͢ΔΞϓϩʔν͕ྑ͍ • ͜ͷ஋ΛؚΉผͷࢦඪΛ༻͍ͨ΄͏͕ྑ͍ • Time To DeliveryʢมߋͷϦʔυλΠϜʣ • MTTR • ͨͩ͠ɺ෼ੳՄೳੑ͸ॏཁɻCI ͕ෆ҆ఆͳͱ͖ɺͲͷ Job ͕Ͳͷఔ౓ෆ҆ఆ͔Λ஌Δඞཁ͸͋Δ
  8. มߋࣦഊ཰ʹؔ͢Δߟ࡯ • "มߋࣦഊ"ͷఆٛͷ໰୊ • ԿΛ΋ͬͯ"มߋࣦഊ"ͱ͢Δ͔ͷఆ͕ٛඞཁ • Label ෇༩ͳͲͷӡ༻ϧʔϧ͕ͳ͍ͱܭଌ͕೉͍͠ • ܭଌํ๏ͷ໰୊

    • ຊ൪ϒϥϯν΁ͷ Revert ͸"มߋࣦഊ"Ҏ֎Ͱ΋ى͖͍ͯͨ • Argo Rollouts Λ࠾༻͍ͯ͠Δ • ௨ৗ͸ Canary Strategy Λ༗ޮʹ͍ͯ͠ͳ͍ • ॏཁػೳͳͲ Canary ͍ͨ͠ͱ͖͚ͩ༗ޮʹ͠ɺ100% ϦϦʔεͨ͠Β Revert ͍ͯͨ͠ • σʔλྔͷ໰୊ • MTTR ಉ༷ͷ໰୊
  9. ·ͱΊͱߟ࡯ • SRE ۀ຿Λ5ͭͷྖҬʹ෼ྨ͠ɺ͏ͪ2ͭͷྖҬ͔ΒɺʮLean ͱ DevOps ͷՊֶʯΛࢀߟʹɺࢦ ඪʹͳΓ͏Δ͔Λܭଌͨ͠ • ༗ޮͳࢦඪͷ৚݅

    • े෼ʹσʔλྔ͕͋Δ͜ͱ • MTTR, มߋࣦഊ཰͸σʔλྔΛಘΔ͜ͱ͕೉͍͠ • ͜ΕΒ͕සൃ͢Δঢ়ଶ͸ SRE ͷ໨తͱ൓͢Δ • ͦͷࢦඪΛؚΉଞͷࢦඪ͕ଘࡏ͠ͳ͍͜ͱ • CI ҆ఆੑ͸ MTTR, Time To DeliveryʢมߋͷϦʔυλΠϜʣͰิ͑Δ • σϓϩΠ࣌ؒ΋ MTTR, Time To DeliveryʢมߋͷϦʔυλΠϜʣͰิ͑Δ • σϓϩΠճ਺Λ݈શʹ૿΍͢ʹ͸มߋࣦഊ཰ͷܭଌ͕ඞཁ
  10. ·ͱΊͱߟ࡯ • MTTR 🚀 • τϥοΩϯά͸ܧଓ • ܧଓతʹऔಘ͢ΔͨΊʹ Incident Response

    ͷվળ͕ඞཁ • σϓϩΠճ਺🚀 • microservices ؚΊͯܭଌ • σϓϩΠ࣌ؒ🤔 • ։ൃϒϥϯνͷܭଌ͕ඞཁ • ௕ظతʹ͸ MTTR / มߋͷϦʔυλΠϜͰิ͏ • CI ҆ఆੑ🤔 • ։ൃϒϥϯνͷܭଌ͕ඞཁ • ௕ظతʹ͸ MTTR / มߋͷϦʔυλΠϜͰิ͏ • มߋࣦഊ཰🚀 • มߋࣦഊͷఆٛͱӡ༻ϧʔϧࡦఆ͕ඞཁ • มߋͷϦʔυλΠϜ🤔 • Develop branch Ͱͷ First commit ͔Β Production ΁ͷ Code มߋ·ͰΛܭଌͰ͖Δͱྑ͍͕ɺม਺͕ଟ͘ɺ͹Β͖͕ͭେ͖͍Մೳੑ͕͋Δ
  11. ࠓޙͷల๬ • ݱঢ়ଌఆ͍ͯ͠Δ΋ͷ͸ܧଓɺࣗಈԽΛ໨ࢦ͢ • "ࣦഊ"ʹؔ͢Δࢦඪ͸׆༻ͮ͠Β͍Մೳੑ͕͋Δ͕ɺܧଓͯ͠ܭ • MTTR, มߋࣦഊ཰ • ਓ͕ؒؔΘΔϓϩηεͰ͸ܭଌͷͨΊʹఆٛɺϧʔϧɺن໿͕ඞཁ

    • ଞͷྖҬʹؔͯ͠΋ࢦඪΛఏҊ͢Δ • ܭଌɾՄࢹԽͷσβΠϯύλʔϯͷ੔ཧΛ͍ͨ͠ • ୭΋͕ԿͰ΋ܭଌͯ͠ՄࢹԽͯࣗ͠཯తʹܧଓతվળ͕Ͱ͖ΔੈքΛ໨ࢦ͢