2021/06/04 第8回WebSystemArchitecture研究会(オンライン) https://wsa.connpass.com/event/207143/
Site Reliability Engineering ʹ͓͚ΔॏཁྖҬͱύϑΥʔϚϯεࢦඪͷఏҊTakeshi Kondo / @chaspy2021/06/04ୈ8ճWebSystemArchitectureݚڀձʢΦϯϥΠϯʣ
View Slide
Who am Ichaspy chaspy_Lead Software EngineerSite Reliability at QuipperTakeshi Kondo
Agenda1. എܠͱత2. SRE ͕ؔΘΔྖҬ3. ఏҊࢦඪͱଌఆํ๏4. ଌఆ݁Ռ5. ·ͱΊͱࠓޙͷల
എܠͱత• SRE ͱ͍͏ Role ͕͘ීٴ͠ɺ࣮ફ͢Δاۀ͕૿͖͑ͯͨ• Ϗδωεɺ৫ͷنੑ࣭ʹΑΓͦͷׂҟͳΔ→SRE ͕ؔΘΔॏཁͳྖҬΛྨ͍ͨ͠• ϓϩμΫτ։ൃͷΑ͏ʹϏδωεKPIΛઃఆ͠ɺͦΕΛ܁Γฦ͠վળ͢Δͱ͍͏Ξϓϩʔν SRE ʹ༗ޮͳͣ→ྖҬ͝ͱͷύϑΥʔϚϯεࢦඪΛఆٛɾܭଌ͍ͨ͠
SRE͕ؔΘΔྖҬ• Ϋϥυ্Ͱ Web αʔϏεΛఏڙ͢ΔاۀΛఆ• 100+ Developer• 30M+ Access / Day
ʲࢀߟʳAWS Well-Architected Frameworkhttps://aws.amazon.com/architecture/well-architected/
ଌఆࢦඪͱଌఆํ๏• Reliability• Developer Productivity• Cost• Security• Platform
ྖҬ͝ͱͷؔੑ1MBUGPSN3FMJBCJMJUZ$PTU%FWFMPQFS1SPEVDUJWJUZ4FDVSJUZEmpowerment Empowerment EmpowermentTrade-OffTrade-Off Trade-Off
ʲࢀߟʳLean ͱ DevOps ͷՊֶ• ΤϦʔτاۀҎԼ͕ͯ͢༏Ε͍ͯΔ• σϓϩΠͷස• มߋͷϦʔυλΠϜ• MTTR• มߋࣦഊhttps://book.impress.co.jp/books/1118101029
ଌఆࢦඪͱଌఆํ๏ྖҬ ࢦඪ ଌఆํ๏3FMJBCJMJUZ .553োൃੜ࣌ʹɺোใࠂϑϩʔͷඞਢ߲ͱͯ͠खಈͰܭଌ%FWFMPQFS1SPEVDUJWJUZ σϓϩΠճ $*αʔϏεͷNFUSJDT%FWFMPQFS1SPEVDUJWJUZ σϓϩΠ࣌ؒ $*αʔϏεͷNFUSJDT%FWFMPQFS1SPEVDUJWJUZ $*҆ఆੑ $*αʔϏεͷNFUSJDT%FWFMPQFS1SPEVDUJWJUZ มߋࣦഊຊ൪ڥσϓϩΠʹରԠ͢Δϒϥϯνͷ3FWFSUDPNNJUͷ
ଌఆ݁Ռ• MTTR• σϓϩΠճ• σϓϩΠ࣌ؒ• CI ҆ఆੑ• มߋࣦഊ
MTTR Plot with Trendline
MTTR per half year
Histgram
MTTR ʹର͢Δߟ• ܭଌՄೳੑͷ• ࣗಈͰूܭͰ͖ΔΈ͕ඞཁ• ͦͷͨΊʹ Incident Response ͷܕԽͱͦΕʹର͢Δ Tool ͕ඞཁ• SeverityʢIncident ͷ Level ఆٛʣ/ োൃੜɾݕɾ෮چͷ࣌ؒΛඞͣه͢ΔϧʔϧͳͲ• σʔλྔͷ• Πϯγσϯτ2Ͱ͔͕ͨͩ50ఔɺेͳσʔλྔ͕ಘΒΕͳ͍• Πϯγσϯτ͕ଟ͍͜ͱ SRE ͷతͱ૬͢Δ• Β͖ͭͷ• σʔλྔ͕ेͰͳ͍ͱɺҰ෦ͷ࣌ؒোʹҾ͖ͣΒΕͯ͠·͏
MTTR ʹର͢Δߟ• ࢦඪͱͯ͠༗ӹ͔Ͳ͏͔·ͩஅͰ͖ͳ͍• গͳ͘ͱҎԼͷͰ༗ӹͳͷͰτϥοΩϯάΛଓ͚Δ• Incident Response ͷܕԽ• ࣌ؒΠϯγσϯτʹର͢Δվળ
ʲࢀߟʳIncident Metrics in SRE• MTTR ࢦඪʹ͖͢Ͱͳ͍ͱओு• ݅ෆͱΒ͖ͭͷେ͖͕͞ཧ༝• ͰͲΕΛ࠾༻͖͔͢ݴٴ͕ͳ͍https://sre.google/resources/practices-and-processes/incident-metrics-in-sre/
ʲิʳDeveloper Productivity ྖҬͷܭଌର• monorepo Λ࠾༻• master branch ͰෳͷΞϓϦ͕ಉ࣌ʹσϓϩΠ͞ΕΔ• Database Λڞ༗͢Δ Distributed monolith ͱͳ͍ͬͯΔ• ͜ΕΒجຊతʹि࣍ϦϦʔε͞ΕΔ• ͜ΕҎ֎ͷ microservices ݸผͰϦϦʔε͞ΕΔ͕ɺࠓճܭଌର֎
σϓϩΠճ
σϓϩΠճʹର͢Δߟ• جຊతʹ Weekly Release Ͱ͋ΔͨΊɺʹ26ճඞͣσϓϩΠ͞ΕΔ• Γ HOTFIX• ԿͷͨΊͷ HOTFIX ͔ʁ• มߋࣦഊʢޙड़ʣͱ߹ΘͤͯΈͳ͍ͱҙຯ͕ബͦ͏• ࣮ࡍ2020લ Production ͷ Kubernetes manifest มߋͷͨΊͷ HOTFIX ͕ଟ͔ͬͨ• σϓϩΠ͕ݮগͳͷ Microservices Խ͍ͯ͠Δ͔Β• Microservices ΛؚΊͯܭଌ͢Δඞཁ͕͋Δ
σϓϩΠ࣌ؒ
σϓϩΠ࣌ؒʹର͢Δߟ• ະੳʢه༧ఆʣ• ͜ͷ࣌ؒมߋࣦഊ࣌ͷ Revert ͷ࣌ؒͱҰக͢ΔͷͰɺ͘͢Ε͢Δ΄Ͳ MTTR ݮʹͭͳ͕Δͣ
CI ҆ఆੑ
CI ҆ఆੑʹؔ͢Δߟ• Time Window 7 Days• 30 Days, 90 Days ͳͲෳͷ Time Window Ͱܭଌͨ͠΄͏͕ྑ͍• ຊ൪͚ͩͰͳ͘ɺ։ൃϒϥϯνಉ༷ʹܭଌ͢Δ͖• ࢦඪͱͯ͠ෆద• جຊతʹ 100% ʹ͚ۙΕ͍ۙ΄Ͳྑ͍• SLO ͱͯ͠ଊ͑ͯɺඪΛҧͨ͠Βࠜຊमਖ਼͢ΔΞϓϩʔν͕ྑ͍• ͜ͷΛؚΉผͷࢦඪΛ༻͍ͨ΄͏͕ྑ͍• Time To DeliveryʢมߋͷϦʔυλΠϜʣ• MTTR• ͨͩ͠ɺੳՄೳੑॏཁɻCI ͕ෆ҆ఆͳͱ͖ɺͲͷ Job ͕Ͳͷఔෆ҆ఆ͔ΛΔඞཁ͋Δ
มߋࣦഊ
มߋࣦഊ
มߋࣦഊʹؔ͢Δߟ• "มߋࣦഊ"ͷఆٛͷ• ԿΛͬͯ"มߋࣦഊ"ͱ͢Δ͔ͷఆ͕ٛඞཁ• Label ༩ͳͲͷӡ༻ϧʔϧ͕ͳ͍ͱܭଌ͕͍͠• ܭଌํ๏ͷ• ຊ൪ϒϥϯνͷ Revert "มߋࣦഊ"Ҏ֎Ͱى͖͍ͯͨ• Argo Rollouts Λ࠾༻͍ͯ͠Δ• ௨ৗ Canary Strategy Λ༗ޮʹ͍ͯ͠ͳ͍• ॏཁػೳͳͲ Canary ͍ͨ͠ͱ͖͚ͩ༗ޮʹ͠ɺ100% ϦϦʔεͨ͠Β Revert ͍ͯͨ͠• σʔλྔͷ• MTTR ಉ༷ͷ
·ͱΊͱߟ• SRE ۀΛ5ͭͷྖҬʹྨ͠ɺ͏ͪ2ͭͷྖҬ͔ΒɺʮLean ͱ DevOps ͷՊֶʯΛࢀߟʹɺࢦඪʹͳΓ͏Δ͔Λܭଌͨ͠• ༗ޮͳࢦඪͷ݅• ेʹσʔλྔ͕͋Δ͜ͱ• MTTR, มߋࣦഊσʔλྔΛಘΔ͜ͱ͕͍͠• ͜ΕΒ͕සൃ͢Δঢ়ଶ SRE ͷతͱ͢Δ• ͦͷࢦඪΛؚΉଞͷࢦඪ͕ଘࡏ͠ͳ͍͜ͱ• CI ҆ఆੑ MTTR, Time To DeliveryʢมߋͷϦʔυλΠϜʣͰิ͑Δ• σϓϩΠ࣌ؒ MTTR, Time To DeliveryʢมߋͷϦʔυλΠϜʣͰิ͑Δ• σϓϩΠճΛ݈શʹ૿͢ʹมߋࣦഊͷܭଌ͕ඞཁ
·ͱΊͱߟ• MTTR 🚀• τϥοΩϯάܧଓ• ܧଓతʹऔಘ͢ΔͨΊʹ Incident Response ͷվળ͕ඞཁ• σϓϩΠճ🚀• microservices ؚΊͯܭଌ• σϓϩΠ࣌ؒ🤔• ։ൃϒϥϯνͷܭଌ͕ඞཁ• ظతʹ MTTR / มߋͷϦʔυλΠϜͰิ͏• CI ҆ఆੑ🤔• ։ൃϒϥϯνͷܭଌ͕ඞཁ• ظతʹ MTTR / มߋͷϦʔυλΠϜͰิ͏• มߋࣦഊ🚀• มߋࣦഊͷఆٛͱӡ༻ϧʔϧࡦఆ͕ඞཁ• มߋͷϦʔυλΠϜ🤔• Develop branch Ͱͷ First commit ͔Β Production ͷ Code มߋ·ͰΛܭଌͰ͖Δͱྑ͍͕ɺม͕ଟ͘ɺΒ͖͕ͭେ͖͍Մೳੑ͕͋Δ
ࠓޙͷల• ݱঢ়ଌఆ͍ͯ͠ΔͷܧଓɺࣗಈԽΛࢦ͢• "ࣦഊ"ʹؔ͢Δࢦඪ׆༻ͮ͠Β͍Մೳੑ͕͋Δ͕ɺܧଓͯ͠ܭ• MTTR, มߋࣦഊ• ਓ͕ؒؔΘΔϓϩηεͰܭଌͷͨΊʹఆٛɺϧʔϧɺن͕ඞཁ• ଞͷྖҬʹؔͯ͠ࢦඪΛఏҊ͢Δ• ܭଌɾՄࢹԽͷσβΠϯύλʔϯͷཧΛ͍ͨ͠• ୭͕ԿͰܭଌͯ͠ՄࢹԽͯࣗ͠తʹܧଓతվળ͕Ͱ͖ΔੈքΛࢦ͢
Thank you!chaspy chaspy_Lead Software EngineerSite Reliability at QuipperTakeshi Kondo