Slide 1

Slide 1 text

Site Reliability Engineering ʹ͓͚Δ ॏཁྖҬͱύϑΥʔϚϯεࢦඪͷఏҊ Takeshi Kondo / @chaspy 2021/06/04 ୈ8ճWebSystemArchitectureݚڀձʢΦϯϥΠϯʣ

Slide 2

Slide 2 text

Who am I chaspy chaspy_ Lead Software Engineer Site Reliability at Quipper Takeshi Kondo

Slide 3

Slide 3 text

Agenda 1. എܠͱ໨త 2. SRE ͕ؔΘΔྖҬ 3. ఏҊࢦඪͱଌఆํ๏ 4. ଌఆ݁Ռ 5. ·ͱΊͱࠓޙͷల๬

Slide 4

Slide 4 text

എܠͱ໨త • SRE ͱ͍͏ Role ͕޿͘ීٴ͠ɺ࣮ફ͢Δاۀ͕૿͖͑ͯͨ • Ϗδωεɺ૊৫ͷن໛΍ੑ࣭ʹΑΓͦͷ໾ׂ͸ҟͳΔ →SRE ͕ؔΘΔॏཁͳྖҬΛ෼ྨ͍ͨ͠ • ϓϩμΫτ։ൃͷΑ͏ʹϏδωεKPIΛઃఆ͠ɺͦΕΛ܁Γฦ ͠վળ͢Δͱ͍͏Ξϓϩʔν͸ SRE ʹ΋༗ޮͳ͸ͣ →ྖҬ͝ͱͷύϑΥʔϚϯεࢦඪΛఆٛɾܭଌ͍ͨ͠

Slide 5

Slide 5 text

SRE͕ؔΘΔྖҬ • Ϋϥ΢υ্Ͱ Web αʔϏεΛఏڙ͢ΔاۀΛ૝ఆ • 100+ Developer • 30M+ Access / Day

Slide 6

Slide 6 text

ʲࢀߟʳAWS Well-Architected Framework https://aws.amazon.com/architecture/well-architected/

Slide 7

Slide 7 text

ଌఆࢦඪͱଌఆํ๏ • Reliability • Developer Productivity • Cost • Security • Platform

Slide 8

Slide 8 text

ྖҬ͝ͱͷؔ܎ੑ 1MBUGPSN 3FMJBCJMJUZ $PTU %FWFMPQFS 1SPEVDUJWJUZ 4FDVSJUZ Empowerment Empowerment Empowerment Trade-Off Trade-Off Trade-Off

Slide 9

Slide 9 text

ʲࢀߟʳLean ͱ DevOps ͷՊֶ • ΤϦʔτاۀ͸ҎԼ͢΂͕ͯ༏Ε͍ͯΔ • σϓϩΠͷස౓ • มߋͷϦʔυλΠϜ • MTTR • มߋࣦഊ཰ https://book.impress.co.jp/books/1118101029

Slide 10

Slide 10 text

ଌఆࢦඪͱଌఆํ๏ • Reliability • Developer Productivity • Cost • Security • Platform

Slide 11

Slide 11 text

ଌఆࢦඪͱଌఆํ๏ ྖҬ ࢦඪ ଌఆํ๏ 3FMJBCJMJUZ .553 ো֐ൃੜ࣌ʹɺো֐ใࠂϑϩʔ಺ͷ ඞਢ߲໨ͱͯ͠खಈͰܭଌ %FWFMPQFS1SPEVDUJWJUZ σϓϩΠճ਺ $*αʔϏεͷNFUSJDT %FWFMPQFS1SPEVDUJWJUZ σϓϩΠ࣌ؒ $*αʔϏεͷNFUSJDT %FWFMPQFS1SPEVDUJWJUZ $*҆ఆੑ $*αʔϏεͷNFUSJDT %FWFMPQFS1SPEVDUJWJUZ มߋࣦഊ཰ ຊ൪؀ڥσϓϩΠʹରԠ͢Δϒϥϯ νͷ3FWFSUDPNNJUͷ਺

Slide 12

Slide 12 text

ଌఆ݁Ռ • MTTR • σϓϩΠճ਺ • σϓϩΠ࣌ؒ • CI ҆ఆੑ • มߋࣦഊ཰

Slide 13

Slide 13 text

MTTR Plot with Trendline

Slide 14

Slide 14 text

MTTR per half year

Slide 15

Slide 15 text

Histgram

Slide 16

Slide 16 text

MTTR ʹର͢Δߟ࡯ • ܭଌՄೳੑͷ໰୊ • ࣗಈͰूܭͰ͖Δ࢓૊Έ͕ඞཁ • ͦͷͨΊʹ͸ Incident Response ͷܕԽͱͦΕʹର͢Δ Tool ͕ඞཁ • SeverityʢIncident ͷ Level ఆٛʣ/ ো֐ൃੜɾݕ஌ɾ෮چͷ࣌ؒΛඞͣه࿥͢ΔϧʔϧͳͲ • σʔλྔͷ໰୊ • Πϯγσϯτ਺͸2೥Ͱ͔͕ͨͩ50ఔ౓ɺे෼ͳσʔλྔ͕ಘΒΕͳ͍ • Πϯγσϯτ਺͕ଟ͍͜ͱ͸ SRE ͷ໨తͱ૬൓͢Δ • ͹Β͖ͭͷ໰୊ • σʔλྔ͕े෼Ͱͳ͍ͱɺҰ෦ͷ௕࣌ؒো֐ʹҾ͖ͣΒΕͯ͠·͏

Slide 17

Slide 17 text

MTTR ʹର͢Δߟ࡯ • ࢦඪͱͯ͠༗ӹ͔Ͳ͏͔͸·ͩ൑அͰ͖ͳ͍ • গͳ͘ͱ΋ҎԼͷ఺Ͱ͸༗ӹͳͷͰτϥοΩϯάΛଓ͚Δ • Incident Response ͷܕԽ • ௕࣌ؒΠϯγσϯτʹର͢Δվળ

Slide 18

Slide 18 text

ʲࢀߟʳIncident Metrics in SRE • MTTR ͸ࢦඪʹ͢΂͖Ͱ͸ͳ͍ͱओு • ݅਺ෆ଍ͱ͹Β͖ͭͷେ͖͕͞ཧ༝ • Ͱ͸ͲΕΛ࠾༻͢΂͖͔͸ݴٴ͕ͳ͍ https://sre.google/resources/practices-and-processes/incident-metrics-in-sre/

Slide 19

Slide 19 text

ʲิ଍ʳDeveloper Productivity ྖҬͷܭଌର৅ • monorepo Λ࠾༻ • master branch Ͱ͸ෳ਺ͷΞϓϦ͕ಉ࣌ʹσϓϩΠ͞ΕΔ • Database Λڞ༗͢Δ Distributed monolith ͱͳ͍ͬͯΔ • ͜ΕΒ͸جຊతʹि࣍ϦϦʔε͞ΕΔ • ͜ΕҎ֎ͷ microservices ͸ݸผͰϦϦʔε͞ΕΔ͕ɺࠓճ ͸ܭଌର৅֎

Slide 20

Slide 20 text

σϓϩΠճ਺

Slide 21

Slide 21 text

σϓϩΠճ਺ʹର͢Δߟ࡯ • جຊతʹ Weekly Release Ͱ͋ΔͨΊɺ൒೥ʹ26ճ͸ඞͣσϓϩΠ ͞ΕΔ • ࢒Γ͸ HOTFIX • ԿͷͨΊͷ HOTFIX ͔ʁ • มߋࣦഊ཰ʢޙड़ʣͱ߹ΘͤͯΈͳ͍ͱҙຯ͕ബͦ͏ • ࣮ࡍ2020೥લ൒͸ Production ͷ Kubernetes manifest มߋͷͨΊͷ HOTFIX ͕ଟ͔ͬͨ • σϓϩΠ਺͕ݮগ܏޲ͳͷ͸ Microservices Խ͍ͯ͠Δ͔Β • Microservices ΛؚΊͯܭଌ͢Δඞཁ͕͋Δ

Slide 22

Slide 22 text

σϓϩΠ࣌ؒ

Slide 23

Slide 23 text

σϓϩΠ࣌ؒʹର͢Δߟ࡯ • ະ෼ੳʢ௥ه༧ఆʣ • ͜ͷ࣌ؒ͸มߋࣦഊ࣌ͷ Revert ͷ࣌ؒͱҰக͢ΔͷͰɺ୹ ͘͢Ε͹͢Δ΄Ͳ MTTR ࡟ݮʹͭͳ͕Δ͸ͣ

Slide 24

Slide 24 text

CI ҆ఆੑ

Slide 25

Slide 25 text

CI ҆ఆੑʹؔ͢Δߟ࡯ • Time Window ͸ 7 Days • 30 Days, 90 Days ͳͲෳ਺ͷ Time Window Ͱܭଌͨ͠΄͏͕ྑ͍ • ຊ൪͚ͩͰͳ͘ɺ։ൃϒϥϯν΋ಉ༷ʹܭଌ͢Δ΂͖ • ࢦඪͱͯ͠͸ෆద౰ • جຊతʹ 100% ʹ͚ۙΕ͹͍ۙ΄Ͳྑ͍ • SLO ͱͯ͠ଊ͑ͯɺ໨ඪ஋Λҧ൓ͨ͠Βࠜຊमਖ਼͢ΔΞϓϩʔν͕ྑ͍ • ͜ͷ஋ΛؚΉผͷࢦඪΛ༻͍ͨ΄͏͕ྑ͍ • Time To DeliveryʢมߋͷϦʔυλΠϜʣ • MTTR • ͨͩ͠ɺ෼ੳՄೳੑ͸ॏཁɻCI ͕ෆ҆ఆͳͱ͖ɺͲͷ Job ͕Ͳͷఔ౓ෆ҆ఆ͔Λ஌Δඞཁ͸͋Δ

Slide 26

Slide 26 text

มߋࣦഊ਺

Slide 27

Slide 27 text

มߋࣦഊ཰

Slide 28

Slide 28 text

มߋࣦഊ཰ʹؔ͢Δߟ࡯ • "มߋࣦഊ"ͷఆٛͷ໰୊ • ԿΛ΋ͬͯ"มߋࣦഊ"ͱ͢Δ͔ͷఆ͕ٛඞཁ • Label ෇༩ͳͲͷӡ༻ϧʔϧ͕ͳ͍ͱܭଌ͕೉͍͠ • ܭଌํ๏ͷ໰୊ • ຊ൪ϒϥϯν΁ͷ Revert ͸"มߋࣦഊ"Ҏ֎Ͱ΋ى͖͍ͯͨ • Argo Rollouts Λ࠾༻͍ͯ͠Δ • ௨ৗ͸ Canary Strategy Λ༗ޮʹ͍ͯ͠ͳ͍ • ॏཁػೳͳͲ Canary ͍ͨ͠ͱ͖͚ͩ༗ޮʹ͠ɺ100% ϦϦʔεͨ͠Β Revert ͍ͯͨ͠ • σʔλྔͷ໰୊ • MTTR ಉ༷ͷ໰୊

Slide 29

Slide 29 text

·ͱΊͱߟ࡯ • SRE ۀ຿Λ5ͭͷྖҬʹ෼ྨ͠ɺ͏ͪ2ͭͷྖҬ͔ΒɺʮLean ͱ DevOps ͷՊֶʯΛࢀߟʹɺࢦ ඪʹͳΓ͏Δ͔Λܭଌͨ͠ • ༗ޮͳࢦඪͷ৚݅ • े෼ʹσʔλྔ͕͋Δ͜ͱ • MTTR, มߋࣦഊ཰͸σʔλྔΛಘΔ͜ͱ͕೉͍͠ • ͜ΕΒ͕සൃ͢Δঢ়ଶ͸ SRE ͷ໨తͱ൓͢Δ • ͦͷࢦඪΛؚΉଞͷࢦඪ͕ଘࡏ͠ͳ͍͜ͱ • CI ҆ఆੑ͸ MTTR, Time To DeliveryʢมߋͷϦʔυλΠϜʣͰิ͑Δ • σϓϩΠ࣌ؒ΋ MTTR, Time To DeliveryʢมߋͷϦʔυλΠϜʣͰิ͑Δ • σϓϩΠճ਺Λ݈શʹ૿΍͢ʹ͸มߋࣦഊ཰ͷܭଌ͕ඞཁ

Slide 30

Slide 30 text

·ͱΊͱߟ࡯ • MTTR 🚀 • τϥοΩϯά͸ܧଓ • ܧଓతʹऔಘ͢ΔͨΊʹ Incident Response ͷվળ͕ඞཁ • σϓϩΠճ਺🚀 • microservices ؚΊͯܭଌ • σϓϩΠ࣌ؒ🤔 • ։ൃϒϥϯνͷܭଌ͕ඞཁ • ௕ظతʹ͸ MTTR / มߋͷϦʔυλΠϜͰิ͏ • CI ҆ఆੑ🤔 • ։ൃϒϥϯνͷܭଌ͕ඞཁ • ௕ظతʹ͸ MTTR / มߋͷϦʔυλΠϜͰิ͏ • มߋࣦഊ཰🚀 • มߋࣦഊͷఆٛͱӡ༻ϧʔϧࡦఆ͕ඞཁ • มߋͷϦʔυλΠϜ🤔 • Develop branch Ͱͷ First commit ͔Β Production ΁ͷ Code มߋ·ͰΛܭଌͰ͖Δͱྑ͍͕ɺม਺͕ଟ͘ɺ͹Β͖͕ͭେ͖͍Մೳੑ͕͋Δ

Slide 31

Slide 31 text

ࠓޙͷల๬ • ݱঢ়ଌఆ͍ͯ͠Δ΋ͷ͸ܧଓɺࣗಈԽΛ໨ࢦ͢ • "ࣦഊ"ʹؔ͢Δࢦඪ͸׆༻ͮ͠Β͍Մೳੑ͕͋Δ͕ɺܧଓͯ͠ܭ • MTTR, มߋࣦഊ཰ • ਓ͕ؒؔΘΔϓϩηεͰ͸ܭଌͷͨΊʹఆٛɺϧʔϧɺن໿͕ඞཁ • ଞͷྖҬʹؔͯ͠΋ࢦඪΛఏҊ͢Δ • ܭଌɾՄࢹԽͷσβΠϯύλʔϯͷ੔ཧΛ͍ͨ͠ • ୭΋͕ԿͰ΋ܭଌͯ͠ՄࢹԽͯࣗ͠཯తʹܧଓతվળ͕Ͱ͖ΔੈքΛ໨ࢦ͢

Slide 32

Slide 32 text

Thank you! chaspy chaspy_ Lead Software Engineer Site Reliability at Quipper Takeshi Kondo