Upgrade to Pro — share decks privately, control downloads, hide ads and more …

サービス立ち上げ期におけるSREの取り組み / SRE efforts in the service launch phase

Takeshi Kondo
January 19, 2022

サービス立ち上げ期におけるSREの取り組み / SRE efforts in the service launch phase

Takeshi Kondo

January 19, 2022
Tweet

More Decks by Takeshi Kondo

Other Decks in Technology

Transcript

  1. αʔ ビ ε্ཱͪ げ ظʹ͓͚ΔSREͷऔΓ૊Έ Takeshi Kondo / @chaspy 2022/01/19

    ʲiCARE Dev Meetup #29ʳΤϯδχΞʹΑΔ৽نαʔϏε্ཱͪ͛ͷۤ࿑ͱتͼ
  2. ࠓ೔࿩͢͜ͱ / ର৅ • ࿩͢͜ͱ • αʔϏε্ཱͪ͛ظʹ͓͍ͯɺSRE ͱ͍͏ߟ͑͸Ͳ͏໾ʹཱͭͷ͔ • SRE

    ͷߟ͑ΛͲͷΑ͏ʹద༻͠ɺ࣮ફ͢Ε͹͍͍ͷ͔ • ͍͍ͩͨϒϩάͷ࿩Ͱ͢ https://blog.sisterwith.com/blog/sre-for-sister • ର৅ • αʔϏε্ཱͪ͛࣌ͷ৴པੑΛͲ͏ߟ͑Ε͹͍͍͔Θ͔Βͳ͍ਓ • SREΛ࣮ફ͠Α͏ͱࢥ͏͕Ͳ͔͜ΒखΛ͚ͭΕ͹͍͍͔Θ͔Βͳ͍ਓ
  3. SRE ͱ͸Կ͔ • SRE = Site Reliability Engineering • ىݯ͸ʮαʔϏεӡ༻Λ

    Software Engineer ʹΑ࣮ͬͯݱ͢ Δ͜ͱʯ (*1) • ίΞίϯηϓτͱͯ͠ SLI/SLO(*2) ͕͋ΓɺϢʔβ͕ظ଴͢Δ αʔϏεϨϕϧΛࢦඪԽ͠ɺػೳ։ൃͱඇػೳ։ൃͷͲͪ Βʹ౤ࢿ͢Δ͔ͷࢦ਑ͱ͢Δ *1 Site Reliability Engineering: https://sre.google/sre-book/introduction/ our Site Reliability Engineering teams focus on hiring software engineers to run our products and to create systems to accomplish the work that would otherwise be performed, often manually, by sysadmins. *2 Service Level Indicator / Service Level Objectives ͷ͜ͱ
  4. ༨ஊɿ100% ৴པੑ໨ඪ͸ؒҧͬͨ໨ඪ • 100% is the wrong reliability target(*3) •

    99.9, 99.99% ͱ 9ͷܻΛ૿΍͢ͱͦͷͨΊͷίετ͕େ͖͔͔͘Δ • 100% ͸ෆՄೳ = ো֐͸ى͖Δ΋ͷɺͱ͍͏લఏΛ࣋ͭ΂͖ *3 Site Reliability Engineering: https://sre.google/sre-book/introduction/ The error budget stems from the observation that 100% is the wrong reliability target for basically everything
  5. ༨ஊɿ৴པੑ͸࠷΋ॏཁͳػೳͷ1ͭ • Reliability Is the Most Important Feature(*4) • γεςϜ͕৴པͰ͖ͳ͚Ε͹ɺϢʔβ͸ͦΕΛ৴པ͠ͳ͍

    • Ϣʔβ͕γεςϜΛ৴པ͠ͳ͚Ε͹ɺ࢖Θͳ͍ • γεςϜ͸ωοτϫʔΫޮՌʹΑΓ޿͕ΔͨΊɺϢʔβ͕͍ͳ͍γε ςϜ͸Ձ஋͕ͳ͍ • ଌఆ߲໨͸৻ॏʹબ୒͠ͳ͍͞ *4 The Site Reliability Workbook: https://sre.google/workbook/reaching-beyond/
  6. ݸਓ։ൃʹ͓͚Δ SREɺͲ͔͜Β͸͡ΊΔʁ • 1. Ϣʔβ͕ظ଴͢ΔαʔϏεϨϕϧΛఏڙͰ͖͍ͯΔ͔ • 2. ͦΕΛఏڙͰ͖ͯͳ͍࣌ؒΛ࠷খԽͰ͖Δ͔ • ݴ͍׵͑Δͱ...

    • मਖ਼ϛεΛຊ൪؀ڥʹग़͢લʹؾ෇͚ΔΑ͏ʹ͢Δ • ຊ൪Ͱमਖ਼ϛε͕ى͖ͯ΋͙͢ؾͮ͘͜ͱ͕Ͱ͖Δ • ຊ൪Ͱमਖ਼ϛε͕ى͖ͨ৔߹ɺͦͷݪҼΛௐࠪՄೳʹ͢Δ • ຊ൪Ͱमਖ਼ϛε͕ى͖ͨ৔߹ɺͦͷमਖ਼Λૉૣ͘ϦϦʔεͰ͖Δ
  7. SRE Λ࣮ફ͢ΔͨΊͷϥμʔ • ্ཱͪ͛࣌ظʢsister ͸͜͜ʣ • Developer Productivity (Local Environment)

    • Release Engineering, Unit Test, CICD • Observability (Logging, Metrics, Tracing) ։ൃɺద༻ɺ֬ೝͷαΠΫϧΛߴ଎Խ ໰୊ʹૉૣ͘ؾͮͨ͘Ίͷ࢓૊Έ࡞Γ
  8. SRE Λ࣮ફ͢ΔͨΊͷϥμʔ • ຊ൪ӡ༻·ͰʢνʔϜن໛ʙ10ਓʣ • Continuous Library Update (renovate/dependabot) •

    Data Protection • Availability (AutoScaling, Redundancy) • Performance Improvement Ϣʔβ਺ɾσʔλ਺͕૿͑ͨͱ͖ʹ޲͚ͨ४උ
  9. SRE Λ࣮ફ͢ΔͨΊͷϥμʔ • ຊ൪ӡ༻։࢝ʙ֦େ࣌ظʢʙ50ਓʣ • E2E Test Automation • SLI/SLO/Error

    Budget Policy • Incident Response Management / Training • Load Test / Stress Test ૊৫ɾνʔϜͰ໨ࢦ͢৴པੑΛ୲อ͢Δ ਺೥ޙΛݟӽͨ͠४උɺઃܭ