Upgrade to Pro — share decks privately, control downloads, hide ads and more …

サービス立ち上げ期におけるSREの取り組み / SRE efforts in the service launch phase

Takeshi Kondo
January 19, 2022

サービス立ち上げ期におけるSREの取り組み / SRE efforts in the service launch phase

Takeshi Kondo

January 19, 2022
Tweet

More Decks by Takeshi Kondo

Other Decks in Technology

Transcript

  1. αʔ
    ビ
    ε্ཱͪ
    げ
    ظʹ͓͚ΔSREͷऔΓ૊Έ
    Takeshi Kondo / @chaspy


    2022/01/19


    ʲiCARE Dev Meetup #29ʳΤϯδχΞʹΑΔ৽نαʔϏε্ཱͪ͛ͷۤ࿑ͱتͼ

    View full-size slide

  2. Who am I
    chaspy chaspy_
    SRE at sisterwith.com

    Takeshi Kondo

    View full-size slide

  3. ࠓ೔࿩͢͜ͱ / ର৅
    • ࿩͢͜ͱ


    • αʔϏε্ཱͪ͛ظʹ͓͍ͯɺSRE ͱ͍͏ߟ͑͸Ͳ͏໾ʹཱͭͷ͔


    • SRE ͷߟ͑ΛͲͷΑ͏ʹద༻͠ɺ࣮ફ͢Ε͹͍͍ͷ͔


    • ͍͍ͩͨϒϩάͷ࿩Ͱ͢ https://blog.sisterwith.com/blog/sre-for-sister


    • ର৅


    • αʔϏε্ཱͪ͛࣌ͷ৴པੑΛͲ͏ߟ͑Ε͹͍͍͔Θ͔Βͳ͍ਓ


    • SREΛ࣮ફ͠Α͏ͱࢥ͏͕Ͳ͔͜ΒखΛ͚ͭΕ͹͍͍͔Θ͔Βͳ͍ਓ

    View full-size slide

  4. Tl;dr
    • SRE ͷߟ͑͸αʔϏε্ཱͪ͛ظͰ΋ద༻Ͱ͖Δ


    • Ϣʔβͷ৴པੑ΁ͷظ଴஋Λ૝૾͠Α͏


    • αʔϏεɾ૊৫ͷن໛ʹԠͯ͡ SRE ରԠͷϩʔυϚοϓΛ
    ࡞Ζ͏

    View full-size slide

  5. Agenda
    1. SRE ͱ͸Կ͔


    2. ݸਓ։ൃͱ SRE


    3. sister Ͱͷࣄྫ


    4. SRE Λ࣮ફ͢ΔͨΊͷϥμʔ

    View full-size slide

  6. Agenda
    1. SRE ͱ͸Կ͔


    2. ݸਓ։ൃͱ SRE


    3. sister Ͱͷࣄྫ


    4. SRE Λ࣮ફ͢ΔͨΊͷϥμʔ

    View full-size slide

  7. SRE ͱ͸Կ͔
    • SRE = Site Reliability Engineering


    • ىݯ͸ʮαʔϏεӡ༻Λ Software Engineer ʹΑ࣮ͬͯݱ͢
    Δ͜ͱʯ
    (*1)


    • ίΞίϯηϓτͱͯ͠ SLI/SLO(*2)
    ͕͋ΓɺϢʔβ͕ظ଴͢Δ
    αʔϏεϨϕϧΛࢦඪԽ͠ɺػೳ։ൃͱඇػೳ։ൃͷͲͪ
    Βʹ౤ࢿ͢Δ͔ͷࢦ਑ͱ͢Δ
    *1 Site Reliability Engineering: https://sre.google/sre-book/introduction/ our Site Reliability Engineering teams focus on hiring software engineers to run our products and to
    create systems to accomplish the work that would otherwise be performed, often manually, by sysadmins.


    *2 Service Level Indicator / Service Level Objectives ͷ͜ͱ

    View full-size slide

  8. Α͋͘Δ࿩ʢཁग़యʣ
    • ͦΕͬͯ Google ͙Β͍ͷେن໛ͳαʔϏε͔ͩΒඞཁͳ
    ͜ͱͳΜͰ͠ΐʁ


    • ݸਓ։ൃ΍ελʔτΞοϓͩͱͱʹ͔͘Ϣʔβʹ࢖ͬͯ΋Β
    ͑ΔػೳΛ࡞Δͷ༏ઌʹܾ·ͬͯΔͷͰ SRE ͳΜͯؔ܎ͳ
    ͍ΑͶʂ


    • ʢތு͍ͯ͠·͢ʣ

    View full-size slide

  9. SRE ͱ͸Կ͔ʢ࠶ʣ
    • -> ίΞίϯηϓτͱͯ͠ SLI/SLO͕͋ΓɺϢʔβ͕ظ଴͢
    ΔαʔϏεϨϕϧΛࢦඪԽ͠ɺػೳ։ൃͱඇػೳ։ൃͷͲ
    ͪΒʹ౤ࢿ͢Δ͔ͷࢦ਑ͱ͢Δ


    • ݴ͍׵͑Δͱ...


    • Ϣʔβ͕ظ଴͢ΔαʔϏεϨϕϧΛఏڙͰ͖͍ͯΔ͔


    • ͦΕΛఏڙͰ͖ͯͳ͍࣌ؒΛ࠷খԽͰ͖Δ͔

    View full-size slide

  10. ༨ஊɿ100% ৴པੑ໨ඪ͸ؒҧͬͨ໨ඪ
    • 100% is the wrong reliability target(*3)


    • 99.9, 99.99% ͱ 9ͷܻΛ૿΍͢ͱͦͷͨΊͷίετ͕େ͖͔͔͘Δ


    • 100% ͸ෆՄೳ = ো֐͸ى͖Δ΋ͷɺͱ͍͏લఏΛ࣋ͭ΂͖
    *3 Site Reliability Engineering: https://sre.google/sre-book/introduction/ The error budget stems from the observation that 100% is the wrong reliability target for basically
    everything

    View full-size slide

  11. Agenda
    1. SRE ͱ͸Կ͔


    2. ݸਓ։ൃͱ SRE


    3. sister Ͱͷࣄྫ


    4. SRE Λ࣮ફ͢ΔͨΊͷϥμʔ

    View full-size slide

  12. ݸਓ։ൃϑΣʔζͱ͸ɺͲ͏͍͏ϑΣʔζͩͱଊ͑Δ͔ʁ
    • Ϣʔβ͸গͳ͍͔΋͠Εͳ͍͕ɺଘࡏ͢Δ


    • Ϣʔβ͕ຬ଍͢Ε͹ɺར༻ऀ͸૿͑Δ


    • ΋͠ຬ଍ʹར༻Ͱ͖ͳ͍ɺظ଴͍ͯ͠ΔΑ͏ʹ࢖͑ͳ͍৔߹


    • Ϣʔβ͸؆୯ʹ཭Εͯ͠·͏

    ݸਓ։ൃͰ΋େن໛։ൃ΋ɺػೳ։ൃͱಉ͡Α͏ʹ


    Ϣʔβظ଴஋Λຬͨ͢৴པੑ͸ॏཁ

    View full-size slide

  13. ༨ஊɿ৴པੑ͸࠷΋ॏཁͳػೳͷ1ͭ
    • Reliability Is the Most Important Feature(*4)


    • γεςϜ͕৴པͰ͖ͳ͚Ε͹ɺϢʔβ͸ͦΕΛ৴པ͠ͳ͍


    • Ϣʔβ͕γεςϜΛ৴པ͠ͳ͚Ε͹ɺ࢖Θͳ͍


    • γεςϜ͸ωοτϫʔΫޮՌʹΑΓ޿͕ΔͨΊɺϢʔβ͕͍ͳ͍γε
    ςϜ͸Ձ஋͕ͳ͍


    • ଌఆ߲໨͸৻ॏʹબ୒͠ͳ͍͞
    *4 The Site Reliability Workbook: https://sre.google/workbook/reaching-beyond/

    View full-size slide

  14. ݸਓ։ൃʹ͓͚Δ SREɺͲ͔͜Β͸͡ΊΔʁ
    • 1. Ϣʔβ͕ظ଴͢ΔαʔϏεϨϕϧΛఏڙͰ͖͍ͯΔ͔


    • 2. ͦΕΛఏڙͰ͖ͯͳ͍࣌ؒΛ࠷খԽͰ͖Δ͔


    • ݴ͍׵͑Δͱ...


    • मਖ਼ϛεΛຊ൪؀ڥʹग़͢લʹؾ෇͚ΔΑ͏ʹ͢Δ


    • ຊ൪Ͱमਖ਼ϛε͕ى͖ͯ΋͙͢ؾͮ͘͜ͱ͕Ͱ͖Δ


    • ຊ൪Ͱमਖ਼ϛε͕ى͖ͨ৔߹ɺͦͷݪҼΛௐࠪՄೳʹ͢Δ


    • ຊ൪Ͱमਖ਼ϛε͕ى͖ͨ৔߹ɺͦͷमਖ਼Λૉૣ͘ϦϦʔεͰ͖Δ

    View full-size slide

  15. Agenda
    1. SRE ͱ͸Կ͔


    2. ݸਓ։ൃͱ SRE


    3. sister Ͱͷࣄྫ


    4. SRE Λ࣮ફ͢ΔͨΊͷϥμʔ

    View full-size slide

  16. sister ʹ͓͚Δ࣮ફ

    View full-size slide

  17. • Developer Productivity


    • Observability


    • Testing


    • Security
    sister ʹ͓͚Δ࣮ફ

    View full-size slide

  18. sister ʹ͓͚Δ࣮ફ

    View full-size slide

  19. Agenda
    1. SRE ͱ͸Կ͔


    2. ݸਓ։ൃͱ SRE


    3. sister Ͱͷࣄྫ


    4. SRE Λ࣮ફ͢ΔͨΊͷϥμʔ

    View full-size slide

  20. SRE Λ࣮ફ͢ΔͨΊͷϥμʔ
    • ૊৫ن໛ͱϑΣʔζʹΑͬͯ3ஈ֊


    • ্ཱͪ͛࣌ظʢsister ͸͜͜ʣ


    • ຊ൪ӡ༻·ͰʢνʔϜن໛ʙ10ਓʣ


    • ຊ൪ӡ༻։࢝ʙ֦େ࣌ظʢʙ50ਓʣ

    View full-size slide

  21. SRE Λ࣮ફ͢ΔͨΊͷϥμʔ
    • ্ཱͪ͛࣌ظʢsister ͸͜͜ʣ


    • Developer Productivity (Local Environment)


    • Release Engineering, Unit Test, CICD


    • Observability (Logging, Metrics, Tracing)


    ։ൃɺద༻ɺ֬ೝͷαΠΫϧΛߴ଎Խ


    ໰୊ʹૉૣ͘ؾͮͨ͘Ίͷ࢓૊Έ࡞Γ

    View full-size slide

  22. SRE Λ࣮ફ͢ΔͨΊͷϥμʔ
    • ຊ൪ӡ༻·ͰʢνʔϜن໛ʙ10ਓʣ


    • Continuous Library Update (renovate/dependabot)


    • Data Protection


    • Availability (AutoScaling, Redundancy)


    • Performance Improvement


    Ϣʔβ਺ɾσʔλ਺͕૿͑ͨͱ͖ʹ޲͚ͨ४උ

    View full-size slide

  23. SRE Λ࣮ફ͢ΔͨΊͷϥμʔ
    • ຊ൪ӡ༻։࢝ʙ֦େ࣌ظʢʙ50ਓʣ


    • E2E Test Automation


    • SLI/SLO/Error Budget Policy


    • Incident Response Management / Training


    • Load Test / Stress Test


    ૊৫ɾνʔϜͰ໨ࢦ͢৴པੑΛ୲อ͢Δ


    ਺೥ޙΛݟӽͨ͠४උɺઃܭ

    View full-size slide

  24. ·ͱΊ
    • SRE ͷߟ͑͸αʔϏε্ཱͪ͛ظͰ΋ద༻Ͱ͖Δ


    • Ϣʔβͷ৴པੑ΁ͷظ଴஋Λ૝૾͠Α͏


    • αʔϏεɾ૊৫ͷن໛ʹԠͯ͡ SRE ରԠͷϩʔυϚοϓΛ
    ࡞Ζ͏

    View full-size slide

  25. ͓ΘΓʹ
    • sister (sisterwith.com) ͸͓࢞͞Μʢϝϯλʔʣɺຓʢϝϯ
    ςΟʔʣΛืू͍ͯ͠·͢


    • SRE ʹ·ͭΘΔτϐοΫ͋Ε͹ؾܰʹ Twitter DM Ͳ͏ͧʂ


    • https://twitter.com/_chaspy

    View full-size slide

  26. Thank you!
    chaspy chaspy_
    SRE at sisterwith.com

    Takeshi Kondo

    View full-size slide