Upgrade to Pro — share decks privately, control downloads, hide ads and more …

サービス立ち上げ期におけるSREの取り組み / SRE efforts in the service launch phase

Takeshi Kondo
January 19, 2022

サービス立ち上げ期におけるSREの取り組み / SRE efforts in the service launch phase

Takeshi Kondo

January 19, 2022
Tweet

More Decks by Takeshi Kondo

Other Decks in Technology

Transcript

  1. αʔ
    ビ
    ε্ཱͪ
    げ
    ظʹ͓͚ΔSREͷऔΓ૊Έ
    Takeshi Kondo / @chaspy


    2022/01/19


    ʲiCARE Dev Meetup #29ʳΤϯδχΞʹΑΔ৽نαʔϏε্ཱͪ͛ͷۤ࿑ͱتͼ

    View Slide

  2. Who am I
    chaspy chaspy_
    SRE at sisterwith.com

    Takeshi Kondo

    View Slide

  3. ࠓ೔࿩͢͜ͱ / ର৅
    • ࿩͢͜ͱ


    • αʔϏε্ཱͪ͛ظʹ͓͍ͯɺSRE ͱ͍͏ߟ͑͸Ͳ͏໾ʹཱͭͷ͔


    • SRE ͷߟ͑ΛͲͷΑ͏ʹద༻͠ɺ࣮ફ͢Ε͹͍͍ͷ͔


    • ͍͍ͩͨϒϩάͷ࿩Ͱ͢ https://blog.sisterwith.com/blog/sre-for-sister


    • ର৅


    • αʔϏε্ཱͪ͛࣌ͷ৴པੑΛͲ͏ߟ͑Ε͹͍͍͔Θ͔Βͳ͍ਓ


    • SREΛ࣮ફ͠Α͏ͱࢥ͏͕Ͳ͔͜ΒखΛ͚ͭΕ͹͍͍͔Θ͔Βͳ͍ਓ

    View Slide

  4. Tl;dr
    • SRE ͷߟ͑͸αʔϏε্ཱͪ͛ظͰ΋ద༻Ͱ͖Δ


    • Ϣʔβͷ৴པੑ΁ͷظ଴஋Λ૝૾͠Α͏


    • αʔϏεɾ૊৫ͷن໛ʹԠͯ͡ SRE ରԠͷϩʔυϚοϓΛ
    ࡞Ζ͏

    View Slide

  5. Agenda
    1. SRE ͱ͸Կ͔


    2. ݸਓ։ൃͱ SRE


    3. sister Ͱͷࣄྫ


    4. SRE Λ࣮ફ͢ΔͨΊͷϥμʔ

    View Slide

  6. Agenda
    1. SRE ͱ͸Կ͔


    2. ݸਓ։ൃͱ SRE


    3. sister Ͱͷࣄྫ


    4. SRE Λ࣮ફ͢ΔͨΊͷϥμʔ

    View Slide

  7. SRE ͱ͸Կ͔
    • SRE = Site Reliability Engineering


    • ىݯ͸ʮαʔϏεӡ༻Λ Software Engineer ʹΑ࣮ͬͯݱ͢
    Δ͜ͱʯ
    (*1)


    • ίΞίϯηϓτͱͯ͠ SLI/SLO(*2)
    ͕͋ΓɺϢʔβ͕ظ଴͢Δ
    αʔϏεϨϕϧΛࢦඪԽ͠ɺػೳ։ൃͱඇػೳ։ൃͷͲͪ
    Βʹ౤ࢿ͢Δ͔ͷࢦ਑ͱ͢Δ
    *1 Site Reliability Engineering: https://sre.google/sre-book/introduction/ our Site Reliability Engineering teams focus on hiring software engineers to run our products and to
    create systems to accomplish the work that would otherwise be performed, often manually, by sysadmins.


    *2 Service Level Indicator / Service Level Objectives ͷ͜ͱ

    View Slide

  8. Α͋͘Δ࿩ʢཁग़యʣ
    • ͦΕͬͯ Google ͙Β͍ͷେن໛ͳαʔϏε͔ͩΒඞཁͳ
    ͜ͱͳΜͰ͠ΐʁ


    • ݸਓ։ൃ΍ελʔτΞοϓͩͱͱʹ͔͘Ϣʔβʹ࢖ͬͯ΋Β
    ͑ΔػೳΛ࡞Δͷ༏ઌʹܾ·ͬͯΔͷͰ SRE ͳΜͯؔ܎ͳ
    ͍ΑͶʂ


    • ʢތு͍ͯ͠·͢ʣ

    View Slide

  9. SRE ͱ͸Կ͔ʢ࠶ʣ
    • -> ίΞίϯηϓτͱͯ͠ SLI/SLO͕͋ΓɺϢʔβ͕ظ଴͢
    ΔαʔϏεϨϕϧΛࢦඪԽ͠ɺػೳ։ൃͱඇػೳ։ൃͷͲ
    ͪΒʹ౤ࢿ͢Δ͔ͷࢦ਑ͱ͢Δ


    • ݴ͍׵͑Δͱ...


    • Ϣʔβ͕ظ଴͢ΔαʔϏεϨϕϧΛఏڙͰ͖͍ͯΔ͔


    • ͦΕΛఏڙͰ͖ͯͳ͍࣌ؒΛ࠷খԽͰ͖Δ͔

    View Slide

  10. ༨ஊɿ100% ৴པੑ໨ඪ͸ؒҧͬͨ໨ඪ
    • 100% is the wrong reliability target(*3)


    • 99.9, 99.99% ͱ 9ͷܻΛ૿΍͢ͱͦͷͨΊͷίετ͕େ͖͔͔͘Δ


    • 100% ͸ෆՄೳ = ো֐͸ى͖Δ΋ͷɺͱ͍͏લఏΛ࣋ͭ΂͖
    *3 Site Reliability Engineering: https://sre.google/sre-book/introduction/ The error budget stems from the observation that 100% is the wrong reliability target for basically
    everything

    View Slide

  11. Agenda
    1. SRE ͱ͸Կ͔


    2. ݸਓ։ൃͱ SRE


    3. sister Ͱͷࣄྫ


    4. SRE Λ࣮ફ͢ΔͨΊͷϥμʔ

    View Slide

  12. ݸਓ։ൃϑΣʔζͱ͸ɺͲ͏͍͏ϑΣʔζͩͱଊ͑Δ͔ʁ
    • Ϣʔβ͸গͳ͍͔΋͠Εͳ͍͕ɺଘࡏ͢Δ


    • Ϣʔβ͕ຬ଍͢Ε͹ɺར༻ऀ͸૿͑Δ


    • ΋͠ຬ଍ʹར༻Ͱ͖ͳ͍ɺظ଴͍ͯ͠ΔΑ͏ʹ࢖͑ͳ͍৔߹


    • Ϣʔβ͸؆୯ʹ཭Εͯ͠·͏

    ݸਓ։ൃͰ΋େن໛։ൃ΋ɺػೳ։ൃͱಉ͡Α͏ʹ


    Ϣʔβظ଴஋Λຬͨ͢৴པੑ͸ॏཁ

    View Slide

  13. ༨ஊɿ৴པੑ͸࠷΋ॏཁͳػೳͷ1ͭ
    • Reliability Is the Most Important Feature(*4)


    • γεςϜ͕৴པͰ͖ͳ͚Ε͹ɺϢʔβ͸ͦΕΛ৴པ͠ͳ͍


    • Ϣʔβ͕γεςϜΛ৴པ͠ͳ͚Ε͹ɺ࢖Θͳ͍


    • γεςϜ͸ωοτϫʔΫޮՌʹΑΓ޿͕ΔͨΊɺϢʔβ͕͍ͳ͍γε
    ςϜ͸Ձ஋͕ͳ͍


    • ଌఆ߲໨͸৻ॏʹબ୒͠ͳ͍͞
    *4 The Site Reliability Workbook: https://sre.google/workbook/reaching-beyond/

    View Slide

  14. ݸਓ։ൃʹ͓͚Δ SREɺͲ͔͜Β͸͡ΊΔʁ
    • 1. Ϣʔβ͕ظ଴͢ΔαʔϏεϨϕϧΛఏڙͰ͖͍ͯΔ͔


    • 2. ͦΕΛఏڙͰ͖ͯͳ͍࣌ؒΛ࠷খԽͰ͖Δ͔


    • ݴ͍׵͑Δͱ...


    • मਖ਼ϛεΛຊ൪؀ڥʹग़͢લʹؾ෇͚ΔΑ͏ʹ͢Δ


    • ຊ൪Ͱमਖ਼ϛε͕ى͖ͯ΋͙͢ؾͮ͘͜ͱ͕Ͱ͖Δ


    • ຊ൪Ͱमਖ਼ϛε͕ى͖ͨ৔߹ɺͦͷݪҼΛௐࠪՄೳʹ͢Δ


    • ຊ൪Ͱमਖ਼ϛε͕ى͖ͨ৔߹ɺͦͷमਖ਼Λૉૣ͘ϦϦʔεͰ͖Δ

    View Slide

  15. Agenda
    1. SRE ͱ͸Կ͔


    2. ݸਓ։ൃͱ SRE


    3. sister Ͱͷࣄྫ


    4. SRE Λ࣮ફ͢ΔͨΊͷϥμʔ

    View Slide

  16. sister ʹ͓͚Δ࣮ફ

    View Slide

  17. • Developer Productivity


    • Observability


    • Testing


    • Security
    sister ʹ͓͚Δ࣮ફ

    View Slide

  18. sister ʹ͓͚Δ࣮ફ

    View Slide

  19. Agenda
    1. SRE ͱ͸Կ͔


    2. ݸਓ։ൃͱ SRE


    3. sister Ͱͷࣄྫ


    4. SRE Λ࣮ફ͢ΔͨΊͷϥμʔ

    View Slide

  20. SRE Λ࣮ફ͢ΔͨΊͷϥμʔ
    • ૊৫ن໛ͱϑΣʔζʹΑͬͯ3ஈ֊


    • ্ཱͪ͛࣌ظʢsister ͸͜͜ʣ


    • ຊ൪ӡ༻·ͰʢνʔϜن໛ʙ10ਓʣ


    • ຊ൪ӡ༻։࢝ʙ֦େ࣌ظʢʙ50ਓʣ

    View Slide

  21. SRE Λ࣮ફ͢ΔͨΊͷϥμʔ
    • ্ཱͪ͛࣌ظʢsister ͸͜͜ʣ


    • Developer Productivity (Local Environment)


    • Release Engineering, Unit Test, CICD


    • Observability (Logging, Metrics, Tracing)


    ։ൃɺద༻ɺ֬ೝͷαΠΫϧΛߴ଎Խ


    ໰୊ʹૉૣ͘ؾͮͨ͘Ίͷ࢓૊Έ࡞Γ

    View Slide

  22. SRE Λ࣮ફ͢ΔͨΊͷϥμʔ
    • ຊ൪ӡ༻·ͰʢνʔϜن໛ʙ10ਓʣ


    • Continuous Library Update (renovate/dependabot)


    • Data Protection


    • Availability (AutoScaling, Redundancy)


    • Performance Improvement


    Ϣʔβ਺ɾσʔλ਺͕૿͑ͨͱ͖ʹ޲͚ͨ४උ

    View Slide

  23. SRE Λ࣮ફ͢ΔͨΊͷϥμʔ
    • ຊ൪ӡ༻։࢝ʙ֦େ࣌ظʢʙ50ਓʣ


    • E2E Test Automation


    • SLI/SLO/Error Budget Policy


    • Incident Response Management / Training


    • Load Test / Stress Test


    ૊৫ɾνʔϜͰ໨ࢦ͢৴པੑΛ୲อ͢Δ


    ਺೥ޙΛݟӽͨ͠४උɺઃܭ

    View Slide

  24. ·ͱΊ
    • SRE ͷߟ͑͸αʔϏε্ཱͪ͛ظͰ΋ద༻Ͱ͖Δ


    • Ϣʔβͷ৴པੑ΁ͷظ଴஋Λ૝૾͠Α͏


    • αʔϏεɾ૊৫ͷن໛ʹԠͯ͡ SRE ରԠͷϩʔυϚοϓΛ
    ࡞Ζ͏

    View Slide

  25. ͓ΘΓʹ
    • sister (sisterwith.com) ͸͓࢞͞Μʢϝϯλʔʣɺຓʢϝϯ
    ςΟʔʣΛืू͍ͯ͠·͢


    • SRE ʹ·ͭΘΔτϐοΫ͋Ε͹ؾܰʹ Twitter DM Ͳ͏ͧʂ


    • https://twitter.com/_chaspy

    View Slide

  26. Thank you!
    chaspy chaspy_
    SRE at sisterwith.com

    Takeshi Kondo

    View Slide