Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Site Reliability Engineering における 重要領域とパフォーマンス指標の提案 / Performance Indicators for SRE

Site Reliability Engineering における 重要領域とパフォーマンス指標の提案 / Performance Indicators for SRE

2021/06/04
第8回WebSystemArchitecture研究会(オンライン)
https://wsa.connpass.com/event/207143/

Takeshi Kondo

June 04, 2021
Tweet

More Decks by Takeshi Kondo

Other Decks in Technology

Transcript

  1. Site Reliability Engineering ʹ͓͚Δ


    ॏཁྖҬͱύϑΥʔϚϯεࢦඪͷఏҊ


    Takeshi Kondo / @chaspy


    2021/06/04


    ୈ8ճWebSystemArchitectureݚڀձʢΦϯϥΠϯʣ

    View Slide

  2. Who am I
    chaspy chaspy_
    Lead Software Engineer

    Site Reliability at Quipper
    Takeshi Kondo

    View Slide

  3. Agenda
    1. എܠͱ໨త


    2. SRE ͕ؔΘΔྖҬ


    3. ఏҊࢦඪͱଌఆํ๏


    4. ଌఆ݁Ռ


    5. ·ͱΊͱࠓޙͷల๬

    View Slide

  4. എܠͱ໨త
    • SRE ͱ͍͏ Role ͕޿͘ීٴ͠ɺ࣮ફ͢Δاۀ͕૿͖͑ͯͨ


    • Ϗδωεɺ૊৫ͷن໛΍ੑ࣭ʹΑΓͦͷ໾ׂ͸ҟͳΔ


    →SRE ͕ؔΘΔॏཁͳྖҬΛ෼ྨ͍ͨ͠


    • ϓϩμΫτ։ൃͷΑ͏ʹϏδωεKPIΛઃఆ͠ɺͦΕΛ܁Γฦ
    ͠վળ͢Δͱ͍͏Ξϓϩʔν͸ SRE ʹ΋༗ޮͳ͸ͣ


    →ྖҬ͝ͱͷύϑΥʔϚϯεࢦඪΛఆٛɾܭଌ͍ͨ͠

    View Slide

  5. SRE͕ؔΘΔྖҬ
    • Ϋϥ΢υ্Ͱ Web αʔϏεΛఏڙ͢ΔاۀΛ૝ఆ


    • 100+ Developer


    • 30M+ Access / Day

    View Slide

  6. ʲࢀߟʳAWS Well-Architected Framework
    https://aws.amazon.com/architecture/well-architected/

    View Slide

  7. ଌఆࢦඪͱଌఆํ๏
    • Reliability


    • Developer Productivity


    • Cost


    • Security


    • Platform

    View Slide

  8. ྖҬ͝ͱͷؔ܎ੑ
    1MBUGPSN
    3FMJBCJMJUZ
    $PTU
    %FWFMPQFS
    1SPEVDUJWJUZ
    4FDVSJUZ
    Empowerment Empowerment Empowerment
    Trade-Off
    Trade-Off Trade-Off

    View Slide

  9. ʲࢀߟʳLean ͱ DevOps ͷՊֶ
    • ΤϦʔτاۀ͸ҎԼ͢΂͕ͯ༏Ε͍ͯΔ


    • σϓϩΠͷස౓


    • มߋͷϦʔυλΠϜ


    • MTTR


    • มߋࣦഊ཰
    https://book.impress.co.jp/books/1118101029

    View Slide

  10. ଌఆࢦඪͱଌఆํ๏
    • Reliability


    • Developer Productivity


    • Cost


    • Security


    • Platform

    View Slide

  11. ଌఆࢦඪͱଌఆํ๏
    ྖҬ ࢦඪ ଌఆํ๏
    3FMJBCJMJUZ .553
    ো֐ൃੜ࣌ʹɺো֐ใࠂϑϩʔ಺ͷ
    ඞਢ߲໨ͱͯ͠खಈͰܭଌ
    %FWFMPQFS1SPEVDUJWJUZ σϓϩΠճ਺ $*αʔϏεͷNFUSJDT
    %FWFMPQFS1SPEVDUJWJUZ σϓϩΠ࣌ؒ $*αʔϏεͷNFUSJDT
    %FWFMPQFS1SPEVDUJWJUZ $*҆ఆੑ $*αʔϏεͷNFUSJDT
    %FWFMPQFS1SPEVDUJWJUZ มߋࣦഊ཰
    ຊ൪؀ڥσϓϩΠʹରԠ͢Δϒϥϯ
    νͷ3FWFSUDPNNJUͷ਺

    View Slide

  12. ଌఆ݁Ռ
    • MTTR


    • σϓϩΠճ਺


    • σϓϩΠ࣌ؒ


    • CI ҆ఆੑ


    • มߋࣦഊ཰

    View Slide

  13. MTTR Plot with Trendline

    View Slide

  14. MTTR per half year

    View Slide

  15. Histgram

    View Slide

  16. MTTR ʹର͢Δߟ࡯
    • ܭଌՄೳੑͷ໰୊


    • ࣗಈͰूܭͰ͖Δ࢓૊Έ͕ඞཁ


    • ͦͷͨΊʹ͸ Incident Response ͷܕԽͱͦΕʹର͢Δ Tool ͕ඞཁ


    • SeverityʢIncident ͷ Level ఆٛʣ/ ো֐ൃੜɾݕ஌ɾ෮چͷ࣌ؒΛඞͣه࿥͢ΔϧʔϧͳͲ


    • σʔλྔͷ໰୊


    • Πϯγσϯτ਺͸2೥Ͱ͔͕ͨͩ50ఔ౓ɺे෼ͳσʔλྔ͕ಘΒΕͳ͍


    • Πϯγσϯτ਺͕ଟ͍͜ͱ͸ SRE ͷ໨తͱ૬൓͢Δ


    • ͹Β͖ͭͷ໰୊


    • σʔλྔ͕े෼Ͱͳ͍ͱɺҰ෦ͷ௕࣌ؒো֐ʹҾ͖ͣΒΕͯ͠·͏

    View Slide

  17. MTTR ʹର͢Δߟ࡯
    • ࢦඪͱͯ͠༗ӹ͔Ͳ͏͔͸·ͩ൑அͰ͖ͳ͍


    • গͳ͘ͱ΋ҎԼͷ఺Ͱ͸༗ӹͳͷͰτϥοΩϯάΛଓ͚Δ


    • Incident Response ͷܕԽ


    • ௕࣌ؒΠϯγσϯτʹର͢Δվળ

    View Slide

  18. ʲࢀߟʳIncident Metrics in SRE
    • MTTR ͸ࢦඪʹ͢΂͖Ͱ͸ͳ͍ͱओு


    • ݅਺ෆ଍ͱ͹Β͖ͭͷେ͖͕͞ཧ༝


    • Ͱ͸ͲΕΛ࠾༻͢΂͖͔͸ݴٴ͕ͳ͍
    https://sre.google/resources/practices-and-processes/incident-metrics-in-sre/

    View Slide

  19. ʲิ଍ʳDeveloper Productivity ྖҬͷܭଌର৅
    • monorepo Λ࠾༻


    • master branch Ͱ͸ෳ਺ͷΞϓϦ͕ಉ࣌ʹσϓϩΠ͞ΕΔ


    • Database Λڞ༗͢Δ Distributed monolith ͱͳ͍ͬͯΔ


    • ͜ΕΒ͸جຊతʹि࣍ϦϦʔε͞ΕΔ


    • ͜ΕҎ֎ͷ microservices ͸ݸผͰϦϦʔε͞ΕΔ͕ɺࠓճ
    ͸ܭଌର৅֎

    View Slide

  20. σϓϩΠճ਺

    View Slide

  21. σϓϩΠճ਺ʹର͢Δߟ࡯
    • جຊతʹ Weekly Release Ͱ͋ΔͨΊɺ൒೥ʹ26ճ͸ඞͣσϓϩΠ
    ͞ΕΔ


    • ࢒Γ͸ HOTFIX


    • ԿͷͨΊͷ HOTFIX ͔ʁ


    • มߋࣦഊ཰ʢޙड़ʣͱ߹ΘͤͯΈͳ͍ͱҙຯ͕ബͦ͏


    • ࣮ࡍ2020೥લ൒͸ Production ͷ Kubernetes manifest มߋͷͨΊͷ HOTFIX ͕ଟ͔ͬͨ


    • σϓϩΠ਺͕ݮগ܏޲ͳͷ͸ Microservices Խ͍ͯ͠Δ͔Β


    • Microservices ΛؚΊͯܭଌ͢Δඞཁ͕͋Δ

    View Slide

  22. σϓϩΠ࣌ؒ

    View Slide

  23. σϓϩΠ࣌ؒʹର͢Δߟ࡯
    • ະ෼ੳʢ௥ه༧ఆʣ


    • ͜ͷ࣌ؒ͸มߋࣦഊ࣌ͷ Revert ͷ࣌ؒͱҰக͢ΔͷͰɺ୹
    ͘͢Ε͹͢Δ΄Ͳ MTTR ࡟ݮʹͭͳ͕Δ͸ͣ

    View Slide

  24. CI ҆ఆੑ

    View Slide

  25. CI ҆ఆੑʹؔ͢Δߟ࡯
    • Time Window ͸ 7 Days


    • 30 Days, 90 Days ͳͲෳ਺ͷ Time Window Ͱܭଌͨ͠΄͏͕ྑ͍


    • ຊ൪͚ͩͰͳ͘ɺ։ൃϒϥϯν΋ಉ༷ʹܭଌ͢Δ΂͖


    • ࢦඪͱͯ͠͸ෆద౰


    • جຊతʹ 100% ʹ͚ۙΕ͹͍ۙ΄Ͳྑ͍


    • SLO ͱͯ͠ଊ͑ͯɺ໨ඪ஋Λҧ൓ͨ͠Βࠜຊमਖ਼͢ΔΞϓϩʔν͕ྑ͍


    • ͜ͷ஋ΛؚΉผͷࢦඪΛ༻͍ͨ΄͏͕ྑ͍


    • Time To DeliveryʢมߋͷϦʔυλΠϜʣ


    • MTTR


    • ͨͩ͠ɺ෼ੳՄೳੑ͸ॏཁɻCI ͕ෆ҆ఆͳͱ͖ɺͲͷ Job ͕Ͳͷఔ౓ෆ҆ఆ͔Λ஌Δඞཁ͸͋Δ

    View Slide

  26. มߋࣦഊ਺

    View Slide

  27. มߋࣦഊ཰

    View Slide

  28. มߋࣦഊ཰ʹؔ͢Δߟ࡯
    • "มߋࣦഊ"ͷఆٛͷ໰୊


    • ԿΛ΋ͬͯ"มߋࣦഊ"ͱ͢Δ͔ͷఆ͕ٛඞཁ


    • Label ෇༩ͳͲͷӡ༻ϧʔϧ͕ͳ͍ͱܭଌ͕೉͍͠


    • ܭଌํ๏ͷ໰୊


    • ຊ൪ϒϥϯν΁ͷ Revert ͸"มߋࣦഊ"Ҏ֎Ͱ΋ى͖͍ͯͨ


    • Argo Rollouts Λ࠾༻͍ͯ͠Δ


    • ௨ৗ͸ Canary Strategy Λ༗ޮʹ͍ͯ͠ͳ͍


    • ॏཁػೳͳͲ Canary ͍ͨ͠ͱ͖͚ͩ༗ޮʹ͠ɺ100% ϦϦʔεͨ͠Β Revert ͍ͯͨ͠


    • σʔλྔͷ໰୊


    • MTTR ಉ༷ͷ໰୊

    View Slide

  29. ·ͱΊͱߟ࡯
    • SRE ۀ຿Λ5ͭͷྖҬʹ෼ྨ͠ɺ͏ͪ2ͭͷྖҬ͔ΒɺʮLean ͱ DevOps ͷՊֶʯΛࢀߟʹɺࢦ
    ඪʹͳΓ͏Δ͔Λܭଌͨ͠


    • ༗ޮͳࢦඪͷ৚݅


    • े෼ʹσʔλྔ͕͋Δ͜ͱ


    • MTTR, มߋࣦഊ཰͸σʔλྔΛಘΔ͜ͱ͕೉͍͠


    • ͜ΕΒ͕සൃ͢Δঢ়ଶ͸ SRE ͷ໨తͱ൓͢Δ


    • ͦͷࢦඪΛؚΉଞͷࢦඪ͕ଘࡏ͠ͳ͍͜ͱ


    • CI ҆ఆੑ͸ MTTR, Time To DeliveryʢมߋͷϦʔυλΠϜʣͰิ͑Δ


    • σϓϩΠ࣌ؒ΋ MTTR, Time To DeliveryʢมߋͷϦʔυλΠϜʣͰิ͑Δ


    • σϓϩΠճ਺Λ݈શʹ૿΍͢ʹ͸มߋࣦഊ཰ͷܭଌ͕ඞཁ

    View Slide

  30. ·ͱΊͱߟ࡯
    • MTTR 🚀


    • τϥοΩϯά͸ܧଓ


    • ܧଓతʹऔಘ͢ΔͨΊʹ Incident Response ͷվળ͕ඞཁ


    • σϓϩΠճ਺🚀


    • microservices ؚΊͯܭଌ


    • σϓϩΠ࣌ؒ🤔


    • ։ൃϒϥϯνͷܭଌ͕ඞཁ


    • ௕ظతʹ͸ MTTR / มߋͷϦʔυλΠϜͰิ͏


    • CI ҆ఆੑ🤔


    • ։ൃϒϥϯνͷܭଌ͕ඞཁ


    • ௕ظతʹ͸ MTTR / มߋͷϦʔυλΠϜͰิ͏


    • มߋࣦഊ཰🚀


    • มߋࣦഊͷఆٛͱӡ༻ϧʔϧࡦఆ͕ඞཁ


    • มߋͷϦʔυλΠϜ🤔


    • Develop branch Ͱͷ First commit ͔Β Production ΁ͷ Code มߋ·ͰΛܭଌͰ͖Δͱྑ͍͕ɺม਺͕ଟ͘ɺ͹Β͖͕ͭେ͖͍Մೳੑ͕͋Δ

    View Slide

  31. ࠓޙͷల๬
    • ݱঢ়ଌఆ͍ͯ͠Δ΋ͷ͸ܧଓɺࣗಈԽΛ໨ࢦ͢


    • "ࣦഊ"ʹؔ͢Δࢦඪ͸׆༻ͮ͠Β͍Մೳੑ͕͋Δ͕ɺܧଓͯ͠ܭ


    • MTTR, มߋࣦഊ཰


    • ਓ͕ؒؔΘΔϓϩηεͰ͸ܭଌͷͨΊʹఆٛɺϧʔϧɺن໿͕ඞཁ


    • ଞͷྖҬʹؔͯ͠΋ࢦඪΛఏҊ͢Δ


    • ܭଌɾՄࢹԽͷσβΠϯύλʔϯͷ੔ཧΛ͍ͨ͠


    • ୭΋͕ԿͰ΋ܭଌͯ͠ՄࢹԽͯࣗ͠཯తʹܧଓతվળ͕Ͱ͖ΔੈքΛ໨ࢦ͢

    View Slide

  32. Thank you!
    chaspy chaspy_
    Lead Software Engineer

    Site Reliability at Quipper
    Takeshi Kondo

    View Slide