Upgrade to Pro — share decks privately, control downloads, hide ads and more …

『スタディサプリ』における SLI/SLO の継続的改善 / Continuous improvement of SLI/SLO at StudySapuri

『スタディサプリ』における SLI/SLO の継続的改善 / Continuous improvement of SLI/SLO at StudySapuri

Takeshi Kondo

May 16, 2023
Tweet

More Decks by Takeshi Kondo

Other Decks in Technology

Transcript

  1. ʰελσΟαϓϦʱʹ͓͚Δ


    SLI/SLO ͷܧଓతվળ
    Takeshi Kondo / @chaspy


    2023/05/13


    SLOconf Tokyo 2023

    View Slide

  2. Who am I
    chaspy chaspy_
    Engineering Manager

    Site Reliability and Web Application Development

    at Recruit Co., Ltd.
    Takeshi Kondo
    https://chaspy.me

    View Slide

  3. SRE NEXT 2020 & 2022
    • 2020


    • SLI/SLO ͱ͍͏ݴ༿͕ͳ͍ঢ়ଶͰ૊৫
    ΁ಋೖΛࢼΈͨࣄྫ


    • 2022


    • SLI/SLO Λಋೖͨ͠ޙͷ࿩


    • ૊৫શମͰ Site Reliability Engineering
    ΛਐΊΔͨΊʹඞཁͳ͜ͱΛߟ͑ͨ

    View Slide

  4. SRE & Web Application Development
    2018 2020 2021 2023
    2019 2022
    2VJQQFS
    ೖࣾ
    43&/&95

    4-0Λ૊৫ʹಋೖ
    ͠Α͏ͱؤுΔ
    &OHJOFFSJOH.BOBHFSͱͯ͠
    8FC։ൃνʔϜʹ΋ࢀՃ
    43&/&95

    &OHJOFFSJOH
    .BOBHFSʹͳΔ
    4-0DPOG
    5PLZP✨

    View Slide

  5. SRE & Web Application Development
    2018 2020 2021 2023
    2019 2022
    2VJQQFS
    ೖࣾ
    43&/&95

    4-0Λ૊৫ʹಋೖ
    ͠Α͏ͱؤுΔ
    43&/&95

    &OHJOFFSJOH
    .BOBHFSʹͳΔ
    &OHJOFFSJOH.BOBHFSͱͯ͠
    8FC։ൃνʔϜʹ΋ࢀՃ
    ࠓ೔͸։ൃऀ໨ઢͰ࿩͠·͢ʂ
    4-0DPOG
    5PLZP✨

    View Slide

  6. ࠓ೔఻͍͑ͨ͜ͱ


    Ұ౓ܾΊͨ SLI/SLO ͸ܧଓతʹݟ௚͠·͠ΐ͏

    View Slide

  7. Ұ౓ઃఆͯ͠ݟ௚͞ͳ͔ͬͨΒͲ͏ͳ͔ͬͨͷ࿩Λ͠·͢😅

    View Slide

  8. ʰελσΟαϓϦʱʹ͓͚Δ


    SLI/SLO ͷܧଓతվળ


    Λ͜Ε͔Β΍͍ͬͯͧ͘ͱ͍͏࿩
    Takeshi Kondo / @chaspy


    2023/05/13


    SLOconf Tokyo 2023

    View Slide

  9. Outline
    • ࣗݾ঺հ


    • ʰελσΟαϓϦ தֶߨ࠲ʱʹ͍ͭͯ


    • SLI/SLO ͸ͳΜͷͨΊʹ͋Δͷ͔


    • αʔϏεӡ༻ͷݱঢ়ͱ՝୊


    • ՝୊ʹ࣮͋ͨͬͯࡍʹऔΓ૊Μͩ͜ͱ


    • ·ͱΊ

    View Slide

  10. Outline
    • ࣗݾ঺հ


    • ʰελσΟαϓϦ தֶߨ࠲ʱʹ͍ͭͯ


    • SLI/SLO ͸ͳΜͷͨΊʹ͋Δͷ͔


    • αʔϏεӡ༻ͷݱঢ়ͱ՝୊


    • ՝୊ʹ࣮͋ͨͬͯࡍʹऔΓ૊Μͩ͜ͱ


    • ·ͱΊ

    View Slide

  11. લఏɿϓϩμΫτ঺հ - ελσΟαϓϦ

    View Slide

  12. View Slide

  13. 2022೥2݄ʹϦχϡʔΞϧ
    • Ϣʔβج൫Ҏ֎ͷ෦෼Λ৽نϚΠΫϩ
    αʔϏεͱͯ͠2೥ʹ౉Γ։ൃ


    • ϦϦʔε͔Β1೥ܦաɻݱࡏ΋ܧଓత
    ʹΤϯϋϯε͍ͯ͠·͢
    https://www.recruit.co.jp/newsroom/pressrelease/2022/0131_9881.html
    ϦχϡʔΞϧͷϙΠϯτʂ
    ࠓिͷϛογϣϯͱ൓෮ԋशػೳʹΑΔݸผֶशࢧԉ
    ԋशྔɾ೉қ౓Λେ෯֦ॆ
    ʮఆظςετରࡦߨ࠲ʯΛؚΉ৽ߨ࠲͕ଓʑొ৔
    ֶशը໘ͷσβΠϯΛҰ৽

    View Slide

  14. උߟ: tara ͱ͍͏ͷ͸͜ͷϦχϡʔΞϧϓϩδΣΫτͷίʔυωʔϜͰɺ࠷ۙΠϯλϏϡʔͰύϒϦοΫʹͳͬͨ https://brand.studysapuri.jp/career/interview/article/Saori_Suzuki/
    ݩʑ͋ͬͨ
    Ϣʔβج൫Λ
    ؚΉαʔϏε

    View Slide

  15. Outline
    • ࣗݾ঺հ


    • ʰελσΟαϓϦ தֶߨ࠲ʱʹ͍ͭͯ


    • SLI/SLO ͸ͳΜͷͨΊʹ͋Δͷ͔


    • αʔϏεӡ༻ͷݱঢ়ͱ՝୊


    • ՝୊ʹ࣮͋ͨͬͯࡍʹऔΓ૊Μͩ͜ͱ


    • ·ͱΊ

    View Slide

  16. Why SLI/SLO?
    • ػೳ։ൃorඇػೳ։ൃɺͲͪΒʹ࣌ؒΛ࢖͏ͷ͔Λ
    Fact-BasedͰܾఆ͢ΔͨΊ
    • Error Budget ͕͋
    Δ͏ͪ͸1ͭ1ͭͷ
    Τϥʔʹରॲ͠ͳ͍


    • Burn Out Λආ͚ΒΕΔ


    • Budget ͕͋Δ͏ͪ͸Ϧε
    Ϋ͕ͱΕΔ

    View Slide

  17. Outline
    • ࣗݾ঺հ


    • ʰελσΟαϓϦ தֶߨ࠲ʱʹ͍ͭͯ


    • SLI/SLO ͸ͳΜͷͨΊʹ͋Δͷ͔


    • αʔϏεӡ༻ͷݱঢ়ͱ՝୊


    • ՝୊ʹ࣮͋ͨͬͯࡍʹऔΓ૊Μͩ͜ͱ


    • ·ͱΊ

    View Slide

  18. උߟ: tara ͱ͍͏ͷ͸͜ͷϦχϡʔΞϧϓϩδΣΫτͷίʔυωʔϜͰɺ࠷ۙΠϯλϏϡʔͰύϒϦοΫʹͳͬͨ https://brand.studysapuri.jp/career/interview/article/Saori_Suzuki/
    ݩʑ͋ͬͨ
    Ϣʔβج൫Λ
    ؚΉαʔϏε
    3FWFSTF1SPYZ /HJOY

    • SLI/SLO ͸શ෦Ͱ8ͭ


    • (a)Availability ͱ (b)Latency


    • http ͷ metrics Λ࢖͏


    • ҎԼͷ4Օॴʹ(a/b)2छྨͣͭ


    • ᶃ api-gateway


    • ᶄ api-gateway -> main


    • ᶅ api-gawatey -> content


    • ᶆ main -> content


    • SLO


    • Availability: 99.9%


    • Latency: 95 percentile < 1000msec

    ᶄ ᶅ

    View Slide

  19. Why Envoy?
    • ౰࣌͸ϚΠΫϩαʔϏεؒͷ metrics Λऔಘ͢Δํ๏͕
    ͳ͔ͬͨ
    • Control Plane ΛؚΜͩ
    Service Mesh Ͱ͸ͳ͘ɺSide-
    car container ͱͯ͠୯ʹૉͷ
    Envoy ΛࡌͤΔͷΈ

    View Slide

  20. DevSupport: ೔ସΘΓ౰൪Ͱఆৗӡ༻ۀ຿Λߦ͏
    • Slack ͷ௨஌Λ֬ೝͯ͠ݪҼௐࠪ


    • Sentry Exception, SLO Alert, GCP Pub/Sub Dead Letter


    • खಈରԠ͕ඞཁͳ΋ͷ͸֤νʔϜʹΤεΧϨʔγϣϯ


    • CS(Customer Support)໰͍߹ΘͤͷҰ࣍ड͚


    • શମ޲͚ϝϯγϣϯͷ1࣍ड͚

    View Slide

  21. ى͖͍ͯͨ՝୊: No SLO Alert
    • ϦϦʔε͔Βࠓ·ͰҰ౓΋ SLO Alert ͕໐ͬͨ͜ͱ͸ͳ͍


    • Sentry ͷ Exception ྔ͕ SLI ʹ൓ө͞Ε͍ͯͳ͍ؾ͕͢Δ


    • Կ͕ى͖͍ͯΔͷͩΖ͏͔ʁ


    • গͳ͘ͱ΋ Sentry Exception Λ1݅ͣͭݟ͍ͯΔ࣌఺Ͱ Error Budget ͱ͍͏֓೦
    ͸ར༻Ͱ͖ͯͳ͍


    • SLO ͕ࣗ෼ͨͪͷظ଴஋ΑΓ΋؇͗͢Δʁ


    • SLI ͷઃఆ͕ޡ͍ͬͯΔʁ


    • ௐࠪͨ͠

    View Slide

  22. Outline
    • ࣗݾ঺հ


    • ʰελσΟαϓϦ தֶߨ࠲ʱʹ͍ͭͯ


    • SLI/SLO ͸ͳΜͷͨΊʹ͋Δͷ͔


    • αʔϏεӡ༻ͷݱঢ়ͱ՝୊


    • ՝୊ʹ࣮͋ͨͬͯࡍʹऔΓ૊Μͩ͜ͱ


    • ·ͱΊ

    View Slide

  23. Ծઆ: Envoy ͷ metrics (SLIᶄᶅᶆ) ͕͓͔͍͠ͷͰ͸ʁ
    • Yes


    • Exception ͷҰ෦͸ DNS ໊લղܾͰࣦഊ͍ͯͨ͠


    • ͭ·Γɺhttp request ʹࢸ͍ͬͯͳ͍


    • envoy.cluster.upstream_rq_2xx ʹܭ্͞Εͳ͍ͷ͸ͦΕ͸ͦ͏


    • ᶄͷ௨৴࣌ɺ໊લղܾʹࣦഊύλʔϯ


    • ᶃ ͷ SLI Ͱܭଌ͞Ε͍ͯΕ͹ྑ͍͕…?
    ௨ৗͷ௨৴
    UBSBBQJHBUFXBZDPOUBJOFS͕IUUQ
    UBSBNBJOΛ໊લղܾ͢Ċ͕͜͜
    ࣦഊͨ͠
    IUUQUBSBNBJOͰ௨৴͢Δ

    View Slide

  24. Ծઆ2: Reverse Proxy ͷ metrics (SLIᶃ) ͕͓͔͍͠ͷͰ͸ʁ
    • Yes


    • GraphQL ϦΫΤετ్͕தͰࣦഊͨ͠৔߹ɺhttp Ͱ͸ 200 Λฦ͍ͯͨ͠😱


    • ϦϦʔε࣌ɺ෦෼ࣦഊ͸ 500 Ͱฦ͢͜ͱΛܾΊ͕ͨɺͦ͏͞Ε͍ͯͳ͔ͬͨ
    ௨ৗͷ௨৴
    $MJFOU͔ΒIUUQTKVOJPSMFBSOTUVEZTBQVSJKQʹΞΫηε͢Δͱ
    3FWFSTF1SPYZʹ౸ୡ
    3FWFSTF1SPYZ͔ΒțBSBBQJHBUFXBZ΁QSPYZᶄ
    UBSBBQJHBUFXBZ͔ΒțBSBNBJO΁௨৴ᶄŠ͜͜ͰΤϥʔ͕ൃੜ

    View Slide

  25. ରॲ1ɿGraphQL Error ͷ৔߹ http 500 Λฦ͢
    • ݩʑ GraphQL ͸ http ͷ͜ͱΛؾʹ͍ͯ͠ͳ͍


    • ڍಈ͸ GraphQL server library ͷڍಈʹґଘ͢Δ


    • Response status ͸ 200 ʹ౷Ұ͢ΔϓϥΫςΟε΋͋Δ


    • Client ΋ Error ͸ Response ͷ errors ΛݟΔͷͰ໰୊͸ͳ͍
    ಉ྅͕γϡοͱ௚ͯ͘͠Ε·ͨ͠🙏
    4QFDJBM5IBOLT!2VSBNZ

    View Slide

  26. ରॲ2ɿ Envoy Λ΍Ίͯ Datadog APM metrics Λར༻
    • ෳࡶੑʹΑΔτϥϒϧγϡʔτͷ೉͠͞ΛݮΒͨ͢Ί


    • Envoy ͷ metrics ʹ໰୊͕͋ͬͨΘ͚Ͱ͸ͳ͍


    • ӡ༻ͷ՝୊΋ଟ͘ metrics औಘҎ֎ͷϝϦοτ͸ಘΒΕ͍ͯͳ͔ͬͨ


    • Curcuit Breaker ೖΕ͍ͯͨ΋ͷͷൃಈͨ͠έʔε͸΄ͱΜͲͳ͍


    • Envoy ͷ version up ରԠʢग़དྷ͍ͯͳ͍ʣ


    • Pod ಺ side-car container ͷىಈɾऴྃॱ੍ޚʢenvoy Λ଴ͨͳ͍ͱΤϥʔʹͳΔʣ


    • Rollouts Λ࢖͍ͬͯΔ৔߹ͷ Patch ํ๏ʢResource ٯసͯ͠ো֐ʹͳͬͨ͜ͱ΋ʣ

    View Slide

  27. খωλ: Datadog APM ݁ߏบ͕͋Δ(1)
    • http client ͷ APM Plugin ͷ resource tag ͸ default Ͱ͸
    http method Ͱ͋Δ


    • Ѽઌ͝ͱͷ SLI ͱͯ͠࠾༻͢Δʹ͸ hostname ͕ඞཁ


    • Node, Ruby ͰͦΕͧΕରԠ


    • ૊৫಺Ͱ http-client ͷ resource tag ͷ໋໊ن໿Λ߹ҙ

    View Slide

  28. খωλ: Datadog APM ݁ߏบ͕͋Δ(2)
    • trace.http.request.errors Ͱ͸ http 5xx ͸֘౰͠ͳ͍


    • ٯʹ 4xx ͸֘౰͢Δ


    • trace.http.request.hits.by_http_status Λར༻͢Δඞཁ͕͋Δ

    View Slide

  29. උߟ: tara ͱ͍͏ͷ͸͜ͷϦχϡʔΞϧϓϩδΣΫτͷίʔυωʔϜͰɺ࠷ۙΠϯλϏϡʔͰύϒϦοΫʹͳͬͨ https://brand.studysapuri.jp/career/interview/article/Saori_Suzuki/
    ݩʑ͋ͬͨ
    Ϣʔβج൫Λ
    ؚΉαʔϏε
    3FWFSTF1SPYZ /HJOY

    • SLO Λݟ௚ͨ͠


    • (a)Availability ͱ (b)Latency


    • http ͷ metrics Λ࢖͏


    • ҎԼͷ4Օॴʹ(a/b)2छྨͣͭ


    • ᶃ api-gateway


    • ᶄ api-gateway -> main


    • ᶅ api-gawatey -> content


    • ᶆ main -> content



    🆕ᶇ api-gateway -> Ϣʔβج൫΁ͷ request


    • SLO


    • Availability: 99.9%


    • Latency: 95 percentile < 1000msec


    • -> αʔϏε͝ͱʹݱঢ়ΛՃຯ͠ɺ
    100~500msec

    ᶄ ᶅ


    ϚΠΫϩαʔϏε͝ͱͷ
    4-*4-0Λഇࢭ
    4-*Λ෼͚ΔϝϦοτ͕ෳ਺
    4-*4-0Λ؅ཧ͢Δίετʹ
    ݟ߹͍ͬͯͳ͍ͨΊ
    Ϣʔβج൫޲͚4-*4-0௥Ճ
    Ϣʔβج൫޲͚ͷڞ௨4-*͜Ε·ͰFOWPZNFUSJDT
    Λར༻͍ͯͨ͠ɻFOWPZΛ֎ͨͨ͠Ί%BUBEPH
    "1.NFUSJDTΛར༻ͨ͠4-*4-0Λ௥Ճ

    View Slide

  30. DevSupport ݟ௚͠
    • Sentry Exception Ͱ͸ΞϓϦέʔγϣϯίʔυىҼͷ΋ͷҎ
    ֎͸શͯ Ignore ͢Δ


    • SLO Alert ͕དྷͨ࣌ͷجຊతͳରॲํ਑ΛυΩϡϝϯτԽ


    • ౰೔ରԠͰ͖ͳ͔ͬͨ΋ͷΛ2िؒʹ1ճνʔϜͰରԠ

    View Slide

  31. Outline
    • ࣗݾ঺հ


    • ʰελσΟαϓϦ தֶߨ࠲ʱʹ͍ͭͯ


    • SLI/SLO ͸ͳΜͷͨΊʹ͋Δͷ͔


    • αʔϏεӡ༻ͷݱঢ়ͱ՝୊


    • ՝୊ʹ࣮͋ͨͬͯࡍʹऔΓ૊Μͩ͜ͱ


    • ·ͱΊ

    View Slide

  32. Կ͕ى͖͍ͯͨͷ͔
    • ϦϦʔε࣌ʹҰ౓ઃఆ͞Εͨ SLI/SLO ͸1౓΋ݟ௚͞Εͯͳ
    ͔ͬͨ


    • SLO ͕ԿͷՁ஋΋ൃش͍ͯ͠ͳ͔ͬͨ


    • SLI/SLO ྆ํΛݟ௚͠ɺࠓޙ΋ܧଓతʹݟ௚͢͜ͱʹͨ͠

    View Slide

  33. Ͳ͏͢΂͖ͩͬͨͷ͔
    • ։ൃऀࢹ఺


    • SLO ͕ຊ౰ʹՁ஋Λ΋ͨΒ͍ͯ͠Δͷ͔Λఆظతʹݕࠪ͢Δ


    • ౰൪΋ྑ͍͕ɺͨ·ʹ͸શһͰݟΔ࣌ؒΛऔΔ͜ͱ΋ॏཁ


    • 1೔ަ୅ͩͱਂ͘ௐ΂ΔΠϯηϯςΟϒ͕ಇ͔ͳ͍


    • SRE ࢹ఺


    • ։ൃνʔϜ͕ SLI/SLO Λఆظతʹݟ௚͢࢓૊ΈΛ࡞Δ


    • ϫʔΫϩʔυ͝ͱʹ SLI/SLO Λࣗಈੜ੒͢Δ࢓૊ΈΛ࡞Δ

    View Slide

  34. ࠓ೔఻͍͑ͨ͜ͱ


    Ұ౓ܾΊͨ SLI/SLO ͸ܧଓతʹݟ௚͠·͠ΐ͏

    View Slide

  35. Thank you!
    chaspy chaspy_
    Engineering Manager

    Site Reliability and Web Application Development

    at Recruit Co., Ltd.
    Takeshi Kondo
    https://chaspy.me

    View Slide