『スタディサプリ』における SLI/SLO の継続的改善 / Continuous improvement of SLI/SLO at StudySapuri
by
Takeshi Kondo
Link
Embed
Share
Beginning
This slide
Copy link URL
Copy link URL
Copy iframe embed code
Copy iframe embed code
Copy javascript embed code
Copy javascript embed code
Share
Tweet
Share
Tweet
Slide 1
Slide 1 text
ʰελσΟαϓϦʱʹ͓͚Δ SLI/SLO ͷܧଓతվળ Takeshi Kondo / @chaspy 2023/05/13 SLOconf Tokyo 2023
Slide 2
Slide 2 text
Who am I chaspy chaspy_ Engineering Manager Site Reliability and Web Application Development at Recruit Co., Ltd. Takeshi Kondo https://chaspy.me
Slide 3
Slide 3 text
SRE NEXT 2020 & 2022 • 2020 • SLI/SLO ͱ͍͏ݴ༿͕ͳ͍ঢ়ଶͰ৫ ಋೖΛࢼΈͨࣄྫ • 2022 • SLI/SLO Λಋೖͨ͠ޙͷ • ৫શମͰ Site Reliability Engineering ΛਐΊΔͨΊʹඞཁͳ͜ͱΛߟ͑ͨ
Slide 4
Slide 4 text
SRE & Web Application Development 2018 2020 2021 2023 2019 2022 2VJQQFS ೖࣾ 43&/&95 4-0Λ৫ʹಋೖ ͠Α͏ͱؤுΔ &OHJOFFSJOH.BOBHFSͱͯ͠ 8FC։ൃνʔϜʹࢀՃ 43&/&95 &OHJOFFSJOH .BOBHFSʹͳΔ 4-0DPOG 5PLZP✨
Slide 5
Slide 5 text
SRE & Web Application Development 2018 2020 2021 2023 2019 2022 2VJQQFS ೖࣾ 43&/&95 4-0Λ৫ʹಋೖ ͠Α͏ͱؤுΔ 43&/&95 &OHJOFFSJOH .BOBHFSʹͳΔ &OHJOFFSJOH.BOBHFSͱͯ͠ 8FC։ൃνʔϜʹࢀՃ ࠓ։ൃऀઢͰ͠·͢ʂ 4-0DPOG 5PLZP✨
Slide 6
Slide 6 text
ࠓ͍͑ͨ͜ͱ ҰܾΊͨ SLI/SLO ܧଓతʹݟ͠·͠ΐ͏
Slide 7
Slide 7 text
Ұઃఆͯ͠ݟ͞ͳ͔ͬͨΒͲ͏ͳ͔ͬͨͷΛ͠·͢😅
Slide 8
Slide 8 text
ʰελσΟαϓϦʱʹ͓͚Δ SLI/SLO ͷܧଓతվળ Λ͜Ε͔Β͍ͬͯͧ͘ͱ͍͏ Takeshi Kondo / @chaspy 2023/05/13 SLOconf Tokyo 2023
Slide 9
Slide 9 text
Outline • ࣗݾհ • ʰελσΟαϓϦ தֶߨ࠲ʱʹ͍ͭͯ • SLI/SLO ͳΜͷͨΊʹ͋Δͷ͔ • αʔϏεӡ༻ͷݱঢ়ͱ՝ • ՝ʹ࣮͋ͨͬͯࡍʹऔΓΜͩ͜ͱ • ·ͱΊ
Slide 10
Slide 10 text
Outline • ࣗݾհ • ʰελσΟαϓϦ தֶߨ࠲ʱʹ͍ͭͯ • SLI/SLO ͳΜͷͨΊʹ͋Δͷ͔ • αʔϏεӡ༻ͷݱঢ়ͱ՝ • ՝ʹ࣮͋ͨͬͯࡍʹऔΓΜͩ͜ͱ • ·ͱΊ
Slide 11
Slide 11 text
લఏɿϓϩμΫτհ - ελσΟαϓϦ
Slide 12
Slide 12 text
No content
Slide 13
Slide 13 text
20222݄ʹϦχϡʔΞϧ • Ϣʔβج൫Ҏ֎ͷ෦Λ৽نϚΠΫϩ αʔϏεͱͯ͠2ʹΓ։ൃ • ϦϦʔε͔Β1ܦաɻݱࡏܧଓత ʹΤϯϋϯε͍ͯ͠·͢ https://www.recruit.co.jp/newsroom/pressrelease/2022/0131_9881.html ϦχϡʔΞϧͷϙΠϯτʂ ࠓिͷϛογϣϯͱ෮ԋशػೳʹΑΔݸผֶशࢧԉ ԋशྔɾқΛେ෯֦ॆ ʮఆظςετରࡦߨ࠲ʯΛؚΉ৽ߨ࠲͕ଓʑొ ֶशը໘ͷσβΠϯΛҰ৽
Slide 14
Slide 14 text
උߟ: tara ͱ͍͏ͷ͜ͷϦχϡʔΞϧϓϩδΣΫτͷίʔυωʔϜͰɺ࠷ۙΠϯλϏϡʔͰύϒϦοΫʹͳͬͨ https://brand.studysapuri.jp/career/interview/article/Saori_Suzuki/ ݩʑ͋ͬͨ Ϣʔβج൫Λ ؚΉαʔϏε
Slide 15
Slide 15 text
Outline • ࣗݾհ • ʰελσΟαϓϦ தֶߨ࠲ʱʹ͍ͭͯ • SLI/SLO ͳΜͷͨΊʹ͋Δͷ͔ • αʔϏεӡ༻ͷݱঢ়ͱ՝ • ՝ʹ࣮͋ͨͬͯࡍʹऔΓΜͩ͜ͱ • ·ͱΊ
Slide 16
Slide 16 text
Why SLI/SLO? • ػೳ։ൃorඇػೳ։ൃɺͲͪΒʹ࣌ؒΛ͏ͷ͔Λ Fact-BasedͰܾఆ͢ΔͨΊ • Error Budget ͕͋ Δ͏ͪ1ͭ1ͭͷ Τϥʔʹରॲ͠ͳ͍ • Burn Out Λආ͚ΒΕΔ • Budget ͕͋Δ͏ͪϦε Ϋ͕ͱΕΔ
Slide 17
Slide 17 text
Outline • ࣗݾհ • ʰελσΟαϓϦ தֶߨ࠲ʱʹ͍ͭͯ • SLI/SLO ͳΜͷͨΊʹ͋Δͷ͔ • αʔϏεӡ༻ͷݱঢ়ͱ՝ • ՝ʹ࣮͋ͨͬͯࡍʹऔΓΜͩ͜ͱ • ·ͱΊ
Slide 18
Slide 18 text
උߟ: tara ͱ͍͏ͷ͜ͷϦχϡʔΞϧϓϩδΣΫτͷίʔυωʔϜͰɺ࠷ۙΠϯλϏϡʔͰύϒϦοΫʹͳͬͨ https://brand.studysapuri.jp/career/interview/article/Saori_Suzuki/ ݩʑ͋ͬͨ Ϣʔβج൫Λ ؚΉαʔϏε 3FWFSTF1SPYZ /HJOY • SLI/SLO શ෦Ͱ8ͭ • (a)Availability ͱ (b)Latency • http ͷ metrics Λ͏ • ҎԼͷ4Օॴʹ(a/b)2छྨͣͭ • ᶃ api-gateway • ᶄ api-gateway -> main • ᶅ api-gawatey -> content • ᶆ main -> content • SLO • Availability: 99.9% • Latency: 95 percentile < 1000msec ᶃ ᶄ ᶅ ᶆ
Slide 19
Slide 19 text
Why Envoy? • ࣌ϚΠΫϩαʔϏεؒͷ metrics Λऔಘ͢Δํ๏͕ ͳ͔ͬͨ • Control Plane ΛؚΜͩ Service Mesh Ͱͳ͘ɺSide- car container ͱͯ͠୯ʹૉͷ Envoy ΛࡌͤΔͷΈ
Slide 20
Slide 20 text
DevSupport: ସΘΓ൪Ͱఆৗӡ༻ۀΛߦ͏ • Slack ͷ௨Λ֬ೝͯ͠ݪҼௐࠪ • Sentry Exception, SLO Alert, GCP Pub/Sub Dead Letter • खಈରԠ͕ඞཁͳͷ֤νʔϜʹΤεΧϨʔγϣϯ • CS(Customer Support)͍߹ΘͤͷҰ࣍ड͚ • શମ͚ϝϯγϣϯͷ1࣍ड͚
Slide 21
Slide 21 text
ى͖͍ͯͨ՝: No SLO Alert • ϦϦʔε͔Βࠓ·ͰҰ SLO Alert ͕໐ͬͨ͜ͱͳ͍ • Sentry ͷ Exception ྔ͕ SLI ʹө͞Ε͍ͯͳ͍ؾ͕͢Δ • Կ͕ى͖͍ͯΔͷͩΖ͏͔ʁ • গͳ͘ͱ Sentry Exception Λ1݅ͣͭݟ͍ͯΔ࣌Ͱ Error Budget ͱ͍͏֓೦ ར༻Ͱ͖ͯͳ͍ • SLO ͕ࣗͨͪͷظΑΓ؇͗͢Δʁ • SLI ͷઃఆ͕ޡ͍ͬͯΔʁ • ௐࠪͨ͠
Slide 22
Slide 22 text
Outline • ࣗݾհ • ʰελσΟαϓϦ தֶߨ࠲ʱʹ͍ͭͯ • SLI/SLO ͳΜͷͨΊʹ͋Δͷ͔ • αʔϏεӡ༻ͷݱঢ়ͱ՝ • ՝ʹ࣮͋ͨͬͯࡍʹऔΓΜͩ͜ͱ • ·ͱΊ
Slide 23
Slide 23 text
Ծઆ: Envoy ͷ metrics (SLIᶄᶅᶆ) ͕͓͔͍͠ͷͰʁ • Yes • Exception ͷҰ෦ DNS ໊લղܾͰࣦഊ͍ͯͨ͠ • ͭ·Γɺhttp request ʹࢸ͍ͬͯͳ͍ • envoy.cluster.upstream_rq_2xx ʹܭ্͞Εͳ͍ͷͦΕͦ͏ • ᶄͷ௨৴࣌ɺ໊લղܾʹࣦഊύλʔϯ • ᶃ ͷ SLI Ͱܭଌ͞Ε͍ͯΕྑ͍͕…? ௨ৗͷ௨৴ UBSBBQJHBUFXBZDPOUBJOFS͕IUUQ UBSBNBJOΛ໊લղܾ͢Δ͕͜͜ ࣦഊͨ͠ IUUQUBSBNBJOͰ௨৴͢Δ
Slide 24
Slide 24 text
Ծઆ2: Reverse Proxy ͷ metrics (SLIᶃ) ͕͓͔͍͠ͷͰʁ • Yes • GraphQL ϦΫΤετ్͕தͰࣦഊͨ͠߹ɺhttp Ͱ 200 Λฦ͍ͯͨ͠😱 • ϦϦʔε࣌ɺ෦ࣦഊ 500 Ͱฦ͢͜ͱΛܾΊ͕ͨɺͦ͏͞Ε͍ͯͳ͔ͬͨ ௨ৗͷ௨৴ $MJFOU͔ΒIUUQTKVOJPSMFBSOTUVEZTBQVSJKQʹΞΫηε͢Δͱ 3FWFSTF1SPYZʹ౸ୡ 3FWFSTF1SPYZ͔ΒțBSBBQJHBUFXBZQSPYZᶄ UBSBBQJHBUFXBZ͔ΒțBSBNBJO௨৴ᶄ͜͜ͰΤϥʔ͕ൃੜ
Slide 25
Slide 25 text
ରॲ1ɿGraphQL Error ͷ߹ http 500 Λฦ͢ • ݩʑ GraphQL http ͷ͜ͱΛؾʹ͍ͯ͠ͳ͍ • ڍಈ GraphQL server library ͷڍಈʹґଘ͢Δ • Response status 200 ʹ౷Ұ͢ΔϓϥΫςΟε͋Δ • Client Error Response ͷ errors ΛݟΔͷͰͳ͍ ಉ྅͕γϡοͱͯ͘͠Ε·ͨ͠🙏 4QFDJBM5IBOLT!2VSBNZ
Slide 26
Slide 26 text
ରॲ2ɿ Envoy ΛΊͯ Datadog APM metrics Λར༻ • ෳࡶੑʹΑΔτϥϒϧγϡʔτͷ͠͞ΛݮΒͨ͢Ί • Envoy ͷ metrics ʹ͕͋ͬͨΘ͚Ͱͳ͍ • ӡ༻ͷ՝ଟ͘ metrics औಘҎ֎ͷϝϦοτಘΒΕ͍ͯͳ͔ͬͨ • Curcuit Breaker ೖΕ͍ͯͨͷͷൃಈͨ͠έʔε΄ͱΜͲͳ͍ • Envoy ͷ version up ରԠʢग़དྷ͍ͯͳ͍ʣ • Pod side-car container ͷىಈɾऴྃॱ੍ޚʢenvoy Λͨͳ͍ͱΤϥʔʹͳΔʣ • Rollouts Λ͍ͬͯΔ߹ͷ Patch ํ๏ʢResource ٯసͯ͠োʹͳͬͨ͜ͱʣ
Slide 27
Slide 27 text
খωλ: Datadog APM ݁ߏบ͕͋Δ(1) • http client ͷ APM Plugin ͷ resource tag default Ͱ http method Ͱ͋Δ • Ѽઌ͝ͱͷ SLI ͱͯ͠࠾༻͢Δʹ hostname ͕ඞཁ • Node, Ruby ͰͦΕͧΕରԠ • ৫Ͱ http-client ͷ resource tag ͷ໋໊نΛ߹ҙ
Slide 28
Slide 28 text
খωλ: Datadog APM ݁ߏบ͕͋Δ(2) • trace.http.request.errors Ͱ http 5xx ֘͠ͳ͍ • ٯʹ 4xx ֘͢Δ • trace.http.request.hits.by_http_status Λར༻͢Δඞཁ͕͋Δ
Slide 29
Slide 29 text
උߟ: tara ͱ͍͏ͷ͜ͷϦχϡʔΞϧϓϩδΣΫτͷίʔυωʔϜͰɺ࠷ۙΠϯλϏϡʔͰύϒϦοΫʹͳͬͨ https://brand.studysapuri.jp/career/interview/article/Saori_Suzuki/ ݩʑ͋ͬͨ Ϣʔβج൫Λ ؚΉαʔϏε 3FWFSTF1SPYZ /HJOY • SLO Λݟͨ͠ • (a)Availability ͱ (b)Latency • http ͷ metrics Λ͏ • ҎԼͷ4Օॴʹ(a/b)2छྨͣͭ • ᶃ api-gateway • ᶄ api-gateway -> main • ᶅ api-gawatey -> content • ᶆ main -> content • 🆕ᶇ api-gateway -> Ϣʔβج൫ͷ request • SLO • Availability: 99.9% • Latency: 95 percentile < 1000msec • -> αʔϏε͝ͱʹݱঢ়ΛՃຯ͠ɺ 100~500msec ᶃ ᶄ ᶅ ᶆ ᶇ ϚΠΫϩαʔϏε͝ͱͷ 4-*4-0Λഇࢭ 4-*Λ͚ΔϝϦοτ͕ෳ 4-*4-0Λཧ͢Δίετʹ ݟ߹͍ͬͯͳ͍ͨΊ Ϣʔβج൫͚4-*4-0Ճ Ϣʔβج൫͚ͷڞ௨4-*͜Ε·ͰFOWPZNFUSJDT Λར༻͍ͯͨ͠ɻFOWPZΛ֎ͨͨ͠Ί%BUBEPH "1.NFUSJDTΛར༻ͨ͠4-*4-0ΛՃ
Slide 30
Slide 30 text
DevSupport ݟ͠ • Sentry Exception ͰΞϓϦέʔγϣϯίʔυىҼͷͷҎ ֎શͯ Ignore ͢Δ • SLO Alert ͕དྷͨ࣌ͷجຊతͳରॲํΛυΩϡϝϯτԽ • ରԠͰ͖ͳ͔ͬͨͷΛ2िؒʹ1ճνʔϜͰରԠ
Slide 31
Slide 31 text
Outline • ࣗݾհ • ʰελσΟαϓϦ தֶߨ࠲ʱʹ͍ͭͯ • SLI/SLO ͳΜͷͨΊʹ͋Δͷ͔ • αʔϏεӡ༻ͷݱঢ়ͱ՝ • ՝ʹ࣮͋ͨͬͯࡍʹऔΓΜͩ͜ͱ • ·ͱΊ
Slide 32
Slide 32 text
Կ͕ى͖͍ͯͨͷ͔ • ϦϦʔε࣌ʹҰઃఆ͞Εͨ SLI/SLO 1ݟ͞Εͯͳ ͔ͬͨ • SLO ͕ԿͷՁൃش͍ͯ͠ͳ͔ͬͨ • SLI/SLO ྆ํΛݟ͠ɺࠓޙܧଓతʹݟ͢͜ͱʹͨ͠
Slide 33
Slide 33 text
Ͳ͏͖ͩͬͨ͢ͷ͔ • ։ൃऀࢹ • SLO ͕ຊʹՁΛͨΒ͍ͯ͠Δͷ͔Λఆظతʹݕࠪ͢Δ • ൪ྑ͍͕ɺͨ·ʹશһͰݟΔ࣌ؒΛऔΔ͜ͱॏཁ • 1ަͩͱਂ͘ௐΔΠϯηϯςΟϒ͕ಇ͔ͳ͍ • SRE ࢹ • ։ൃνʔϜ͕ SLI/SLO Λఆظతʹݟ͢ΈΛ࡞Δ • ϫʔΫϩʔυ͝ͱʹ SLI/SLO Λࣗಈੜ͢ΔΈΛ࡞Δ
Slide 34
Slide 34 text
ࠓ͍͑ͨ͜ͱ ҰܾΊͨ SLI/SLO ܧଓతʹݟ͠·͠ΐ͏
Slide 35
Slide 35 text
Thank you! chaspy chaspy_ Engineering Manager Site Reliability and Web Application Development at Recruit Co., Ltd. Takeshi Kondo https://chaspy.me