Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
『スタディサプリ』における SLI/SLO の継続的改善 / Continuous impro...
Search
Takeshi Kondo
May 16, 2023
Technology
1
3.8k
『スタディサプリ』における SLI/SLO の継続的改善 / Continuous improvement of SLI/SLO at StudySapuri
https://connpass.com/event/282120/
Takeshi Kondo
May 16, 2023
Tweet
Share
More Decks by Takeshi Kondo
See All by Takeshi Kondo
SREの知識地図 - 第2章の紹介 - / Knowledge Map of SRE – Introduction to Chapter 2 –
chaspy
0
53
SRE NEXT CfP チームが語る 聞きたくなるプロポーザルとは / Proposals by the SRE NEXT CfP Team that are sure to be accepted
chaspy
2
1.5k
Slack Platform(Deno) での RAG 実装 - LangChain(js) を使ってみた / rag-implementation-on-slack-platform-deno-experimenting-with-langchain-js
chaspy
0
270
SRE の考えをマネジメントに活かす / applying SRE ideas to management
chaspy
7
8k
RAGの簡易評価によるフィードバックサイクル実践 / Feedback cycle practice through simplified assessment of RAGs
chaspy
2
5.8k
定量データと定性評価を用いた技術戦略の組織的実践 / Systematic implementation of technology strategies using quantitative data and qualitative evaluation
chaspy
9
2.1k
エンジニアブランディングチームの KPI / KPI's of engineer branding team
chaspy
2
2.4k
「SLO Review」今やるならこうする / If I had to do the "SLO Review" again
chaspy
3
2.2k
開発者とともに作る Site Reliability Engineering / SREing with Developers
chaspy
10
8.7k
Other Decks in Technology
See All in Technology
Introduction to Bill One Development Engineer
sansan33
PRO
0
360
量子クラウドサービスの裏側 〜Deep Dive into OQTOPUS〜
oqtopus
0
120
ブロックテーマ、WordPress でウェブサイトをつくるということ / 2026.02.07 Gifu WordPress Meetup
torounit
0
180
予期せぬコストの急増を障害のように扱う――「コスト版ポストモーテム」の導入とその後の改善
muziyoshiz
1
1.9k
モダンUIでフルサーバーレスなAIエージェントをAmplifyとCDKでサクッとデプロイしよう
minorun365
4
200
FinTech SREのAWSサービス活用/Leveraging AWS Services in FinTech SRE
maaaato
0
130
30万人の同時アクセスに耐えたい!新サービスの盤石なリリースを支える負荷試験 / SRE Kaigi 2026
genda
4
1.3k
外部キー制約の知っておいて欲しいこと - RDBMSを正しく使うために必要なこと / FOREIGN KEY Night
soudai
PRO
12
5.5k
2026年、サーバーレスの現在地 -「制約と戦う技術」から「当たり前の実行基盤」へ- /serverless2026
slsops
2
250
SREチームをどう作り、どう育てるか ― Findy横断SREのマネジメント
rvirus0817
0
260
学生・新卒・ジュニアから目指すSRE
hiroyaonoe
2
610
What happened to RubyGems and what can we learn?
mikemcquaid
0
300
Featured
See All Featured
GraphQLの誤解/rethinking-graphql
sonatard
74
11k
Why Our Code Smells
bkeepers
PRO
340
58k
Docker and Python
trallard
47
3.7k
Groundhog Day: Seeking Process in Gaming for Health
codingconduct
0
93
4 Signs Your Business is Dying
shpigford
187
22k
How STYLIGHT went responsive
nonsquared
100
6k
Kristin Tynski - Automating Marketing Tasks With AI
techseoconnect
PRO
0
140
GitHub's CSS Performance
jonrohan
1032
470k
What does AI have to do with Human Rights?
axbom
PRO
0
2k
Ecommerce SEO: The Keys for Success Now & Beyond - #SERPConf2024
aleyda
1
1.8k
Conquering PDFs: document understanding beyond plain text
inesmontani
PRO
4
2.3k
RailsConf 2023
tenderlove
30
1.3k
Transcript
ʰελσΟαϓϦʱʹ͓͚Δ SLI/SLO ͷܧଓతվળ Takeshi Kondo / @chaspy 2023/05/13 SLOconf Tokyo
2023
Who am I chaspy chaspy_ Engineering Manager Site Reliability and
Web Application Development at Recruit Co., Ltd. Takeshi Kondo https://chaspy.me
SRE NEXT 2020 & 2022 • 2020 • SLI/SLO ͱ͍͏ݴ༿͕ͳ͍ঢ়ଶͰ৫
ಋೖΛࢼΈͨࣄྫ • 2022 • SLI/SLO Λಋೖͨ͠ޙͷ • ৫શମͰ Site Reliability Engineering ΛਐΊΔͨΊʹඞཁͳ͜ͱΛߟ͑ͨ
SRE & Web Application Development 2018 2020 2021 2023 2019
2022 2VJQQFS ೖࣾ 43&/&95 4-0Λ৫ʹಋೖ ͠Α͏ͱؤுΔ &OHJOFFSJOH.BOBHFSͱͯ͠ 8FC։ൃνʔϜʹࢀՃ 43&/&95 &OHJOFFSJOH .BOBHFSʹͳΔ 4-0DPOG 5PLZP✨
SRE & Web Application Development 2018 2020 2021 2023 2019
2022 2VJQQFS ೖࣾ 43&/&95 4-0Λ৫ʹಋೖ ͠Α͏ͱؤுΔ 43&/&95 &OHJOFFSJOH .BOBHFSʹͳΔ &OHJOFFSJOH.BOBHFSͱͯ͠ 8FC։ൃνʔϜʹࢀՃ ࠓ։ൃऀઢͰ͠·͢ʂ 4-0DPOG 5PLZP✨
ࠓ͍͑ͨ͜ͱ ҰܾΊͨ SLI/SLO ܧଓతʹݟ͠·͠ΐ͏
Ұઃఆͯ͠ݟ͞ͳ͔ͬͨΒͲ͏ͳ͔ͬͨͷΛ͠·͢😅
ʰελσΟαϓϦʱʹ͓͚Δ SLI/SLO ͷܧଓతվળ Λ͜Ε͔Β͍ͬͯͧ͘ͱ͍͏ Takeshi Kondo / @chaspy 2023/05/13 SLOconf
Tokyo 2023
Outline • ࣗݾհ • ʰελσΟαϓϦ தֶߨ࠲ʱʹ͍ͭͯ • SLI/SLO ͳΜͷͨΊʹ͋Δͷ͔ •
αʔϏεӡ༻ͷݱঢ়ͱ՝ • ՝ʹ࣮͋ͨͬͯࡍʹऔΓΜͩ͜ͱ • ·ͱΊ
Outline • ࣗݾհ • ʰελσΟαϓϦ தֶߨ࠲ʱʹ͍ͭͯ • SLI/SLO ͳΜͷͨΊʹ͋Δͷ͔ •
αʔϏεӡ༻ͷݱঢ়ͱ՝ • ՝ʹ࣮͋ͨͬͯࡍʹऔΓΜͩ͜ͱ • ·ͱΊ
લఏɿϓϩμΫτհ - ελσΟαϓϦ
None
20222݄ʹϦχϡʔΞϧ • Ϣʔβج൫Ҏ֎ͷ෦Λ৽نϚΠΫϩ αʔϏεͱͯ͠2ʹΓ։ൃ • ϦϦʔε͔Β1ܦաɻݱࡏܧଓత ʹΤϯϋϯε͍ͯ͠·͢ https://www.recruit.co.jp/newsroom/pressrelease/2022/0131_9881.html ϦχϡʔΞϧͷϙΠϯτʂ
ࠓिͷϛογϣϯͱ෮ԋशػೳʹΑΔݸผֶशࢧԉ ԋशྔɾқΛେ෯֦ॆ ʮఆظςετରࡦߨ࠲ʯΛؚΉ৽ߨ࠲͕ଓʑొ ֶशը໘ͷσβΠϯΛҰ৽
උߟ: tara ͱ͍͏ͷ͜ͷϦχϡʔΞϧϓϩδΣΫτͷίʔυωʔϜͰɺ࠷ۙΠϯλϏϡʔͰύϒϦοΫʹͳͬͨ https://brand.studysapuri.jp/career/interview/article/Saori_Suzuki/ ݩʑ͋ͬͨ Ϣʔβج൫Λ ؚΉαʔϏε
Outline • ࣗݾհ • ʰελσΟαϓϦ தֶߨ࠲ʱʹ͍ͭͯ • SLI/SLO ͳΜͷͨΊʹ͋Δͷ͔ •
αʔϏεӡ༻ͷݱঢ়ͱ՝ • ՝ʹ࣮͋ͨͬͯࡍʹऔΓΜͩ͜ͱ • ·ͱΊ
Why SLI/SLO? • ػೳ։ൃorඇػೳ։ൃɺͲͪΒʹ࣌ؒΛ͏ͷ͔Λ Fact-BasedͰܾఆ͢ΔͨΊ • Error Budget ͕͋ Δ͏ͪ1ͭ1ͭͷ
Τϥʔʹରॲ͠ͳ͍ • Burn Out Λආ͚ΒΕΔ • Budget ͕͋Δ͏ͪϦε Ϋ͕ͱΕΔ
Outline • ࣗݾհ • ʰελσΟαϓϦ தֶߨ࠲ʱʹ͍ͭͯ • SLI/SLO ͳΜͷͨΊʹ͋Δͷ͔ •
αʔϏεӡ༻ͷݱঢ়ͱ՝ • ՝ʹ࣮͋ͨͬͯࡍʹऔΓΜͩ͜ͱ • ·ͱΊ
උߟ: tara ͱ͍͏ͷ͜ͷϦχϡʔΞϧϓϩδΣΫτͷίʔυωʔϜͰɺ࠷ۙΠϯλϏϡʔͰύϒϦοΫʹͳͬͨ https://brand.studysapuri.jp/career/interview/article/Saori_Suzuki/ ݩʑ͋ͬͨ Ϣʔβج൫Λ ؚΉαʔϏε 3FWFSTF1SPYZ /HJOY •
SLI/SLO શ෦Ͱ8ͭ • (a)Availability ͱ (b)Latency • http ͷ metrics Λ͏ • ҎԼͷ4Օॴʹ(a/b)2छྨͣͭ • ᶃ api-gateway • ᶄ api-gateway -> main • ᶅ api-gawatey -> content • ᶆ main -> content • SLO • Availability: 99.9% • Latency: 95 percentile < 1000msec ᶃ ᶄ ᶅ ᶆ
Why Envoy? • ࣌ϚΠΫϩαʔϏεؒͷ metrics Λऔಘ͢Δํ๏͕ ͳ͔ͬͨ • Control Plane
ΛؚΜͩ Service Mesh Ͱͳ͘ɺSide- car container ͱͯ͠୯ʹૉͷ Envoy ΛࡌͤΔͷΈ
DevSupport: ସΘΓ൪Ͱఆৗӡ༻ۀΛߦ͏ • Slack ͷ௨Λ֬ೝͯ͠ݪҼௐࠪ • Sentry Exception, SLO Alert,
GCP Pub/Sub Dead Letter • खಈରԠ͕ඞཁͳͷ֤νʔϜʹΤεΧϨʔγϣϯ • CS(Customer Support)͍߹ΘͤͷҰ࣍ड͚ • શମ͚ϝϯγϣϯͷ1࣍ड͚
ى͖͍ͯͨ՝: No SLO Alert • ϦϦʔε͔Βࠓ·ͰҰ SLO Alert ͕໐ͬͨ͜ͱͳ͍ •
Sentry ͷ Exception ྔ͕ SLI ʹө͞Ε͍ͯͳ͍ؾ͕͢Δ • Կ͕ى͖͍ͯΔͷͩΖ͏͔ʁ • গͳ͘ͱ Sentry Exception Λ1݅ͣͭݟ͍ͯΔ࣌Ͱ Error Budget ͱ͍͏֓೦ ར༻Ͱ͖ͯͳ͍ • SLO ͕ࣗͨͪͷظΑΓ؇͗͢Δʁ • SLI ͷઃఆ͕ޡ͍ͬͯΔʁ • ௐࠪͨ͠
Outline • ࣗݾհ • ʰελσΟαϓϦ தֶߨ࠲ʱʹ͍ͭͯ • SLI/SLO ͳΜͷͨΊʹ͋Δͷ͔ •
αʔϏεӡ༻ͷݱঢ়ͱ՝ • ՝ʹ࣮͋ͨͬͯࡍʹऔΓΜͩ͜ͱ • ·ͱΊ
Ծઆ: Envoy ͷ metrics (SLIᶄᶅᶆ) ͕͓͔͍͠ͷͰʁ • Yes • Exception
ͷҰ෦ DNS ໊લղܾͰࣦഊ͍ͯͨ͠ • ͭ·Γɺhttp request ʹࢸ͍ͬͯͳ͍ • envoy.cluster.upstream_rq_2xx ʹܭ্͞Εͳ͍ͷͦΕͦ͏ • ᶄͷ௨৴࣌ɺ໊લղܾʹࣦഊύλʔϯ • ᶃ ͷ SLI Ͱܭଌ͞Ε͍ͯΕྑ͍͕…? ௨ৗͷ௨৴ UBSBBQJHBUFXBZDPOUBJOFS͕IUUQ UBSBNBJOΛ໊લղܾ͢Δ͕͜͜ ࣦഊͨ͠ IUUQUBSBNBJOͰ௨৴͢Δ
Ծઆ2: Reverse Proxy ͷ metrics (SLIᶃ) ͕͓͔͍͠ͷͰʁ • Yes •
GraphQL ϦΫΤετ్͕தͰࣦഊͨ͠߹ɺhttp Ͱ 200 Λฦ͍ͯͨ͠😱 • ϦϦʔε࣌ɺ෦ࣦഊ 500 Ͱฦ͢͜ͱΛܾΊ͕ͨɺͦ͏͞Ε͍ͯͳ͔ͬͨ ௨ৗͷ௨৴ $MJFOU͔ΒIUUQTKVOJPSMFBSOTUVEZTBQVSJKQʹΞΫηε͢Δͱ 3FWFSTF1SPYZʹ౸ୡ 3FWFSTF1SPYZ͔ΒțBSBBQJHBUFXBZQSPYZᶄ UBSBBQJHBUFXBZ͔ΒțBSBNBJO௨৴ᶄ͜͜ͰΤϥʔ͕ൃੜ
ରॲ1ɿGraphQL Error ͷ߹ http 500 Λฦ͢ • ݩʑ GraphQL
http ͷ͜ͱΛؾʹ͍ͯ͠ͳ͍ • ڍಈ GraphQL server library ͷڍಈʹґଘ͢Δ • Response status 200 ʹ౷Ұ͢ΔϓϥΫςΟε͋Δ • Client Error Response ͷ errors ΛݟΔͷͰͳ͍ ಉ྅͕γϡοͱͯ͘͠Ε·ͨ͠🙏 4QFDJBM5IBOLT!2VSBNZ
ରॲ2ɿ Envoy ΛΊͯ Datadog APM metrics Λར༻ • ෳࡶੑʹΑΔτϥϒϧγϡʔτͷ͠͞ΛݮΒͨ͢Ί •
Envoy ͷ metrics ʹ͕͋ͬͨΘ͚Ͱͳ͍ • ӡ༻ͷ՝ଟ͘ metrics औಘҎ֎ͷϝϦοτಘΒΕ͍ͯͳ͔ͬͨ • Curcuit Breaker ೖΕ͍ͯͨͷͷൃಈͨ͠έʔε΄ͱΜͲͳ͍ • Envoy ͷ version up ରԠʢग़དྷ͍ͯͳ͍ʣ • Pod side-car container ͷىಈɾऴྃॱ੍ޚʢenvoy Λͨͳ͍ͱΤϥʔʹͳΔʣ • Rollouts Λ͍ͬͯΔ߹ͷ Patch ํ๏ʢResource ٯసͯ͠োʹͳͬͨ͜ͱʣ
খωλ: Datadog APM ݁ߏบ͕͋Δ(1) • http client ͷ APM Plugin
ͷ resource tag default Ͱ http method Ͱ͋Δ • Ѽઌ͝ͱͷ SLI ͱͯ͠࠾༻͢Δʹ hostname ͕ඞཁ • Node, Ruby ͰͦΕͧΕରԠ • ৫Ͱ http-client ͷ resource tag ͷ໋໊نΛ߹ҙ
খωλ: Datadog APM ݁ߏบ͕͋Δ(2) • trace.http.request.errors Ͱ http 5xx ֘͠ͳ͍
• ٯʹ 4xx ֘͢Δ • trace.http.request.hits.by_http_status Λར༻͢Δඞཁ͕͋Δ
උߟ: tara ͱ͍͏ͷ͜ͷϦχϡʔΞϧϓϩδΣΫτͷίʔυωʔϜͰɺ࠷ۙΠϯλϏϡʔͰύϒϦοΫʹͳͬͨ https://brand.studysapuri.jp/career/interview/article/Saori_Suzuki/ ݩʑ͋ͬͨ Ϣʔβج൫Λ ؚΉαʔϏε 3FWFSTF1SPYZ /HJOY •
SLO Λݟͨ͠ • (a)Availability ͱ (b)Latency • http ͷ metrics Λ͏ • ҎԼͷ4Օॴʹ(a/b)2छྨͣͭ • ᶃ api-gateway • ᶄ api-gateway -> main • ᶅ api-gawatey -> content • ᶆ main -> content • 🆕ᶇ api-gateway -> Ϣʔβج൫ͷ request • SLO • Availability: 99.9% • Latency: 95 percentile < 1000msec • -> αʔϏε͝ͱʹݱঢ়ΛՃຯ͠ɺ 100~500msec ᶃ ᶄ ᶅ ᶆ ᶇ ϚΠΫϩαʔϏε͝ͱͷ 4-*4-0Λഇࢭ 4-*Λ͚ΔϝϦοτ͕ෳ 4-*4-0Λཧ͢Δίετʹ ݟ߹͍ͬͯͳ͍ͨΊ Ϣʔβج൫͚4-*4-0Ճ Ϣʔβج൫͚ͷڞ௨4-*͜Ε·ͰFOWPZNFUSJDT Λར༻͍ͯͨ͠ɻFOWPZΛ֎ͨͨ͠Ί%BUBEPH "1.NFUSJDTΛར༻ͨ͠4-*4-0ΛՃ
DevSupport ݟ͠ • Sentry Exception ͰΞϓϦέʔγϣϯίʔυىҼͷͷҎ ֎શͯ Ignore ͢Δ •
SLO Alert ͕དྷͨ࣌ͷجຊతͳରॲํΛυΩϡϝϯτԽ • ରԠͰ͖ͳ͔ͬͨͷΛ2िؒʹ1ճνʔϜͰରԠ
Outline • ࣗݾհ • ʰελσΟαϓϦ தֶߨ࠲ʱʹ͍ͭͯ • SLI/SLO ͳΜͷͨΊʹ͋Δͷ͔ •
αʔϏεӡ༻ͷݱঢ়ͱ՝ • ՝ʹ࣮͋ͨͬͯࡍʹऔΓΜͩ͜ͱ • ·ͱΊ
Կ͕ى͖͍ͯͨͷ͔ • ϦϦʔε࣌ʹҰઃఆ͞Εͨ SLI/SLO 1ݟ͞Εͯͳ ͔ͬͨ • SLO ͕ԿͷՁൃش͍ͯ͠ͳ͔ͬͨ •
SLI/SLO ྆ํΛݟ͠ɺࠓޙܧଓతʹݟ͢͜ͱʹͨ͠
Ͳ͏͖ͩͬͨ͢ͷ͔ • ։ൃऀࢹ • SLO ͕ຊʹՁΛͨΒ͍ͯ͠Δͷ͔Λఆظతʹݕࠪ͢Δ • ൪ྑ͍͕ɺͨ·ʹશһͰݟΔ࣌ؒΛऔΔ͜ͱॏཁ • 1ަͩͱਂ͘ௐΔΠϯηϯςΟϒ͕ಇ͔ͳ͍
• SRE ࢹ • ։ൃνʔϜ͕ SLI/SLO Λఆظతʹݟ͢ΈΛ࡞Δ • ϫʔΫϩʔυ͝ͱʹ SLI/SLO Λࣗಈੜ͢ΔΈΛ࡞Δ
ࠓ͍͑ͨ͜ͱ ҰܾΊͨ SLI/SLO ܧଓతʹݟ͠·͠ΐ͏
Thank you! chaspy chaspy_ Engineering Manager Site Reliability and Web
Application Development at Recruit Co., Ltd. Takeshi Kondo https://chaspy.me