Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
『スタディサプリ』における SLI/SLO の継続的改善 / Continuous improvement of SLI/SLO at StudySapuri
Search
Takeshi Kondo
May 16, 2023
Technology
1
1.9k
『スタディサプリ』における SLI/SLO の継続的改善 / Continuous improvement of SLI/SLO at StudySapuri
https://connpass.com/event/282120/
Takeshi Kondo
May 16, 2023
Tweet
Share
More Decks by Takeshi Kondo
See All by Takeshi Kondo
エンジニアブランディングチームの KPI / KPI's of engineer branding team
chaspy
2
640
「SLO Review」今やるならこうする / If I had to do the "SLO Review" again
chaspy
3
710
開発者とともに作る Site Reliability Engineering / SREing with Developers
chaspy
10
5.9k
自己診断能力の獲得を目指して / Toward the acquisition of self-diagnostic skills
chaspy
1
2.7k
『スタディサプリ 中学講座』における E2E Test の運用と計測による改善 / Improved E2E testing through measurement
chaspy
0
2.8k
ポストモーテム運用を支える文化と技術 / Culture and Technology Supporting Postmortem Operations
chaspy
2
1.1k
Who owns the Service Level?
chaspy
5
9.8k
多様な働き方を支える Working Agreements / Working agreements that support diverse work styles
chaspy
1
1.9k
SRE を実現するための組織マネジメント / Management to achieve SRE
chaspy
3
5.2k
Other Decks in Technology
See All in Technology
オブザーバビリティの Primary Signals
onk
PRO
0
530
Databricks:『生成AI World Cup』のご案内
databricksjapan
1
130
キャラクター制御のためのプロンプト術 for LINE Bot
uezo
0
520
o11y入門_外形監視を利用したWebアプリケーションへの最適なモニタリング_TechBrew
k5k
2
100
オーナーシップを持つ領域を明確にする
konifar
8
1k
SIEMを用いて、セキュリティログ分析の可視化と分析を実現し、PDCAサイクルを回してみた
coconala_engineer
0
190
少数チームで挑む: SwiftUI, TCA, KMPを用いた 新規動画配信アプリ 「ABEMA Live」の開発について
tomu28
0
520
ここが嬉しいABAC ここが辛いよABAC #再解説+補足編
masahirokawahara
0
180
SPI原点回帰論:事業課題とFour Keysの結節点を見出す実践的ソフトウェアプロセス改善 / DevOpsDays Tokyo 2024
visional_engineering_and_design
4
1.3k
【SORACOM UG】SIM Deep Dive セキュアエレメント編
soracom
PRO
0
240
小さな開発会社がWebサービスを作る理由
polidog
PRO
0
120
Oracle Exadata Database Service on Cloud@Customer (ExaDB-C@C) - UI スクリーン・キャプチャ集
oracle4engineer
PRO
1
1.1k
Featured
See All Featured
Let's Do A Bunch of Simple Stuff to Make Websites Faster
chriscoyier
501
140k
Designing with Data
zakiwarfel
95
4.8k
How GitHub (no longer) Works
holman
304
140k
Faster Mobile Websites
deanohume
296
30k
Raft: Consensus for Rubyists
vanstee
131
6.2k
Agile that works and the tools we love
rasmusluckow
323
20k
Debugging Ruby Performance
tmm1
69
11k
What's new in Ruby 2.0
geeforr
336
31k
The Language of Interfaces
destraynor
151
23k
Git: the NoSQL Database
bkeepers
PRO
421
63k
jQuery: Nuts, Bolts and Bling
dougneiner
59
7.1k
10 Git Anti Patterns You Should be Aware of
lemiorhan
645
57k
Transcript
ʰελσΟαϓϦʱʹ͓͚Δ SLI/SLO ͷܧଓతվળ Takeshi Kondo / @chaspy 2023/05/13 SLOconf Tokyo
2023
Who am I chaspy chaspy_ Engineering Manager Site Reliability and
Web Application Development at Recruit Co., Ltd. Takeshi Kondo https://chaspy.me
SRE NEXT 2020 & 2022 • 2020 • SLI/SLO ͱ͍͏ݴ༿͕ͳ͍ঢ়ଶͰ৫
ಋೖΛࢼΈͨࣄྫ • 2022 • SLI/SLO Λಋೖͨ͠ޙͷ • ৫શମͰ Site Reliability Engineering ΛਐΊΔͨΊʹඞཁͳ͜ͱΛߟ͑ͨ
SRE & Web Application Development 2018 2020 2021 2023 2019
2022 2VJQQFS ೖࣾ 43&/&95 4-0Λ৫ʹಋೖ ͠Α͏ͱؤுΔ &OHJOFFSJOH.BOBHFSͱͯ͠ 8FC։ൃνʔϜʹࢀՃ 43&/&95 &OHJOFFSJOH .BOBHFSʹͳΔ 4-0DPOG 5PLZP✨
SRE & Web Application Development 2018 2020 2021 2023 2019
2022 2VJQQFS ೖࣾ 43&/&95 4-0Λ৫ʹಋೖ ͠Α͏ͱؤுΔ 43&/&95 &OHJOFFSJOH .BOBHFSʹͳΔ &OHJOFFSJOH.BOBHFSͱͯ͠ 8FC։ൃνʔϜʹࢀՃ ࠓ։ൃऀઢͰ͠·͢ʂ 4-0DPOG 5PLZP✨
ࠓ͍͑ͨ͜ͱ ҰܾΊͨ SLI/SLO ܧଓతʹݟ͠·͠ΐ͏
Ұઃఆͯ͠ݟ͞ͳ͔ͬͨΒͲ͏ͳ͔ͬͨͷΛ͠·͢😅
ʰελσΟαϓϦʱʹ͓͚Δ SLI/SLO ͷܧଓతվળ Λ͜Ε͔Β͍ͬͯͧ͘ͱ͍͏ Takeshi Kondo / @chaspy 2023/05/13 SLOconf
Tokyo 2023
Outline • ࣗݾհ • ʰελσΟαϓϦ தֶߨ࠲ʱʹ͍ͭͯ • SLI/SLO ͳΜͷͨΊʹ͋Δͷ͔ •
αʔϏεӡ༻ͷݱঢ়ͱ՝ • ՝ʹ࣮͋ͨͬͯࡍʹऔΓΜͩ͜ͱ • ·ͱΊ
Outline • ࣗݾհ • ʰελσΟαϓϦ தֶߨ࠲ʱʹ͍ͭͯ • SLI/SLO ͳΜͷͨΊʹ͋Δͷ͔ •
αʔϏεӡ༻ͷݱঢ়ͱ՝ • ՝ʹ࣮͋ͨͬͯࡍʹऔΓΜͩ͜ͱ • ·ͱΊ
લఏɿϓϩμΫτհ - ελσΟαϓϦ
None
20222݄ʹϦχϡʔΞϧ • Ϣʔβج൫Ҏ֎ͷ෦Λ৽نϚΠΫϩ αʔϏεͱͯ͠2ʹΓ։ൃ • ϦϦʔε͔Β1ܦաɻݱࡏܧଓత ʹΤϯϋϯε͍ͯ͠·͢ https://www.recruit.co.jp/newsroom/pressrelease/2022/0131_9881.html ϦχϡʔΞϧͷϙΠϯτʂ
ࠓिͷϛογϣϯͱ෮ԋशػೳʹΑΔݸผֶशࢧԉ ԋशྔɾқΛେ෯֦ॆ ʮఆظςετରࡦߨ࠲ʯΛؚΉ৽ߨ࠲͕ଓʑొ ֶशը໘ͷσβΠϯΛҰ৽
උߟ: tara ͱ͍͏ͷ͜ͷϦχϡʔΞϧϓϩδΣΫτͷίʔυωʔϜͰɺ࠷ۙΠϯλϏϡʔͰύϒϦοΫʹͳͬͨ https://brand.studysapuri.jp/career/interview/article/Saori_Suzuki/ ݩʑ͋ͬͨ Ϣʔβج൫Λ ؚΉαʔϏε
Outline • ࣗݾհ • ʰελσΟαϓϦ தֶߨ࠲ʱʹ͍ͭͯ • SLI/SLO ͳΜͷͨΊʹ͋Δͷ͔ •
αʔϏεӡ༻ͷݱঢ়ͱ՝ • ՝ʹ࣮͋ͨͬͯࡍʹऔΓΜͩ͜ͱ • ·ͱΊ
Why SLI/SLO? • ػೳ։ൃorඇػೳ։ൃɺͲͪΒʹ࣌ؒΛ͏ͷ͔Λ Fact-BasedͰܾఆ͢ΔͨΊ • Error Budget ͕͋ Δ͏ͪ1ͭ1ͭͷ
Τϥʔʹରॲ͠ͳ͍ • Burn Out Λආ͚ΒΕΔ • Budget ͕͋Δ͏ͪϦε Ϋ͕ͱΕΔ
Outline • ࣗݾհ • ʰελσΟαϓϦ தֶߨ࠲ʱʹ͍ͭͯ • SLI/SLO ͳΜͷͨΊʹ͋Δͷ͔ •
αʔϏεӡ༻ͷݱঢ়ͱ՝ • ՝ʹ࣮͋ͨͬͯࡍʹऔΓΜͩ͜ͱ • ·ͱΊ
උߟ: tara ͱ͍͏ͷ͜ͷϦχϡʔΞϧϓϩδΣΫτͷίʔυωʔϜͰɺ࠷ۙΠϯλϏϡʔͰύϒϦοΫʹͳͬͨ https://brand.studysapuri.jp/career/interview/article/Saori_Suzuki/ ݩʑ͋ͬͨ Ϣʔβج൫Λ ؚΉαʔϏε 3FWFSTF1SPYZ /HJOY •
SLI/SLO શ෦Ͱ8ͭ • (a)Availability ͱ (b)Latency • http ͷ metrics Λ͏ • ҎԼͷ4Օॴʹ(a/b)2छྨͣͭ • ᶃ api-gateway • ᶄ api-gateway -> main • ᶅ api-gawatey -> content • ᶆ main -> content • SLO • Availability: 99.9% • Latency: 95 percentile < 1000msec ᶃ ᶄ ᶅ ᶆ
Why Envoy? • ࣌ϚΠΫϩαʔϏεؒͷ metrics Λऔಘ͢Δํ๏͕ ͳ͔ͬͨ • Control Plane
ΛؚΜͩ Service Mesh Ͱͳ͘ɺSide- car container ͱͯ͠୯ʹૉͷ Envoy ΛࡌͤΔͷΈ
DevSupport: ସΘΓ൪Ͱఆৗӡ༻ۀΛߦ͏ • Slack ͷ௨Λ֬ೝͯ͠ݪҼௐࠪ • Sentry Exception, SLO Alert,
GCP Pub/Sub Dead Letter • खಈରԠ͕ඞཁͳͷ֤νʔϜʹΤεΧϨʔγϣϯ • CS(Customer Support)͍߹ΘͤͷҰ࣍ड͚ • શମ͚ϝϯγϣϯͷ1࣍ड͚
ى͖͍ͯͨ՝: No SLO Alert • ϦϦʔε͔Βࠓ·ͰҰ SLO Alert ͕໐ͬͨ͜ͱͳ͍ •
Sentry ͷ Exception ྔ͕ SLI ʹө͞Ε͍ͯͳ͍ؾ͕͢Δ • Կ͕ى͖͍ͯΔͷͩΖ͏͔ʁ • গͳ͘ͱ Sentry Exception Λ1݅ͣͭݟ͍ͯΔ࣌Ͱ Error Budget ͱ͍͏֓೦ ར༻Ͱ͖ͯͳ͍ • SLO ͕ࣗͨͪͷظΑΓ؇͗͢Δʁ • SLI ͷઃఆ͕ޡ͍ͬͯΔʁ • ௐࠪͨ͠
Outline • ࣗݾհ • ʰελσΟαϓϦ தֶߨ࠲ʱʹ͍ͭͯ • SLI/SLO ͳΜͷͨΊʹ͋Δͷ͔ •
αʔϏεӡ༻ͷݱঢ়ͱ՝ • ՝ʹ࣮͋ͨͬͯࡍʹऔΓΜͩ͜ͱ • ·ͱΊ
Ծઆ: Envoy ͷ metrics (SLIᶄᶅᶆ) ͕͓͔͍͠ͷͰʁ • Yes • Exception
ͷҰ෦ DNS ໊લղܾͰࣦഊ͍ͯͨ͠ • ͭ·Γɺhttp request ʹࢸ͍ͬͯͳ͍ • envoy.cluster.upstream_rq_2xx ʹܭ্͞Εͳ͍ͷͦΕͦ͏ • ᶄͷ௨৴࣌ɺ໊લղܾʹࣦഊύλʔϯ • ᶃ ͷ SLI Ͱܭଌ͞Ε͍ͯΕྑ͍͕…? ௨ৗͷ௨৴ UBSBBQJHBUFXBZDPOUBJOFS͕IUUQ UBSBNBJOΛ໊લղܾ͢Δ͕͜͜ ࣦഊͨ͠ IUUQUBSBNBJOͰ௨৴͢Δ
Ծઆ2: Reverse Proxy ͷ metrics (SLIᶃ) ͕͓͔͍͠ͷͰʁ • Yes •
GraphQL ϦΫΤετ్͕தͰࣦഊͨ͠߹ɺhttp Ͱ 200 Λฦ͍ͯͨ͠😱 • ϦϦʔε࣌ɺ෦ࣦഊ 500 Ͱฦ͢͜ͱΛܾΊ͕ͨɺͦ͏͞Ε͍ͯͳ͔ͬͨ ௨ৗͷ௨৴ $MJFOU͔ΒIUUQTKVOJPSMFBSOTUVEZTBQVSJKQʹΞΫηε͢Δͱ 3FWFSTF1SPYZʹ౸ୡ 3FWFSTF1SPYZ͔ΒțBSBBQJHBUFXBZQSPYZᶄ UBSBBQJHBUFXBZ͔ΒțBSBNBJO௨৴ᶄ͜͜ͰΤϥʔ͕ൃੜ
ରॲ1ɿGraphQL Error ͷ߹ http 500 Λฦ͢ • ݩʑ GraphQL
http ͷ͜ͱΛؾʹ͍ͯ͠ͳ͍ • ڍಈ GraphQL server library ͷڍಈʹґଘ͢Δ • Response status 200 ʹ౷Ұ͢ΔϓϥΫςΟε͋Δ • Client Error Response ͷ errors ΛݟΔͷͰͳ͍ ಉ྅͕γϡοͱͯ͘͠Ε·ͨ͠🙏 4QFDJBM5IBOLT!2VSBNZ
ରॲ2ɿ Envoy ΛΊͯ Datadog APM metrics Λར༻ • ෳࡶੑʹΑΔτϥϒϧγϡʔτͷ͠͞ΛݮΒͨ͢Ί •
Envoy ͷ metrics ʹ͕͋ͬͨΘ͚Ͱͳ͍ • ӡ༻ͷ՝ଟ͘ metrics औಘҎ֎ͷϝϦοτಘΒΕ͍ͯͳ͔ͬͨ • Curcuit Breaker ೖΕ͍ͯͨͷͷൃಈͨ͠έʔε΄ͱΜͲͳ͍ • Envoy ͷ version up ରԠʢग़དྷ͍ͯͳ͍ʣ • Pod side-car container ͷىಈɾऴྃॱ੍ޚʢenvoy Λͨͳ͍ͱΤϥʔʹͳΔʣ • Rollouts Λ͍ͬͯΔ߹ͷ Patch ํ๏ʢResource ٯసͯ͠োʹͳͬͨ͜ͱʣ
খωλ: Datadog APM ݁ߏบ͕͋Δ(1) • http client ͷ APM Plugin
ͷ resource tag default Ͱ http method Ͱ͋Δ • Ѽઌ͝ͱͷ SLI ͱͯ͠࠾༻͢Δʹ hostname ͕ඞཁ • Node, Ruby ͰͦΕͧΕରԠ • ৫Ͱ http-client ͷ resource tag ͷ໋໊نΛ߹ҙ
খωλ: Datadog APM ݁ߏบ͕͋Δ(2) • trace.http.request.errors Ͱ http 5xx ֘͠ͳ͍
• ٯʹ 4xx ֘͢Δ • trace.http.request.hits.by_http_status Λར༻͢Δඞཁ͕͋Δ
උߟ: tara ͱ͍͏ͷ͜ͷϦχϡʔΞϧϓϩδΣΫτͷίʔυωʔϜͰɺ࠷ۙΠϯλϏϡʔͰύϒϦοΫʹͳͬͨ https://brand.studysapuri.jp/career/interview/article/Saori_Suzuki/ ݩʑ͋ͬͨ Ϣʔβج൫Λ ؚΉαʔϏε 3FWFSTF1SPYZ /HJOY •
SLO Λݟͨ͠ • (a)Availability ͱ (b)Latency • http ͷ metrics Λ͏ • ҎԼͷ4Օॴʹ(a/b)2छྨͣͭ • ᶃ api-gateway • ᶄ api-gateway -> main • ᶅ api-gawatey -> content • ᶆ main -> content • 🆕ᶇ api-gateway -> Ϣʔβج൫ͷ request • SLO • Availability: 99.9% • Latency: 95 percentile < 1000msec • -> αʔϏε͝ͱʹݱঢ়ΛՃຯ͠ɺ 100~500msec ᶃ ᶄ ᶅ ᶆ ᶇ ϚΠΫϩαʔϏε͝ͱͷ 4-*4-0Λഇࢭ 4-*Λ͚ΔϝϦοτ͕ෳ 4-*4-0Λཧ͢Δίετʹ ݟ߹͍ͬͯͳ͍ͨΊ Ϣʔβج൫͚4-*4-0Ճ Ϣʔβج൫͚ͷڞ௨4-*͜Ε·ͰFOWPZNFUSJDT Λར༻͍ͯͨ͠ɻFOWPZΛ֎ͨͨ͠Ί%BUBEPH "1.NFUSJDTΛར༻ͨ͠4-*4-0ΛՃ
DevSupport ݟ͠ • Sentry Exception ͰΞϓϦέʔγϣϯίʔυىҼͷͷҎ ֎શͯ Ignore ͢Δ •
SLO Alert ͕དྷͨ࣌ͷجຊతͳରॲํΛυΩϡϝϯτԽ • ରԠͰ͖ͳ͔ͬͨͷΛ2िؒʹ1ճνʔϜͰରԠ
Outline • ࣗݾհ • ʰελσΟαϓϦ தֶߨ࠲ʱʹ͍ͭͯ • SLI/SLO ͳΜͷͨΊʹ͋Δͷ͔ •
αʔϏεӡ༻ͷݱঢ়ͱ՝ • ՝ʹ࣮͋ͨͬͯࡍʹऔΓΜͩ͜ͱ • ·ͱΊ
Կ͕ى͖͍ͯͨͷ͔ • ϦϦʔε࣌ʹҰઃఆ͞Εͨ SLI/SLO 1ݟ͞Εͯͳ ͔ͬͨ • SLO ͕ԿͷՁൃش͍ͯ͠ͳ͔ͬͨ •
SLI/SLO ྆ํΛݟ͠ɺࠓޙܧଓతʹݟ͢͜ͱʹͨ͠
Ͳ͏͖ͩͬͨ͢ͷ͔ • ։ൃऀࢹ • SLO ͕ຊʹՁΛͨΒ͍ͯ͠Δͷ͔Λఆظతʹݕࠪ͢Δ • ൪ྑ͍͕ɺͨ·ʹશһͰݟΔ࣌ؒΛऔΔ͜ͱॏཁ • 1ަͩͱਂ͘ௐΔΠϯηϯςΟϒ͕ಇ͔ͳ͍
• SRE ࢹ • ։ൃνʔϜ͕ SLI/SLO Λఆظతʹݟ͢ΈΛ࡞Δ • ϫʔΫϩʔυ͝ͱʹ SLI/SLO Λࣗಈੜ͢ΔΈΛ࡞Δ
ࠓ͍͑ͨ͜ͱ ҰܾΊͨ SLI/SLO ܧଓతʹݟ͠·͠ΐ͏
Thank you! chaspy chaspy_ Engineering Manager Site Reliability and Web
Application Development at Recruit Co., Ltd. Takeshi Kondo https://chaspy.me