Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Site Reliability Engineering における 重要領域とパフォーマンス指...
Search
Takeshi Kondo
June 04, 2021
Technology
1
3.3k
Site Reliability Engineering における 重要領域とパフォーマンス指標の提案 / Performance Indicators for SRE
2021/06/04
第8回WebSystemArchitecture研究会(オンライン)
https://wsa.connpass.com/event/207143/
Takeshi Kondo
June 04, 2021
Tweet
Share
More Decks by Takeshi Kondo
See All by Takeshi Kondo
SRE NEXT CfP チームが語る 聞きたくなるプロポーザルとは / Proposals by the SRE NEXT CfP Team that are sure to be accepted
chaspy
2
1.4k
Slack Platform(Deno) での RAG 実装 - LangChain(js) を使ってみた / rag-implementation-on-slack-platform-deno-experimenting-with-langchain-js
chaspy
0
240
SRE の考えをマネジメントに活かす / applying SRE ideas to management
chaspy
7
7.7k
RAGの簡易評価によるフィードバックサイクル実践 / Feedback cycle practice through simplified assessment of RAGs
chaspy
2
5.5k
定量データと定性評価を用いた技術戦略の組織的実践 / Systematic implementation of technology strategies using quantitative data and qualitative evaluation
chaspy
9
1.9k
エンジニアブランディングチームの KPI / KPI's of engineer branding team
chaspy
2
2.2k
「SLO Review」今やるならこうする / If I had to do the "SLO Review" again
chaspy
3
2k
開発者とともに作る Site Reliability Engineering / SREing with Developers
chaspy
10
8.4k
自己診断能力の獲得を目指して / Toward the acquisition of self-diagnostic skills
chaspy
1
5.3k
Other Decks in Technology
See All in Technology
職種別ミートアップで社内から盛り上げる アウトプット文化の醸成と関係強化/ #DevRelKaigi
nishiuma
2
130
E2Eテスト設計_自動化のリアル___Playwrightでの実践とMCPの試み__AIによるテスト観点作成_.pdf
findy_eventslides
0
110
許しとアジャイル
jnuank
1
120
Goに育てられ開発者向けセキュリティ事業を立ち上げた僕が今向き合う、AI × セキュリティの最前線 / Go Conference 2025
flatt_security
0
350
成長自己責任時代のあるきかた/How to navigate the era of personal responsibility for growth
kwappa
3
270
定期的な価値提供だけじゃない、スクラムが導くチームの共創化 / 20251004 Naoki Takahashi
shift_evolve
PRO
3
300
BtoBプロダクト開発の深層
16bitidol
0
270
AIが書いたコードをAIが検証する!自律的なモバイルアプリ開発の実現
henteko
1
340
Where will it converge?
ibknadedeji
0
180
SREとソフトウェア開発者の合同チームはどのようにS3のコストを削減したか?
muziyoshiz
1
100
多様な事業ドメインのクリエイターへ 価値を届けるための営みについて
massyuu
0
110
「Verify with Wallet API」を アプリに導入するために
hinakko
1
230
Featured
See All Featured
Improving Core Web Vitals using Speculation Rules API
sergeychernyshev
19
1.2k
Gamification - CAS2011
davidbonilla
81
5.5k
KATA
mclloyd
32
15k
Measuring & Analyzing Core Web Vitals
bluesmoon
9
610
The Illustrated Children's Guide to Kubernetes
chrisshort
48
51k
Producing Creativity
orderedlist
PRO
347
40k
Templates, Plugins, & Blocks: Oh My! Creating the theme that thinks of everything
marktimemedia
31
2.5k
The Cost Of JavaScript in 2023
addyosmani
53
9k
Fireside Chat
paigeccino
40
3.7k
StorybookのUI Testing Handbookを読んだ
zakiyama
31
6.2k
The MySQL Ecosystem @ GitHub 2015
samlambert
251
13k
Context Engineering - Making Every Token Count
addyosmani
5
180
Transcript
Site Reliability Engineering ʹ͓͚Δ ॏཁྖҬͱύϑΥʔϚϯεࢦඪͷఏҊ Takeshi Kondo / @chaspy 2021/06/04
ୈ8ճWebSystemArchitectureݚڀձʢΦϯϥΠϯʣ
Who am I chaspy chaspy_ Lead Software Engineer Site Reliability
at Quipper Takeshi Kondo
Agenda 1. എܠͱత 2. SRE ͕ؔΘΔྖҬ 3. ఏҊࢦඪͱଌఆํ๏ 4. ଌఆ݁Ռ
5. ·ͱΊͱࠓޙͷల
എܠͱత • SRE ͱ͍͏ Role ͕͘ීٴ͠ɺ࣮ફ͢Δاۀ͕૿͖͑ͯͨ • Ϗδωεɺ৫ͷنੑ࣭ʹΑΓͦͷׂҟͳΔ →SRE ͕ؔΘΔॏཁͳྖҬΛྨ͍ͨ͠
• ϓϩμΫτ։ൃͷΑ͏ʹϏδωεKPIΛઃఆ͠ɺͦΕΛ܁Γฦ ͠վળ͢Δͱ͍͏Ξϓϩʔν SRE ʹ༗ޮͳͣ →ྖҬ͝ͱͷύϑΥʔϚϯεࢦඪΛఆٛɾܭଌ͍ͨ͠
SRE͕ؔΘΔྖҬ • Ϋϥυ্Ͱ Web αʔϏεΛఏڙ͢ΔاۀΛఆ • 100+ Developer • 30M+
Access / Day
ʲࢀߟʳAWS Well-Architected Framework https://aws.amazon.com/architecture/well-architected/
ଌఆࢦඪͱଌఆํ๏ • Reliability • Developer Productivity • Cost • Security
• Platform
ྖҬ͝ͱͷؔੑ 1MBUGPSN 3FMJBCJMJUZ $PTU %FWFMPQFS 1SPEVDUJWJUZ 4FDVSJUZ Empowerment Empowerment Empowerment
Trade-Off Trade-Off Trade-Off
ʲࢀߟʳLean ͱ DevOps ͷՊֶ • ΤϦʔτاۀҎԼ͕ͯ͢༏Ε͍ͯΔ • σϓϩΠͷස • มߋͷϦʔυλΠϜ
• MTTR • มߋࣦഊ https://book.impress.co.jp/books/1118101029
ଌఆࢦඪͱଌఆํ๏ • Reliability • Developer Productivity • Cost • Security
• Platform
ଌఆࢦඪͱଌఆํ๏ ྖҬ ࢦඪ ଌఆํ๏ 3FMJBCJMJUZ .553 োൃੜ࣌ʹɺোใࠂϑϩʔͷ ඞਢ߲ͱͯ͠खಈͰܭଌ %FWFMPQFS1SPEVDUJWJUZ σϓϩΠճ
$*αʔϏεͷNFUSJDT %FWFMPQFS1SPEVDUJWJUZ σϓϩΠ࣌ؒ $*αʔϏεͷNFUSJDT %FWFMPQFS1SPEVDUJWJUZ $*҆ఆੑ $*αʔϏεͷNFUSJDT %FWFMPQFS1SPEVDUJWJUZ มߋࣦഊ ຊ൪ڥσϓϩΠʹରԠ͢Δϒϥϯ νͷ3FWFSUDPNNJUͷ
ଌఆ݁Ռ • MTTR • σϓϩΠճ • σϓϩΠ࣌ؒ • CI ҆ఆੑ
• มߋࣦഊ
MTTR Plot with Trendline
MTTR per half year
Histgram
MTTR ʹର͢Δߟ • ܭଌՄೳੑͷ • ࣗಈͰूܭͰ͖ΔΈ͕ඞཁ • ͦͷͨΊʹ Incident Response
ͷܕԽͱͦΕʹର͢Δ Tool ͕ඞཁ • SeverityʢIncident ͷ Level ఆٛʣ/ োൃੜɾݕɾ෮چͷ࣌ؒΛඞͣه͢ΔϧʔϧͳͲ • σʔλྔͷ • Πϯγσϯτ2Ͱ͔͕ͨͩ50ఔɺेͳσʔλྔ͕ಘΒΕͳ͍ • Πϯγσϯτ͕ଟ͍͜ͱ SRE ͷతͱ૬͢Δ • Β͖ͭͷ • σʔλྔ͕ेͰͳ͍ͱɺҰ෦ͷ࣌ؒোʹҾ͖ͣΒΕͯ͠·͏
MTTR ʹର͢Δߟ • ࢦඪͱͯ͠༗ӹ͔Ͳ͏͔·ͩஅͰ͖ͳ͍ • গͳ͘ͱҎԼͷͰ༗ӹͳͷͰτϥοΩϯάΛଓ͚Δ • Incident Response ͷܕԽ
• ࣌ؒΠϯγσϯτʹର͢Δվળ
ʲࢀߟʳIncident Metrics in SRE • MTTR ࢦඪʹ͖͢Ͱͳ͍ͱओு • ݅ෆͱΒ͖ͭͷେ͖͕͞ཧ༝ •
ͰͲΕΛ࠾༻͖͔͢ݴٴ͕ͳ͍ https://sre.google/resources/practices-and-processes/incident-metrics-in-sre/
ʲิʳDeveloper Productivity ྖҬͷܭଌର • monorepo Λ࠾༻ • master branch ͰෳͷΞϓϦ͕ಉ࣌ʹσϓϩΠ͞ΕΔ
• Database Λڞ༗͢Δ Distributed monolith ͱͳ͍ͬͯΔ • ͜ΕΒجຊతʹि࣍ϦϦʔε͞ΕΔ • ͜ΕҎ֎ͷ microservices ݸผͰϦϦʔε͞ΕΔ͕ɺࠓճ ܭଌର֎
σϓϩΠճ
σϓϩΠճʹର͢Δߟ • جຊతʹ Weekly Release Ͱ͋ΔͨΊɺʹ26ճඞͣσϓϩΠ ͞ΕΔ • Γ HOTFIX
• ԿͷͨΊͷ HOTFIX ͔ʁ • มߋࣦഊʢޙड़ʣͱ߹ΘͤͯΈͳ͍ͱҙຯ͕ബͦ͏ • ࣮ࡍ2020લ Production ͷ Kubernetes manifest มߋͷͨΊͷ HOTFIX ͕ଟ͔ͬͨ • σϓϩΠ͕ݮগͳͷ Microservices Խ͍ͯ͠Δ͔Β • Microservices ΛؚΊͯܭଌ͢Δඞཁ͕͋Δ
σϓϩΠ࣌ؒ
σϓϩΠ࣌ؒʹର͢Δߟ • ະੳʢه༧ఆʣ • ͜ͷ࣌ؒมߋࣦഊ࣌ͷ Revert ͷ࣌ؒͱҰக͢ΔͷͰɺ ͘͢Ε͢Δ΄Ͳ MTTR ݮʹͭͳ͕Δͣ
CI ҆ఆੑ
CI ҆ఆੑʹؔ͢Δߟ • Time Window 7 Days • 30
Days, 90 Days ͳͲෳͷ Time Window Ͱܭଌͨ͠΄͏͕ྑ͍ • ຊ൪͚ͩͰͳ͘ɺ։ൃϒϥϯνಉ༷ʹܭଌ͢Δ͖ • ࢦඪͱͯ͠ෆద • جຊతʹ 100% ʹ͚ۙΕ͍ۙ΄Ͳྑ͍ • SLO ͱͯ͠ଊ͑ͯɺඪΛҧͨ͠Βࠜຊमਖ਼͢ΔΞϓϩʔν͕ྑ͍ • ͜ͷΛؚΉผͷࢦඪΛ༻͍ͨ΄͏͕ྑ͍ • Time To DeliveryʢมߋͷϦʔυλΠϜʣ • MTTR • ͨͩ͠ɺੳՄೳੑॏཁɻCI ͕ෆ҆ఆͳͱ͖ɺͲͷ Job ͕Ͳͷఔෆ҆ఆ͔ΛΔඞཁ͋Δ
มߋࣦഊ
มߋࣦഊ
มߋࣦഊʹؔ͢Δߟ • "มߋࣦഊ"ͷఆٛͷ • ԿΛͬͯ"มߋࣦഊ"ͱ͢Δ͔ͷఆ͕ٛඞཁ • Label ༩ͳͲͷӡ༻ϧʔϧ͕ͳ͍ͱܭଌ͕͍͠ • ܭଌํ๏ͷ
• ຊ൪ϒϥϯνͷ Revert "มߋࣦഊ"Ҏ֎Ͱى͖͍ͯͨ • Argo Rollouts Λ࠾༻͍ͯ͠Δ • ௨ৗ Canary Strategy Λ༗ޮʹ͍ͯ͠ͳ͍ • ॏཁػೳͳͲ Canary ͍ͨ͠ͱ͖͚ͩ༗ޮʹ͠ɺ100% ϦϦʔεͨ͠Β Revert ͍ͯͨ͠ • σʔλྔͷ • MTTR ಉ༷ͷ
·ͱΊͱߟ • SRE ۀΛ5ͭͷྖҬʹྨ͠ɺ͏ͪ2ͭͷྖҬ͔ΒɺʮLean ͱ DevOps ͷՊֶʯΛࢀߟʹɺࢦ ඪʹͳΓ͏Δ͔Λܭଌͨ͠ • ༗ޮͳࢦඪͷ݅
• ेʹσʔλྔ͕͋Δ͜ͱ • MTTR, มߋࣦഊσʔλྔΛಘΔ͜ͱ͕͍͠ • ͜ΕΒ͕සൃ͢Δঢ়ଶ SRE ͷతͱ͢Δ • ͦͷࢦඪΛؚΉଞͷࢦඪ͕ଘࡏ͠ͳ͍͜ͱ • CI ҆ఆੑ MTTR, Time To DeliveryʢมߋͷϦʔυλΠϜʣͰิ͑Δ • σϓϩΠ࣌ؒ MTTR, Time To DeliveryʢมߋͷϦʔυλΠϜʣͰิ͑Δ • σϓϩΠճΛ݈શʹ૿͢ʹมߋࣦഊͷܭଌ͕ඞཁ
·ͱΊͱߟ • MTTR 🚀 • τϥοΩϯάܧଓ • ܧଓతʹऔಘ͢ΔͨΊʹ Incident Response
ͷվળ͕ඞཁ • σϓϩΠճ🚀 • microservices ؚΊͯܭଌ • σϓϩΠ࣌ؒ🤔 • ։ൃϒϥϯνͷܭଌ͕ඞཁ • ظతʹ MTTR / มߋͷϦʔυλΠϜͰิ͏ • CI ҆ఆੑ🤔 • ։ൃϒϥϯνͷܭଌ͕ඞཁ • ظతʹ MTTR / มߋͷϦʔυλΠϜͰิ͏ • มߋࣦഊ🚀 • มߋࣦഊͷఆٛͱӡ༻ϧʔϧࡦఆ͕ඞཁ • มߋͷϦʔυλΠϜ🤔 • Develop branch Ͱͷ First commit ͔Β Production ͷ Code มߋ·ͰΛܭଌͰ͖Δͱྑ͍͕ɺม͕ଟ͘ɺΒ͖͕ͭେ͖͍Մೳੑ͕͋Δ
ࠓޙͷల • ݱঢ়ଌఆ͍ͯ͠ΔͷܧଓɺࣗಈԽΛࢦ͢ • "ࣦഊ"ʹؔ͢Δࢦඪ׆༻ͮ͠Β͍Մೳੑ͕͋Δ͕ɺܧଓͯ͠ܭ • MTTR, มߋࣦഊ • ਓ͕ؒؔΘΔϓϩηεͰܭଌͷͨΊʹఆٛɺϧʔϧɺن͕ඞཁ
• ଞͷྖҬʹؔͯ͠ࢦඪΛఏҊ͢Δ • ܭଌɾՄࢹԽͷσβΠϯύλʔϯͷཧΛ͍ͨ͠ • ୭͕ԿͰܭଌͯ͠ՄࢹԽͯࣗ͠తʹܧଓతվળ͕Ͱ͖ΔੈքΛࢦ͢
Thank you! chaspy chaspy_ Lead Software Engineer Site Reliability at
Quipper Takeshi Kondo