Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Site Reliability Engineering における 重要領域とパフォーマンス指...
Search
Takeshi Kondo
June 04, 2021
Technology
1
3.3k
Site Reliability Engineering における 重要領域とパフォーマンス指標の提案 / Performance Indicators for SRE
2021/06/04
第8回WebSystemArchitecture研究会(オンライン)
https://wsa.connpass.com/event/207143/
Takeshi Kondo
June 04, 2021
Tweet
Share
More Decks by Takeshi Kondo
See All by Takeshi Kondo
SRE NEXT CfP チームが語る 聞きたくなるプロポーザルとは / Proposals by the SRE NEXT CfP Team that are sure to be accepted
chaspy
2
1.3k
Slack Platform(Deno) での RAG 実装 - LangChain(js) を使ってみた / rag-implementation-on-slack-platform-deno-experimenting-with-langchain-js
chaspy
0
210
SRE の考えをマネジメントに活かす / applying SRE ideas to management
chaspy
7
7.5k
RAGの簡易評価によるフィードバックサイクル実践 / Feedback cycle practice through simplified assessment of RAGs
chaspy
2
5.4k
定量データと定性評価を用いた技術戦略の組織的実践 / Systematic implementation of technology strategies using quantitative data and qualitative evaluation
chaspy
9
1.9k
エンジニアブランディングチームの KPI / KPI's of engineer branding team
chaspy
2
2.2k
「SLO Review」今やるならこうする / If I had to do the "SLO Review" again
chaspy
3
2k
開発者とともに作る Site Reliability Engineering / SREing with Developers
chaspy
10
8.3k
自己診断能力の獲得を目指して / Toward the acquisition of self-diagnostic skills
chaspy
1
5.1k
Other Decks in Technology
See All in Technology
QAを早期に巻き込む”って どうやるの? モヤモヤから抜け出す実践知
moritamasami
2
180
(HackFes)米国国防総省のDevSecOpsライフサイクルをAWSのセキュリティサービスとOSSで実現
syoshie
5
660
Recoil脱却の現状と挑戦
kirik
2
340
大規模組織にAIエージェントを迅速に導入するためのセキュリティの勘所 / AI agents for large-scale organizations
i35_267
6
220
Bliki (ja), and the Cathedral, and the Bazaar
koic
8
1.3k
AI駆動開発 with MixLeap Study【大阪支部 #3】
lycorptech_jp
PRO
0
200
AWS Well-Architected から考えるオブザーバビリティの勘所 / Considering the Essentials of Observability from AWS Well-Architected
sms_tech
1
850
20150719_Amazon Nova Canvas Virtual try-onアプリ 作成裏話
riz3f7
0
130
データ駆動経営の道しるべ:プロダクト開発指標の戦略的活用法
ham0215
2
230
PdM業務における使い分け
shinshiro
0
590
本当にわかりやすいAIエージェント入門
segavvy
10
5.9k
経験がないことを言い訳にしない、 AI時代の他領域への染み出し方
parayama0625
0
140
Featured
See All Featured
How to Create Impact in a Changing Tech Landscape [PerfNow 2023]
tammyeverts
53
2.9k
Responsive Adventures: Dirty Tricks From The Dark Corners of Front-End
smashingmag
251
21k
Statistics for Hackers
jakevdp
799
220k
Practical Tips for Bootstrapping Information Extraction Pipelines
honnibal
PRO
21
1.3k
Being A Developer After 40
akosma
90
590k
Faster Mobile Websites
deanohume
308
31k
Performance Is Good for Brains [We Love Speed 2024]
tammyeverts
10
990
Building a Scalable Design System with Sketch
lauravandoore
462
33k
RailsConf & Balkan Ruby 2019: The Past, Present, and Future of Rails at GitHub
eileencodes
138
34k
Refactoring Trust on Your Teams (GOTO; Chicago 2020)
rmw
34
3.1k
Fight the Zombie Pattern Library - RWD Summit 2016
marcelosomers
234
17k
Building an army of robots
kneath
306
45k
Transcript
Site Reliability Engineering ʹ͓͚Δ ॏཁྖҬͱύϑΥʔϚϯεࢦඪͷఏҊ Takeshi Kondo / @chaspy 2021/06/04
ୈ8ճWebSystemArchitectureݚڀձʢΦϯϥΠϯʣ
Who am I chaspy chaspy_ Lead Software Engineer Site Reliability
at Quipper Takeshi Kondo
Agenda 1. എܠͱత 2. SRE ͕ؔΘΔྖҬ 3. ఏҊࢦඪͱଌఆํ๏ 4. ଌఆ݁Ռ
5. ·ͱΊͱࠓޙͷల
എܠͱత • SRE ͱ͍͏ Role ͕͘ීٴ͠ɺ࣮ફ͢Δاۀ͕૿͖͑ͯͨ • Ϗδωεɺ৫ͷنੑ࣭ʹΑΓͦͷׂҟͳΔ →SRE ͕ؔΘΔॏཁͳྖҬΛྨ͍ͨ͠
• ϓϩμΫτ։ൃͷΑ͏ʹϏδωεKPIΛઃఆ͠ɺͦΕΛ܁Γฦ ͠վળ͢Δͱ͍͏Ξϓϩʔν SRE ʹ༗ޮͳͣ →ྖҬ͝ͱͷύϑΥʔϚϯεࢦඪΛఆٛɾܭଌ͍ͨ͠
SRE͕ؔΘΔྖҬ • Ϋϥυ্Ͱ Web αʔϏεΛఏڙ͢ΔاۀΛఆ • 100+ Developer • 30M+
Access / Day
ʲࢀߟʳAWS Well-Architected Framework https://aws.amazon.com/architecture/well-architected/
ଌఆࢦඪͱଌఆํ๏ • Reliability • Developer Productivity • Cost • Security
• Platform
ྖҬ͝ͱͷؔੑ 1MBUGPSN 3FMJBCJMJUZ $PTU %FWFMPQFS 1SPEVDUJWJUZ 4FDVSJUZ Empowerment Empowerment Empowerment
Trade-Off Trade-Off Trade-Off
ʲࢀߟʳLean ͱ DevOps ͷՊֶ • ΤϦʔτاۀҎԼ͕ͯ͢༏Ε͍ͯΔ • σϓϩΠͷස • มߋͷϦʔυλΠϜ
• MTTR • มߋࣦഊ https://book.impress.co.jp/books/1118101029
ଌఆࢦඪͱଌఆํ๏ • Reliability • Developer Productivity • Cost • Security
• Platform
ଌఆࢦඪͱଌఆํ๏ ྖҬ ࢦඪ ଌఆํ๏ 3FMJBCJMJUZ .553 োൃੜ࣌ʹɺোใࠂϑϩʔͷ ඞਢ߲ͱͯ͠खಈͰܭଌ %FWFMPQFS1SPEVDUJWJUZ σϓϩΠճ
$*αʔϏεͷNFUSJDT %FWFMPQFS1SPEVDUJWJUZ σϓϩΠ࣌ؒ $*αʔϏεͷNFUSJDT %FWFMPQFS1SPEVDUJWJUZ $*҆ఆੑ $*αʔϏεͷNFUSJDT %FWFMPQFS1SPEVDUJWJUZ มߋࣦഊ ຊ൪ڥσϓϩΠʹରԠ͢Δϒϥϯ νͷ3FWFSUDPNNJUͷ
ଌఆ݁Ռ • MTTR • σϓϩΠճ • σϓϩΠ࣌ؒ • CI ҆ఆੑ
• มߋࣦഊ
MTTR Plot with Trendline
MTTR per half year
Histgram
MTTR ʹର͢Δߟ • ܭଌՄೳੑͷ • ࣗಈͰूܭͰ͖ΔΈ͕ඞཁ • ͦͷͨΊʹ Incident Response
ͷܕԽͱͦΕʹର͢Δ Tool ͕ඞཁ • SeverityʢIncident ͷ Level ఆٛʣ/ োൃੜɾݕɾ෮چͷ࣌ؒΛඞͣه͢ΔϧʔϧͳͲ • σʔλྔͷ • Πϯγσϯτ2Ͱ͔͕ͨͩ50ఔɺेͳσʔλྔ͕ಘΒΕͳ͍ • Πϯγσϯτ͕ଟ͍͜ͱ SRE ͷతͱ૬͢Δ • Β͖ͭͷ • σʔλྔ͕ेͰͳ͍ͱɺҰ෦ͷ࣌ؒোʹҾ͖ͣΒΕͯ͠·͏
MTTR ʹର͢Δߟ • ࢦඪͱͯ͠༗ӹ͔Ͳ͏͔·ͩஅͰ͖ͳ͍ • গͳ͘ͱҎԼͷͰ༗ӹͳͷͰτϥοΩϯάΛଓ͚Δ • Incident Response ͷܕԽ
• ࣌ؒΠϯγσϯτʹର͢Δվળ
ʲࢀߟʳIncident Metrics in SRE • MTTR ࢦඪʹ͖͢Ͱͳ͍ͱओு • ݅ෆͱΒ͖ͭͷେ͖͕͞ཧ༝ •
ͰͲΕΛ࠾༻͖͔͢ݴٴ͕ͳ͍ https://sre.google/resources/practices-and-processes/incident-metrics-in-sre/
ʲิʳDeveloper Productivity ྖҬͷܭଌର • monorepo Λ࠾༻ • master branch ͰෳͷΞϓϦ͕ಉ࣌ʹσϓϩΠ͞ΕΔ
• Database Λڞ༗͢Δ Distributed monolith ͱͳ͍ͬͯΔ • ͜ΕΒجຊతʹि࣍ϦϦʔε͞ΕΔ • ͜ΕҎ֎ͷ microservices ݸผͰϦϦʔε͞ΕΔ͕ɺࠓճ ܭଌର֎
σϓϩΠճ
σϓϩΠճʹର͢Δߟ • جຊతʹ Weekly Release Ͱ͋ΔͨΊɺʹ26ճඞͣσϓϩΠ ͞ΕΔ • Γ HOTFIX
• ԿͷͨΊͷ HOTFIX ͔ʁ • มߋࣦഊʢޙड़ʣͱ߹ΘͤͯΈͳ͍ͱҙຯ͕ബͦ͏ • ࣮ࡍ2020લ Production ͷ Kubernetes manifest มߋͷͨΊͷ HOTFIX ͕ଟ͔ͬͨ • σϓϩΠ͕ݮগͳͷ Microservices Խ͍ͯ͠Δ͔Β • Microservices ΛؚΊͯܭଌ͢Δඞཁ͕͋Δ
σϓϩΠ࣌ؒ
σϓϩΠ࣌ؒʹର͢Δߟ • ະੳʢه༧ఆʣ • ͜ͷ࣌ؒมߋࣦഊ࣌ͷ Revert ͷ࣌ؒͱҰக͢ΔͷͰɺ ͘͢Ε͢Δ΄Ͳ MTTR ݮʹͭͳ͕Δͣ
CI ҆ఆੑ
CI ҆ఆੑʹؔ͢Δߟ • Time Window 7 Days • 30
Days, 90 Days ͳͲෳͷ Time Window Ͱܭଌͨ͠΄͏͕ྑ͍ • ຊ൪͚ͩͰͳ͘ɺ։ൃϒϥϯνಉ༷ʹܭଌ͢Δ͖ • ࢦඪͱͯ͠ෆద • جຊతʹ 100% ʹ͚ۙΕ͍ۙ΄Ͳྑ͍ • SLO ͱͯ͠ଊ͑ͯɺඪΛҧͨ͠Βࠜຊमਖ਼͢ΔΞϓϩʔν͕ྑ͍ • ͜ͷΛؚΉผͷࢦඪΛ༻͍ͨ΄͏͕ྑ͍ • Time To DeliveryʢมߋͷϦʔυλΠϜʣ • MTTR • ͨͩ͠ɺੳՄೳੑॏཁɻCI ͕ෆ҆ఆͳͱ͖ɺͲͷ Job ͕Ͳͷఔෆ҆ఆ͔ΛΔඞཁ͋Δ
มߋࣦഊ
มߋࣦഊ
มߋࣦഊʹؔ͢Δߟ • "มߋࣦഊ"ͷఆٛͷ • ԿΛͬͯ"มߋࣦഊ"ͱ͢Δ͔ͷఆ͕ٛඞཁ • Label ༩ͳͲͷӡ༻ϧʔϧ͕ͳ͍ͱܭଌ͕͍͠ • ܭଌํ๏ͷ
• ຊ൪ϒϥϯνͷ Revert "มߋࣦഊ"Ҏ֎Ͱى͖͍ͯͨ • Argo Rollouts Λ࠾༻͍ͯ͠Δ • ௨ৗ Canary Strategy Λ༗ޮʹ͍ͯ͠ͳ͍ • ॏཁػೳͳͲ Canary ͍ͨ͠ͱ͖͚ͩ༗ޮʹ͠ɺ100% ϦϦʔεͨ͠Β Revert ͍ͯͨ͠ • σʔλྔͷ • MTTR ಉ༷ͷ
·ͱΊͱߟ • SRE ۀΛ5ͭͷྖҬʹྨ͠ɺ͏ͪ2ͭͷྖҬ͔ΒɺʮLean ͱ DevOps ͷՊֶʯΛࢀߟʹɺࢦ ඪʹͳΓ͏Δ͔Λܭଌͨ͠ • ༗ޮͳࢦඪͷ݅
• ेʹσʔλྔ͕͋Δ͜ͱ • MTTR, มߋࣦഊσʔλྔΛಘΔ͜ͱ͕͍͠ • ͜ΕΒ͕සൃ͢Δঢ়ଶ SRE ͷతͱ͢Δ • ͦͷࢦඪΛؚΉଞͷࢦඪ͕ଘࡏ͠ͳ͍͜ͱ • CI ҆ఆੑ MTTR, Time To DeliveryʢมߋͷϦʔυλΠϜʣͰิ͑Δ • σϓϩΠ࣌ؒ MTTR, Time To DeliveryʢมߋͷϦʔυλΠϜʣͰิ͑Δ • σϓϩΠճΛ݈શʹ૿͢ʹมߋࣦഊͷܭଌ͕ඞཁ
·ͱΊͱߟ • MTTR 🚀 • τϥοΩϯάܧଓ • ܧଓతʹऔಘ͢ΔͨΊʹ Incident Response
ͷվળ͕ඞཁ • σϓϩΠճ🚀 • microservices ؚΊͯܭଌ • σϓϩΠ࣌ؒ🤔 • ։ൃϒϥϯνͷܭଌ͕ඞཁ • ظతʹ MTTR / มߋͷϦʔυλΠϜͰิ͏ • CI ҆ఆੑ🤔 • ։ൃϒϥϯνͷܭଌ͕ඞཁ • ظతʹ MTTR / มߋͷϦʔυλΠϜͰิ͏ • มߋࣦഊ🚀 • มߋࣦഊͷఆٛͱӡ༻ϧʔϧࡦఆ͕ඞཁ • มߋͷϦʔυλΠϜ🤔 • Develop branch Ͱͷ First commit ͔Β Production ͷ Code มߋ·ͰΛܭଌͰ͖Δͱྑ͍͕ɺม͕ଟ͘ɺΒ͖͕ͭେ͖͍Մೳੑ͕͋Δ
ࠓޙͷల • ݱঢ়ଌఆ͍ͯ͠ΔͷܧଓɺࣗಈԽΛࢦ͢ • "ࣦഊ"ʹؔ͢Δࢦඪ׆༻ͮ͠Β͍Մೳੑ͕͋Δ͕ɺܧଓͯ͠ܭ • MTTR, มߋࣦഊ • ਓ͕ؒؔΘΔϓϩηεͰܭଌͷͨΊʹఆٛɺϧʔϧɺن͕ඞཁ
• ଞͷྖҬʹؔͯ͠ࢦඪΛఏҊ͢Δ • ܭଌɾՄࢹԽͷσβΠϯύλʔϯͷཧΛ͍ͨ͠ • ୭͕ԿͰܭଌͯ͠ՄࢹԽͯࣗ͠తʹܧଓతվળ͕Ͱ͖ΔੈքΛࢦ͢
Thank you! chaspy chaspy_ Lead Software Engineer Site Reliability at
Quipper Takeshi Kondo