Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Site Reliability Engineering における 重要領域とパフォーマンス指...
Search
Takeshi Kondo
June 04, 2021
Technology
1
3.3k
Site Reliability Engineering における 重要領域とパフォーマンス指標の提案 / Performance Indicators for SRE
2021/06/04
第8回WebSystemArchitecture研究会(オンライン)
https://wsa.connpass.com/event/207143/
Takeshi Kondo
June 04, 2021
Tweet
Share
More Decks by Takeshi Kondo
See All by Takeshi Kondo
SRE NEXT CfP チームが語る 聞きたくなるプロポーザルとは / Proposals by the SRE NEXT CfP Team that are sure to be accepted
chaspy
2
1.4k
Slack Platform(Deno) での RAG 実装 - LangChain(js) を使ってみた / rag-implementation-on-slack-platform-deno-experimenting-with-langchain-js
chaspy
0
220
SRE の考えをマネジメントに活かす / applying SRE ideas to management
chaspy
7
7.6k
RAGの簡易評価によるフィードバックサイクル実践 / Feedback cycle practice through simplified assessment of RAGs
chaspy
2
5.5k
定量データと定性評価を用いた技術戦略の組織的実践 / Systematic implementation of technology strategies using quantitative data and qualitative evaluation
chaspy
9
1.9k
エンジニアブランディングチームの KPI / KPI's of engineer branding team
chaspy
2
2.2k
「SLO Review」今やるならこうする / If I had to do the "SLO Review" again
chaspy
3
2k
開発者とともに作る Site Reliability Engineering / SREing with Developers
chaspy
10
8.4k
自己診断能力の獲得を目指して / Toward the acquisition of self-diagnostic skills
chaspy
1
5.2k
Other Decks in Technology
See All in Technology
Evolution on AI Agent and Beyond - AGI への道のりと、シンギュラリティの3つのシナリオ
masayamoriofficial
0
150
datadog-distribution-of-opentelemetry-collector-intro
tetsuya28
0
240
生成AI利用プログラミング:誰でもプログラムが書けると 世の中どうなる?/opencampus202508
okana2ki
0
190
RAID6 を楔形文字で組んで現代人を怖がらせましょう(実装編)
mimifuwa
0
300
フルカイテン株式会社 エンジニア向け採用資料
fullkaiten
0
8.6k
広島発!スタートアップ開発の裏側
tsankyo
0
230
Exadata Database Service on Dedicated Infrastructure セキュリティ、ネットワーク、および管理について
oracle4engineer
PRO
1
360
AIとTDDによるNext.js「隙間ツール」開発の実践
makotot
5
590
ウォンテッドリーのアラート設計と Datadog 移行での知見
donkomura
0
300
Android Studio の 新しいAI機能を試してみよう / Try out the new AI features in Android Studio
yanzm
0
260
DeNA での思い出 / Memories at DeNA
orgachem
PRO
3
1.1k
モバイルアプリ研修
recruitengineers
PRO
2
130
Featured
See All Featured
Producing Creativity
orderedlist
PRO
347
40k
Designing Dashboards & Data Visualisations in Web Apps
destraynor
231
53k
Large-scale JavaScript Application Architecture
addyosmani
512
110k
Code Review Best Practice
trishagee
70
19k
How STYLIGHT went responsive
nonsquared
100
5.7k
Designing for Performance
lara
610
69k
The Power of CSS Pseudo Elements
geoffreycrofte
77
5.9k
[Rails World 2023 - Day 1 Closing Keynote] - The Magic of Rails
eileencodes
36
2.5k
Music & Morning Musume
bryan
46
6.7k
Mobile First: as difficult as doing things right
swwweet
223
9.9k
The Psychology of Web Performance [Beyond Tellerrand 2023]
tammyeverts
49
3k
Building Flexible Design Systems
yeseniaperezcruz
328
39k
Transcript
Site Reliability Engineering ʹ͓͚Δ ॏཁྖҬͱύϑΥʔϚϯεࢦඪͷఏҊ Takeshi Kondo / @chaspy 2021/06/04
ୈ8ճWebSystemArchitectureݚڀձʢΦϯϥΠϯʣ
Who am I chaspy chaspy_ Lead Software Engineer Site Reliability
at Quipper Takeshi Kondo
Agenda 1. എܠͱత 2. SRE ͕ؔΘΔྖҬ 3. ఏҊࢦඪͱଌఆํ๏ 4. ଌఆ݁Ռ
5. ·ͱΊͱࠓޙͷల
എܠͱత • SRE ͱ͍͏ Role ͕͘ීٴ͠ɺ࣮ફ͢Δاۀ͕૿͖͑ͯͨ • Ϗδωεɺ৫ͷنੑ࣭ʹΑΓͦͷׂҟͳΔ →SRE ͕ؔΘΔॏཁͳྖҬΛྨ͍ͨ͠
• ϓϩμΫτ։ൃͷΑ͏ʹϏδωεKPIΛઃఆ͠ɺͦΕΛ܁Γฦ ͠վળ͢Δͱ͍͏Ξϓϩʔν SRE ʹ༗ޮͳͣ →ྖҬ͝ͱͷύϑΥʔϚϯεࢦඪΛఆٛɾܭଌ͍ͨ͠
SRE͕ؔΘΔྖҬ • Ϋϥυ্Ͱ Web αʔϏεΛఏڙ͢ΔاۀΛఆ • 100+ Developer • 30M+
Access / Day
ʲࢀߟʳAWS Well-Architected Framework https://aws.amazon.com/architecture/well-architected/
ଌఆࢦඪͱଌఆํ๏ • Reliability • Developer Productivity • Cost • Security
• Platform
ྖҬ͝ͱͷؔੑ 1MBUGPSN 3FMJBCJMJUZ $PTU %FWFMPQFS 1SPEVDUJWJUZ 4FDVSJUZ Empowerment Empowerment Empowerment
Trade-Off Trade-Off Trade-Off
ʲࢀߟʳLean ͱ DevOps ͷՊֶ • ΤϦʔτاۀҎԼ͕ͯ͢༏Ε͍ͯΔ • σϓϩΠͷස • มߋͷϦʔυλΠϜ
• MTTR • มߋࣦഊ https://book.impress.co.jp/books/1118101029
ଌఆࢦඪͱଌఆํ๏ • Reliability • Developer Productivity • Cost • Security
• Platform
ଌఆࢦඪͱଌఆํ๏ ྖҬ ࢦඪ ଌఆํ๏ 3FMJBCJMJUZ .553 োൃੜ࣌ʹɺোใࠂϑϩʔͷ ඞਢ߲ͱͯ͠खಈͰܭଌ %FWFMPQFS1SPEVDUJWJUZ σϓϩΠճ
$*αʔϏεͷNFUSJDT %FWFMPQFS1SPEVDUJWJUZ σϓϩΠ࣌ؒ $*αʔϏεͷNFUSJDT %FWFMPQFS1SPEVDUJWJUZ $*҆ఆੑ $*αʔϏεͷNFUSJDT %FWFMPQFS1SPEVDUJWJUZ มߋࣦഊ ຊ൪ڥσϓϩΠʹରԠ͢Δϒϥϯ νͷ3FWFSUDPNNJUͷ
ଌఆ݁Ռ • MTTR • σϓϩΠճ • σϓϩΠ࣌ؒ • CI ҆ఆੑ
• มߋࣦഊ
MTTR Plot with Trendline
MTTR per half year
Histgram
MTTR ʹର͢Δߟ • ܭଌՄೳੑͷ • ࣗಈͰूܭͰ͖ΔΈ͕ඞཁ • ͦͷͨΊʹ Incident Response
ͷܕԽͱͦΕʹର͢Δ Tool ͕ඞཁ • SeverityʢIncident ͷ Level ఆٛʣ/ োൃੜɾݕɾ෮چͷ࣌ؒΛඞͣه͢ΔϧʔϧͳͲ • σʔλྔͷ • Πϯγσϯτ2Ͱ͔͕ͨͩ50ఔɺेͳσʔλྔ͕ಘΒΕͳ͍ • Πϯγσϯτ͕ଟ͍͜ͱ SRE ͷతͱ૬͢Δ • Β͖ͭͷ • σʔλྔ͕ेͰͳ͍ͱɺҰ෦ͷ࣌ؒোʹҾ͖ͣΒΕͯ͠·͏
MTTR ʹର͢Δߟ • ࢦඪͱͯ͠༗ӹ͔Ͳ͏͔·ͩஅͰ͖ͳ͍ • গͳ͘ͱҎԼͷͰ༗ӹͳͷͰτϥοΩϯάΛଓ͚Δ • Incident Response ͷܕԽ
• ࣌ؒΠϯγσϯτʹର͢Δվળ
ʲࢀߟʳIncident Metrics in SRE • MTTR ࢦඪʹ͖͢Ͱͳ͍ͱओு • ݅ෆͱΒ͖ͭͷେ͖͕͞ཧ༝ •
ͰͲΕΛ࠾༻͖͔͢ݴٴ͕ͳ͍ https://sre.google/resources/practices-and-processes/incident-metrics-in-sre/
ʲิʳDeveloper Productivity ྖҬͷܭଌର • monorepo Λ࠾༻ • master branch ͰෳͷΞϓϦ͕ಉ࣌ʹσϓϩΠ͞ΕΔ
• Database Λڞ༗͢Δ Distributed monolith ͱͳ͍ͬͯΔ • ͜ΕΒجຊతʹि࣍ϦϦʔε͞ΕΔ • ͜ΕҎ֎ͷ microservices ݸผͰϦϦʔε͞ΕΔ͕ɺࠓճ ܭଌର֎
σϓϩΠճ
σϓϩΠճʹର͢Δߟ • جຊతʹ Weekly Release Ͱ͋ΔͨΊɺʹ26ճඞͣσϓϩΠ ͞ΕΔ • Γ HOTFIX
• ԿͷͨΊͷ HOTFIX ͔ʁ • มߋࣦഊʢޙड़ʣͱ߹ΘͤͯΈͳ͍ͱҙຯ͕ബͦ͏ • ࣮ࡍ2020લ Production ͷ Kubernetes manifest มߋͷͨΊͷ HOTFIX ͕ଟ͔ͬͨ • σϓϩΠ͕ݮগͳͷ Microservices Խ͍ͯ͠Δ͔Β • Microservices ΛؚΊͯܭଌ͢Δඞཁ͕͋Δ
σϓϩΠ࣌ؒ
σϓϩΠ࣌ؒʹର͢Δߟ • ະੳʢه༧ఆʣ • ͜ͷ࣌ؒมߋࣦഊ࣌ͷ Revert ͷ࣌ؒͱҰக͢ΔͷͰɺ ͘͢Ε͢Δ΄Ͳ MTTR ݮʹͭͳ͕Δͣ
CI ҆ఆੑ
CI ҆ఆੑʹؔ͢Δߟ • Time Window 7 Days • 30
Days, 90 Days ͳͲෳͷ Time Window Ͱܭଌͨ͠΄͏͕ྑ͍ • ຊ൪͚ͩͰͳ͘ɺ։ൃϒϥϯνಉ༷ʹܭଌ͢Δ͖ • ࢦඪͱͯ͠ෆద • جຊతʹ 100% ʹ͚ۙΕ͍ۙ΄Ͳྑ͍ • SLO ͱͯ͠ଊ͑ͯɺඪΛҧͨ͠Βࠜຊमਖ਼͢ΔΞϓϩʔν͕ྑ͍ • ͜ͷΛؚΉผͷࢦඪΛ༻͍ͨ΄͏͕ྑ͍ • Time To DeliveryʢมߋͷϦʔυλΠϜʣ • MTTR • ͨͩ͠ɺੳՄೳੑॏཁɻCI ͕ෆ҆ఆͳͱ͖ɺͲͷ Job ͕Ͳͷఔෆ҆ఆ͔ΛΔඞཁ͋Δ
มߋࣦഊ
มߋࣦഊ
มߋࣦഊʹؔ͢Δߟ • "มߋࣦഊ"ͷఆٛͷ • ԿΛͬͯ"มߋࣦഊ"ͱ͢Δ͔ͷఆ͕ٛඞཁ • Label ༩ͳͲͷӡ༻ϧʔϧ͕ͳ͍ͱܭଌ͕͍͠ • ܭଌํ๏ͷ
• ຊ൪ϒϥϯνͷ Revert "มߋࣦഊ"Ҏ֎Ͱى͖͍ͯͨ • Argo Rollouts Λ࠾༻͍ͯ͠Δ • ௨ৗ Canary Strategy Λ༗ޮʹ͍ͯ͠ͳ͍ • ॏཁػೳͳͲ Canary ͍ͨ͠ͱ͖͚ͩ༗ޮʹ͠ɺ100% ϦϦʔεͨ͠Β Revert ͍ͯͨ͠ • σʔλྔͷ • MTTR ಉ༷ͷ
·ͱΊͱߟ • SRE ۀΛ5ͭͷྖҬʹྨ͠ɺ͏ͪ2ͭͷྖҬ͔ΒɺʮLean ͱ DevOps ͷՊֶʯΛࢀߟʹɺࢦ ඪʹͳΓ͏Δ͔Λܭଌͨ͠ • ༗ޮͳࢦඪͷ݅
• ेʹσʔλྔ͕͋Δ͜ͱ • MTTR, มߋࣦഊσʔλྔΛಘΔ͜ͱ͕͍͠ • ͜ΕΒ͕සൃ͢Δঢ়ଶ SRE ͷతͱ͢Δ • ͦͷࢦඪΛؚΉଞͷࢦඪ͕ଘࡏ͠ͳ͍͜ͱ • CI ҆ఆੑ MTTR, Time To DeliveryʢมߋͷϦʔυλΠϜʣͰิ͑Δ • σϓϩΠ࣌ؒ MTTR, Time To DeliveryʢมߋͷϦʔυλΠϜʣͰิ͑Δ • σϓϩΠճΛ݈શʹ૿͢ʹมߋࣦഊͷܭଌ͕ඞཁ
·ͱΊͱߟ • MTTR 🚀 • τϥοΩϯάܧଓ • ܧଓతʹऔಘ͢ΔͨΊʹ Incident Response
ͷվળ͕ඞཁ • σϓϩΠճ🚀 • microservices ؚΊͯܭଌ • σϓϩΠ࣌ؒ🤔 • ։ൃϒϥϯνͷܭଌ͕ඞཁ • ظతʹ MTTR / มߋͷϦʔυλΠϜͰิ͏ • CI ҆ఆੑ🤔 • ։ൃϒϥϯνͷܭଌ͕ඞཁ • ظతʹ MTTR / มߋͷϦʔυλΠϜͰิ͏ • มߋࣦഊ🚀 • มߋࣦഊͷఆٛͱӡ༻ϧʔϧࡦఆ͕ඞཁ • มߋͷϦʔυλΠϜ🤔 • Develop branch Ͱͷ First commit ͔Β Production ͷ Code มߋ·ͰΛܭଌͰ͖Δͱྑ͍͕ɺม͕ଟ͘ɺΒ͖͕ͭେ͖͍Մೳੑ͕͋Δ
ࠓޙͷల • ݱঢ়ଌఆ͍ͯ͠ΔͷܧଓɺࣗಈԽΛࢦ͢ • "ࣦഊ"ʹؔ͢Δࢦඪ׆༻ͮ͠Β͍Մೳੑ͕͋Δ͕ɺܧଓͯ͠ܭ • MTTR, มߋࣦഊ • ਓ͕ؒؔΘΔϓϩηεͰܭଌͷͨΊʹఆٛɺϧʔϧɺن͕ඞཁ
• ଞͷྖҬʹؔͯ͠ࢦඪΛఏҊ͢Δ • ܭଌɾՄࢹԽͷσβΠϯύλʔϯͷཧΛ͍ͨ͠ • ୭͕ԿͰܭଌͯ͠ՄࢹԽͯࣗ͠తʹܧଓతվળ͕Ͱ͖ΔੈքΛࢦ͢
Thank you! chaspy chaspy_ Lead Software Engineer Site Reliability at
Quipper Takeshi Kondo