Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Servers are doomed to fail
Search
JBD
May 17, 2019
Technology
3
1.6k
Servers are doomed to fail
JBD
May 17, 2019
Tweet
Share
More Decks by JBD
See All by JBD
eBPF in Microservices Observability at eBPF Day
rakyll
1
2.2k
eBPF in Microservices Observability
rakyll
1
1.7k
OpenTelemetry at AWS
rakyll
1
1.9k
Debugging Code Generation in Go
rakyll
5
1.6k
Are you ready for production?
rakyll
8
2.9k
Serverless Containers
rakyll
1
270
Critical Path Analysis
rakyll
0
680
Monitoring and Debugging Containers
rakyll
2
1.1k
CPDD
rakyll
0
4.2k
Other Decks in Technology
See All in Technology
セキュリティ はじめの一歩
nikinusu
0
1.5k
IaaS/SaaS管理における SREの実践 - SRE Kaigi 2026
bbqallstars
4
1.6k
ZOZOにおけるAI活用の現在 ~開発組織全体での取り組みと試行錯誤~
zozotech
PRO
4
4.7k
小さく始めるBCP ― 多プロダクト環境で始める最初の一歩
kekke_n
1
310
データの整合性を保ちたいだけなんだ
shoheimitani
6
2.5k
マーケットプレイス版Oracle WebCenter Content For OCI
oracle4engineer
PRO
5
1.5k
Bill One 開発エンジニア 紹介資料
sansan33
PRO
4
17k
システムのアラート調査をサポートするAI Agentの紹介/Introduction to an AI Agent for System Alert Investigation
taddy_919
2
1.6k
ファインディの横断SREがTakumi byGMOと取り組む、セキュリティと開発スピードの両立
rvirus0817
1
1k
SREのプラクティスを用いた3領域同時 マネジメントへの挑戦 〜SRE・情シス・セキュリティを統合した チーム運営術〜
coconala_engineer
2
550
しろおびセキュリティへ ようこそ
log0417
0
290
SREが向き合う大規模リアーキテクチャ 〜信頼性とアジリティの両立〜
zepprix
0
380
Featured
See All Featured
We Are The Robots
honzajavorek
0
160
The Limits of Empathy - UXLibs8
cassininazir
1
210
How to Get Subject Matter Experts Bought In and Actively Contributing to SEO & PR Initiatives.
livdayseo
0
54
Digital Projects Gone Horribly Wrong (And the UX Pros Who Still Save the Day) - Dean Schuster
uxyall
0
280
<Decoding/> the Language of Devs - We Love SEO 2024
nikkihalliwell
1
120
Automating Front-end Workflow
addyosmani
1371
200k
Building an army of robots
kneath
306
46k
Code Reviewing Like a Champion
maltzj
527
40k
How STYLIGHT went responsive
nonsquared
100
6k
DBのスキルで生き残る技術 - AI時代におけるテーブル設計の勘所
soudai
PRO
61
49k
Writing Fast Ruby
sferik
630
62k
Taking LLMs out of the black box: A practical guide to human-in-the-loop distillation
inesmontani
PRO
3
2k
Transcript
Servers are doomed to fail Jaana B. Dogan
[email protected]
@rakyll
Serverless is also doomed to fail Jaana B. Dogan
[email protected]
@rakyll
Systems are doomed to fail Jaana B. Dogan
[email protected]
@rakyll
Is failure OK? Is failure an unexpected case?
Failure is not an exception. Systems change all the time.
“I haven’t touched the code for a century, it should
just work.” Said no one ever.
Failure is expected. Yes, it is.
None
@rakyll monitoring debugging postmortem
Monitoring is about saying if something is broken.
“99.99% of the requests should return in 100ms.”
@rakyll
@rakyll
Debugging
Debugging is collaborative.
Debugging comes in flavors. Logs Traces Metrics ...
Postmortems
Postmortems
Postmortems
Blameless? Focus on identifying problems.
Collaboration Design for collaboration.
Design for failure Set SLOs, plan for instrumentation, plan for
debugging.
Cross-stack debugging Accountability across stack with high cardinality data. speakerdeck.com/rakyll/rpc-metrics-at-google
Correlation Jump from monitoring/debugging data to data.
On-call debugging Jump from distributed tracing data to on-call information.
who to page?
Dynamic collection Capability to enable more collection in production when
needed.
Continuous collection Continuously collect signals, generate fleet-wide analysis reports.
Introspection Introspection pages provided from the services.
@rakyll monitoring debugging postmortem
Thank you Jaana B. Dogan Google
[email protected]