Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Engineering Large Systems When You're Not Googl...
Search
Charity Majors
April 30, 2018
Technology
20
5.6k
Engineering Large Systems When You're Not Google Or Facebook (test in prod)
lightning talk at Clever, 4/30/18
Charity Majors
April 30, 2018
Tweet
Share
More Decks by Charity Majors
See All by Charity Majors
The Twin Mandate of Observability
charity
4
2.1k
In Praise of "Normal" Engineers (LDX3)
charity
4
2.8k
In Praise of "Normal" Engineers (with full speaker notes)
charity
1
250
AIOps: Prove It! (An Open Letter to Vendors Selling AI for SREs)
charity
1
62
SRECon 2024 Keynote: Is It Already Time To Version Observability? (Signs Point To Yes)
charity
3
500
CTO Craft Con Keynote: Observability is due for a version change: are you ready for it?
charity
4
1.4k
Case Studies: Modern Development Practices In Highly Regulated Environments
charity
6
4.3k
Compliance & Regulatory Standards Are NOT Incompatible With Modern Development Best Practices
charity
7
6.2k
Perils, Pitfalls and Pratfalls of Platform Engineering (QCon NYC, 2023)
charity
1
430
Other Decks in Technology
See All in Technology
身体を持ったパーソナルAIエージェントの 可能性を探る開発
yokomachi
1
120
JAWS FESTA 2025でリリースしたほぼリアルタイム文字起こし/翻訳機能の構成について
naoki8408
1
540
スクリプトの先へ!AIエージェントと組み合わせる モバイルE2Eテスト
error96num
0
180
最強のAIエージェントを諦めたら品質が上がった話 / how quality improved after giving up on the strongest AI agent
kt2mikan
0
190
楽しく学ぼう!コミュニティ入門 AWSと人が つむいできたストーリー
hiroramos4
PRO
1
200
Postman v12 で変わる API開発ワークフロー (Postman v12 アップデート) / New API development workflow with Postman v12
yokawasa
0
130
Exadata Database Service on Dedicated Infrastructure(ExaDB-D) UI スクリーン・キャプチャ集
oracle4engineer
PRO
8
7.2k
Dr. Werner Vogelsの14年のキーノートから紐解くエンジニアリング組織への処方箋@JAWS DAYS 2026
p0n
1
140
Claude Code Skills 勉強会 (DevelersIO向けに調整済み) / claude code skills for devio
masahirokawahara
1
21k
[JAWSDAYS2026][D8]その起票、愛が足りてますか?AWSサポートを味方につける、技術的「ラブレター」の書き方
hirosys_
3
180
プラットフォームエンジニアリングはAI時代の開発者をどう救うのか
jacopen
5
3k
GCASアップデート(202601-202603)
techniczna
0
130
Featured
See All Featured
HU Berlin: Industrial-Strength Natural Language Processing with spaCy and Prodigy
inesmontani
PRO
0
260
Scaling GitHub
holman
464
140k
Chrome DevTools: State of the Union 2024 - Debugging React & Beyond
addyosmani
10
1.1k
Building Better People: How to give real-time feedback that sticks.
wjessup
370
20k
"I'm Feeling Lucky" - Building Great Search Experiences for Today's Users (#IAC19)
danielanewman
231
22k
DBのスキルで生き残る技術 - AI時代におけるテーブル設計の勘所
soudai
PRO
63
51k
30 Presentation Tips
portentint
PRO
1
250
VelocityConf: Rendering Performance Case Studies
addyosmani
333
24k
4 Signs Your Business is Dying
shpigford
187
22k
Future Trends and Review - Lecture 12 - Web Technologies (1019888BNR)
signer
PRO
0
3.3k
How People are Using Generative and Agentic AI to Supercharge Their Products, Projects, Services and Value Streams Today
helenjbeal
1
140
How to build a perfect <img>
jonoalderson
1
5.3k
Transcript
Engineering Large Systems When You’re Not Google Or Facebook Some
Advice By Charity Majors
None
I blame this guy: Testing in production has gotten a
bad rap.
None
how they think we are how we really are
but *why*?
monitoring => observability known unknowns => unknown unknowns LAMP stack
=> distributed systems
“Complexity is increasing” - Science
Many catastrophic states exist at any given time. Your system
is never entirely ‘up’
We are all distributed systems engineers now the unknowns outstrip
the knowns why does this matter more and more?
Distributed systems are particularly hostile to being cloned or imitated
(or monitored). (clients, concurrency, chaotic traffic patterns, edge cases …)
Distributed systems have an infinitely long list of almost-impossible failure
scenarios that make staging environments particularly worthless. this is a black hole for engineering time
unit tests integration tests functional tests basic failover test before
prod: … the basics. the simple stuff. known-unknowns
behavioral tests experiments load tests (!!) edge cases canaries rolling
deploys multi-region test in prod: unknown-unknowns
test in staging? meh
unit tests integration tests functional tests “What happens when …”
(you know the answer) “What happens when …” (you don’t) behavioral tests experiments load tests (!!) edge cases canaries rolling deploys multi-region test before prod: test in prod:
Only production is production. You can ONLY verify the deploy
for any env by deploying to that env
1. Every deploy is a *unique* exercise of your process+
code+system 2. Deploy scripts are production code. If you’re using fabric or capistrano, this means you have fab/cap in production.
Staging is not production.
Why do people sink so much time into staging, when
they can’t even tell if their own production environment is healthy or not?
That energy is better used elsewhere: Production. You can catch
80% of the bugs with 20% of the effort. And you should. @caitie’s PWL talk: https://youtu.be/-3tw2MYYT0Q
feature flags (launch darkly) high cardinality tooling (honeycomb) canary canary
canaries, shadow systems (goturbine, linkerd) capture/replay for databases (apiary, percona) also build or use: plz dont build your own ffs
Failure is not rare Practice shipping and fixing lots of
small problems And practice on your users!!
Failure: it’s “when”, not “if” (lots and lots and lots
of “when’s”)
Does everyone … know what normal looks like? know how
to deploy? know how to roll back? know how to canary? know how to debug in production? Practice!!~
None
None
None
• Charity Majors @mipsytipsy