Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Engineering Large Systems When You're Not Googl...
Search
Sponsored
·
Your Podcast. Everywhere. Effortlessly.
Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.
→
Charity Majors
April 30, 2018
Technology
5.6k
20
Share
Engineering Large Systems When You're Not Google Or Facebook (test in prod)
lightning talk at Clever, 4/30/18
Charity Majors
April 30, 2018
More Decks by Charity Majors
See All by Charity Majors
The Twin Mandate of Observability
charity
4
2.2k
In Praise of "Normal" Engineers (LDX3)
charity
4
2.8k
In Praise of "Normal" Engineers (with full speaker notes)
charity
1
260
AIOps: Prove It! (An Open Letter to Vendors Selling AI for SREs)
charity
1
70
SRECon 2024 Keynote: Is It Already Time To Version Observability? (Signs Point To Yes)
charity
3
520
CTO Craft Con Keynote: Observability is due for a version change: are you ready for it?
charity
4
1.4k
Case Studies: Modern Development Practices In Highly Regulated Environments
charity
6
4.4k
Compliance & Regulatory Standards Are NOT Incompatible With Modern Development Best Practices
charity
7
6.2k
Perils, Pitfalls and Pratfalls of Platform Engineering (QCon NYC, 2023)
charity
1
450
Other Decks in Technology
See All in Technology
20260515 ログイン機能だけではないアカウント管理を全体で考える~サービス設計者向け~
oidfj
1
840
Claude Code / Codex / Kiro に AWS 権限を 渡すとき、何を設計すべきか
k_adachi_01
6
1.9k
「強制アップデート」か「チームの自律」か?エンタープライズが辿り着いたプラットフォームのハイブリッド運用/cloudnative-kaigi-hybrid-platform-operations
mhrtech
0
210
R&D 祭 2024 アニメエフェクト作成の効率化
olmdrd
PRO
0
100
Directions Asia 2026 | Beyond Buildable AI Agents: Let’s Visualize Partner Value in the AI Era
ryoheig0405
0
120
パーソルキャリア IT/テクノロジー職向け 会社紹介資料|Company Introduction Deck
techtekt
PRO
0
230
Purview 勉強会報告 Microsoft Purview 入門しようとしてみた
masakichixo
1
460
エンタープライズの厳格な制約を開発者に意識させない:クラウドネイティブ開発基盤設計/cloudnative-kaigi-golden-path
mhrtech
0
460
続 運用改善、不都合な真実 〜 物理制約のない運用改善はほとんど無価値 / 20260518-ssmjp-kaizen-no-value-without-physical-constraints
opelab
2
270
O'Reilly Infrastructure & Ops Superstream: Platform Engineering for Developers, Architects & the Rest of Us
syntasso
0
310
Swift Sequence の便利 API 再発見
treastrain
1
290
【関西製造業祭り2026春】現場を変える技術はここまで来た〜世界最大の製造業見本市から持って帰ってきたもの〜
tanakaseiya
0
190
Featured
See All Featured
The Spectacular Lies of Maps
axbom
PRO
1
750
The Web Performance Landscape in 2024 [PerfNow 2024]
tammyeverts
12
1.1k
SEO Brein meetup: CTRL+C is not how to scale international SEO
lindahogenes
1
2.6k
Easily Structure & Communicate Ideas using Wireframe
afnizarnur
194
17k
Jamie Indigo - Trashchat’s Guide to Black Boxes: Technical SEO Tactics for LLMs
techseoconnect
PRO
0
140
Jess Joyce - The Pitfalls of Following Frameworks
techseoconnect
PRO
1
150
Accessibility Awareness
sabderemane
1
120
Facilitating Awesome Meetings
lara
57
6.9k
Documentation Writing (for coders)
carmenintech
77
5.3k
The Language of Interfaces
destraynor
162
26k
Introduction to Domain-Driven Design and Collaborative software design
baasie
1
790
Abbi's Birthday
coloredviolet
2
7.6k
Transcript
Engineering Large Systems When You’re Not Google Or Facebook Some
Advice By Charity Majors
None
I blame this guy: Testing in production has gotten a
bad rap.
None
how they think we are how we really are
but *why*?
monitoring => observability known unknowns => unknown unknowns LAMP stack
=> distributed systems
“Complexity is increasing” - Science
Many catastrophic states exist at any given time. Your system
is never entirely ‘up’
We are all distributed systems engineers now the unknowns outstrip
the knowns why does this matter more and more?
Distributed systems are particularly hostile to being cloned or imitated
(or monitored). (clients, concurrency, chaotic traffic patterns, edge cases …)
Distributed systems have an infinitely long list of almost-impossible failure
scenarios that make staging environments particularly worthless. this is a black hole for engineering time
unit tests integration tests functional tests basic failover test before
prod: … the basics. the simple stuff. known-unknowns
behavioral tests experiments load tests (!!) edge cases canaries rolling
deploys multi-region test in prod: unknown-unknowns
test in staging? meh
unit tests integration tests functional tests “What happens when …”
(you know the answer) “What happens when …” (you don’t) behavioral tests experiments load tests (!!) edge cases canaries rolling deploys multi-region test before prod: test in prod:
Only production is production. You can ONLY verify the deploy
for any env by deploying to that env
1. Every deploy is a *unique* exercise of your process+
code+system 2. Deploy scripts are production code. If you’re using fabric or capistrano, this means you have fab/cap in production.
Staging is not production.
Why do people sink so much time into staging, when
they can’t even tell if their own production environment is healthy or not?
That energy is better used elsewhere: Production. You can catch
80% of the bugs with 20% of the effort. And you should. @caitie’s PWL talk: https://youtu.be/-3tw2MYYT0Q
feature flags (launch darkly) high cardinality tooling (honeycomb) canary canary
canaries, shadow systems (goturbine, linkerd) capture/replay for databases (apiary, percona) also build or use: plz dont build your own ffs
Failure is not rare Practice shipping and fixing lots of
small problems And practice on your users!!
Failure: it’s “when”, not “if” (lots and lots and lots
of “when’s”)
Does everyone … know what normal looks like? know how
to deploy? know how to roll back? know how to canary? know how to debug in production? Practice!!~
None
None
None
• Charity Majors @mipsytipsy