Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Engineering Large Systems When You're Not Google Or Facebook (test in prod)
Search
Charity Majors
April 30, 2018
Technology
20
5.3k
Engineering Large Systems When You're Not Google Or Facebook (test in prod)
lightning talk at Clever, 4/30/18
Charity Majors
April 30, 2018
Tweet
Share
More Decks by Charity Majors
See All by Charity Majors
Case Studies: Modern Development Practices In Highly Regulated Environments
charity
3
3.4k
Compliance & Regulatory Standards Are NOT Incompatible With Modern Development Best Practices
charity
7
5.2k
Perils, Pitfalls and Pratfalls of Platform Engineering (QCon NYC, 2023)
charity
0
220
The Death of DevOps Has Been Greatly Exaggerated, but Platform Engineering Is Here To Stay
charity
1
280
The Future of Ops Jobs (PlatformCon 2023)
charity
2
110
Observability and the Glorious Future (with Liz Fong-Jones)
charity
0
220
The Engineer/Manager Pendulum (QCon SF 2022)
charity
0
810
Being On Call Does Not Have to Suck.
charity
1
1.2k
The Paradox of Alerts
charity
5
6.6k
Other Decks in Technology
See All in Technology
オブジェクト指向CSSが叶えたかったことと、CSSのいま / The aims of Object-oriented CSS and the current state of CSS usage
shinkufencer
11
3.6k
KubeCon EU: Unlocking new Platform Experiences with Open Interfaces
salaboy
1
380
大規模なアジャイル開発の現場と技術負債 / Technical Debt
yoshiitaka
21
4.1k
AWS アーキテクチャクイズ
yuu26
2
700
期待しすぎずに取り組む両面 TypeScript
shozawa
4
470
10分でわかるfreeeのQA
freee
0
260
KTC_DBRE.pdf
_awache
1
290
現実世界の事象から学ぶSOLID原則
h0r15h0
24
10k
エンジニア候補者向け資料2024.03.28.pdf
macloud
0
2.9k
ビジネスロジックを「型」で表現するOOPのための関数型DDD / Functional And Type-Safe DDD for OOP
yuitosato
29
12k
やっていきテスト
k6s4i53rx
0
160
CI/CDがあたりまえの今の時代にAPIテスティングツールに求められていること / CI/CD Test Night #7
k1low
12
3k
Featured
See All Featured
Six Lessons from altMBA
skipperchong
19
2.9k
Designing for Performance
lara
601
67k
Happy Clients
brianwarren
91
6.3k
How to Create Impact in a Changing Tech Landscape [PerfNow 2023]
tammyeverts
12
1.4k
Building Applications with DynamoDB
mza
88
5.6k
5 minutes of I Can Smell Your CMS
philhawksworth
199
19k
XXLCSS - How to scale CSS and keep your sanity
sugarenia
240
1.2M
It's Worth the Effort
3n
180
27k
Clear Off the Table
cherdarchuk
82
310k
Adopting Sorbet at Scale
ufuk
66
8.5k
Designing on Purpose - Digital PM Summit 2013
jponch
109
6.4k
The Invisible Side of Design
smashingmag
293
49k
Transcript
Engineering Large Systems When You’re Not Google Or Facebook Some
Advice By Charity Majors
None
I blame this guy: Testing in production has gotten a
bad rap.
None
how they think we are how we really are
but *why*?
monitoring => observability known unknowns => unknown unknowns LAMP stack
=> distributed systems
“Complexity is increasing” - Science
Many catastrophic states exist at any given time. Your system
is never entirely ‘up’
We are all distributed systems engineers now the unknowns outstrip
the knowns why does this matter more and more?
Distributed systems are particularly hostile to being cloned or imitated
(or monitored). (clients, concurrency, chaotic traffic patterns, edge cases …)
Distributed systems have an infinitely long list of almost-impossible failure
scenarios that make staging environments particularly worthless. this is a black hole for engineering time
unit tests integration tests functional tests basic failover test before
prod: … the basics. the simple stuff. known-unknowns
behavioral tests experiments load tests (!!) edge cases canaries rolling
deploys multi-region test in prod: unknown-unknowns
test in staging? meh
unit tests integration tests functional tests “What happens when …”
(you know the answer) “What happens when …” (you don’t) behavioral tests experiments load tests (!!) edge cases canaries rolling deploys multi-region test before prod: test in prod:
Only production is production. You can ONLY verify the deploy
for any env by deploying to that env
1. Every deploy is a *unique* exercise of your process+
code+system 2. Deploy scripts are production code. If you’re using fabric or capistrano, this means you have fab/cap in production.
Staging is not production.
Why do people sink so much time into staging, when
they can’t even tell if their own production environment is healthy or not?
That energy is better used elsewhere: Production. You can catch
80% of the bugs with 20% of the effort. And you should. @caitie’s PWL talk: https://youtu.be/-3tw2MYYT0Q
feature flags (launch darkly) high cardinality tooling (honeycomb) canary canary
canaries, shadow systems (goturbine, linkerd) capture/replay for databases (apiary, percona) also build or use: plz dont build your own ffs
Failure is not rare Practice shipping and fixing lots of
small problems And practice on your users!!
Failure: it’s “when”, not “if” (lots and lots and lots
of “when’s”)
Does everyone … know what normal looks like? know how
to deploy? know how to roll back? know how to canary? know how to debug in production? Practice!!~
None
None
None
• Charity Majors @mipsytipsy