Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Engineering Large Systems When You're Not Googl...
Search
Charity Majors
April 30, 2018
Technology
20
5.5k
Engineering Large Systems When You're Not Google Or Facebook (test in prod)
lightning talk at Clever, 4/30/18
Charity Majors
April 30, 2018
Tweet
Share
More Decks by Charity Majors
See All by Charity Majors
SRECon 2024 Keynote: Is It Already Time To Version Observability? (Signs Point To Yes)
charity
3
300
CTO Craft Con Keynote: Observability is due for a version change: are you ready for it?
charity
3
920
Case Studies: Modern Development Practices In Highly Regulated Environments
charity
5
3.8k
Compliance & Regulatory Standards Are NOT Incompatible With Modern Development Best Practices
charity
7
5.7k
Perils, Pitfalls and Pratfalls of Platform Engineering (QCon NYC, 2023)
charity
1
300
The Death of DevOps Has Been Greatly Exaggerated, but Platform Engineering Is Here To Stay
charity
2
410
The Future of Ops Jobs (PlatformCon 2023)
charity
3
160
Observability and the Glorious Future (with Liz Fong-Jones)
charity
0
290
The Engineer/Manager Pendulum (QCon SF 2022)
charity
1
870
Other Decks in Technology
See All in Technology
個人でもIAM Identity Centerを使おう!(アクセス管理編)
ryder472
3
190
SSMRunbook作成の勘所_20241120
koichiotomo
2
120
TypeScript、上達の瞬間
sadnessojisan
46
13k
Why does continuous profiling matter to developers? #appdevelopercon
salaboy
0
180
フルカイテン株式会社 採用資料
fullkaiten
0
40k
BLADE: An Attempt to Automate Penetration Testing Using Autonomous AI Agents
bbrbbq
0
290
20241120_JAWS_東京_ランチタイムLT#17_AWS認定全冠の先へ
tsumita
2
240
Making your applications cross-environment - OSCG 2024 NA
salaboy
0
180
RubyのWebアプリケーションを50倍速くする方法 / How to Make a Ruby Web Application 50 Times Faster
hogelog
3
940
Terraform未経験の御様に対してどの ように導⼊を進めていったか
tkikuchi
2
430
AWS Media Services 最新サービスアップデート 2024
eijikominami
0
190
B2B SaaSから見た最近のC#/.NETの進化
sansantech
PRO
0
680
Featured
See All Featured
Fireside Chat
paigeccino
34
3k
Let's Do A Bunch of Simple Stuff to Make Websites Faster
chriscoyier
506
140k
Why You Should Never Use an ORM
jnunemaker
PRO
54
9.1k
What's in a price? How to price your products and services
michaelherold
243
12k
Reflections from 52 weeks, 52 projects
jeffersonlam
346
20k
Fashionably flexible responsive web design (full day workshop)
malarkey
405
65k
Visualizing Your Data: Incorporating Mongo into Loggly Infrastructure
mongodb
42
9.2k
The Straight Up "How To Draw Better" Workshop
denniskardys
232
140k
Java REST API Framework Comparison - PWX 2021
mraible
PRO
28
8.2k
It's Worth the Effort
3n
183
27k
RailsConf 2023
tenderlove
29
900
Fight the Zombie Pattern Library - RWD Summit 2016
marcelosomers
232
17k
Transcript
Engineering Large Systems When You’re Not Google Or Facebook Some
Advice By Charity Majors
None
I blame this guy: Testing in production has gotten a
bad rap.
None
how they think we are how we really are
but *why*?
monitoring => observability known unknowns => unknown unknowns LAMP stack
=> distributed systems
“Complexity is increasing” - Science
Many catastrophic states exist at any given time. Your system
is never entirely ‘up’
We are all distributed systems engineers now the unknowns outstrip
the knowns why does this matter more and more?
Distributed systems are particularly hostile to being cloned or imitated
(or monitored). (clients, concurrency, chaotic traffic patterns, edge cases …)
Distributed systems have an infinitely long list of almost-impossible failure
scenarios that make staging environments particularly worthless. this is a black hole for engineering time
unit tests integration tests functional tests basic failover test before
prod: … the basics. the simple stuff. known-unknowns
behavioral tests experiments load tests (!!) edge cases canaries rolling
deploys multi-region test in prod: unknown-unknowns
test in staging? meh
unit tests integration tests functional tests “What happens when …”
(you know the answer) “What happens when …” (you don’t) behavioral tests experiments load tests (!!) edge cases canaries rolling deploys multi-region test before prod: test in prod:
Only production is production. You can ONLY verify the deploy
for any env by deploying to that env
1. Every deploy is a *unique* exercise of your process+
code+system 2. Deploy scripts are production code. If you’re using fabric or capistrano, this means you have fab/cap in production.
Staging is not production.
Why do people sink so much time into staging, when
they can’t even tell if their own production environment is healthy or not?
That energy is better used elsewhere: Production. You can catch
80% of the bugs with 20% of the effort. And you should. @caitie’s PWL talk: https://youtu.be/-3tw2MYYT0Q
feature flags (launch darkly) high cardinality tooling (honeycomb) canary canary
canaries, shadow systems (goturbine, linkerd) capture/replay for databases (apiary, percona) also build or use: plz dont build your own ffs
Failure is not rare Practice shipping and fixing lots of
small problems And practice on your users!!
Failure: it’s “when”, not “if” (lots and lots and lots
of “when’s”)
Does everyone … know what normal looks like? know how
to deploy? know how to roll back? know how to canary? know how to debug in production? Practice!!~
None
None
None
• Charity Majors @mipsytipsy