Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Engineering Large Systems When You're Not Googl...
Search
Charity Majors
April 30, 2018
Technology
20
5.6k
Engineering Large Systems When You're Not Google Or Facebook (test in prod)
lightning talk at Clever, 4/30/18
Charity Majors
April 30, 2018
Tweet
Share
More Decks by Charity Majors
See All by Charity Majors
In Praise of "Normal" Engineers (LDX3)
charity
4
2.5k
In Praise of "Normal" Engineers (with full speaker notes)
charity
1
180
AIOps: Prove It! (An Open Letter to Vendors Selling AI for SREs)
charity
1
24
SRECon 2024 Keynote: Is It Already Time To Version Observability? (Signs Point To Yes)
charity
3
470
CTO Craft Con Keynote: Observability is due for a version change: are you ready for it?
charity
4
1.3k
Case Studies: Modern Development Practices In Highly Regulated Environments
charity
6
4.1k
Compliance & Regulatory Standards Are NOT Incompatible With Modern Development Best Practices
charity
7
6.1k
Perils, Pitfalls and Pratfalls of Platform Engineering (QCon NYC, 2023)
charity
1
390
The Death of DevOps Has Been Greatly Exaggerated, but Platform Engineering Is Here To Stay
charity
2
570
Other Decks in Technology
See All in Technology
信頼できる開発プラットフォームをどう作るか?-Governance as Codeと継続的監視/フィードバックが導くPlatform Engineeringの進め方
yuriemori
1
200
AIドリブンのソフトウェア開発 - うまいやり方とまずいやり方
okdt
PRO
7
120
[kickflow]20250319_少人数チームでのAutify活用
otouhujej
0
170
歴代のWeb Speed Hackathonの出題から考えるデグレしないパフォーマンス改善
shuta13
6
530
プロジェクトマネジメントは不確実性との対話だ
hisashiwatanabe
0
160
生成AI利用プログラミング:誰でもプログラムが書けると 世の中どうなる?/opencampus202508
okana2ki
0
160
「Roblox」の開発環境とその効率化 ~DAU9700万人超の巨大プラットフォームの開発 事始め~
keitatanji
0
150
モノレポにおけるエラー管理 ~Runbook自動生成とチームメンションの最適化
biwashi
0
380
AI時代の大規模データ活用とセキュリティ戦略
ken5scal
1
260
オブザーバビリティ文化を組織に浸透させるには / install observability culture
mackerelio
0
340
o11yツールを乗り換えた話
tak0x00
2
1.7k
会社にデータエンジニアがいることでできるようになること
10xinc
7
970
Featured
See All Featured
Rails Girls Zürich Keynote
gr2m
95
14k
VelocityConf: Rendering Performance Case Studies
addyosmani
332
24k
RailsConf & Balkan Ruby 2019: The Past, Present, and Future of Rails at GitHub
eileencodes
139
34k
Performance Is Good for Brains [We Love Speed 2024]
tammyeverts
10
1k
GraphQLとの向き合い方2022年版
quramy
49
14k
Responsive Adventures: Dirty Tricks From The Dark Corners of Front-End
smashingmag
251
21k
The Psychology of Web Performance [Beyond Tellerrand 2023]
tammyeverts
49
3k
Fashionably flexible responsive web design (full day workshop)
malarkey
407
66k
The Cult of Friendly URLs
andyhume
79
6.5k
Building Adaptive Systems
keathley
43
2.7k
How GitHub (no longer) Works
holman
314
140k
Bootstrapping a Software Product
garrettdimon
PRO
307
110k
Transcript
Engineering Large Systems When You’re Not Google Or Facebook Some
Advice By Charity Majors
None
I blame this guy: Testing in production has gotten a
bad rap.
None
how they think we are how we really are
but *why*?
monitoring => observability known unknowns => unknown unknowns LAMP stack
=> distributed systems
“Complexity is increasing” - Science
Many catastrophic states exist at any given time. Your system
is never entirely ‘up’
We are all distributed systems engineers now the unknowns outstrip
the knowns why does this matter more and more?
Distributed systems are particularly hostile to being cloned or imitated
(or monitored). (clients, concurrency, chaotic traffic patterns, edge cases …)
Distributed systems have an infinitely long list of almost-impossible failure
scenarios that make staging environments particularly worthless. this is a black hole for engineering time
unit tests integration tests functional tests basic failover test before
prod: … the basics. the simple stuff. known-unknowns
behavioral tests experiments load tests (!!) edge cases canaries rolling
deploys multi-region test in prod: unknown-unknowns
test in staging? meh
unit tests integration tests functional tests “What happens when …”
(you know the answer) “What happens when …” (you don’t) behavioral tests experiments load tests (!!) edge cases canaries rolling deploys multi-region test before prod: test in prod:
Only production is production. You can ONLY verify the deploy
for any env by deploying to that env
1. Every deploy is a *unique* exercise of your process+
code+system 2. Deploy scripts are production code. If you’re using fabric or capistrano, this means you have fab/cap in production.
Staging is not production.
Why do people sink so much time into staging, when
they can’t even tell if their own production environment is healthy or not?
That energy is better used elsewhere: Production. You can catch
80% of the bugs with 20% of the effort. And you should. @caitie’s PWL talk: https://youtu.be/-3tw2MYYT0Q
feature flags (launch darkly) high cardinality tooling (honeycomb) canary canary
canaries, shadow systems (goturbine, linkerd) capture/replay for databases (apiary, percona) also build or use: plz dont build your own ffs
Failure is not rare Practice shipping and fixing lots of
small problems And practice on your users!!
Failure: it’s “when”, not “if” (lots and lots and lots
of “when’s”)
Does everyone … know what normal looks like? know how
to deploy? know how to roll back? know how to canary? know how to debug in production? Practice!!~
None
None
None
• Charity Majors @mipsytipsy