Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Engineering Large Systems When You're Not Googl...
Search
Charity Majors
April 30, 2018
Technology
20
5.5k
Engineering Large Systems When You're Not Google Or Facebook (test in prod)
lightning talk at Clever, 4/30/18
Charity Majors
April 30, 2018
Tweet
Share
More Decks by Charity Majors
See All by Charity Majors
SRECon 2024 Keynote: Is It Already Time To Version Observability? (Signs Point To Yes)
charity
3
270
CTO Craft Con Keynote: Observability is due for a version change: are you ready for it?
charity
3
850
Case Studies: Modern Development Practices In Highly Regulated Environments
charity
5
3.7k
Compliance & Regulatory Standards Are NOT Incompatible With Modern Development Best Practices
charity
7
5.7k
Perils, Pitfalls and Pratfalls of Platform Engineering (QCon NYC, 2023)
charity
1
290
The Death of DevOps Has Been Greatly Exaggerated, but Platform Engineering Is Here To Stay
charity
2
410
The Future of Ops Jobs (PlatformCon 2023)
charity
3
160
Observability and the Glorious Future (with Liz Fong-Jones)
charity
0
280
The Engineer/Manager Pendulum (QCon SF 2022)
charity
1
870
Other Decks in Technology
See All in Technology
国土交通省 データコンペ参加者向け勉強会
takehikohashimoto
0
270
SREの前に
nwiizo
8
1.3k
Forget efficiency – Become more productive without the stress
ufried
0
190
フロントエンド メタフレームワーク 選定の際に考えたこと
yuppeeng
0
320
[AWS JAPAN 生成AIハッカソン] Dialog の紹介
yoshimi0227
0
170
DatabricksにおけるLLMOpsのベストプラクティス
taka_aki
4
1.3k
신뢰할 수 있는 AI 검색 엔진을 만들기 위한 Liner의 여정
huffon
0
450
AI機能の開発運用のリアルと今後のリアル
akiroom
0
130
新卒1年目が向き合う生成AI事業の開発を加速させる技術選定 / ai-web-launcher
cyberagentdevelopers
PRO
8
1.6k
Shift-from-React-to-Vue
calm1205
4
1.5k
AWS CDKでデータリストアの運用、どのように設計する?~Aurora・EFSの実践事例を紹介~/aws-cdk-data-restore-aurora-efs
mhrtech
4
750
ジョブマッチングサービスにおける相互推薦システムの応用事例と課題
hakubishin3
2
510
Featured
See All Featured
Designing for Performance
lara
604
68k
No one is an island. Learnings from fostering a developers community.
thoeni
19
3k
Creating an realtime collaboration tool: Agile Flush - .NET Oxford
marcduiker
25
1.8k
Practical Orchestrator
shlominoach
186
10k
Building a Scalable Design System with Sketch
lauravandoore
459
33k
Automating Front-end Workflow
addyosmani
1366
200k
ピンチをチャンスに:未来をつくるプロダクトロードマップ #pmconf2020
aki_iinuma
108
49k
The Pragmatic Product Professional
lauravandoore
31
6.3k
Bootstrapping a Software Product
garrettdimon
PRO
305
110k
CoffeeScript is Beautiful & I Never Want to Write Plain JavaScript Again
sstephenson
159
15k
Build your cross-platform service in a week with App Engine
jlugia
229
18k
Teambox: Starting and Learning
jrom
132
8.7k
Transcript
Engineering Large Systems When You’re Not Google Or Facebook Some
Advice By Charity Majors
None
I blame this guy: Testing in production has gotten a
bad rap.
None
how they think we are how we really are
but *why*?
monitoring => observability known unknowns => unknown unknowns LAMP stack
=> distributed systems
“Complexity is increasing” - Science
Many catastrophic states exist at any given time. Your system
is never entirely ‘up’
We are all distributed systems engineers now the unknowns outstrip
the knowns why does this matter more and more?
Distributed systems are particularly hostile to being cloned or imitated
(or monitored). (clients, concurrency, chaotic traffic patterns, edge cases …)
Distributed systems have an infinitely long list of almost-impossible failure
scenarios that make staging environments particularly worthless. this is a black hole for engineering time
unit tests integration tests functional tests basic failover test before
prod: … the basics. the simple stuff. known-unknowns
behavioral tests experiments load tests (!!) edge cases canaries rolling
deploys multi-region test in prod: unknown-unknowns
test in staging? meh
unit tests integration tests functional tests “What happens when …”
(you know the answer) “What happens when …” (you don’t) behavioral tests experiments load tests (!!) edge cases canaries rolling deploys multi-region test before prod: test in prod:
Only production is production. You can ONLY verify the deploy
for any env by deploying to that env
1. Every deploy is a *unique* exercise of your process+
code+system 2. Deploy scripts are production code. If you’re using fabric or capistrano, this means you have fab/cap in production.
Staging is not production.
Why do people sink so much time into staging, when
they can’t even tell if their own production environment is healthy or not?
That energy is better used elsewhere: Production. You can catch
80% of the bugs with 20% of the effort. And you should. @caitie’s PWL talk: https://youtu.be/-3tw2MYYT0Q
feature flags (launch darkly) high cardinality tooling (honeycomb) canary canary
canaries, shadow systems (goturbine, linkerd) capture/replay for databases (apiary, percona) also build or use: plz dont build your own ffs
Failure is not rare Practice shipping and fixing lots of
small problems And practice on your users!!
Failure: it’s “when”, not “if” (lots and lots and lots
of “when’s”)
Does everyone … know what normal looks like? know how
to deploy? know how to roll back? know how to canary? know how to debug in production? Practice!!~
None
None
None
• Charity Majors @mipsytipsy