Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Speaker Deck
PRO
Sign in
Sign up for free
Engineering Large Systems When You're Not Google Or Facebook (test in prod)
Charity Majors
April 30, 2018
Technology
20
5.1k
Engineering Large Systems When You're Not Google Or Facebook (test in prod)
lightning talk at Clever, 4/30/18
Charity Majors
April 30, 2018
Tweet
Share
More Decks by Charity Majors
See All by Charity Majors
Observability and the Glorious Future
charity
1
1.1k
It is time to fulfill the promise of CI/CD
charity
23
20k
The Sociotechnical Path to High-Performing Teams II
charity
0
790
The Sociotechnical Path to High-Performing Teams
charity
2
900
What got you here won't get you there (CodeFreeze 2020)
charity
1
370
Observability and Complex Systems (devopsdays AMS)
charity
4
2.1k
Observability-Driven Development
charity
8
1.1k
Observability and the Glorious Future
charity
10
310
Keep calm and carry on: scaling your org with microservices
charity
1
320
Other Decks in Technology
See All in Technology
AWS CLI入門_20220513
suzakiyoshito
0
3.9k
HTTP Session Architecture Pattern
chiroito
1
400
5分で完全理解するGoのiota
uji
3
2.1k
SRE_チーム立ち上げから1年_気づいたら_SRE_っぽくない仕事まで貢献しちゃってる説
bitkey
PRO
0
2.3k
BFFとmicroservicesアーキテクチャ
hirac1220
0
100
New Features in C# 10/11
chack411
0
960
Whats new in Android Media?
myolwin00
2
110
ソフトウェアテストで参考にしている67のモノ #scrumniigata / 67 things for software testing
kyonmm
PRO
1
500
成長を続ける組織でのSRE戦略:プレモーテムによる信頼性の認識共有 SRE Next 2022
niwatakeru
7
2.7k
圧倒的成長をしながら急拡大してる開発組織が新卒採用やってみた
andpad
0
160
街じゅうを"駅前化"する電動マイクロモビリティのシェアサービス「LUUP」のIoTとSRE
0gm
1
790
220521_SFN_品質文化試論と『LEADING QUALITY』/220521_SFN_Essay_of_Quality_Culture_and_LEADING_QUALITY
mkwrd
0
280
Featured
See All Featured
実際に使うSQLの書き方 徹底解説 / pgcon21j-tutorial
soudai
38
12k
A Philosophy of Restraint
colly
192
14k
Easily Structure & Communicate Ideas using Wireframe
afnizarnur
181
15k
Embracing the Ebb and Flow
colly
73
3.3k
Debugging Ruby Performance
tmm1
65
10k
Reflections from 52 weeks, 52 projects
jeffersonlam
337
17k
Refactoring Trust on Your Teams (GOTO; Chicago 2020)
rmw
19
1.4k
We Have a Design System, Now What?
morganepeng
35
2.9k
StorybookのUI Testing Handbookを読んだ
zakiyama
4
2k
Large-scale JavaScript Application Architecture
addyosmani
499
110k
How STYLIGHT went responsive
nonsquared
85
3.9k
Java REST API Framework Comparison - PWX 2021
mraible
PRO
11
4.6k
Transcript
Engineering Large Systems When You’re Not Google Or Facebook Some
Advice By Charity Majors
None
I blame this guy: Testing in production has gotten a
bad rap.
None
how they think we are how we really are
but *why*?
monitoring => observability known unknowns => unknown unknowns LAMP stack
=> distributed systems
“Complexity is increasing” - Science
Many catastrophic states exist at any given time. Your system
is never entirely ‘up’
We are all distributed systems engineers now the unknowns outstrip
the knowns why does this matter more and more?
Distributed systems are particularly hostile to being cloned or imitated
(or monitored). (clients, concurrency, chaotic traffic patterns, edge cases …)
Distributed systems have an infinitely long list of almost-impossible failure
scenarios that make staging environments particularly worthless. this is a black hole for engineering time
unit tests integration tests functional tests basic failover test before
prod: … the basics. the simple stuff. known-unknowns
behavioral tests experiments load tests (!!) edge cases canaries rolling
deploys multi-region test in prod: unknown-unknowns
test in staging? meh
unit tests integration tests functional tests “What happens when …”
(you know the answer) “What happens when …” (you don’t) behavioral tests experiments load tests (!!) edge cases canaries rolling deploys multi-region test before prod: test in prod:
Only production is production. You can ONLY verify the deploy
for any env by deploying to that env
1. Every deploy is a *unique* exercise of your process+
code+system 2. Deploy scripts are production code. If you’re using fabric or capistrano, this means you have fab/cap in production.
Staging is not production.
Why do people sink so much time into staging, when
they can’t even tell if their own production environment is healthy or not?
That energy is better used elsewhere: Production. You can catch
80% of the bugs with 20% of the effort. And you should. @caitie’s PWL talk: https://youtu.be/-3tw2MYYT0Q
feature flags (launch darkly) high cardinality tooling (honeycomb) canary canary
canaries, shadow systems (goturbine, linkerd) capture/replay for databases (apiary, percona) also build or use: plz dont build your own ffs
Failure is not rare Practice shipping and fixing lots of
small problems And practice on your users!!
Failure: it’s “when”, not “if” (lots and lots and lots
of “when’s”)
Does everyone … know what normal looks like? know how
to deploy? know how to roll back? know how to canary? know how to debug in production? Practice!!~
None
None
None
• Charity Majors @mipsytipsy