Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Engineering Large Systems When You're Not Googl...
Search
Charity Majors
April 30, 2018
Technology
20
5.6k
Engineering Large Systems When You're Not Google Or Facebook (test in prod)
lightning talk at Clever, 4/30/18
Charity Majors
April 30, 2018
Tweet
Share
More Decks by Charity Majors
See All by Charity Majors
The Twin Mandate of Observability
charity
1
430
In Praise of "Normal" Engineers (LDX3)
charity
4
2.7k
In Praise of "Normal" Engineers (with full speaker notes)
charity
1
200
AIOps: Prove It! (An Open Letter to Vendors Selling AI for SREs)
charity
1
33
SRECon 2024 Keynote: Is It Already Time To Version Observability? (Signs Point To Yes)
charity
3
480
CTO Craft Con Keynote: Observability is due for a version change: are you ready for it?
charity
4
1.4k
Case Studies: Modern Development Practices In Highly Regulated Environments
charity
6
4.2k
Compliance & Regulatory Standards Are NOT Incompatible With Modern Development Best Practices
charity
7
6.1k
Perils, Pitfalls and Pratfalls of Platform Engineering (QCon NYC, 2023)
charity
1
400
Other Decks in Technology
See All in Technology
AI-ready"のための"データ基盤 〜 LLMOpsで事業貢献するための基盤づくり
ismk
0
120
DMMの検索システムをSolrからElasticCloudに移行した話
hmaa_ryo
0
370
累計5000万DLサービスの裏側 – LINEマンガのKotlinで挑む大規模 Server-side ETLの最適化
ldf_tech
0
190
20251106 Offers DeepDive 知識を民主化!あらゆる業務のスピードと品質を 改善するためのドキュメント自動更新・活用術
masashiyokota
1
220
AIの個性を理解し、指揮する
shoota
3
640
初海外がre:Inventだった人間の感じたこと
tommy0124
1
200
DSPy入門
tomehirata
6
900
龍昌餃子で理解するWebサーバーの並行処理モデル - 東葛.dev #9
kozy4324
1
110
Gov-JAWS4回_某団体でのAmazon Bedrock活用検証で見えた“使う側”の課題精度よりもリテラシー
takuma818t
0
110
LLM APIを2年間本番運用して苦労した話
ivry_presentationmaterials
10
8.5k
今のコンピュータ、AI にも Web にも 向いていないので 作り直そう!!
piacerex
0
660
AI時代に必要なデータプラットフォームの要件とは by @Kazaneya_PR / 20251107
kazaneya
PRO
4
730
Featured
See All Featured
Save Time (by Creating Custom Rails Generators)
garrettdimon
PRO
32
1.7k
Designing for Performance
lara
610
69k
We Have a Design System, Now What?
morganepeng
54
7.9k
RailsConf & Balkan Ruby 2019: The Past, Present, and Future of Rails at GitHub
eileencodes
140
34k
Learning to Love Humans: Emotional Interface Design
aarron
274
41k
How To Stay Up To Date on Web Technology
chriscoyier
791
250k
Building a Scalable Design System with Sketch
lauravandoore
463
33k
Sharpening the Axe: The Primacy of Toolmaking
bcantrill
46
2.5k
Balancing Empowerment & Direction
lara
5
710
Become a Pro
speakerdeck
PRO
29
5.6k
The Art of Programming - Codeland 2020
erikaheidi
56
14k
Into the Great Unknown - MozCon
thekraken
40
2.1k
Transcript
Engineering Large Systems When You’re Not Google Or Facebook Some
Advice By Charity Majors
None
I blame this guy: Testing in production has gotten a
bad rap.
None
how they think we are how we really are
but *why*?
monitoring => observability known unknowns => unknown unknowns LAMP stack
=> distributed systems
“Complexity is increasing” - Science
Many catastrophic states exist at any given time. Your system
is never entirely ‘up’
We are all distributed systems engineers now the unknowns outstrip
the knowns why does this matter more and more?
Distributed systems are particularly hostile to being cloned or imitated
(or monitored). (clients, concurrency, chaotic traffic patterns, edge cases …)
Distributed systems have an infinitely long list of almost-impossible failure
scenarios that make staging environments particularly worthless. this is a black hole for engineering time
unit tests integration tests functional tests basic failover test before
prod: … the basics. the simple stuff. known-unknowns
behavioral tests experiments load tests (!!) edge cases canaries rolling
deploys multi-region test in prod: unknown-unknowns
test in staging? meh
unit tests integration tests functional tests “What happens when …”
(you know the answer) “What happens when …” (you don’t) behavioral tests experiments load tests (!!) edge cases canaries rolling deploys multi-region test before prod: test in prod:
Only production is production. You can ONLY verify the deploy
for any env by deploying to that env
1. Every deploy is a *unique* exercise of your process+
code+system 2. Deploy scripts are production code. If you’re using fabric or capistrano, this means you have fab/cap in production.
Staging is not production.
Why do people sink so much time into staging, when
they can’t even tell if their own production environment is healthy or not?
That energy is better used elsewhere: Production. You can catch
80% of the bugs with 20% of the effort. And you should. @caitie’s PWL talk: https://youtu.be/-3tw2MYYT0Q
feature flags (launch darkly) high cardinality tooling (honeycomb) canary canary
canaries, shadow systems (goturbine, linkerd) capture/replay for databases (apiary, percona) also build or use: plz dont build your own ffs
Failure is not rare Practice shipping and fixing lots of
small problems And practice on your users!!
Failure: it’s “when”, not “if” (lots and lots and lots
of “when’s”)
Does everyone … know what normal looks like? know how
to deploy? know how to roll back? know how to canary? know how to debug in production? Practice!!~
None
None
None
• Charity Majors @mipsytipsy