Engineering Large Systems When You’re
Not Google Or Facebook
Some Advice By Charity Majors
Slide 2
Slide 2 text
No content
Slide 3
Slide 3 text
I blame this guy:
Testing in production has gotten a bad rap.
Slide 4
Slide 4 text
No content
Slide 5
Slide 5 text
how they think we are
how we really are
Slide 6
Slide 6 text
but *why*?
Slide 7
Slide 7 text
monitoring => observability
known unknowns => unknown unknowns
LAMP stack => distributed systems
Slide 8
Slide 8 text
“Complexity is increasing” - Science
Slide 9
Slide 9 text
Many catastrophic states exist at any given time.
Your system is never entirely ‘up’
Slide 10
Slide 10 text
We are all distributed systems
engineers now
the unknowns outstrip the knowns
why does this matter more and more?
Slide 11
Slide 11 text
Distributed systems are particularly hostile to being
cloned or imitated (or monitored).
(clients, concurrency, chaotic traffic patterns, edge cases …)
Slide 12
Slide 12 text
Distributed systems have an infinitely long list of
almost-impossible failure scenarios that make staging
environments particularly worthless.
this is a black hole for engineering time
Slide 13
Slide 13 text
unit tests
integration tests
functional tests
basic failover
test before prod:
… the basics.
the simple stuff.
known-unknowns
Slide 14
Slide 14 text
behavioral tests
experiments
load tests (!!)
edge cases
canaries
rolling deploys
multi-region
test in prod:
unknown-unknowns
Slide 15
Slide 15 text
test in staging?
meh
Slide 16
Slide 16 text
unit tests
integration tests
functional tests
“What happens when …”
(you know the answer)
“What happens when …”
(you don’t)
behavioral tests
experiments
load tests (!!)
edge cases
canaries
rolling deploys
multi-region
test before prod:
test in prod:
Slide 17
Slide 17 text
Only production is production.
You can ONLY verify the deploy for any env by deploying to that env
Slide 18
Slide 18 text
1. Every deploy is a *unique*
exercise of your process+
code+system
2. Deploy scripts are production
code. If you’re using fabric or
capistrano, this means you
have fab/cap in production.
Slide 19
Slide 19 text
Staging is not production.
Slide 20
Slide 20 text
Why do people sink so much time into staging,
when they can’t even tell if their own
production environment is healthy or not?
Slide 21
Slide 21 text
That energy is better used elsewhere:
Production.
You can catch 80% of the bugs with 20% of the effort. And you should.
@caitie’s PWL talk: https://youtu.be/-3tw2MYYT0Q
Slide 22
Slide 22 text
feature flags (launch darkly)
high cardinality tooling (honeycomb)
canary canary canaries,
shadow systems (goturbine, linkerd)
capture/replay for databases (apiary, percona)
also build or use:
plz dont build your own ffs
Slide 23
Slide 23 text
Failure is not rare
Practice shipping and fixing lots of small problems
And practice on your users!!
Slide 24
Slide 24 text
Failure: it’s “when”, not “if”
(lots and lots and lots of “when’s”)
Slide 25
Slide 25 text
Does everyone …
know what normal looks like?
know how to deploy?
know how to roll back?
know how to canary?
know how to debug in production?
Practice!!~