Engineering Large Systems When You're Not Google Or Facebook (test in prod)

Slide 1

Slide 1 text

Engineering Large Systems When You’re Not Google Or Facebook Some Advice By Charity Majors

Slide 2

Slide 2 text

No content

Slide 3

Slide 3 text

I blame this guy: Testing in production has gotten a bad rap.

Slide 4

Slide 4 text

No content

Slide 5

Slide 5 text

how they think we are how we really are

Slide 6

Slide 6 text

but *why*?

Slide 7

Slide 7 text

monitoring => observability known unknowns => unknown unknowns LAMP stack => distributed systems

Slide 8

Slide 8 text

“Complexity is increasing” - Science

Slide 9

Slide 9 text

Many catastrophic states exist at any given time. Your system is never entirely ‘up’

Slide 10

Slide 10 text

We are all distributed systems engineers now the unknowns outstrip the knowns why does this matter more and more?

Slide 11

Slide 11 text

Distributed systems are particularly hostile to being cloned or imitated (or monitored). (clients, concurrency, chaotic traffic patterns, edge cases …)

Slide 12

Slide 12 text

Distributed systems have an inﬁnitely long list of almost-impossible failure scenarios that make staging environments particularly worthless. this is a black hole for engineering time

Slide 13

Slide 13 text

unit tests integration tests functional tests basic failover test before prod: … the basics. the simple stuﬀ. known-unknowns

Slide 14

Slide 14 text

behavioral tests experiments load tests (!!) edge cases canaries rolling deploys multi-region test in prod: unknown-unknowns

Slide 15

Slide 15 text

test in staging? meh

Slide 16

Slide 16 text

unit tests integration tests functional tests “What happens when …” (you know the answer) “What happens when …” (you don’t) behavioral tests experiments load tests (!!) edge cases canaries rolling deploys multi-region test before prod: test in prod:

Slide 17

Slide 17 text

Only production is production. You can ONLY verify the deploy for any env by deploying to that env

Slide 18

Slide 18 text

1. Every deploy is a *unique* exercise of your process+  code+system 2. Deploy scripts are production code. If you’re using fabric or capistrano, this means you have fab/cap in production.

Slide 19

Slide 19 text

Staging is not production.

Slide 20

Slide 20 text

Why do people sink so much time into staging, when they can’t even tell if their own production environment is healthy or not?

Slide 21

Slide 21 text

That energy is better used elsewhere: Production. You can catch 80% of the bugs with 20% of the effort. And you should. @caitie’s PWL talk: https://youtu.be/-3tw2MYYT0Q

Slide 22

Slide 22 text

feature ﬂags (launch darkly) high cardinality tooling (honeycomb) canary canary canaries, shadow systems (goturbine, linkerd) capture/replay for databases (apiary, percona) also build or use: plz dont build your own ffs

Slide 23

Slide 23 text

Failure is not rare Practice shipping and ﬁxing lots of small problems And practice on your users!!

Slide 24

Slide 24 text

Failure: it’s “when”, not “if” (lots and lots and lots of “when’s”)

Slide 25

Slide 25 text

Does everyone … know what normal looks like? know how to deploy? know how to roll back? know how to canary? know how to debug in production? Practice!!~