Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Engineering Large Systems When You're Not Google Or Facebook (test in prod)

Engineering Large Systems When You're Not Google Or Facebook (test in prod)

lightning talk at Clever, 4/30/18

Ac734fc32781678475b577944bb5a9ae?s=128

Charity Majors

April 30, 2018
Tweet

Transcript

  1. Engineering Large Systems When You’re Not Google Or Facebook Some

    Advice By Charity Majors
  2. None
  3. I blame this guy: Testing in production has gotten a

    bad rap.
  4. None
  5. how they think we are how we really are

  6. but *why*?

  7. monitoring => observability known unknowns => unknown unknowns LAMP stack

    => distributed systems
  8. “Complexity is increasing” - Science

  9. Many catastrophic states exist at any given time. Your system

    is never entirely ‘up’
  10. We are all distributed systems engineers now the unknowns outstrip

    the knowns why does this matter more and more?
  11. Distributed systems are particularly hostile to being cloned or imitated

    (or monitored). (clients, concurrency, chaotic traffic patterns, edge cases …)
  12. Distributed systems have an infinitely long list of almost-impossible failure

    scenarios that make staging environments particularly worthless. this is a black hole for engineering time
  13. unit tests integration tests functional tests basic failover test before

    prod: … the basics. the simple stuff. known-unknowns
  14. behavioral tests experiments load tests (!!) edge cases canaries rolling

    deploys multi-region test in prod: unknown-unknowns
  15. test in staging? meh

  16. unit tests integration tests functional tests “What happens when …”

    (you know the answer) “What happens when …” (you don’t) behavioral tests experiments load tests (!!) edge cases canaries rolling deploys multi-region test before prod: test in prod:
  17. Only production is production. You can ONLY verify the deploy

    for any env by deploying to that env
  18. 1. Every deploy is a *unique* exercise of your process+


    code+system 2. Deploy scripts are production code. If you’re using fabric or capistrano, this means you have fab/cap in production.
  19. Staging is not production.

  20. Why do people sink so much time into staging, when

    they can’t even tell if their own production environment is healthy or not?
  21. That energy is better used elsewhere: Production. You can catch

    80% of the bugs with 20% of the effort. And you should. @caitie’s PWL talk: https://youtu.be/-3tw2MYYT0Q
  22. feature flags (launch darkly) high cardinality tooling (honeycomb) canary canary

    canaries, shadow systems (goturbine, linkerd) capture/replay for databases (apiary, percona) also build or use: plz dont build your own ffs
  23. Failure is not rare Practice shipping and fixing lots of

    small problems And practice on your users!!
  24. Failure: it’s “when”, not “if” (lots and lots and lots

    of “when’s”)
  25. Does everyone … know what normal looks like? know how

    to deploy? know how to roll back? know how to canary? know how to debug in production? Practice!!~
  26. None
  27. None
  28. None
  29. • Charity Majors @mipsytipsy