No matter whether you run a web app, search for gravitational waves, or maintain a backup script: reliability of your systems make the difference between sweet dreams and production nightmares at 4am.
SOLID SNAKESHYNEK SCHLAWACK
View Slide
ATTITUDE
INCENTIVES
IMPORTANTVSURGENT
THE PRICE OF RELIABILITYIS THE PURSUIT OF THEUTMOST SIMPLICITY.Sir C.A.R. HoareSIMPLICITY
NORMAL ACCIDENTS
ESSENTIAL
ESSENTIALVSACCIDENTAL
OPERATIONALCOMPLEXITY
your DCClientAppDBRedisCacheCDNWorkQueue
MICROSERVICES
Service 2Service 3Service 1Service 4Service 5Service 6Service 7Service 8
COMPLEXITYISREALITY
PLANFORSTUPIDITY
I DON’T BELIEVEIN HUMAN ERRORJohn Allspaw, CTO at EtsyHUMAN ERRORS
ILLEGAL STATE
1. VALID AFTER INITIALIZATION
1. VALID AFTER INITIALIZATION2. PREVENT MUTATION TO ILLEGAL
NO PARTIAL INITIALIZATIONconn = Connection()conn.tls = Trueconn.connect("host.name")
NO PARTIAL INITIALIZATION: CLASSMETHOD FACTORIESconn = Connection.connect("host.name",tls=True)
NO PARTIAL INITIALIZATION: BUILDER PATTERNconn = ConnectionBuilder() \.for_hostname("host.name") \.with_tls(True) \.connect()
PREVENT MUTATIONTO ILLEGAL
PREVENT MUTATION
DATA VALIDATION
DATA VALIDATIONAT EDGES
DATA VALIDATIONNORMALIZATIONAT EDGES
PLOT TWIST!
FAILURE ISINEVITABLE
RELIABILITY
RELIABILITYTwitter 2007
RELIABILITYTwitter 2007 NASA 1969
FAILURE ISINEVITABLE(⌐■_■)
EXPECT
TIMEOUTS
CLOSEDLocalClientRemoteAPICircuitBreakercall() call()resultresult
CLOSED → OPENLocalClientRemoteAPICircuitBreakercall() call()timeout!timeout!
OPENLocalClientRemoteAPICircuitBreakercall()circuitopen!
OPEN → HALF-CLOSEDLocalClientRemoteAPICircuitBreakercall() call()result result
REDUNDANCY
DOCS
DEAL WITH IT(¬∎_∎)
DON’TMAKE ITWORSE
RETRIES
BACKOFF
BACKOFFEXPONENTIAL
BACKOFFEXPONENTIALWITH JITTER
FrontendBackend3x
InternalBackendAInternalBackendB9x9xFrontendBackend3x
InternalBackendC27xInternalBackendAInternalBackendB9x9xFrontendBackend3x
DON’TSWALLOWERRORS
try:do_something()return Trueexcept Exception:return False
try:do_something()except Exception:raise AppException()
try:do_something()return Trueexcept Exception as e:raise AppException() from e
try:do_something()return Trueexcept Exception as e:raise AppException() from eAppException().__cause__ == e
DON’T TRYTOO HARD
sys.exit(1)
CRASH-ONLY
FAIL FASTFAIL LOUDLY
FOCUSONRECOVERY
MTTR
ZEROEXPECTATIONS
FAULT TOLERANCE
FAULT TOLERANCERECOVERY
OX.CX/SS@HYNEKVRMD.DE