Solid Snakes or: How to Take 5 Weeks of Vacation

Solid Snakes or: How to Take 5 Weeks of Vacation

No matter whether you run a web app, search for gravitational waves, or maintain a backup script: reliability of your systems make the difference between sweet dreams and production nightmares at 4am.

174e7b0ff60963f821d0b9a4f1a3ef52?s=128

Hynek Schlawack

May 19, 2017
Tweet

Transcript

  1. SOLID SNAKES HYNEK SCHLAWACK

  2. None
  3. None
  4. ATTITUDE

  5. INCENTIVES

  6. IMPORTANT VS URGENT

  7. THE PRICE OF RELIABILITY IS THE PURSUIT OF THE UTMOST

    SIMPLICITY. Sir C.A.R. Hoare SIMPLICITY
  8. None
  9. NORMAL ACCIDENTS

  10. None
  11. None
  12. None
  13. ESSENTIAL

  14. ESSENTIAL VS ACCIDENTAL

  15. None
  16. None
  17. None
  18. None
  19. OPERATIONAL COMPLEXITY

  20. your DC Client App DB Redis Cache CDN Work Queue

  21. your DC Client App DB Redis Cache CDN Work Queue

  22. your DC Client App DB Redis Cache CDN Work Queue

  23. MICROSERVICES

  24. Service 2 Service 3 Service 1 Service 4 Service 5

    Service 6 Service 7 Service 8
  25. None
  26. COMPLEXITY IS REALITY

  27. None
  28. PLAN FOR STUPIDITY

  29. I DON’T BELIEVE IN HUMAN ERROR John Allspaw, CTO at

    Etsy HUMAN ERRORS
  30. None
  31. ILLEGAL STATE

  32. None
  33. 1. VALID AFTER INITIALIZATION

  34. 1. VALID AFTER INITIALIZATION 2. PREVENT MUTATION TO ILLEGAL

  35. NO PARTIAL INITIALIZATION conn = Connection() conn.tls = True conn.connect("host.name")

  36. NO PARTIAL INITIALIZATION: CLASSMETHOD FACTORIES conn = Connection.connect( "host.name", tls=True

    )
  37. NO PARTIAL INITIALIZATION: BUILDER PATTERN conn = ConnectionBuilder() \ .for_hostname("host.name")

    \ .with_tls(True) \ .connect()
  38. PREVENT MUTATION TO ILLEGAL

  39. PREVENT MUTATION

  40. None
  41. DATA VALIDATION

  42. DATA VALIDATION AT EDGES

  43. DATA VALIDATION NORMALIZATION AT EDGES

  44. None
  45. PLOT TWIST!

  46. FAILURE IS INEVITABLE

  47. RELIABILITY

  48. RELIABILITY Twitter 2007

  49. RELIABILITY Twitter 2007 NASA 1969

  50. None
  51. FAILURE IS INEVITABLE

  52. FAILURE IS INEVITABLE (⌐▪_▪)

  53. EXPECT

  54. None
  55. None
  56. TIMEOUTS

  57. None
  58. CLOSED Local Client Remote API Circuit Breaker call() call() result

    result
  59. CLOSED → OPEN Local Client Remote API Circuit Breaker call()

    call() timeout! timeout!
  60. OPEN Local Client Remote API Circuit Breaker call() circuit open!

  61. OPEN → HALF-CLOSED Local Client Remote API Circuit Breaker call()

    call() result result
  62. REDUNDANCY

  63. None
  64. DOCS

  65. DEAL WITH IT (¬∎_∎)

  66. DON’T MAKE IT WORSE

  67. RETRIES

  68. BACKOFF

  69. BACKOFF EXPONENTIAL

  70. BACKOFF EXPONENTIAL WITH JITTER

  71. Frontend Backend 3x

  72. Internal Backend A Internal Backend B 9x 9x Frontend Backend

    3x
  73. Internal Backend C 27x Internal Backend A Internal Backend B

    9x 9x Frontend Backend 3x
  74. DON’T SWALLOW ERRORS

  75. try: do_something() return True except Exception: return False

  76. try: do_something() except Exception: raise AppException()

  77. try: do_something() return True except Exception as e: raise AppException()

    from e
  78. try: do_something() return True except Exception as e: raise AppException()

    from e AppException().__cause__ == e
  79. DON’T TRY TOO HARD

  80. sys.exit(1)

  81. CRASH-ONLY

  82. FAIL FAST FAIL LOUDLY

  83. FOCUS ON RECOVERY

  84. MTTR

  85. None
  86. ZERO EXPECTATIONS

  87. None
  88. FAULT TOLERANCE

  89. FAULT TOLERANCE RECOVERY

  90. OX.CX/SS @HYNEK VRMD.DE