Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Solid Snakes or: How to Take 5 Weeks of Vacation by Hynek Schlawack

7b0645f018c0bddc8ce3900ccc3ba70c?s=47 Pycon ZA
October 05, 2017

Solid Snakes or: How to Take 5 Weeks of Vacation by Hynek Schlawack

No matter whether you run a web app, search for gravitational waves, or maintain a backup script: being responsible for a piece of software or infrastructure means that you either get a pager right away, or that you get angry calls from people affected by outages. Being paged at 4am in everyday life is bad enough. Having to fix problems from hotel rooms while your travel buddies go for brunch is even worse.

And while incidents can’t be prevented completely, there are ways to make your systems more reliable and minimize the need for (your!) manual intervention. This talk will help you to get calm nights and relaxing vacations by teaching you some of them.

7b0645f018c0bddc8ce3900ccc3ba70c?s=128

Pycon ZA

October 05, 2017
Tweet

Transcript

  1. SOLID SNAKES HYNEK SCHLAWACK

  2. None
  3. None
  4. ATTITUDE

  5. INCENTIVES

  6. IMPORTANT VS URGENT

  7. THE PRICE OF RELIABILITY IS THE PURSUIT OF THE UTMOST

    SIMPLICITY. Sir C.A.R. Hoare SIMPLICITY
  8. None
  9. NORMAL ACCIDENTS

  10. None
  11. None
  12. None
  13. ESSENTIAL

  14. ESSENTIAL VS ACCIDENTAL

  15. None
  16. None
  17. None
  18. None
  19. None
  20. OPERATIONAL COMPLEXITY

  21. your DC Client App DB Redis Cache CDN Work Queue

  22. your DC Client App DB Redis Cache CDN Work Queue

  23. your DC Client App DB Redis Cache CDN Work Queue

  24. MICROSERVICES

  25. Service 2 Service 3 Service 1 Service 4 Service 5

    Service 6 Service 7 Service 8
  26. None
  27. COMPLEXITY IS REALITY

  28. None
  29. PLAN FOR STUPIDITY

  30. I DON’T BELIEVE IN HUMAN ERROR John Allspaw, CTO at

    Etsy HUMAN ERRORS
  31. None
  32. None
  33. DATA VALIDATION

  34. DATA VALIDATION AT EDGES

  35. DATA VALIDATION NORMALIZATION AT EDGES

  36. None
  37. PLOT TWIST!

  38. FAILURE IS INEVITABLE

  39. RELIABILITY

  40. RELIABILITY Twitter 2007

  41. RELIABILITY Twitter 2007 NASA 1969

  42. None
  43. FAILURE IS INEVITABLE

  44. FAILURE IS INEVITABLE (⌐▪_▪)

  45. None
  46. EXPECT

  47. None
  48. None
  49. TIMEOUTS

  50. None
  51. CLOSED Local Client Remote API Circuit Breaker call() call() result

    result
  52. CLOSED → OPEN Local Client Remote API Circuit Breaker call()

    call() timeout! timeout!
  53. OPEN Local Client Remote API Circuit Breaker call() circuit open!

  54. OPEN → HALF-CLOSED Local Client Remote API Circuit Breaker call()

    call() result result
  55. REDUNDANCY

  56. None
  57. DOCS

  58. DEAL WITH IT (¬∎_∎)

  59. DON’T MAKE IT WORSE

  60. RETRIES

  61. BACKOFF

  62. BACKOFF EXPONENTIAL

  63. BACKOFF EXPONENTIAL WITH JITTER

  64. Frontend Backend 3x

  65. Internal Backend A Internal Backend B 9x 9x Frontend Backend

    3x
  66. Internal Backend C 27x Internal Backend A Internal Backend B

    9x 9x Frontend Backend 3x
  67. DON’T SWALLOW ERRORS

  68. try: do_something() return True except Exception: return False

  69. try: do_something() except Exception: raise AppException()

  70. try: do_something() return True except Exception as e: raise AppException()

    from e
  71. try: do_something() return True except Exception as e: raise AppException()

    from e AppException().__cause__ == e
  72. DON’T TRY TOO HARD

  73. sys.exit(1)

  74. CRASH-ONLY

  75. FAIL FAST FAIL LOUDLY

  76. FOCUS ON RECOVERY

  77. MTTR

  78. None
  79. ZERO EXPECTATIONS

  80. None
  81. FAULT TOLERANCE

  82. FAULT TOLERANCE RECOVERY

  83. OX.CX/SS @HYNEK VRMD.DE