Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Solid Snakes or: How to Take 5 Weeks of Vacation

Solid Snakes or: How to Take 5 Weeks of Vacation

No matter whether you run a web app, search for gravitational waves, or maintain a backup script: reliability of your systems make the difference between sweet dreams and production nightmares at 4am.

Hynek Schlawack

May 19, 2017
Tweet

More Decks by Hynek Schlawack

Other Decks in Technology

Transcript

  1. SOLID SNAKES
    HYNEK SCHLAWACK

    View Slide

  2. View Slide

  3. View Slide

  4. ATTITUDE

    View Slide

  5. INCENTIVES

    View Slide

  6. IMPORTANT
    VS
    URGENT

    View Slide

  7. THE PRICE OF RELIABILITY
    IS THE PURSUIT OF THE
    UTMOST SIMPLICITY.
    Sir C.A.R. Hoare
    SIMPLICITY

    View Slide

  8. View Slide

  9. NORMAL ACCIDENTS

    View Slide

  10. View Slide

  11. View Slide

  12. View Slide

  13. ESSENTIAL

    View Slide

  14. ESSENTIAL
    VS
    ACCIDENTAL

    View Slide

  15. View Slide

  16. View Slide

  17. View Slide

  18. View Slide

  19. OPERATIONAL
    COMPLEXITY

    View Slide

  20. your DC
    Client
    App
    DB
    Redis
    Cache
    CDN
    Work
    Queue

    View Slide

  21. your DC
    Client
    App
    DB
    Redis
    Cache
    CDN
    Work
    Queue

    View Slide

  22. your DC
    Client
    App
    DB
    Redis
    Cache
    CDN
    Work
    Queue

    View Slide

  23. MICROSERVICES

    View Slide

  24. Service 2
    Service 3
    Service 1
    Service 4
    Service 5
    Service 6
    Service 7
    Service 8

    View Slide

  25. View Slide

  26. COMPLEXITY
    IS
    REALITY

    View Slide

  27. View Slide

  28. PLAN
    FOR
    STUPIDITY

    View Slide

  29. I DON’T BELIEVE
    IN HUMAN ERROR
    John Allspaw, CTO at Etsy
    HUMAN ERRORS

    View Slide

  30. View Slide

  31. ILLEGAL STATE

    View Slide

  32. View Slide

  33. 1. VALID AFTER INITIALIZATION

    View Slide

  34. 1. VALID AFTER INITIALIZATION
    2. PREVENT MUTATION TO ILLEGAL

    View Slide

  35. NO PARTIAL INITIALIZATION
    conn = Connection()
    conn.tls = True
    conn.connect("host.name")

    View Slide

  36. NO PARTIAL INITIALIZATION: CLASSMETHOD FACTORIES
    conn = Connection.connect(
    "host.name",
    tls=True
    )

    View Slide

  37. NO PARTIAL INITIALIZATION: BUILDER PATTERN
    conn = ConnectionBuilder() \
    .for_hostname("host.name") \
    .with_tls(True) \
    .connect()

    View Slide

  38. PREVENT MUTATION
    TO ILLEGAL

    View Slide

  39. PREVENT MUTATION

    View Slide

  40. View Slide

  41. DATA VALIDATION

    View Slide

  42. DATA VALIDATION
    AT EDGES

    View Slide

  43. DATA VALIDATION
    NORMALIZATION
    AT EDGES

    View Slide

  44. View Slide

  45. PLOT TWIST!

    View Slide

  46. FAILURE IS
    INEVITABLE

    View Slide

  47. RELIABILITY

    View Slide

  48. RELIABILITY
    Twitter 2007

    View Slide

  49. RELIABILITY
    Twitter 2007 NASA 1969

    View Slide

  50. View Slide

  51. FAILURE IS
    INEVITABLE

    View Slide

  52. FAILURE IS
    INEVITABLE
    (⌐■_■)

    View Slide

  53. EXPECT

    View Slide

  54. View Slide

  55. View Slide

  56. TIMEOUTS

    View Slide

  57. View Slide

  58. CLOSED
    Local
    Client
    Remote
    API
    Circuit
    Breaker
    call() call()
    result
    result

    View Slide

  59. CLOSED → OPEN
    Local
    Client
    Remote
    API
    Circuit
    Breaker
    call() call()
    timeout!
    timeout!

    View Slide

  60. OPEN
    Local
    Client
    Remote
    API
    Circuit
    Breaker
    call()
    circuit
    open!

    View Slide

  61. OPEN → HALF-CLOSED
    Local
    Client
    Remote
    API
    Circuit
    Breaker
    call() call()
    result result

    View Slide

  62. REDUNDANCY

    View Slide

  63. View Slide

  64. DOCS

    View Slide

  65. DEAL WITH IT
    (¬∎_∎)

    View Slide

  66. DON’T
    MAKE IT
    WORSE

    View Slide

  67. RETRIES

    View Slide

  68. BACKOFF

    View Slide

  69. BACKOFF
    EXPONENTIAL

    View Slide

  70. BACKOFF
    EXPONENTIAL
    WITH JITTER

    View Slide

  71. Frontend
    Backend
    3x

    View Slide

  72. Internal
    Backend
    A
    Internal
    Backend
    B
    9x
    9x
    Frontend
    Backend
    3x

    View Slide

  73. Internal
    Backend
    C
    27x
    Internal
    Backend
    A
    Internal
    Backend
    B
    9x
    9x
    Frontend
    Backend
    3x

    View Slide

  74. DON’T
    SWALLOW
    ERRORS

    View Slide

  75. try:
    do_something()
    return True
    except Exception:
    return False

    View Slide

  76. try:
    do_something()
    except Exception:
    raise AppException()

    View Slide

  77. try:
    do_something()
    return True
    except Exception as e:
    raise AppException() from e

    View Slide

  78. try:
    do_something()
    return True
    except Exception as e:
    raise AppException() from e
    AppException().__cause__ == e

    View Slide

  79. DON’T TRY
    TOO HARD

    View Slide

  80. sys.exit(1)

    View Slide

  81. CRASH-ONLY

    View Slide

  82. FAIL FAST
    FAIL LOUDLY

    View Slide

  83. FOCUS
    ON
    RECOVERY

    View Slide

  84. MTTR

    View Slide

  85. View Slide

  86. ZERO
    EXPECTATIONS

    View Slide

  87. View Slide

  88. FAULT TOLERANCE

    View Slide

  89. FAULT TOLERANCE
    RECOVERY

    View Slide

  90. OX.CX/SS
    @HYNEK
    VRMD.DE

    View Slide