Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The Walking Dead - A Survival Guide to Resilient Applications

The Walking Dead - A Survival Guide to Resilient Applications

This talk is an introduction to resilient, fault tolerant systems with a focus on design patterns and practical examples thereof.

It was given at Voxxed Days Vienna 2015 and once uploaded should be available on parleys.com

Michael Nitschinger

February 06, 2015
Tweet

More Decks by Michael Nitschinger

Other Decks in Programming

Transcript

  1. @daschl
    #Voxxed
    The Walking Dead
    A Survival Guide to Resilient Applications
    Michael Nitschinger

    View Slide

  2. the right
    Mindset
    2

    View Slide

  3. – U.S. Marine Corps
    “The more you sweat in peace, the less
    you bleed in war.”
    3

    View Slide

  4. 4

    View Slide

  5. 5

    View Slide

  6. Not so fast, mister fancy tests!
    6

    View Slide

  7. What can go wrong?
    Always ask yourself
    7

    View Slide

  8. Fault Tolerance
    101
    8

    View Slide

  9. Fault Error Failure
    A fault is a latent defect that can cause an
    error when activated.
    9

    View Slide

  10. Fault Error Failure
    Errors are the manifestations of faults.
    10

    View Slide

  11. Fault Error Failure
    Failure occurs when the service no longer
    complies with its specifications.
    11

    View Slide

  12. Fault Error Failure
    Errors are inevitable. We need to
    detect, recover and mitigate
    them before they become failures.
    12

    View Slide

  13. Reliability
    is the probability that a system will perform
    failure free for a given amount of time.
    MTTF Mean Time To Failure
    MTTR Mean Time To Repair
    13

    View Slide

  14. Availability
    is the percentage of time the system is able to
    perform its function.
    availability =
    MTTF
    MTTF + MTTR
    14

    View Slide

  15. Expression Downtime/Year
    Three 9s 99.9% 525.6 min
    Four 9s 99.99% 52.56 min
    Four 9s and a 5 99.995% 26.28 min
    Five 9s 99.999% 5.256 min
    Six 9s 99.9999% 0.5256 min
    100% 0
    15

    View Slide

  16. Pop Quiz!
    Edge Service
    User Service Session Store Data Warehouse
    Wanted: 99.99% Availability
    ??? ??? ???
    16

    View Slide

  17. Pop Quiz!
    Edge Service
    User Service Session Store Data Warehouse
    Wanted: 99.99% Availability
    99.999% 99.999% 99.999%
    17

    View Slide

  18. Fault Tolerant
    Architecture
    18

    View Slide

  19. Units of Mitigation
    are the basic units of
    error containment and recovery.
    19

    View Slide

  20. 20

    View Slide

  21. Redundancy
    Cost
    Active/Active Active/Standby N+M
    Cost Time To Recover
    21

    View Slide

  22. Escalation
    is used when recovery or mitigation
    is not possible inside the unit.
    22

    View Slide

  23. Escalation
    taken from http://letitcrash.com/post/30165507578/shutdown-patterns-in-akka-2
    23

    View Slide

  24. The Fault Observer
    receives system and error events and
    can guide and orchestrate detection and recovery
    Unit
    Unit
    Observer
    Listener
    Listener
    Unit
    Unit
    24

    View Slide

  25. 25

    View Slide

  26. 26

    View Slide

  27. Detecting
    Errors
    27

    View Slide

  28. A silent system
    is a dead system.
    28

    View Slide

  29. A System Monitor
    helps to study behaviour and to
    make sure it is operating as specified.
    http://upload.wikimedia.org/wikipedia/commons/3/3b/Mission_control_center.jpg
    29

    View Slide

  30. https://github.com/Netflix/Turbine
    30

    View Slide

  31. Periodic Checking
    Heartbeats monitor tasks or remote services
    and initiate recovery
    Routine Exercises prevent idle
    unit starvation and surface malfunctions
    31

    View Slide

  32. Utilizing Netty’s IdleStateHandler
    32

    View Slide

  33. Riding over Transients
    is used to defer error recovery
    if the error is temporary.
    “‘Patience is a virtue’ to allow the true signature of
    an error to show itself.”
    - Robert S. Hanmer
    33

    View Slide

  34. 34

    View Slide

  35. And more!
    • Complete Parameter Checking
    • Watchdogs
    • Voting
    • Checksums
    • Routine Audits
    35

    View Slide

  36. Recovery and Mitigation
    of Errors
    36

    View Slide

  37. Failover
    to a redundant unit when the error has been
    detected and isolated.
    Cost
    Active/Active Active/Standby N+M
    Cost Time To Recover
    Redundancy

    Reminder
    37

    View Slide

  38. Intelligent Retries
    Time between Retries
    Number of Attempts
    Fixed Linear Exponential
    38

    View Slide

  39. Restart
    can be used as a last resort with the
    trade-off to lose state and time.
    39

    View Slide

  40. Fail Fast
    to shed load and give a partial great service
    than a complete bad one.
    Boundary
    40

    View Slide

  41. Backpressure
    & Batching!
    41

    View Slide

  42. Case Study: Hystrix
    https://raw.githubusercontent.com/wiki/Netflix/Hystrix/images/hystrix-flow-chart-original.png
    42

    View Slide

  43. And more!
    • Rollback
    • Roll-Forward
    • Checkpoints
    • Data Reset
    Recovery Mitigation
    • Bounded Queuing
    • Expansive Controls
    • Marking Data
    • Error Correcting Codes
    43

    View Slide

  44. And more!
    • Rollback
    • Roll-Forward
    • Checkpoints
    • Data Reset
    Recovery Mitigation
    • Bounded Queuing
    • Expansive Controls
    • Marking Data
    • Error Correcting Codes
    44

    View Slide

  45. Watch it in
    Action
    45

    View Slide

  46. Recommended
    Reading
    46

    View Slide

  47. Patterns for
    Fault-Tolerant Software
    by Robert S. Hanmer
    47

    View Slide

  48. Release It!
    by Michael T. Nygard
    48

    View Slide

  49. Any
    Questions?
    49

    View Slide

  50. twitter
    @daschl
    email
    [email protected]
    Thank you!
    50

    View Slide