Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The Walking Dead - A Survival Guide to Resilient Reactive Applications

The Walking Dead - A Survival Guide to Resilient Reactive Applications

This talk was given at Lambda Days Krakow 2015, a great event with focus on functional and reactive programming.

The slides are very much like the one given at Voxxed, but with a greater focus on functional and event driven paradigms.

Michael Nitschinger

February 27, 2015
Tweet

More Decks by Michael Nitschinger

Other Decks in Programming

Transcript

  1. The Walking Dead
    A Survival Guide to Resilient Reactive Applications
    Michael Nitschinger
    @daschl

    View full-size slide

  2. the right
    Mindset
    2

    View full-size slide

  3. – U.S. Marine Corps
    “The more you sweat in peace, the less
    you bleed in war.”
    3

    View full-size slide

  4. Not so fast, mister fancy tests!
    6

    View full-size slide

  5. What can go wrong?
    Always ask yourself
    7

    View full-size slide

  6. Fault Tolerance
    101
    8

    View full-size slide

  7. Fault Error Failure
    A fault is a latent defect that can cause an
    error when activated.
    9

    View full-size slide

  8. Fault Error Failure
    Errors are the manifestations of faults.
    10

    View full-size slide

  9. Fault Error Failure
    Failure occurs when the service no longer
    complies with its specifications.
    11

    View full-size slide

  10. Fault Error Failure
    Errors are inevitable. We need to
    detect, recover and mitigate
    them before they become failures.
    12

    View full-size slide

  11. Reliability
    is the probability that a system will perform
    failure free for a given amount of time.
    MTTF Mean Time To Failure
    MTTR Mean Time To Repair
    13

    View full-size slide

  12. Availability
    is the percentage of time the system is able to
    perform its function.
    availability =
    MTTF
    MTTF + MTTR
    14

    View full-size slide

  13. Expression Downtime/Year
    Three 9s 99.9% 525.6 min
    Four 9s 99.99% 52.56 min
    Four 9s and a 5 99.995% 26.28 min
    Five 9s 99.999% 5.256 min
    Six 9s 99.9999% 0.5256 min
    100% 0
    15

    View full-size slide

  14. Pop Quiz!
    Edge Service
    User Service Session Store Data Warehouse
    Wanted: 99.99% Availability
    ??? ??? ???
    16

    View full-size slide

  15. Pop Quiz!
    Edge Service
    User Service Session Store Data Warehouse
    Wanted: 99.99% Availability
    99.99%
    17
    99.99% 99.99%

    View full-size slide

  16. Pop Quiz!
    Edge Service
    User Service Session Store Data Warehouse
    Wanted: 99.99% Availability
    ~99.999% ~99.999% ~99.999%
    18

    View full-size slide

  17. Fault Tolerant
    Architecture
    19

    View full-size slide

  18. Units of Mitigation
    are the basic units of
    error containment and recovery.
    20

    View full-size slide

  19. Escalation
    is used when recovery or mitigation
    is not possible inside the unit.
    21

    View full-size slide

  20. Escalation
    22
    Cluster
    Node Node
    Service Service Service Service Service
    Endpoint Endpoint Endpoint Endpoint Endpoint

    View full-size slide

  21. Escalation
    23
    Cluster
    Node Node
    Service Service Service Service Service
    Endpoint Endpoint Endpoint Endpoint Endpoint

    View full-size slide

  22. Escalation
    24
    Cluster
    Node Node
    Service Service Service Service Service
    Endpoint Endpoint Endpoint Endpoint Endpoint

    View full-size slide

  23. Escalation
    25
    Cluster
    Node Node
    Service Service Service Service Service
    Endpoint Endpoint Endpoint Endpoint Endpoint

    View full-size slide

  24. Redundancy
    Cost
    Active/Active Active/Standby N+M

    Active/Passive
    Cost Time To Recover
    26

    View full-size slide

  25. The Fault Observer
    receives system and error events and
    can guide and orchestrate detection and recovery
    Unit
    Unit
    Observer
    Listener
    Listener
    Unit
    Unit
    27

    View full-size slide

  26. Detecting
    Errors
    30

    View full-size slide

  27. A silent system
    is a dead system.
    31

    View full-size slide

  28. A System Monitor
    helps to study behaviour and to
    make sure it is operating as specified.
    http://upload.wikimedia.org/wikipedia/commons/3/3b/Mission_control_center.jpg
    32

    View full-size slide

  29. https://github.com/Netflix/Turbine
    33

    View full-size slide

  30. Periodic Checking
    Heartbeats monitor tasks or remote services
    and initiate recovery
    Routine Exercises prevent idle
    unit starvation and surface malfunctions
    34

    View full-size slide

  31. 35
    Encoder(
    Encoder(
    Ne*y(
    Writes(
    Ne*y(
    Reads(
    Decoder(
    Decoder(
    Event on Idle
    No Traffic
    Endpoint

    View full-size slide

  32. Riding over Transients
    is used to defer error recovery
    if the error is temporary.
    “‘Patience is a virtue’ to allow the true signature of
    an error to show itself.”
    - Robert S. Hanmer
    36

    View full-size slide

  33. And more!
    • Complete Parameter Checking
    • Watchdogs
    • Voting
    • Checksums
    • Routine Audits
    38

    View full-size slide

  34. Recovery and Mitigation
    of Errors
    39

    View full-size slide

  35. Timeout
    to not wait forever and keep
    holding up the resource.
    40
    X

    View full-size slide

  36. Failover
    to a redundant unit when the error has been
    detected and isolated.
    Cost
    Active/Active Active/Standby N+M
    Cost Time To Recover
    Redundancy

    Reminder
    41

    View full-size slide

  37. Intelligent Retries
    Time between Retries
    Number of Attempts
    Fixed Linear Exponential
    42

    View full-size slide

  38. Restart
    can be used as a last resort with the
    trade-off to lose state and time.
    43

    View full-size slide

  39. Fail Fast
    to shed load and give a partial great service
    than a complete bad one.
    Boundary
    44

    View full-size slide

  40. Backpressure
    & Batching!
    45

    View full-size slide

  41. Case Study: Hystrix
    https://raw.githubusercontent.com/wiki/Netflix/Hystrix/images/hystrix-flow-chart-original.png
    46

    View full-size slide

  42. And more!
    • Rollback
    • Roll-Forward
    • Checkpoints
    • Data Reset
    Recovery Mitigation
    • Bounded Queuing
    • Expansive Controls
    • Marking Data
    • Error Correcting Codes
    47

    View full-size slide

  43. And more!
    • Rollback
    • Roll-Forward
    • Checkpoints
    • Data Reset
    Recovery Mitigation
    • Bounded Queuing
    • Expansive Controls
    • Marking Data
    • Error Correcting Codes
    48

    View full-size slide

  44. Recommended
    Reading
    49

    View full-size slide

  45. Patterns for
    Fault-Tolerant Software
    by Robert S. Hanmer
    50

    View full-size slide

  46. Release It!
    by Michael T. Nygard
    51

    View full-size slide

  47. Any
    Questions?
    52

    View full-size slide

  48. twitter
    @daschl
    email
    [email protected]
    Thank you!
    53

    View full-size slide