Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The Walking Dead - A Survival Guide to Resilient Reactive Applications

The Walking Dead - A Survival Guide to Resilient Reactive Applications

This talk was given at Lambda Days Krakow 2015, a great event with focus on functional and reactive programming.

The slides are very much like the one given at Voxxed, but with a greater focus on functional and event driven paradigms.

Michael Nitschinger

February 27, 2015
Tweet

More Decks by Michael Nitschinger

Other Decks in Programming

Transcript

  1. The Walking Dead
    A Survival Guide to Resilient Reactive Applications
    Michael Nitschinger
    @daschl

    View Slide

  2. the right
    Mindset
    2

    View Slide

  3. – U.S. Marine Corps
    “The more you sweat in peace, the less
    you bleed in war.”
    3

    View Slide

  4. 4

    View Slide

  5. 5

    View Slide

  6. Not so fast, mister fancy tests!
    6

    View Slide

  7. What can go wrong?
    Always ask yourself
    7

    View Slide

  8. Fault Tolerance
    101
    8

    View Slide

  9. Fault Error Failure
    A fault is a latent defect that can cause an
    error when activated.
    9

    View Slide

  10. Fault Error Failure
    Errors are the manifestations of faults.
    10

    View Slide

  11. Fault Error Failure
    Failure occurs when the service no longer
    complies with its specifications.
    11

    View Slide

  12. Fault Error Failure
    Errors are inevitable. We need to
    detect, recover and mitigate
    them before they become failures.
    12

    View Slide

  13. Reliability
    is the probability that a system will perform
    failure free for a given amount of time.
    MTTF Mean Time To Failure
    MTTR Mean Time To Repair
    13

    View Slide

  14. Availability
    is the percentage of time the system is able to
    perform its function.
    availability =
    MTTF
    MTTF + MTTR
    14

    View Slide

  15. Expression Downtime/Year
    Three 9s 99.9% 525.6 min
    Four 9s 99.99% 52.56 min
    Four 9s and a 5 99.995% 26.28 min
    Five 9s 99.999% 5.256 min
    Six 9s 99.9999% 0.5256 min
    100% 0
    15

    View Slide

  16. Pop Quiz!
    Edge Service
    User Service Session Store Data Warehouse
    Wanted: 99.99% Availability
    ??? ??? ???
    16

    View Slide

  17. Pop Quiz!
    Edge Service
    User Service Session Store Data Warehouse
    Wanted: 99.99% Availability
    99.99%
    17
    99.99% 99.99%

    View Slide

  18. Pop Quiz!
    Edge Service
    User Service Session Store Data Warehouse
    Wanted: 99.99% Availability
    ~99.999% ~99.999% ~99.999%
    18

    View Slide

  19. Fault Tolerant
    Architecture
    19

    View Slide

  20. Units of Mitigation
    are the basic units of
    error containment and recovery.
    20

    View Slide

  21. Escalation
    is used when recovery or mitigation
    is not possible inside the unit.
    21

    View Slide

  22. Escalation
    22
    Cluster
    Node Node
    Service Service Service Service Service
    Endpoint Endpoint Endpoint Endpoint Endpoint

    View Slide

  23. Escalation
    23
    Cluster
    Node Node
    Service Service Service Service Service
    Endpoint Endpoint Endpoint Endpoint Endpoint

    View Slide

  24. Escalation
    24
    Cluster
    Node Node
    Service Service Service Service Service
    Endpoint Endpoint Endpoint Endpoint Endpoint

    View Slide

  25. Escalation
    25
    Cluster
    Node Node
    Service Service Service Service Service
    Endpoint Endpoint Endpoint Endpoint Endpoint

    View Slide

  26. Redundancy
    Cost
    Active/Active Active/Standby N+M

    Active/Passive
    Cost Time To Recover
    26

    View Slide

  27. The Fault Observer
    receives system and error events and
    can guide and orchestrate detection and recovery
    Unit
    Unit
    Observer
    Listener
    Listener
    Unit
    Unit
    27

    View Slide

  28. 28

    View Slide

  29. 29

    View Slide

  30. Detecting
    Errors
    30

    View Slide

  31. A silent system
    is a dead system.
    31

    View Slide

  32. A System Monitor
    helps to study behaviour and to
    make sure it is operating as specified.
    http://upload.wikimedia.org/wikipedia/commons/3/3b/Mission_control_center.jpg
    32

    View Slide

  33. https://github.com/Netflix/Turbine
    33

    View Slide

  34. Periodic Checking
    Heartbeats monitor tasks or remote services
    and initiate recovery
    Routine Exercises prevent idle
    unit starvation and surface malfunctions
    34

    View Slide

  35. 35
    Encoder(
    Encoder(
    Ne*y(
    Writes(
    Ne*y(
    Reads(
    Decoder(
    Decoder(
    Event on Idle
    No Traffic
    Endpoint

    View Slide

  36. Riding over Transients
    is used to defer error recovery
    if the error is temporary.
    “‘Patience is a virtue’ to allow the true signature of
    an error to show itself.”
    - Robert S. Hanmer
    36

    View Slide

  37. 37

    View Slide

  38. And more!
    • Complete Parameter Checking
    • Watchdogs
    • Voting
    • Checksums
    • Routine Audits
    38

    View Slide

  39. Recovery and Mitigation
    of Errors
    39

    View Slide

  40. Timeout
    to not wait forever and keep
    holding up the resource.
    40
    X

    View Slide

  41. Failover
    to a redundant unit when the error has been
    detected and isolated.
    Cost
    Active/Active Active/Standby N+M
    Cost Time To Recover
    Redundancy

    Reminder
    41

    View Slide

  42. Intelligent Retries
    Time between Retries
    Number of Attempts
    Fixed Linear Exponential
    42

    View Slide

  43. Restart
    can be used as a last resort with the
    trade-off to lose state and time.
    43

    View Slide

  44. Fail Fast
    to shed load and give a partial great service
    than a complete bad one.
    Boundary
    44

    View Slide

  45. Backpressure
    & Batching!
    45

    View Slide

  46. Case Study: Hystrix
    https://raw.githubusercontent.com/wiki/Netflix/Hystrix/images/hystrix-flow-chart-original.png
    46

    View Slide

  47. And more!
    • Rollback
    • Roll-Forward
    • Checkpoints
    • Data Reset
    Recovery Mitigation
    • Bounded Queuing
    • Expansive Controls
    • Marking Data
    • Error Correcting Codes
    47

    View Slide

  48. And more!
    • Rollback
    • Roll-Forward
    • Checkpoints
    • Data Reset
    Recovery Mitigation
    • Bounded Queuing
    • Expansive Controls
    • Marking Data
    • Error Correcting Codes
    48

    View Slide

  49. Recommended
    Reading
    49

    View Slide

  50. Patterns for
    Fault-Tolerant Software
    by Robert S. Hanmer
    50

    View Slide

  51. Release It!
    by Michael T. Nygard
    51

    View Slide

  52. Any
    Questions?
    52

    View Slide

  53. twitter
    @daschl
    email
    [email protected]
    Thank you!
    53

    View Slide