Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The Walking Dead - A Survival Guide to Resilient Reactive Applications

The Walking Dead - A Survival Guide to Resilient Reactive Applications

I gave this talk at GeeCon 2015 in Krakow. Recording will be available through the GeeCon channels.

Michael Nitschinger

May 12, 2015
Tweet

More Decks by Michael Nitschinger

Other Decks in Programming

Transcript

  1. The Walking Dead
    A Survival Guide to Resilient Reactive Applications
    Michael Nitschinger @daschl

    View full-size slide

  2. the right
    Mindset
    2

    View full-size slide

  3. – U.S. Marine Corps
    “The more you sweat in peace, the less
    you bleed in war.”
    3

    View full-size slide

  4. Not so fast, mister fancy tests!
    6

    View full-size slide

  5. What can go wrong?
    Always ask yourself
    7

    View full-size slide

  6. Fault Tolerance
    101
    8

    View full-size slide

  7. Fault Error Failure
    A fault is a latent defect that can cause an
    error when activated.
    9

    View full-size slide

  8. Fault Error Failure
    Errors are the manifestations of faults.
    10

    View full-size slide

  9. Fault Error Failure
    Failure occurs when the service no longer
    complies with its specifications.
    11

    View full-size slide

  10. Fault Error Failure
    Errors are inevitable. We need to
    detect, recover and mitigate
    them before they become failures.
    12

    View full-size slide

  11. Reliability
    is the probability that a system will perform
    failure free for a given amount of time.
    MTTF Mean Time To Failure
    MTTR Mean Time To Repair
    13

    View full-size slide

  12. Availability
    is the percentage of time the system is able to
    perform its function.
    availability =
    MTTF
    MTTF + MTTR
    14

    View full-size slide

  13. Expression Downtime/Year
    Three 9s 99.9% 525.6 min
    Four 9s 99.99% 52.56 min
    Four 9s and a 5 99.995% 26.28 min
    Five 9s 99.999% 5.256 min
    Six 9s 99.9999% 0.5256 min
    100% 0
    15

    View full-size slide

  14. Pop Quiz!
    Edge Service
    User Service Session Store Data Warehouse
    Wanted: 99.99% Availability
    ??? ??? ???
    16

    View full-size slide

  15. Pop Quiz!
    Edge Service
    User Service Session Store Data Warehouse
    Wanted: 99.99% Availability
    99.99%
    17
    99.99% 99.99%

    View full-size slide

  16. Pop Quiz!
    Edge Service
    User Service Session Store Data Warehouse
    Wanted: 99.99% Availability
    ~99.999% ~99.999% ~99.999%
    18

    View full-size slide

  17. Fault Tolerant
    Architecture
    19

    View full-size slide

  18. Units of Mitigation
    are the basic units of
    error containment and recovery.
    20

    View full-size slide

  19. Escalation
    is used when recovery or mitigation
    is not possible inside the unit.
    21

    View full-size slide

  20. Escalation
    22
    Cluster
    Node Node
    Service Service Service Service Service
    Endpoint Endpoint Endpoint Endpoint Endpoint

    View full-size slide

  21. Escalation
    23
    Cluster
    Node Node
    Service Service Service Service Service
    Endpoint Endpoint Endpoint Endpoint Endpoint

    View full-size slide

  22. Escalation
    24
    Cluster
    Node Node
    Service Service Service Service Service
    Endpoint Endpoint Endpoint Endpoint Endpoint

    View full-size slide

  23. Escalation
    25
    Cluster
    Node Node
    Service Service Service Service Service
    Endpoint Endpoint Endpoint Endpoint Endpoint

    View full-size slide

  24. Redundancy
    Cost
    Active/Active Active/Standby N+M

    Active/Passive
    Cost Time To Recover
    26

    View full-size slide

  25. The Fault Observer
    receives system and error events and
    can guide and orchestrate detection and recovery
    Unit
    Unit
    Observer
    Listener
    Listener
    Unit
    Unit
    27

    View full-size slide

  26. Detecting
    Errors
    30

    View full-size slide

  27. A silent system
    is a dead system.
    31

    View full-size slide

  28. A System Monitor
    helps to study behaviour and to
    make sure it is operating as specified.
    http://upload.wikimedia.org/wikipedia/commons/3/3b/Mission_control_center.jpg
    32

    View full-size slide

  29. https://github.com/Netflix/Turbine
    33

    View full-size slide

  30. Periodic Checking
    Heartbeats monitor tasks or remote services
    and initiate recovery
    Routine Exercises prevent idle
    unit starvation and surface malfunctions
    34

    View full-size slide

  31. 35
    Encoder(
    Encoder(
    Ne*y(
    Writes(
    Ne*y(
    Reads(
    Decoder(
    Decoder(
    Event on Idle
    No Traffic
    Endpoint

    View full-size slide

  32. Riding over Transients
    is used to defer error recovery
    if the error is temporary.
    “‘Patience is a virtue’ to allow the true signature of
    an error to show itself.”
    - Robert S. Hanmer
    36

    View full-size slide

  33. And more!
    • Complete Parameter Checking
    • Watchdogs
    • Voting
    • Checksums
    • Routine Audits
    38

    View full-size slide

  34. Recovery and Mitigation
    of Errors
    39

    View full-size slide

  35. Timeout
    to not wait forever and keep
    holding up the resource.
    40
    X

    View full-size slide

  36. Failover
    to a redundant unit when the error has been
    detected and isolated.
    Cost
    Active/Active Active/Standby N+M
    Cost Time To Recover
    Redundancy

    Reminder
    41

    View full-size slide

  37. Intelligent Retries
    Time between Retries
    Number of Attempts
    Fixed Linear Exponential
    42

    View full-size slide

  38. Restart
    can be used as a last resort with the
    trade-off to lose state and time.
    43

    View full-size slide

  39. Fail Fast
    to shed load and give a partial great service
    than a complete bad one.
    Boundary
    44

    View full-size slide

  40. Backpressure
    & Batching!
    45

    View full-size slide

  41. Case Study: Hystrix
    https://raw.githubusercontent.com/wiki/Netflix/Hystrix/images/hystrix-flow-chart-original.png
    46

    View full-size slide

  42. And more!
    • Rollback
    • Roll-Forward
    • Checkpoints
    • Data Reset
    Recovery Mitigation
    • Bounded Queuing
    • Expansive Controls
    • Marking Data
    • Error Correcting Codes
    47

    View full-size slide

  43. And more!
    • Rollback
    • Roll-Forward
    • Checkpoints
    • Data Reset
    Recovery Mitigation
    • Bounded Queuing
    • Expansive Controls
    • Marking Data
    • Error Correcting Codes
    48

    View full-size slide

  44. Recommended
    Reading
    49

    View full-size slide

  45. Patterns for
    Fault-Tolerant Software
    by Robert S. Hanmer
    50

    View full-size slide

  46. Release It!
    by Michael T. Nygard
    51

    View full-size slide

  47. Any
    Questions?
    52

    View full-size slide

  48. twitter
    @daschl
    email
    [email protected]
    Thank you!
    53

    View full-size slide