Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The Walking Dead - A Survival Guide to Resilien...

The Walking Dead - A Survival Guide to Resilient Reactive Applications

This talk was given at JAX 2015 in Mainz.

Michael Nitschinger

April 23, 2015
Tweet

More Decks by Michael Nitschinger

Other Decks in Programming

Transcript

  1. 4

  2. 5

  3. Fault Error Failure A fault is a latent defect that

    can cause an error when activated. 9
  4. Fault Error Failure Errors are inevitable. We need to detect,

    recover and mitigate them before they become failures. 12
  5. Reliability is the probability that a system will perform failure

    free for a given amount of time. MTTF Mean Time To Failure MTTR Mean Time To Repair 13
  6. Availability is the percentage of time the system is able

    to perform its function. availability = MTTF MTTF + MTTR 14
  7. Expression Downtime/Year Three 9s 99.9% 525.6 min Four 9s 99.99%

    52.56 min Four 9s and a 5 99.995% 26.28 min Five 9s 99.999% 5.256 min Six 9s 99.9999% 0.5256 min 100% 0 15
  8. Pop Quiz! Edge Service User Service Session Store Data Warehouse

    Wanted: 99.99% Availability ??? ??? ??? 16
  9. Pop Quiz! Edge Service User Service Session Store Data Warehouse

    Wanted: 99.99% Availability 99.99% 17 99.99% 99.99%
  10. Pop Quiz! Edge Service User Service Session Store Data Warehouse

    Wanted: 99.99% Availability ~99.999% ~99.999% ~99.999% 18
  11. The Fault Observer receives system and error events and can

    guide and orchestrate detection and recovery Unit Unit Observer Listener Listener Unit Unit 27
  12. 28

  13. 29

  14. A System Monitor helps to study behaviour and to make

    sure it is operating as specified. 32 http://cdn-www.airliners.net/aviation-photos/photos/9/2/1/0982129.jpg
  15. Periodic Checking Heartbeats monitor tasks or remote services and initiate

    recovery Routine Exercises prevent idle unit starvation and surface malfunctions 34
  16. Riding over Transients is used to defer error recovery if

    the error is temporary. “‘Patience is a virtue’ to allow the true signature of an error to show itself.” - Robert S. Hanmer 36
  17. Failover to a redundant unit when the error has been

    detected and isolated. Cost Active/Active Active/Standby N+M Cost Time To Recover Redundancy
 Reminder 41
  18. Restart can be used as a last resort with the

    trade-off to lose state and time. 43
  19. Fail Fast to shed load and give a partial great

    service than a complete bad one. Boundary 44
  20. And more! • Rollback • Roll-Forward • Checkpoints • Data

    Reset Recovery Mitigation • Bounded Queuing • Expansive Controls • Marking Data • Error Correcting Codes 47
  21. And more! • Rollback • Roll-Forward • Checkpoints • Data

    Reset Recovery Mitigation • Bounded Queuing • Expansive Controls • Marking Data • Error Correcting Codes 48