The Walking Dead - A Survival Guide to Resilient Applications

The Walking Dead - A Survival Guide to Resilient Applications

This talk is an introduction to resilient, fault tolerant systems with a focus on design patterns and practical examples thereof.

It was given at Voxxed Days Vienna 2015 and once uploaded should be available on parleys.com

D839d9aa56849a71d8a9aa3d292a6ce6?s=128

Michael Nitschinger

February 06, 2015
Tweet

Transcript

  1. @daschl #Voxxed The Walking Dead A Survival Guide to Resilient

    Applications Michael Nitschinger
  2. the right Mindset 2

  3. – U.S. Marine Corps “The more you sweat in peace,

    the less you bleed in war.” 3
  4. 4

  5. 5

  6. Not so fast, mister fancy tests! 6

  7. What can go wrong? Always ask yourself 7

  8. Fault Tolerance 101 8

  9. Fault Error Failure A fault is a latent defect that

    can cause an error when activated. 9
  10. Fault Error Failure Errors are the manifestations of faults. 10

  11. Fault Error Failure Failure occurs when the service no longer

    complies with its specifications. 11
  12. Fault Error Failure Errors are inevitable. We need to detect,

    recover and mitigate them before they become failures. 12
  13. Reliability is the probability that a system will perform failure

    free for a given amount of time. MTTF Mean Time To Failure MTTR Mean Time To Repair 13
  14. Availability is the percentage of time the system is able

    to perform its function. availability = MTTF MTTF + MTTR 14
  15. Expression Downtime/Year Three 9s 99.9% 525.6 min Four 9s 99.99%

    52.56 min Four 9s and a 5 99.995% 26.28 min Five 9s 99.999% 5.256 min Six 9s 99.9999% 0.5256 min 100% 0 15
  16. Pop Quiz! Edge Service User Service Session Store Data Warehouse

    Wanted: 99.99% Availability ??? ??? ??? 16
  17. Pop Quiz! Edge Service User Service Session Store Data Warehouse

    Wanted: 99.99% Availability 99.999% 99.999% 99.999% 17
  18. Fault Tolerant Architecture 18

  19. Units of Mitigation are the basic units of error containment

    and recovery. 19
  20. 20

  21. Redundancy Cost Active/Active Active/Standby N+M Cost Time To Recover 21

  22. Escalation is used when recovery or mitigation is not possible

    inside the unit. 22
  23. Escalation taken from http://letitcrash.com/post/30165507578/shutdown-patterns-in-akka-2 23

  24. The Fault Observer receives system and error events and can

    guide and orchestrate detection and recovery Unit Unit Observer Listener Listener Unit Unit 24
  25. 25

  26. 26

  27. Detecting Errors 27

  28. A silent system is a dead system. 28

  29. A System Monitor helps to study behaviour and to make

    sure it is operating as specified. http://upload.wikimedia.org/wikipedia/commons/3/3b/Mission_control_center.jpg 29
  30. https://github.com/Netflix/Turbine 30

  31. Periodic Checking Heartbeats monitor tasks or remote services and initiate

    recovery Routine Exercises prevent idle unit starvation and surface malfunctions 31
  32. Utilizing Netty’s IdleStateHandler 32

  33. Riding over Transients is used to defer error recovery if

    the error is temporary. “‘Patience is a virtue’ to allow the true signature of an error to show itself.” - Robert S. Hanmer 33
  34. 34

  35. And more! • Complete Parameter Checking • Watchdogs • Voting

    • Checksums • Routine Audits 35
  36. Recovery and Mitigation of Errors 36

  37. Failover to a redundant unit when the error has been

    detected and isolated. Cost Active/Active Active/Standby N+M Cost Time To Recover Redundancy
 Reminder 37
  38. Intelligent Retries Time between Retries Number of Attempts Fixed Linear

    Exponential 38
  39. Restart can be used as a last resort with the

    trade-off to lose state and time. 39
  40. Fail Fast to shed load and give a partial great

    service than a complete bad one. Boundary 40
  41. Backpressure & Batching! 41

  42. Case Study: Hystrix https://raw.githubusercontent.com/wiki/Netflix/Hystrix/images/hystrix-flow-chart-original.png 42

  43. And more! • Rollback • Roll-Forward • Checkpoints • Data

    Reset Recovery Mitigation • Bounded Queuing • Expansive Controls • Marking Data • Error Correcting Codes 43
  44. And more! • Rollback • Roll-Forward • Checkpoints • Data

    Reset Recovery Mitigation • Bounded Queuing • Expansive Controls • Marking Data • Error Correcting Codes 44
  45. Watch it in Action 45

  46. Recommended Reading 46

  47. Patterns for Fault-Tolerant Software by Robert S. Hanmer 47

  48. Release It! by Michael T. Nygard 48

  49. Any Questions? 49

  50. twitter @daschl email michael.nitschinger@couchbase.com Thank you! 50