The Walking Dead - A Survival Guide to Resilient Reactive Applications

The Walking Dead - A Survival Guide to Resilient Reactive Applications

This talk was given at JAX 2015 in Mainz.

D839d9aa56849a71d8a9aa3d292a6ce6?s=128

Michael Nitschinger

April 23, 2015
Tweet

Transcript

  1. Michael Nitschinger | Couchbase, Inc. The Walking Dead A Survival

    Guide to Reactive Resilient Applications
  2. the right Mindset 2

  3. – U.S. Marine Corps “The more you sweat in peace,

    the less you bleed in war.” 3
  4. 4

  5. 5

  6. Not so fast, mister fancy tests! 6

  7. What can go wrong? Always ask yourself 7

  8. Fault Tolerance 101 8

  9. Fault Error Failure A fault is a latent defect that

    can cause an error when activated. 9
  10. Fault Error Failure Errors are the manifestations of faults. 10

  11. Fault Error Failure Failure occurs when the service no longer

    complies with its specifications. 11
  12. Fault Error Failure Errors are inevitable. We need to detect,

    recover and mitigate them before they become failures. 12
  13. Reliability is the probability that a system will perform failure

    free for a given amount of time. MTTF Mean Time To Failure MTTR Mean Time To Repair 13
  14. Availability is the percentage of time the system is able

    to perform its function. availability = MTTF MTTF + MTTR 14
  15. Expression Downtime/Year Three 9s 99.9% 525.6 min Four 9s 99.99%

    52.56 min Four 9s and a 5 99.995% 26.28 min Five 9s 99.999% 5.256 min Six 9s 99.9999% 0.5256 min 100% 0 15
  16. Pop Quiz! Edge Service User Service Session Store Data Warehouse

    Wanted: 99.99% Availability ??? ??? ??? 16
  17. Pop Quiz! Edge Service User Service Session Store Data Warehouse

    Wanted: 99.99% Availability 99.99% 17 99.99% 99.99%
  18. Pop Quiz! Edge Service User Service Session Store Data Warehouse

    Wanted: 99.99% Availability ~99.999% ~99.999% ~99.999% 18
  19. Fault Tolerant Architecture 19

  20. Units of Mitigation are the basic units of error containment

    and recovery. 20
  21. Escalation is used when recovery or mitigation is not possible

    inside the unit. 21
  22. Escalation 22 Cluster Node Node Service Service Service Service Service

    Endpoint Endpoint Endpoint Endpoint Endpoint
  23. Escalation 23 Cluster Node Node Service Service Service Service Service

    Endpoint Endpoint Endpoint Endpoint Endpoint
  24. Escalation 24 Cluster Node Node Service Service Service Service Service

    Endpoint Endpoint Endpoint Endpoint Endpoint
  25. Escalation 25 Cluster Node Node Service Service Service Service Service

    Endpoint Endpoint Endpoint Endpoint Endpoint
  26. Redundancy Cost Active/Active Active/Standby N+M
 Active/Passive Cost Time To Recover

    26
  27. The Fault Observer receives system and error events and can

    guide and orchestrate detection and recovery Unit Unit Observer Listener Listener Unit Unit 27
  28. 28

  29. 29

  30. Detecting Errors 30

  31. A silent system is a dead system. 31

  32. A System Monitor helps to study behaviour and to make

    sure it is operating as specified. 32 http://cdn-www.airliners.net/aviation-photos/photos/9/2/1/0982129.jpg
  33. https://github.com/Netflix/Turbine 33

  34. Periodic Checking Heartbeats monitor tasks or remote services and initiate

    recovery Routine Exercises prevent idle unit starvation and surface malfunctions 34
  35. 35 Encoder( Encoder( Ne*y( Writes( Ne*y( Reads( Decoder( Decoder( Event

    on Idle No Traffic Endpoint
  36. Riding over Transients is used to defer error recovery if

    the error is temporary. “‘Patience is a virtue’ to allow the true signature of an error to show itself.” - Robert S. Hanmer 36
  37. 37 The Leaky Bucket

  38. And more! • Complete Parameter Checking • Watchdogs • Voting

    • Checksums • Routine Audits 38
  39. Recovery and Mitigation of Errors 39

  40. Timeout to not wait forever and keep holding up the

    resource. 40 X
  41. Failover to a redundant unit when the error has been

    detected and isolated. Cost Active/Active Active/Standby N+M Cost Time To Recover Redundancy
 Reminder 41
  42. Intelligent Retries Time between Retries Number of Attempts Fixed Linear

    Exponential 42
  43. Restart can be used as a last resort with the

    trade-off to lose state and time. 43
  44. Fail Fast to shed load and give a partial great

    service than a complete bad one. Boundary 44
  45. Backpressure & Batching! 45

  46. Case Study: Hystrix https://raw.githubusercontent.com/wiki/Netflix/Hystrix/images/hystrix-flow-chart-original.png 46

  47. And more! • Rollback • Roll-Forward • Checkpoints • Data

    Reset Recovery Mitigation • Bounded Queuing • Expansive Controls • Marking Data • Error Correcting Codes 47
  48. And more! • Rollback • Roll-Forward • Checkpoints • Data

    Reset Recovery Mitigation • Bounded Queuing • Expansive Controls • Marking Data • Error Correcting Codes 48
  49. Recommended Reading 49

  50. Patterns for Fault-Tolerant Software by Robert S. Hanmer 50

  51. Release It! by Michael T. Nygard 51

  52. Announcement CB Server 4.0 dp! 52 http://blog.couchbase.com/introducing-developer-preview-for-couchbase-server-4.0

  53. Any Questions? 53

  54. twitter @daschl email michael.nitschinger@couchbase.com Thank you! 54