Kiran Bhattaram on Failure Detectors

Kiran Bhattaram on Failure Detectors

The problem of consensus is central to many distributed systems algorithms. Failure detectors are central to the way we think about consensus algorithms. In a fully asynchronous system, the FLP impossibility result (https://groups.csail.mit.edu/tds/papers/Lynch/jacm85.pdf) shows that no consensus solution that can tolerate crash failures exists! This simple, stunning result imposed a hard constraint on what could be solved in an asynchronous model.

The FLP (http://the-paper-trail.org/blog/a-brief-tour-of-flp-impossibility/) result kicked off a flurry of research into ways to circumvent the impossibility result. Failure detectors were the most compelling abstraction proposed. These augmented the asynchronous model just enough to allow consensus, while retaining most of the neat abstractions that make asynchronous systems simple to reason about.

In this talk, I'll discuss some of the history and background of Chandra and Toueg's failure detector proposal (http://courses.csail.mit.edu/6.852/08/papers/CT96-JACM.pdf), and discuss some failure detector mechanisms that followed the paper.

66402e897ef8d00d5a1ee30dcb5774f2?s=128

Papers_We_Love

March 27, 2017
Tweet

Transcript

  1. 1 FA I L U R E 
 D E

    T E C TO R S Papers We Love NYC
  2. 2 Kiran Bhattaram @kiranb

  3. 3 Why? Failure detectors are pervasive. Failure detectors abstract complexity.

  4. 4 Timeline T h e P a p e r

    E x a m p l e s E x p a n d i n g S c o p e B a c k g r o u n d
  5. 4 Timeline T h e P a p e r

    E x a m p l e s E x p a n d i n g S c o p e B a c k g r o u n d
  6. 4 Timeline T h e P a p e r

    E x a m p l e s E x p a n d i n g S c o p e B a c k g r o u n d
  7. 4 Timeline T h e P a p e r

    E x a m p l e s E x p a n d i n g S c o p e B a c k g r o u n d
  8. 4 Timeline T h e P a p e r

    E x a m p l e s E x p a n d i n g S c o p e B a c k g r o u n d
  9. 5 Background 1 history, system models consensus, impossibility

  10. - how long do operations take? - is message delivery

    reliable? - what kind of crashes happen? System Models Set of assumptions about the system 6
  11. 7 The Synchronous System Model upper bound on message delivery

    delay reliable delivery fail stop crashes upper bound on processing time
  12. 8 The Asynchronous System Model unbounded processing time reliable delivery

    fail stop crashes unbounded message delivery delay
  13. 9 Problems: Consensus C B 8 8 A 8

  14. 10 Consensus

  15. 10 Consensus Termination

  16. 10 Consensus Termination The processing will eventually conclude.

  17. 10 Consensus Termination Agreement The processing will eventually conclude.

  18. 10 Consensus Termination Agreement The processing will eventually conclude. Everyone

    will agree on the same thing.
  19. 10 Consensus Termination Agreement Validity The processing will eventually conclude.

    Everyone will agree on the same thing.
  20. 10 Consensus Termination Agreement Validity The processing will eventually conclude.

    Everyone will agree on the same thing. Some node will have proposed the agreed-upon value.
  21. 11 Consensus in Synchronous Systems Use timeouts to determine whether

    a process has crashed: t > (processing time bound + message delay time bound)
  22. 11 Consensus in Synchronous Systems Use timeouts to determine whether

    a process has crashed: t > (processing time bound + message delay time bound) => perfect failure detectors
  23. 12 Consensus in Asynchronous Systems: FLP! Even if only one

    process can crash Even with reliable delivery
  24. 13 Wait, what? but I use consensus systems all the

    time!
  25. 13 Wait, what? but I use consensus systems all the

    time! Any fault-tolerant algorithm solving consensus has runs that never terminate
  26. 13 Wait, what? but I use consensus systems all the

    time! Any fault-tolerant algorithm solving consensus has runs that never terminate but these runs may have very small probabilities. [Ben- Or] (weakens termination!)
  27. 14 “consensus is impossible” => “consensus is not always possible”

  28. 15 What Now? or, Keep Calm and Consensus On

  29. 15 What Now? or, Keep Calm and Consensus On or,

    Keep Augmenting the System Model
  30. 16 The Paper 2 oracles, classification, solving consensus

  31. 17 When do you stop waiting? 17

  32. 18 The Failure Detector Model 18 An oracle that guesses

    at which processes are still alive. - might be incorrect! - might be different for different processes! - might be flappy!
  33. 19 Evaluating Failure Detectors 19 Accuracy Completeness no false negatives

    no false positives A C B D
  34. 20 20 Accuracy Completeness Strong Weak Eventually Weak Eventually Strong

    Strong Weak Perfect P Strong S Eventually Perfect ὓP Eventually Strong ὓS Eventually Weak ὓW Weak W ὓQ Q
  35. 21 Completeness Strong Weak

  36. 22 Weak Completeness 22 A C B D

  37. 22 Weak Completeness 22 A C B D every node

    that has crashed is permanently suspected by at least one alive node
  38. 22 Weak Completeness 22 A C B D D has

    died! every node that has crashed is permanently suspected by at least one alive node
  39. 22 Weak Completeness 22 A C B D D has

    died! A has died! every node that has crashed is permanently suspected by at least one alive node
  40. 23 Strong Completeness 23 A C B D

  41. 23 Strong Completeness 23 A C B D eventually every

    process that crashes is permanently suspected by every correct process.
  42. 23 Strong Completeness 23 A C B D A &

    D! eventually every process that crashes is permanently suspected by every correct process.
  43. 23 Strong Completeness 23 A C B D A &

    D! A & D! eventually every process that crashes is permanently suspected by every correct process.
  44. 24

  45. 25 25 Accuracy Completeness Strong Weak Eventually Weak Eventually Strong

    Strong Perfect P Strong S Eventually Perfect ὓP Eventually Strong ὓS
  46. 26 26 Accuracy Strong Weak Eventually Weak Eventually Strong

  47. 27 C C A B D Perfect Accuracy

  48. 27 C C A B D Perfect Accuracy No process

    is suspected before it crashes.
  49. 27 C A B D C has died! Perfect Accuracy

    No process is suspected before it crashes.
  50. C 28 A B D Weak Accuracy

  51. C 28 A B D Weak Accuracy at least one

    correct process is never suspected.
  52. C 28 A B D C & D have died!

    Weak Accuracy B has died! B & C have died! at least one correct process is never suspected.
  53. C 28 A B D C & D have died!

    Weak Accuracy B has died! B & C have died! at least one correct process is never suspected.
  54. 29 29 Accuracy Strong Weak Eventually Weak Eventually Strong

  55. C 30 A B D Eventually Strong Accuracy

  56. C 30 A B D Eventually Strong Accuracy eventually NO

    correct processes is suspected by any correct process.
  57. C 30 A B D C has died! Eventually Strong

    Accuracy eventually NO correct processes is suspected by any correct process.
  58. C 30 A B D C has died! Eventually Strong

    Accuracy B & C have died! eventually NO correct processes is suspected by any correct process.
  59. C 30 A B D C has died! Eventually Strong

    Accuracy C has died! eventually NO correct processes is suspected by any correct process.
  60. C 31 A B D Eventually Weak Accuracy

  61. C 31 A B D A, C & D Eventually

    Weak Accuracy A & B B & C B, C & D
  62. C 31 A B D A, C & D Eventually

    Weak Accuracy B & C B B, C & D
  63. C 31 A B D Eventually Weak Accuracy B &

    C B C B, C & D
  64. C 31 A B D Eventually Weak Accuracy B &

    C B C B, C & D
  65. C 31 A B D Eventually Weak Accuracy B &

    C B C B, C & D eventually SOME correct process is not suspected by any correct process.
  66. None
  67. 33 33 Accuracy Completeness Strong Weak Eventually Weak Eventually Strong

    Strong Perfect P Strong S Eventually Perfect ὓP Eventually Strong ὓS
  68. 34 Consensus: ὓS 34 initial arbitrary information, BUT: eventually every

    process that crashes is permanently suspected by every correct process. eventually SOME correct process is not suspected by any correct process. solvable for up to n/2 failures!
  69. 35 Consensus: ὓS Mickens, The Saddest Moment.

  70. 36 Consensus: ὓS 36 C C A B D G

    choose leader by c = (r mod n) + 1 Phase 1: gather proposals F a proposal! E
  71. 36 Consensus: ὓS 36 C A B D G choose

    leader by c = (r mod n) + 1 Phase 1: gather proposals F a proposal! E
  72. 36 Consensus: ὓS 36 C A B D G choose

    leader by c = (r mod n) + 1 Phase 1: gather proposals F a proposal! E move on when majority proposes
  73. 37 Consensus: ὓS 37 C A B D G Phase

    2: send proposal a proposal! F E
  74. 37 Consensus: ὓS 37 C A B D G Phase

    2: send proposal a proposal! F E no waiting!
  75. 38 Consensus: ὓS 38 C A B D G Phase

    3: gather votes F E
  76. 38 Consensus: ὓS 38 C A B D G Phase

    3: gather votes sgtm! F E
  77. 38 Consensus: ὓS 38 C A B D G Phase

    3: gather votes sgtm! I think you may have died! F E
  78. 38 Consensus: ὓS 38 C A B D G Phase

    3: gather votes sgtm! I think you may have died! move on when majority votes F E
  79. 38 Consensus: ὓS 38 C A B D G Phase

    3: gather votes sgtm! I think you may have died! move on when majority votes cancel if all nodes realizes B is down F E
  80. 38 Consensus: ὓS 38 C A B D G Phase

    3: gather votes sgtm! I think you may have died! move on when majority votes cancel if all nodes realizes B is down OR F E
  81. 39 Consensus: ὓS 39 C A B D G Phase

    4: decision that’s okay, commit anyway! F E
  82. 39 Consensus: ὓS 39 C A B D G Phase

    4: decision that’s okay, commit anyway! no waiting — all done! F E
  83. 40 A very simple example of ὓS! 40 SWIM A

    B I’M ALIVE A looks fishy.
  84. 40 A very simple example of ὓS! 40 SWIM A

    B I’M ALIVE A looks fishy.
  85. 40 A very simple example of ὓS! 40 SWIM A

    B I’M ALIVE A looks fishy. oh hey, it’s back!
  86. 41 Expanding the Scope 4 New models, New problems

  87. 42 As of 1996 42 Model: - asynchronous systems -

    fail-stop processes - no recovery - no message losses Problems: - consensus - atomic broadcast
  88. 43 43 Accuracy Completeness Strong Weak Eventually Weak Eventually Strong

    Strong Weak Perfect P Strong S Eventually Perfect ὓP Eventually Strong ὓS Eventually Weak ὓW Weak W ὓQ Q tim eless heartbeat leader Ω
  89. 44 Non-blocking Atomic Commit: P? + diamond S 44 FLL

    Quiescent Communication: heartbeats And more! HB-completeness: if p[j] crashes, HB_i[j] stops increasing HB-accuracy: if p[j] is correct, HB_i[j] keeps increasing anonymously perfect: if a crash happens, the FD is informed.
  90. 45 And even more! Other models: - Crashes & link

    failures (FLL) - Network partitioning - Crash/recovery Other problems: - non-blocking atomic commit - group membership - leader election - k-set agreement - reliable communication
  91. 46 Rephrasing problems 46 encapsulating complexity/hairy bits A, B &

    C look sus
  92. 47 Examples 3 Productionization, SWIM, Phi Accrual

  93. 48 In production 48 • network efficiency & message load

    • speed of first detection • speed of knowledge propagation • minimizing flappy alerts completeness & accuracy, PLUS
  94. 49 SWIM Scalable Weakly-consistent Infection-style Process Group Membership Protocol

  95. 50 Additional features 50 SWIM • network: • constant message

    load/group member • propagate membership updates with gossip • time to detection: • deterministic bound on failure detection latency • prevent flappy alerts: • “suspect” nodes before declaring them dead
  96. 51 Randomized pings 51 k random nodes SWIM

  97. 51 Randomized pings 51 k random nodes SWIM

  98. 51 Randomized pings 51 k random nodes SWIM

  99. 51 Randomized pings 51 k random nodes SWIM

  100. 52 Gossip 52 B SWIM

  101. 52 Gossip 52 B SWIM

  102. 52 Gossip 52 B SWIM

  103. 52 Gossip 52 B SWIM

  104. 52 Gossip 52 B SWIM

  105. 52 Gossip 52 B SWIM Hey, I suspect B is

    dead!
  106. 53 Phi Accrual OTHER NOTES HERE TKTKTK φ

  107. 54 Model 54 54 φ C A B D G

    F E C is 25% likely to be down
  108. 55 Use cases: a job scheduler 55 55 - at

    25%, stop sending it new jobs. - at 50%, reschedule outstanding jobs on another node, and wait for recovery. - at 75%, φ
  109. 56 Where to Go from Here or, a bibliography •

    FLP result • Chandra/Toueg • Reynal survey • SWIM • Phi Accrual Failure Detectors • Guerraoui et al. survey • Freiling et al. survey
  110. 57 Conclusion! T h e P a p e r

    E x a m p l e s E x p a n d i n g S c o p e B a c k g r o u n d history, system models consensus, impossibility oracles, classification, solving consensus New models, New problems Productionization, SWIM, Phi Accrual
  111. 57 Conclusion! T h e P a p e r

    E x a m p l e s E x p a n d i n g S c o p e B a c k g r o u n d history, system models consensus, impossibility oracles, classification, solving consensus New models, New problems Productionization, SWIM, Phi Accrual
  112. 57 Conclusion! T h e P a p e r

    E x a m p l e s E x p a n d i n g S c o p e B a c k g r o u n d history, system models consensus, impossibility oracles, classification, solving consensus New models, New problems Productionization, SWIM, Phi Accrual
  113. 57 Conclusion! T h e P a p e r

    E x a m p l e s E x p a n d i n g S c o p e B a c k g r o u n d history, system models consensus, impossibility oracles, classification, solving consensus New models, New problems Productionization, SWIM, Phi Accrual
  114. 57 Conclusion! T h e P a p e r

    E x a m p l e s E x p a n d i n g S c o p e B a c k g r o u n d history, system models consensus, impossibility oracles, classification, solving consensus New models, New problems Productionization, SWIM, Phi Accrual
  115. 58 Thanks! @ k i r a n b kiranbot.com

  116. 59 Appendix!

  117. 60 Gossip van Renesse et. al

  118. 61 Ping one node 61 Gossip

  119. 61 Ping one node 61 Gossip