Upgrade to Pro — share decks privately, control downloads, hide ads and more …

ITT 2019 - Nic Jackson - From the first photocopy to modern failure detection in distributed systems

ITT 2019 - Nic Jackson - From the first photocopy to modern failure detection in distributed systems

Distributed systems are not a new problem, for as long as there have been n+1 computers in a network, the problem of managing membership in a group and detecting failure has existed. Many of the algorithms we use in today's systems to solve this problem are over 30 years old, and in this talk, we will look at how an algorithm for email replication by a photocopier company, has morphed into SWIM, used to manage group membership and failure detection in modern distributed systems.

Takeaways:

- Introduction to Gossip and epidemic rumor spreading

- Deep dive into SWIM which uses Gossip for failure detection in distributed systems

- Investigation of Lifeguard which builds on SWIM adding many improvements

Istanbul Tech Talks

April 02, 2019
Tweet

More Decks by Istanbul Tech Talks

Other Decks in Technology

Transcript

  1. © 2019 HashiCorp 1
    From the first
    photocopy, to modern
    failure detection in
    distributed systems
    Nic Jackson

    View Slide

  2. © 2019 HashiCorp 2
    What do photocopies have to do with
    anything?

    View Slide

  3. © 2019 HashiCorp
    Epidemic Algorithms for Replicated Database Maintenance 3

    View Slide

  4. © 2019 HashiCorp 4
    Methods analysed
    ● Direct mail
    ● Anti-entropy
    ● Rumor mongering (Gossip)

    View Slide

  5. © 2019 HashiCorp
    SWIM - Scalable Weakly
    Consistent
    Infection-style Process
    Group Membership
    5

    View Slide

  6. © 2019 HashiCorp 6
    Gossip

    View Slide

  7. © 2019 HashiCorp 7
    What is Gossip?

    View Slide

  8. © 2019 HashiCorp 8
    Epidemic propagation (Gossip)
    ● Worked well for 3 machines in 1989
    ● Still working for 55 machines in 2002
    ● Not beaten yet at 5000 machines in 2019

    View Slide

  9. © 2019 HashiCorp 9
    How Gossip works
    ● All nodes in a cluster operate in synchronous rounds
    ● It is assumed that each node knows the other nodes

    View Slide

  10. © 2019 HashiCorp 10
    How Gossip works - Round 1

    View Slide

  11. © 2019 HashiCorp 11
    How Gossip works - Round 2

    View Slide

  12. © 2019 HashiCorp 12
    How Gossip works - Round 3

    View Slide

  13. © 2019 HashiCorp 13
    How Gossip works - Round 3

    View Slide

  14. © 2019 HashiCorp 14
    How Gossip works
    Number of rounds required to spread a rumor =
    O(log n)

    View Slide

  15. © 2019 HashiCorp 15
    How Gossip works
    ● log(8) / log(2) = 3
    ● 0.903089986992 / 0.301029995664 = 3

    View Slide

  16. © 2019 HashiCorp 16
    Demo

    View Slide

  17. © 2019 HashiCorp
    Demo - Rules
    17
    1. When you receive the rumor you are going to select a random person
    nearby (behind, in front, left, right), and tell them the rumor
    2. Once you have passed the rumor repeat step 1
    3. If the person you tell the rumor to has already heard it stop and raise
    your hand

    View Slide

  18. © 2019 HashiCorp 18
    SWIM - Scalable Weakly
    Consistent Infection-style
    Process Group Membership

    View Slide

  19. © 2019 HashiCorp 19
    SWIM - Scalable Weakly Consistent
    Infection-style Process Group
    Membership

    View Slide

  20. © 2019 HashiCorp 20
    Why was SWIM needed?
    ● Node failure needs to be understood to maintain a healthy group
    ● Traditional heartbeating does not work because:
    ○ network load which grows quadratically with group size
    ○ compromised response times
    ○ false positive frequency with relation to detecting process crashes

    View Slide

  21. © 2019 HashiCorp 21
    Why was SWIM needed?
    ● network load which grows quadratically with group size
    ● compromised response times
    ● false positive frequency with relation to detecting process crashes

    View Slide

  22. © 2019 HashiCorp 22
    Problems with traditional heart-beating
    ● Sending all heart beats to a central server can cause overload
    ● Sending heartbeats to all members either through Gossip or Multicast
    leads to high network load, this grows quadratically with group size O(n2)
    e.g. 4,9,16,25,36

    View Slide

  23. © 2019 HashiCorp 23
    Properties of SWIM
    ● Constant load per message group
    ● Process failure detected in constant time
    ● Infection-style (Gossip) process for membership updates

    View Slide

  24. © 2019 HashiCorp 24
    Failure detection with SWIM
    ● Distributed failure detection
    ● No single point of failure
    ● Every node probes another node at random
    ● With constant probability, every node is probed
    ● Once failure is confirmed nodes gossip this failure

    View Slide

  25. © 2019 HashiCorp 25
    Failure detection with SWIM

    View Slide

  26. © 2019 HashiCorp 26
    Failure detection with SWIM

    View Slide

  27. © 2019 HashiCorp 27
    Failure detection with SWIM

    View Slide

  28. © 2019 HashiCorp 28
    Failure detection with SWIM

    View Slide

  29. © 2019 HashiCorp 29
    Failure detection with SWIM

    View Slide

  30. © 2019 HashiCorp 30
    Why was SWIM needed?
    ● network load which grows quadratically with group size
    ● compromised response times
    ● false positive frequency with relation to detecting process crashes

    View Slide

  31. © 2019 HashiCorp 31
    Gray failures

    View Slide

  32. © 2019 HashiCorp 32
    Problems with failure detection in SWIM
    ● Much of the process is based on a fail-stop process rather than byzantine
    failure
    ● This means that the process under suspicion might just be running slow
    or it might be suffering temporal failure
    ● It also means that the probing service could be spreading false rumors like
    the traitor in the byzantine generals problem, it might be the underlying
    problem

    View Slide

  33. © 2019 HashiCorp 33
    Byzantine failure
    ● Could be a flakey node
    ● Temporary network problem

    View Slide

  34. © 2019 HashiCorp 34
    Byzantine failure
    ● Two armies A and B
    ● Need to decide to attack or retreat
    ● If they both agree on an approach they win
    ● If they disagree then they lose

    View Slide

  35. © 2019 HashiCorp 35
    Byzantine failure

    View Slide

  36. © 2019 HashiCorp 36
    Gray Failure: The Achilles' Heel of
    Cloud-Scale Systems
    ● Paper from Microsoft research
    ● "the major availability breakdowns and performance anomalies we see in
    cloud environments tend to be caused by subtle underlying faults, i.e. gray
    failures rather than fail-stop failure"

    View Slide

  37. © 2019 HashiCorp 37
    Gray Failure: The Achilles' Heel of
    Cloud-Scale Systems
    ● Performance degradation
    ● Random packet loss
    ● Flaky I/O
    ● Memory pressure
    ● Non-fatal exceptions

    View Slide

  38. © 2019 HashiCorp 38
    Gray Failure: Byzantine fault tolerance

    View Slide

  39. © 2019 HashiCorp 39
    Gray Failure: Byzantine fault tolerance
    ● Complex to implement
    ● High network overhead
    ● Not proven in production

    View Slide

  40. © 2019 HashiCorp 40
    Handling gray failures with
    SWIM

    View Slide

  41. © 2019 HashiCorp 41
    Gray Failure: SWIM, suspicion

    View Slide

  42. © 2019 HashiCorp 42
    Gray Failure: SWIM, suspicion

    View Slide

  43. © 2019 HashiCorp 43
    SWIM at scale
    ● SWIM only tested on 55 nodes
    ● When the probing node raises a suspicion message and is running slow it
    might not get the refutation message

    View Slide

  44. © 2019 HashiCorp 44
    Lifeguard

    View Slide

  45. © 2019 HashiCorp 45
    Lifeguard - beyond SWIM
    ● Dynamic fault detector timeouts - "Self Awareness"
    ● Dynamic suspicion timeouts - "Dogpile"
    ● Refutation timeouts - "Buddy System"

    View Slide

  46. © 2019 HashiCorp 46
    Lifeguard - Dynamic fault detection
    ● When no Acks are received from probes, assume fault is probing node
    ● Timeouts relaxed
    ○ Probe interval modified
    ○ Probe timeout modified
    ○ Node Self-Awareness counter (NSA)
    ● Implement NACKs when asking other nodes to confirm failure

    View Slide

  47. © 2019 HashiCorp 47
    Lifeguard - Dynamic fault detection

    View Slide

  48. © 2019 HashiCorp 48
    Lifeguard - Dynamic fault detection

    View Slide

  49. © 2019 HashiCorp 49
    Lifeguard - Dynamic fault detection
    ● Failed probe (no Ack) +1
    ● Probe with missed Nack +1
    ● Refute suspicion about self +1
    ● Successful probe (get Ack) -1
    ● Max NSA = 8
    ProbeTimeout = BaseInterval*(NSA+1)
    ProbeInterval = BaseInterval*(NSA+1)

    View Slide

  50. © 2019 HashiCorp 50
    Lifeguard - Dynamic suspicion timeouts
    ● Start with a high suspicion timeout
    ● When suspicion messages are received from other nodes regarding a
    node we already suspect, reduce suspicion timeout

    View Slide

  51. © 2019 HashiCorp 51
    Lifeguard - Dynamic suspicion timeouts

    View Slide

  52. © 2019 HashiCorp 52
    Lifeguard - More timely refutation
    ● If you are probing a node which you
    have suspicion about
    ● Let the node know
    ● Suspicion messages are piggybacked
    on probes

    View Slide

  53. © 2019 HashiCorp 53
    Lifeguard - Results
    ● 7% increase in latency under normal conditions
    ● 12% increase in messages
    ● 8% total reduction in network traffic due to piggy backed messages
    ● 98% reduction in false positives

    View Slide

  54. © 2019 HashiCorp
    Memberlist
    54
    https://github.com/hashicorp/memberlist

    View Slide

  55. © 2019 HashiCorp 55
    Lifeguard - References
    Epidemic Algorithms
    http://bitsavers.trailing-edge.com/pdf/xerox/parc/techReports/CSL-89-1_Epidemic_Algorithms_for_Re
    plicated_Database_Maintenance.pdf
    SWIM
    https://www.cs.cornell.edu/projects/Quicksilver/public_pdfs/SWIM.pdf
    Gray Failures
    https://www.microsoft.com/en-us/research/wp-content/uploads/2017/06/paper-1.pdf
    Byzantine Generals Problem
    https://people.eecs.berkeley.edu/~luca/cs174/byzantine.pdf
    Lifeguard
    https://arxiv.org/pdf/1707.00788.pdf

    View Slide

  56. © 2019 HashiCorp
    Lifeguard - Summary
    56
    “a simple protocol designed by a photocopier
    company for email replication still powers many
    of our distributed systems 30 years later”

    View Slide

  57. Thank you
    @sheriffjackson
    [email protected]
    57

    View Slide