Upgrade to Pro — share decks privately, control downloads, hide ads and more …

ITT 2019 - Nic Jackson - From the first photocopy to modern failure detection in distributed systems

ITT 2019 - Nic Jackson - From the first photocopy to modern failure detection in distributed systems

Distributed systems are not a new problem, for as long as there have been n+1 computers in a network, the problem of managing membership in a group and detecting failure has existed. Many of the algorithms we use in today's systems to solve this problem are over 30 years old, and in this talk, we will look at how an algorithm for email replication by a photocopier company, has morphed into SWIM, used to manage group membership and failure detection in modern distributed systems.

Takeaways:

- Introduction to Gossip and epidemic rumor spreading

- Deep dive into SWIM which uses Gossip for failure detection in distributed systems

- Investigation of Lifeguard which builds on SWIM adding many improvements

990b89ca5f918a94ef6523d399eda9a4?s=128

Istanbul Tech Talks

April 02, 2019
Tweet

Transcript

  1. © 2019 HashiCorp 1 From the first photocopy, to modern

    failure detection in distributed systems Nic Jackson
  2. © 2019 HashiCorp 2 What do photocopies have to do

    with anything?
  3. © 2019 HashiCorp Epidemic Algorithms for Replicated Database Maintenance 3

  4. © 2019 HashiCorp 4 Methods analysed • Direct mail •

    Anti-entropy • Rumor mongering (Gossip)
  5. © 2019 HashiCorp SWIM - Scalable Weakly Consistent Infection-style Process

    Group Membership 5
  6. © 2019 HashiCorp 6 Gossip

  7. © 2019 HashiCorp 7 What is Gossip?

  8. © 2019 HashiCorp 8 Epidemic propagation (Gossip) • Worked well

    for 3 machines in 1989 • Still working for 55 machines in 2002 • Not beaten yet at 5000 machines in 2019
  9. © 2019 HashiCorp 9 How Gossip works • All nodes

    in a cluster operate in synchronous rounds • It is assumed that each node knows the other nodes
  10. © 2019 HashiCorp 10 How Gossip works - Round 1

  11. © 2019 HashiCorp 11 How Gossip works - Round 2

  12. © 2019 HashiCorp 12 How Gossip works - Round 3

  13. © 2019 HashiCorp 13 How Gossip works - Round 3

  14. © 2019 HashiCorp 14 How Gossip works Number of rounds

    required to spread a rumor = O(log n)
  15. © 2019 HashiCorp 15 How Gossip works • log(8) /

    log(2) = 3 • 0.903089986992 / 0.301029995664 = 3
  16. © 2019 HashiCorp 16 Demo

  17. © 2019 HashiCorp Demo - Rules 17 1. When you

    receive the rumor you are going to select a random person nearby (behind, in front, left, right), and tell them the rumor 2. Once you have passed the rumor repeat step 1 3. If the person you tell the rumor to has already heard it stop and raise your hand
  18. © 2019 HashiCorp 18 SWIM - Scalable Weakly Consistent Infection-style

    Process Group Membership
  19. © 2019 HashiCorp 19 SWIM - Scalable Weakly Consistent Infection-style

    Process Group Membership
  20. © 2019 HashiCorp 20 Why was SWIM needed? • Node

    failure needs to be understood to maintain a healthy group • Traditional heartbeating does not work because: ◦ network load which grows quadratically with group size ◦ compromised response times ◦ false positive frequency with relation to detecting process crashes
  21. © 2019 HashiCorp 21 Why was SWIM needed? • network

    load which grows quadratically with group size • compromised response times • false positive frequency with relation to detecting process crashes
  22. © 2019 HashiCorp 22 Problems with traditional heart-beating • Sending

    all heart beats to a central server can cause overload • Sending heartbeats to all members either through Gossip or Multicast leads to high network load, this grows quadratically with group size O(n2) e.g. 4,9,16,25,36
  23. © 2019 HashiCorp 23 Properties of SWIM • Constant load

    per message group • Process failure detected in constant time • Infection-style (Gossip) process for membership updates
  24. © 2019 HashiCorp 24 Failure detection with SWIM • Distributed

    failure detection • No single point of failure • Every node probes another node at random • With constant probability, every node is probed • Once failure is confirmed nodes gossip this failure
  25. © 2019 HashiCorp 25 Failure detection with SWIM

  26. © 2019 HashiCorp 26 Failure detection with SWIM

  27. © 2019 HashiCorp 27 Failure detection with SWIM

  28. © 2019 HashiCorp 28 Failure detection with SWIM

  29. © 2019 HashiCorp 29 Failure detection with SWIM

  30. © 2019 HashiCorp 30 Why was SWIM needed? • network

    load which grows quadratically with group size • compromised response times • false positive frequency with relation to detecting process crashes
  31. © 2019 HashiCorp 31 Gray failures

  32. © 2019 HashiCorp 32 Problems with failure detection in SWIM

    • Much of the process is based on a fail-stop process rather than byzantine failure • This means that the process under suspicion might just be running slow or it might be suffering temporal failure • It also means that the probing service could be spreading false rumors like the traitor in the byzantine generals problem, it might be the underlying problem
  33. © 2019 HashiCorp 33 Byzantine failure • Could be a

    flakey node • Temporary network problem
  34. © 2019 HashiCorp 34 Byzantine failure • Two armies A

    and B • Need to decide to attack or retreat • If they both agree on an approach they win • If they disagree then they lose
  35. © 2019 HashiCorp 35 Byzantine failure

  36. © 2019 HashiCorp 36 Gray Failure: The Achilles' Heel of

    Cloud-Scale Systems • Paper from Microsoft research • "the major availability breakdowns and performance anomalies we see in cloud environments tend to be caused by subtle underlying faults, i.e. gray failures rather than fail-stop failure"
  37. © 2019 HashiCorp 37 Gray Failure: The Achilles' Heel of

    Cloud-Scale Systems • Performance degradation • Random packet loss • Flaky I/O • Memory pressure • Non-fatal exceptions
  38. © 2019 HashiCorp 38 Gray Failure: Byzantine fault tolerance

  39. © 2019 HashiCorp 39 Gray Failure: Byzantine fault tolerance •

    Complex to implement • High network overhead • Not proven in production
  40. © 2019 HashiCorp 40 Handling gray failures with SWIM

  41. © 2019 HashiCorp 41 Gray Failure: SWIM, suspicion

  42. © 2019 HashiCorp 42 Gray Failure: SWIM, suspicion

  43. © 2019 HashiCorp 43 SWIM at scale • SWIM only

    tested on 55 nodes • When the probing node raises a suspicion message and is running slow it might not get the refutation message
  44. © 2019 HashiCorp 44 Lifeguard

  45. © 2019 HashiCorp 45 Lifeguard - beyond SWIM • Dynamic

    fault detector timeouts - "Self Awareness" • Dynamic suspicion timeouts - "Dogpile" • Refutation timeouts - "Buddy System"
  46. © 2019 HashiCorp 46 Lifeguard - Dynamic fault detection •

    When no Acks are received from probes, assume fault is probing node • Timeouts relaxed ◦ Probe interval modified ◦ Probe timeout modified ◦ Node Self-Awareness counter (NSA) • Implement NACKs when asking other nodes to confirm failure
  47. © 2019 HashiCorp 47 Lifeguard - Dynamic fault detection

  48. © 2019 HashiCorp 48 Lifeguard - Dynamic fault detection

  49. © 2019 HashiCorp 49 Lifeguard - Dynamic fault detection •

    Failed probe (no Ack) +1 • Probe with missed Nack +1 • Refute suspicion about self +1 • Successful probe (get Ack) -1 • Max NSA = 8 ProbeTimeout = BaseInterval*(NSA+1) ProbeInterval = BaseInterval*(NSA+1)
  50. © 2019 HashiCorp 50 Lifeguard - Dynamic suspicion timeouts •

    Start with a high suspicion timeout • When suspicion messages are received from other nodes regarding a node we already suspect, reduce suspicion timeout
  51. © 2019 HashiCorp 51 Lifeguard - Dynamic suspicion timeouts

  52. © 2019 HashiCorp 52 Lifeguard - More timely refutation •

    If you are probing a node which you have suspicion about • Let the node know • Suspicion messages are piggybacked on probes
  53. © 2019 HashiCorp 53 Lifeguard - Results • 7% increase

    in latency under normal conditions • 12% increase in messages • 8% total reduction in network traffic due to piggy backed messages • 98% reduction in false positives
  54. © 2019 HashiCorp Memberlist 54 https://github.com/hashicorp/memberlist

  55. © 2019 HashiCorp 55 Lifeguard - References Epidemic Algorithms http://bitsavers.trailing-edge.com/pdf/xerox/parc/techReports/CSL-89-1_Epidemic_Algorithms_for_Re

    plicated_Database_Maintenance.pdf SWIM https://www.cs.cornell.edu/projects/Quicksilver/public_pdfs/SWIM.pdf Gray Failures https://www.microsoft.com/en-us/research/wp-content/uploads/2017/06/paper-1.pdf Byzantine Generals Problem https://people.eecs.berkeley.edu/~luca/cs174/byzantine.pdf Lifeguard https://arxiv.org/pdf/1707.00788.pdf
  56. © 2019 HashiCorp Lifeguard - Summary 56 “a simple protocol

    designed by a photocopier company for email replication still powers many of our distributed systems 30 years later”
  57. Thank you @sheriffjackson nic@hashicorp.com 57