Upgrade to Pro — share decks privately, control downloads, hide ads and more …

ITT 2019 - Nic Jackson - From the first photocopy to modern failure detection in distributed systems

ITT 2019 - Nic Jackson - From the first photocopy to modern failure detection in distributed systems

Distributed systems are not a new problem, for as long as there have been n+1 computers in a network, the problem of managing membership in a group and detecting failure has existed. Many of the algorithms we use in today's systems to solve this problem are over 30 years old, and in this talk, we will look at how an algorithm for email replication by a photocopier company, has morphed into SWIM, used to manage group membership and failure detection in modern distributed systems.

Takeaways:

- Introduction to Gossip and epidemic rumor spreading

- Deep dive into SWIM which uses Gossip for failure detection in distributed systems

- Investigation of Lifeguard which builds on SWIM adding many improvements

Istanbul Tech Talks

April 02, 2019
Tweet

More Decks by Istanbul Tech Talks

Other Decks in Technology

Transcript

  1. © 2019 HashiCorp 1 From the first photocopy, to modern

    failure detection in distributed systems Nic Jackson
  2. © 2019 HashiCorp 4 Methods analysed • Direct mail •

    Anti-entropy • Rumor mongering (Gossip)
  3. © 2019 HashiCorp 8 Epidemic propagation (Gossip) • Worked well

    for 3 machines in 1989 • Still working for 55 machines in 2002 • Not beaten yet at 5000 machines in 2019
  4. © 2019 HashiCorp 9 How Gossip works • All nodes

    in a cluster operate in synchronous rounds • It is assumed that each node knows the other nodes
  5. © 2019 HashiCorp 14 How Gossip works Number of rounds

    required to spread a rumor = O(log n)
  6. © 2019 HashiCorp 15 How Gossip works • log(8) /

    log(2) = 3 • 0.903089986992 / 0.301029995664 = 3
  7. © 2019 HashiCorp Demo - Rules 17 1. When you

    receive the rumor you are going to select a random person nearby (behind, in front, left, right), and tell them the rumor 2. Once you have passed the rumor repeat step 1 3. If the person you tell the rumor to has already heard it stop and raise your hand
  8. © 2019 HashiCorp 20 Why was SWIM needed? • Node

    failure needs to be understood to maintain a healthy group • Traditional heartbeating does not work because: ◦ network load which grows quadratically with group size ◦ compromised response times ◦ false positive frequency with relation to detecting process crashes
  9. © 2019 HashiCorp 21 Why was SWIM needed? • network

    load which grows quadratically with group size • compromised response times • false positive frequency with relation to detecting process crashes
  10. © 2019 HashiCorp 22 Problems with traditional heart-beating • Sending

    all heart beats to a central server can cause overload • Sending heartbeats to all members either through Gossip or Multicast leads to high network load, this grows quadratically with group size O(n2) e.g. 4,9,16,25,36
  11. © 2019 HashiCorp 23 Properties of SWIM • Constant load

    per message group • Process failure detected in constant time • Infection-style (Gossip) process for membership updates
  12. © 2019 HashiCorp 24 Failure detection with SWIM • Distributed

    failure detection • No single point of failure • Every node probes another node at random • With constant probability, every node is probed • Once failure is confirmed nodes gossip this failure
  13. © 2019 HashiCorp 30 Why was SWIM needed? • network

    load which grows quadratically with group size • compromised response times • false positive frequency with relation to detecting process crashes
  14. © 2019 HashiCorp 32 Problems with failure detection in SWIM

    • Much of the process is based on a fail-stop process rather than byzantine failure • This means that the process under suspicion might just be running slow or it might be suffering temporal failure • It also means that the probing service could be spreading false rumors like the traitor in the byzantine generals problem, it might be the underlying problem
  15. © 2019 HashiCorp 33 Byzantine failure • Could be a

    flakey node • Temporary network problem
  16. © 2019 HashiCorp 34 Byzantine failure • Two armies A

    and B • Need to decide to attack or retreat • If they both agree on an approach they win • If they disagree then they lose
  17. © 2019 HashiCorp 36 Gray Failure: The Achilles' Heel of

    Cloud-Scale Systems • Paper from Microsoft research • "the major availability breakdowns and performance anomalies we see in cloud environments tend to be caused by subtle underlying faults, i.e. gray failures rather than fail-stop failure"
  18. © 2019 HashiCorp 37 Gray Failure: The Achilles' Heel of

    Cloud-Scale Systems • Performance degradation • Random packet loss • Flaky I/O • Memory pressure • Non-fatal exceptions
  19. © 2019 HashiCorp 39 Gray Failure: Byzantine fault tolerance •

    Complex to implement • High network overhead • Not proven in production
  20. © 2019 HashiCorp 43 SWIM at scale • SWIM only

    tested on 55 nodes • When the probing node raises a suspicion message and is running slow it might not get the refutation message
  21. © 2019 HashiCorp 45 Lifeguard - beyond SWIM • Dynamic

    fault detector timeouts - "Self Awareness" • Dynamic suspicion timeouts - "Dogpile" • Refutation timeouts - "Buddy System"
  22. © 2019 HashiCorp 46 Lifeguard - Dynamic fault detection •

    When no Acks are received from probes, assume fault is probing node • Timeouts relaxed ◦ Probe interval modified ◦ Probe timeout modified ◦ Node Self-Awareness counter (NSA) • Implement NACKs when asking other nodes to confirm failure
  23. © 2019 HashiCorp 49 Lifeguard - Dynamic fault detection •

    Failed probe (no Ack) +1 • Probe with missed Nack +1 • Refute suspicion about self +1 • Successful probe (get Ack) -1 • Max NSA = 8 ProbeTimeout = BaseInterval*(NSA+1) ProbeInterval = BaseInterval*(NSA+1)
  24. © 2019 HashiCorp 50 Lifeguard - Dynamic suspicion timeouts •

    Start with a high suspicion timeout • When suspicion messages are received from other nodes regarding a node we already suspect, reduce suspicion timeout
  25. © 2019 HashiCorp 52 Lifeguard - More timely refutation •

    If you are probing a node which you have suspicion about • Let the node know • Suspicion messages are piggybacked on probes
  26. © 2019 HashiCorp 53 Lifeguard - Results • 7% increase

    in latency under normal conditions • 12% increase in messages • 8% total reduction in network traffic due to piggy backed messages • 98% reduction in false positives
  27. © 2019 HashiCorp 55 Lifeguard - References Epidemic Algorithms http://bitsavers.trailing-edge.com/pdf/xerox/parc/techReports/CSL-89-1_Epidemic_Algorithms_for_Re

    plicated_Database_Maintenance.pdf SWIM https://www.cs.cornell.edu/projects/Quicksilver/public_pdfs/SWIM.pdf Gray Failures https://www.microsoft.com/en-us/research/wp-content/uploads/2017/06/paper-1.pdf Byzantine Generals Problem https://people.eecs.berkeley.edu/~luca/cs174/byzantine.pdf Lifeguard https://arxiv.org/pdf/1707.00788.pdf
  28. © 2019 HashiCorp Lifeguard - Summary 56 “a simple protocol

    designed by a photocopier company for email replication still powers many of our distributed systems 30 years later”