ITT 2019 - Nic Jackson - From the first photocopy to modern failure detection in distributed systems

© 2019 HashiCorp 1 From the first photocopy, to modern
failure detection in distributed systems Nic Jackson

© 2019 HashiCorp 2 What do photocopies have to do
with anything?

© 2019 HashiCorp 4 Methods analysed • Direct mail •
Anti-entropy • Rumor mongering (Gossip)

© 2019 HashiCorp SWIM - Scalable Weakly Consistent Infection-style Process
Group Membership 5

© 2019 HashiCorp 8 Epidemic propagation (Gossip) • Worked well
for 3 machines in 1989 • Still working for 55 machines in 2002 • Not beaten yet at 5000 machines in 2019

© 2019 HashiCorp 9 How Gossip works • All nodes
in a cluster operate in synchronous rounds • It is assumed that each node knows the other nodes

© 2019 HashiCorp 14 How Gossip works Number of rounds
required to spread a rumor = O(log n)

© 2019 HashiCorp 15 How Gossip works • log(8) /
log(2) = 3 • 0.903089986992 / 0.301029995664 = 3

© 2019 HashiCorp Demo - Rules 17 1. When you
receive the rumor you are going to select a random person nearby (behind, in front, left, right), and tell them the rumor 2. Once you have passed the rumor repeat step 1 3. If the person you tell the rumor to has already heard it stop and raise your hand

© 2019 HashiCorp 18 SWIM - Scalable Weakly Consistent Infection-style
Process Group Membership

© 2019 HashiCorp 19 SWIM - Scalable Weakly Consistent Infection-style
Process Group Membership

© 2019 HashiCorp 20 Why was SWIM needed? • Node
failure needs to be understood to maintain a healthy group • Traditional heartbeating does not work because: ◦ network load which grows quadratically with group size ◦ compromised response times ◦ false positive frequency with relation to detecting process crashes

© 2019 HashiCorp 21 Why was SWIM needed? • network
load which grows quadratically with group size • compromised response times • false positive frequency with relation to detecting process crashes

© 2019 HashiCorp 22 Problems with traditional heart-beating • Sending
all heart beats to a central server can cause overload • Sending heartbeats to all members either through Gossip or Multicast leads to high network load, this grows quadratically with group size O(n2) e.g. 4,9,16,25,36

© 2019 HashiCorp 23 Properties of SWIM • Constant load
per message group • Process failure detected in constant time • Infection-style (Gossip) process for membership updates

© 2019 HashiCorp 24 Failure detection with SWIM • Distributed
failure detection • No single point of failure • Every node probes another node at random • With constant probability, every node is probed • Once failure is confirmed nodes gossip this failure

© 2019 HashiCorp 30 Why was SWIM needed? • network
load which grows quadratically with group size • compromised response times • false positive frequency with relation to detecting process crashes

© 2019 HashiCorp 32 Problems with failure detection in SWIM
• Much of the process is based on a fail-stop process rather than byzantine failure • This means that the process under suspicion might just be running slow or it might be suffering temporal failure • It also means that the probing service could be spreading false rumors like the traitor in the byzantine generals problem, it might be the underlying problem

© 2019 HashiCorp 33 Byzantine failure • Could be a
flakey node • Temporary network problem

© 2019 HashiCorp 34 Byzantine failure • Two armies A
and B • Need to decide to attack or retreat • If they both agree on an approach they win • If they disagree then they lose

© 2019 HashiCorp 36 Gray Failure: The Achilles' Heel of
Cloud-Scale Systems • Paper from Microsoft research • "the major availability breakdowns and performance anomalies we see in cloud environments tend to be caused by subtle underlying faults, i.e. gray failures rather than fail-stop failure"

© 2019 HashiCorp 37 Gray Failure: The Achilles' Heel of
Cloud-Scale Systems • Performance degradation • Random packet loss • Flaky I/O • Memory pressure • Non-fatal exceptions

© 2019 HashiCorp 39 Gray Failure: Byzantine fault tolerance •
Complex to implement • High network overhead • Not proven in production

© 2019 HashiCorp 43 SWIM at scale • SWIM only
tested on 55 nodes • When the probing node raises a suspicion message and is running slow it might not get the refutation message

© 2019 HashiCorp 45 Lifeguard - beyond SWIM • Dynamic
fault detector timeouts - "Self Awareness" • Dynamic suspicion timeouts - "Dogpile" • Refutation timeouts - "Buddy System"

© 2019 HashiCorp 46 Lifeguard - Dynamic fault detection •
When no Acks are received from probes, assume fault is probing node • Timeouts relaxed ◦ Probe interval modified ◦ Probe timeout modified ◦ Node Self-Awareness counter (NSA) • Implement NACKs when asking other nodes to confirm failure

© 2019 HashiCorp 49 Lifeguard - Dynamic fault detection •
Failed probe (no Ack) +1 • Probe with missed Nack +1 • Refute suspicion about self +1 • Successful probe (get Ack) -1 • Max NSA = 8 ProbeTimeout = BaseInterval*(NSA+1) ProbeInterval = BaseInterval*(NSA+1)

© 2019 HashiCorp 50 Lifeguard - Dynamic suspicion timeouts •
Start with a high suspicion timeout • When suspicion messages are received from other nodes regarding a node we already suspect, reduce suspicion timeout

© 2019 HashiCorp 52 Lifeguard - More timely refutation •
If you are probing a node which you have suspicion about • Let the node know • Suspicion messages are piggybacked on probes

© 2019 HashiCorp 53 Lifeguard - Results • 7% increase
in latency under normal conditions • 12% increase in messages • 8% total reduction in network traffic due to piggy backed messages • 98% reduction in false positives

© 2019 HashiCorp 55 Lifeguard - References Epidemic Algorithms http://bitsavers.trailing-edge.com/pdf/xerox/parc/techReports/CSL-89-1_Epidemic_Algorithms_for_Re
plicated_Database_Maintenance.pdf SWIM https://www.cs.cornell.edu/projects/Quicksilver/public_pdfs/SWIM.pdf Gray Failures https://www.microsoft.com/en-us/research/wp-content/uploads/2017/06/paper-1.pdf Byzantine Generals Problem https://people.eecs.berkeley.edu/~luca/cs174/byzantine.pdf Lifeguard https://arxiv.org/pdf/1707.00788.pdf

Thank you @sheriffjackson [email protected] 57

ITT 2019 - Nic Jackson - From the first photoco...

ITT 2019 - Nic Jackson - From the first photocopy to modern failure detection in distributed systems

More Decks by Istanbul Tech Talks

Other Decks in Technology

Featured

Transcript