EverybodyTalks.pdf

Everybody Talks Digging up Consul’s skeletons

Who am I? - Hashidork at Hashicorp - My mom
says I’m cool - Member of the Consulate

“I’m sick of hardcoding IP addresses everywhere” - me, like
all the time and Armon probably once maybe What was the problem?

PEER TO PEER FULLY DECENTRALIZED COMMUNICATION

Everyone is the client. Everyone is the server. We are
all one. We are all no one. DECENTRALIZED

PEER TO PEER

DECENTRALIZED PEER TO PEER COMMUNICATION

Gossip Protocols Peer to peer decentralized communication

Gossip: Peer to peer Unstructured: Randomized / Epidemic Full Visibility:
All members known Something simple

SWIM Scalable Weakly Consistent Infection-style Process Group Membership

SCALABLE

WEAKLY CONSISTENT

We focus on a weaker variant of group membership, where
membership lists at different members need not be consistent across the group at the same (causal) point in time. Stronger guarantees could be provided by augmenting the membership sub-system, e.g. a virtually-synchronous style membership can be provided through a sequencer process that checkpoints the membership list periodically. However, unlike the weakly consistent problem, strongly consistent specifications might have fundamental scalability limitations - SWIM: Scalable Weakly-consistent Infection-style Process Group Membership Protocol Abhinandan Das, Indranil Gupta, Ashish Motivala

TL;DR the stronger the consistency model the more complex we
get

INFECTION-STYLE PROCESS GROUP

MEMBERSHIP

© 2019 HashiCorp History • Constant load per member regardless
of group size • Failure detection latency is independent of cluster size • Infection-style (Gossip) for membership updates ▪ Alive, Suspect, Dead Properties of SWIM

© 2019 HashiCorp History What does SWIM give us? •
Propagating membership updates ◦ Joining, Failing, Leaving • Failure Detection • Not Heartbeat Driven

FAILURE DETECTION

© 2019 HashiCorp History • Incarnation numbers are used to
order messages ◦ Numbers start at 0 and increments when it receives information about it being suspected • Incarnation numbers are local, only the node can increment it’s own incarnation number • Incarnation numbers increment when a suspicion is received about ourselves but we are alive! Properties of SWIM

“We built that as the Memberlist library, and tried to
have that be as pure of an implementation as possible”

- Member States: - Alive, Suspect, Dead - Indirect and
Direct probes for liveness - Member updates piggyback on probes Memberlist Basics

- Using TCP and UDP for direct probes - Anti-Entropy
mechanism - Full state syncs over TCP periodically with random members - Separate messaging layer for member updates - Nodes will send out messages on their own for member updates periodically Memberlist Additions

- Dynamic Fault Detector Timeouts (Self-Awareness) - Dynamic Suspicion Timeouts
(Dogpile) - More Timely Refutation (Buddy System) Lifeguard: SWIM-ing with situational awareness

© 2019 HashiCorp Lifeguard: Self Awareness - Prevents unhealthy nodes
from sending inaccurate suspect messages to reduce network traﬃc until local conditions improve - Useful in times of high local resource utilization, network partitions - Awareness score increases if the node suspects it is resource constrained

Lifeguard: Dogpile - New suspicions get longest time to respond,
subsequent nodes get shorter timeframes. - Shortens suspicion timeout based of responses to reduce time failed node is in suspect state “We wait for the truth before we start spreading falsehoods” - Solomon Christoﬀ, Golden Retriever

Lifeguard: Buddy System - Prioritize notifying a member of a
suspicion - Anytime we probe a node, indirect or directly, we’ll let them know we think they’re having problems - Expedited refuting false failures

Jon Currey: Making Gossip More Robust with Lifeguard

- Keyring used to encrypt communication between nodes. - Many
keys be stored and to try and decrypt a message, similarly many keys can be used to encrypt a message. Encryption

“I’ll make a product out of you” Memberlist -Armon and
Mitchell

Decentralized Cluster Membership, Failure Detection

Memberlist (Failure Detection, Node Communication and Discovery) Serf (New Functions,
Vivaldi, Lamport Clocks, CLI)

- Graceful Leave - Gives nodes the option to leave
the cluster on their own behalf. - Snapshotting - Save state of a node - Network Coordinates - get Network Coordinates locally - KeyManager - Install, Uninstall keys - And much more! Serf Salesman: *slaps roof of Serf* “This bad boy has so many new functions”

Custom Event Propagation - Run a shell script, exec a
command, choose your own adventure - Can trigger oﬀ member events or be localized to a speciﬁc member

Vivaldi Network Tomography System “Study of network’s internal characteristics using
information from end point data” Uses Round Trip Time calculates distance between peers in a cluster ^ Vivaldi, irl.

Vivaldi Vivaldi gives Serf network coordinates - Use this information
to create your own adventure

Armon Dadgar on Vivaldi: Decentralized Network Coordinate System

Lamport Clocks Leslie Lamport is back at it, in 1978
Replaces incarnation numbers to keep messages ordered Logical clock that is event based

“Missed it by a four second interval” - Oliver Tree

Consensus Problem

© 2019 HashiCorp In the beginning Raft has to decide
a leader. Each node will have a randomized timeout set 150ms 157ms 190ms 300ms 201ms Breakdown

© 2019 HashiCorp The ﬁrst node to reach the end
of it’s timeout will request to be leader A node will typically reach the end of it’s timeout when it doesn’t get a message from the leader 0ms 7ms 40ms 150ms 51ms Vote for me please! Breakdown

© 2019 HashiCorp The elected leader will send out health
checks which will restart the other node’s timeouts. 51ms 150ms 165ms 40ms 150ms 51ms New phone, who dis?? Breakdown

© 2019 HashiCorp Server can be in any of the
three states at any given time: Follower Listening for heartbeats Candidate Polling for votes Leader Listening for incoming commands, sending out heartbeats to keep term alive

© 2019 HashiCorp Breakdown Raft is divided into terms, where
at most there is one leader per term. - Some terms can have no leaders “Terms identify obsolete information” - John Ousterhout - Leader’s log is seen as the truth, and is the most up to date log. Breakdown: Terms

© 2019 HashiCorp Breakdown: Leader Election Timeout occurs after not
receiving heartbeat from leader Request others to vote for you Becomes leader, send out heartbeats Somebody else becomes leader, become a follower Vote split, nobody wins. New term Breakdown: Leader Election

© 2019 HashiCorp Candidates will deny a leader if their
log has a higher term, higher index then the proposed-leaders log. 1 X = 3 1 X = 3 1 X = 3 1 X = 3 1 X = 3 2 Y = 8 2 Y = 8 2 Y = 8 2 Y = 8 INDEX Value INDEX Value Different color represents new term 2 Y = 8 3 Y = 8 3 N = 9 3 N = 9 3 N = 9 3 N = 9 Leader Election Vote for me please!

© 2019 HashiCorp “Keeping the replicated log consistent is the
job of the consensus algorithm.” - Raft is designed around the log. Servers with inconsistent logs will never get elected as leader - Normal operation of Raft will repair inconsistencies Breakdown: Log Replication

© 2019 HashiCorp 1 X = 3 1 X =
3 2 Y = 8 2 Y = 8 3 N = 9 3 N = 9 Logs must persist through crashes Any committed entry is safe to execute in state machines A committed entry is replicated on the majority of servers 1 X = 3 2 Y = 8 3 N = 9 1 X = 3 2 Y = 8 3 N = 9 1 X = 3 2 Y = 8 3 N = 9 4 P = 6 4 P = 6 4 P = 6 4 P = 6 4 P = 6 5 L = 0 5 L = 0 5 L = 0 5 L = 0 6 R = 7 6 R = 7 6 R = 7 7 Z = 6 7 Z = 6 Committed Entries Breakdown

Sarah Christoff: You Must Build a Raft!

Problem Solved, time to scoot on

© 2019 HashiCorp Eventually Consistent, Cluster Membership, Failure Detection Strongly
Consistent Service Discovery, Service Monitoring, K/V store

- Strongly Consistent (via Raft) - Multiple Gossip pools to
span datacenters - Key/Value Store - Service Discovery & Service Level Health Checks - Centralized API and UI Consul Basics

Consul Vocabulary Agent There is an agent process that runs
on every machine within a Consul cluster, it can be either a server or client. Server Typically a standalone instance that is involved in the Raft quorum and maintains state. Can communicate across datacenters. Client An agent that monitors an application* which is not apart of the Raft quorum and does not have state. Cannot communicate across datacenters.

Service Conﬁguration: Key/Value Store - K/V store is strongly consistent
- Implemented by a “simple in-memory database” based off Radix Trees - hashicorp/go-memdb - Stored on Consul Servers however can be accessed by both agents (clients or servers)

Network Coordinates: Implemented - Prepared Queries are rules or guidelines
for Consul to follow. - Using the network coordinates we can provide failover for services based off geo location. - https://learn.hashicorp.com/consul/develo per-discovery/geo-failover

Service Discovery & Service Monitoring - Stored service information the
same in mem database as K/Vs, but different tables! - Health checks are configurable per service definition - HTTP - TCP - TTL - Docker - Script (Build your own!)

Consul Today

© 2019 HashiCorp Service Discovery Service Monitoring Service Conﬁguration Group
Membership Failure Detection Decentralized Cluster Membership, Failure Detection, Network Coordinates

Help me get better through feedback https://tinyurl.com/hashisarah

Thank you @sheriffjackson 82 learn.hashicorp.com/consul

EverybodyTalks.pdf

EverybodyTalks.pdf

More Decks by Sarah Christoff

Other Decks in Technology

Featured

Transcript