Building a Distributed Message Log from Scratch

Building a Distributed Message Log from Scratch Tyler Treat ·
Iowa Code Camp · 11/04/17

- Messaging Nerd @ Apcera - Working on nats.io -
Distributed systems - bravenewgeek.com Tyler Treat

- The Log  -> What?  -> Why? - Implementation  ->
Storage mechanics  -> Data-replication techniques  -> Scaling message delivery  -> Trade-oﬀs and lessons learned Outline

The Log

The Log A totally-ordered, append-only data structure.

The Log 0

0 1 The Log

0 1 2 The Log

0 1 2 3 The Log

0 1 2 3 4 The Log

0 1 2 3 4 5 The Log

0 1 2 3 4 5 newest record oldest record
The Log

newest record oldest record The Log

Logs record what happened and when.

caches databases indexes writes

Examples in the wild: -> Apache Kafka  -> Amazon Kinesis
-> NATS Streaming  -> Tank

Key Goals: -> Performance -> High Availability -> Scalability

The purpose of this talk is to learn…  -> a
bit about the internals of a log abstraction. -> how it can achieve these goals. -> some applied distributed systems theory.

You will probably never need to build something like this
yourself, but it helps to know how it works.

Implemen- tation

Implemen- tation Don’t try this at home.

Some ﬁrst principles… Storage Mechanics • The log is an
ordered, immutable sequence of messages • Messages are atomic (meaning they can’t be broken up) • The log has a notion of message retention based on some policies (time, number of messages, bytes, etc.) • The log can be played back from any arbitrary position • The log is stored on disk • Sequential disk access is fast* • OS page cache means sequential access often avoids disk

http://queue.acm.org/detail.cfm?id=1563874

avg-cpu: %user %nice %system %iowait %steal %idle 13.53 0.00 11.28
0.00 0.00 75.19 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn xvda 0.00 0.00 0.00 0 0 iostat

Storage Mechanics log ﬁle 0

Storage Mechanics log ﬁle 0 1

Storage Mechanics log ﬁle 0 1 2

Storage Mechanics log ﬁle 0 1 2 3

Storage Mechanics log ﬁle 0 1 2 3 4

Storage Mechanics log ﬁle 0 1 2 3 4 5

Storage Mechanics log ﬁle … 0 1 2 3 4
5

Storage Mechanics log segment 3 ﬁle log segment 0 ﬁle
0 1 2 3 4 5

Storage Mechanics log segment 3 file log segment 0 file
0 1 2 3 4 5 0 1 2 0 1 2 index segment 0 file index segment 3 file

Zero-copy Reads user space kernel space page cache disk socket
NIC application read send

Zero-copy Reads user space kernel space page cache disk NIC
sendﬁle

Left as an exercise for the listener…  -> Batching  ->
Compression

How do we achieve high availability and fault tolerance?

Questions:  -> How do we ensure continuity of reads/writes? ->
How do we replicate data? -> How do we ensure replicas are consistent? -> How do we keep things fast? -> How do we ensure data is durable?

Data-Replication Techniques 1. Gossip/multicast protocols Epidemic broadcast trees, bimodal multicast,
SWIM, HyParView, NeEM  2. Consensus protocols 2PC/3PC, Paxos, Raft, Zab, chain replication

Consensus-Based Replication 1. Designate a leader 2. Replicate by either: 
a) waiting for all replicas  —or— b) waiting for a quorum of replicas

Pros Cons All Replicas Tolerates f failures with f+1 replicas
Latency pegged to slowest replica Quorum Hides delay from a slow replica Tolerates f failures with 2f+1 replicas Consensus-Based Replication

Replication in Kafka 1. Select a leader 2. Maintain in-sync
replica set (ISR) (initially every replica) 3. Leader writes messages to write-ahead log (WAL) 4. Leader commits messages when all replicas in ISR ack 5. Leader maintains high-water mark (HW) of last committed message 6. Piggyback HW on replica fetch responses which replicas periodically checkpoint to disk

0 1 2 3 4 5 b1 (leader) 0 1
2 3 4 HW: 3 0 1 2 3 HW: 3 HW: 3 b2 (follower) b3 (follower) ISR: {b1, b2, b3} writes Replication in Kafka

Failure Modes 1. Leader fails

0 1 2 3 4 5 b1 (leader) 0 1
2 3 4 HW: 3 0 1 2 3 HW: 3 HW: 3 b2 (follower) b3 (follower) ISR: {b1, b2, b3} writes Leader fails

0 1 2 3 HW: 3 0 1 2 3
HW: 3 b2 (leader) b3 (follower) ISR: {b2, b3} writes Leader fails

Failure Modes 1. Leader fails  2. Follower fails

0 1 2 3 4 5 b1 (leader) 0 1
2 3 4 HW: 3 0 1 2 3 HW: 3 HW: 3 b2 (follower) b3 (follower) ISR: {b1, b2, b3} writes Follower fails

0 1 2 3 4 5 b1 (leader) 0 1
2 3 4 HW: 3 0 1 2 3 HW: 3 HW: 3 b2 (follower) b3 (follower) ISR: {b1, b2, b3} writes Follower fails replica.lag.time.max.ms

0 1 2 3 4 5 b1 (leader) HW: 3
0 1 2 3 HW: 3 b3 (follower) ISR: {b1, b3} writes Follower fails replica.lag.time.max.ms

Failure Modes 1. Leader fails  2. Follower fails  3. Follower
temporarily partitioned

0 1 2 3 4 5 b1 (leader) 0 1
2 3 4 HW: 3 0 1 2 3 HW: 3 HW: 3 b2 (follower) b3 (follower) ISR: {b1, b2, b3} writes Follower temporarily  partitioned

Follower temporarily  partitioned 0 1 2 3 4 5 b1
(leader) 0 1 2 3 4 HW: 3 0 1 2 3 HW: 3 HW: 3 b2 (follower) b3 (follower) ISR: {b1, b2, b3} writes

(leader) 0 1 2 3 4 HW: 3 0 1 2 3 HW: 3 HW: 3 b2 (follower) b3 (follower) ISR: {b1, b2, b3} writes replica.lag.time.max.ms

(leader) 0 1 2 3 4 HW: 3 0 1 2 3 HW: 3 HW: 3 b2 (follower) b3 (follower) ISR: {b1, b2} writes replica.lag.time.max.ms

(leader) 0 1 2 3 4 HW: 5 0 1 2 3 HW: 5 HW: 3 b2 (follower) b3 (follower) ISR: {b1, b2} writes 5

(leader) 0 1 2 3 4 HW: 5 0 1 2 3 HW: 5 HW: 4 b2 (follower) b3 (follower) ISR: {b1, b2} writes 5 4

(leader) 0 1 2 3 4 HW: 5 0 1 2 3 HW: 5 HW: 5 b2 (follower) b3 (follower) ISR: {b1, b2} writes 5 4 5

(leader) 0 1 2 3 4 HW: 5 0 1 2 3 HW: 5 HW: 5 b2 (follower) b3 (follower) ISR: {b1, b2, b3} writes 5 4 5

Replication in NATS Streaming 1. Metadata Raft group replicates client
state  2. Separate Raft group per topic replicates messages and subscriptions  3. Conceptually, two logs: Raft log and message log

http://thesecretlivesofdata.com/raft

Challenges 1. Scaling Raft

Scaling Raft With a single topic, one node is elected
leader and it heartbeats messages to followers

Scaling Raft As the number of topics increases unbounded, so
do the number of Raft groups.

Scaling Raft Technique 1: run a ﬁxed number of Raft
groups and use a consistent hash to map a topic to a group.

Scaling Raft Technique 2: run an entire node’s worth of
topics as a single group using a layer on top of Raft. https://www.cockroachlabs.com/blog/scaling-raft

Challenges 1. Scaling Raft 2. Dual writes

Dual Writes Raft Store committed

Dual Writes msg 1 Raft Store committed

Dual Writes msg 1 msg 2 Raft Store committed

Dual Writes msg 1 msg 2 Raft msg 1 msg
2 Store committed

Dual Writes msg 1 msg 2 sub Raft msg 1
msg 2 Store committed

Dual Writes msg 1 msg 2 sub msg 3 Raft
msg 1 msg 2 Store committed

Dual Writes msg 1 msg 2 sub msg 3 add
peer msg 4 Raft msg 1 msg 2 msg 3 Store committed

peer msg 4 Raft msg 1 msg 2 msg 3 msg 4 Store commit

peer msg 4 Raft msg 1 msg 2 msg 3 msg 4 Store 0 1 2 3 4 5 0 1 2 3 physical oﬀset logical oﬀset

peer msg 4 Raft msg 1 msg 2 Index 0 1 2 3 4 5 0 1 2 3 physical oﬀset logical oﬀset msg 3 msg 4

Treat the Raft log as our message write-ahead log.

Performance 1. Publisher acks   -> broker acks on commit
(slow but safe)  -> broker acks on local log append (fast but unsafe)  -> publisher doesn’t wait for ack (fast but unsafe)   2. Don’t fsync, rely on replication for durability  3. Keep disk access sequential and maximize zero-copy reads  4. Batch aggressively

Durability 1. Quorum guarantees durability  -> Comes for free with
Raft  -> In Kafka, need to conﬁgure min.insync.replicas and acks, e.g.  topic with replication factor 3, min.insync.replicas=2, and  acks=all  2. Disable unclean leader elections  3. At odds with availability,  i.e. no quorum == no reads/writes

Scaling Message Delivery 1. Partitioning

Partitioning is how we scale linearly.

HELLA WRITES caches databases indexes

caches databases indexes HELLA WRITES

caches databases indexes writes writes writes writes Topic: purchases Topic:
inventory

caches databases indexes writes writes writes writes Topic: purchases Topic:
inventory Accounts A-M Accounts N-Z SKUs A-M SKUs N-Z

Scaling Message Delivery 1. Partitioning 2. High fan-out

High Fan-out 1. Observation: with an immutable log, there are
no stale/phantom reads  2. This should make it “easy” (in theory) to scale to a large number of consumers (e.g. hundreds of thousands of IoT/edge devices)  3. With Raft, we can use “non-voters” to act as read replicas and load balance consumers

Scaling Message Delivery 1. Partitioning 2. High fan-out 3. Push
vs. pull

Push vs. Pull • In Kafka, consumers pull data from
brokers • In NATS Streaming, brokers push data to consumers • Pros/cons to both:  -> With push we need ﬂow control; implicit in pull  -> Need to make decisions about optimizing for  latency vs. throughput  -> Thick vs. thin client and API ergonomics

Scaling Message Delivery 1. Partitioning 2. High fan-out 3. Push
vs. pull 4. Bookkeeping

Bookkeeping • Two ways to track position in the log: 
-> Have the server track it for consumers  -> Have consumers track it  • Trade-off between API simplicity and performance/server complexity  • Also, consumers might not have stable storage (e.g. IoT device, ephemeral container, etc.)  • Can we split the difference?

Offset Storage • Can store offsets themselves in the log
(in Kafka, originally had to store them in ZooKeeper)  • Clients periodically checkpoint offset to log  • Use log compaction to retain only latest offsets  • On recovery, fetch latest offset from log

Offset Storage bob-foo-0  11 alice-foo-0  15 Oﬀsets 0 1 2
3 bob-foo-1  20 bob-foo-0  18 4 bob-foo-0  21

Offset Storage alice-foo-0  15 bob-foo-1  20 Oﬀsets 1 2 4
bob-foo-0  21

Offset Storage Advantages:  -> Fault-tolerant  -> Consistent reads  -> High
write throughput (unlike ZooKeeper)  -> Reuses existing structures, so less server  complexity

Trade-offs and Lessons Learned 1. Competing goals

Competing Goals 1. Performance  -> Easy to make something fast
that’s not fault-tolerant or scalable  -> Simplicity of mechanism makes this easier  -> Simplicity of “UX” makes this harder 2. Scalability (and fault-tolerance)  -> Scalability and FT are at odds with simplicity  -> Cannot be an afterthought—needs to be designed from day 1 3. Simplicity (“UX”)  -> Simplicity of mechanism shifts complexity elsewhere (e.g. client)  -> Easy to let server handle complexity; hard when that needs to be  distributed and consistent while still being fast

Trade-offs and Lessons Learned 1. Competing goals 2. Availability vs.
Consistency

Availability vs. Consistency • CAP theorem • Consistency requires quorum
which hinders availability and performance • Minimize what you need to replicate

Consistency 3. Aim for simplicity

Distributed systems are complex enough.  Simple is usually better (and
faster).

Consistency 3. Aim for simplicity 4. Lean on existing work

Don’t roll your own coordination protocol,  use Raft, ZooKeeper, etc.

Consistency 3. Aim for simplicity 4. Lean on existing work 5. There are probably edge cases for which you haven’t written tests

There are many failure modes, and you can only write
so many tests.    Formal methods and property-based/ generative testing can help.

Consistency 3. Aim for simplicity 4. Lean on existing work 5. There are probably edge cases for which you haven’t written tests 6. Be honest with your users

Don’t try to be everything to everyone. Be explicit about
design decisions, trade- offs, guarantees, defaults, etc.

Thanks! @tyler_treat  bravenewgeek.com

Building a Distributed Message Log from Scratch

Building a Distributed Message Log from Scratch

More Decks by Tyler Treat

Other Decks in Programming

Featured

Transcript