The Log
A totally-ordered,
append-only data
structure.
Slide 7
Slide 7 text
The Log
0
Slide 8
Slide 8 text
0 1
The Log
Slide 9
Slide 9 text
0 1 2
The Log
Slide 10
Slide 10 text
0 1 2 3
The Log
Slide 11
Slide 11 text
0 1 2 3 4
The Log
Slide 12
Slide 12 text
0 1 2 3 4 5
The Log
Slide 13
Slide 13 text
0 1 2 3 4 5
newest record
oldest record
The Log
Slide 14
Slide 14 text
newest record
oldest record
The Log
Slide 15
Slide 15 text
Logs record what
happened and when.
Slide 16
Slide 16 text
caches
databases
indexes
writes
Slide 17
Slide 17 text
No content
Slide 18
Slide 18 text
Examples in the wild:
-> Apache Kafka
-> Amazon Kinesis
-> NATS Streaming
-> Tank
Slide 19
Slide 19 text
Key Goals:
-> Performance
-> High Availability
-> Scalability
Slide 20
Slide 20 text
The purpose of this talk is to learn…
-> a bit about the internals of a log abstraction.
-> how it can achieve these goals.
-> some applied distributed systems theory.
Slide 21
Slide 21 text
You will probably never need to
build something like this yourself,
but it helps to know how it works.
Slide 22
Slide 22 text
Implemen-
tation
Slide 23
Slide 23 text
Implemen-
tation
Don’t try this at
home.
Slide 24
Slide 24 text
Some first principles…
Storage Mechanics
• The log is an ordered, immutable sequence of messages
• Messages are atomic (meaning they can’t be broken up)
• The log has a notion of message retention based on some policies
(time, number of messages, bytes, etc.)
• The log can be played back from any arbitrary position
• The log is stored on disk
• Sequential disk access is fast*
• OS page cache means sequential access often avoids disk
Zero-copy Reads
user space
kernel space
page cache
disk
socket
NIC
application
read send
Slide 37
Slide 37 text
Zero-copy Reads
user space
kernel space
page cache
disk NIC
sendfile
Slide 38
Slide 38 text
Left as an exercise for the listener…
-> Batching
-> Compression
Slide 39
Slide 39 text
caches
databases
indexes
writes
Slide 40
Slide 40 text
caches
databases
indexes
writes
Slide 41
Slide 41 text
caches
databases
indexes
writes
Slide 42
Slide 42 text
How do we achieve high availability
and fault tolerance?
Slide 43
Slide 43 text
Questions:
-> How do we ensure continuity of reads/writes?
-> How do we replicate data?
-> How do we ensure replicas are consistent?
-> How do we keep things fast?
-> How do we ensure data is durable?
Slide 44
Slide 44 text
Questions:
-> How do we ensure continuity of reads/writes?
-> How do we replicate data?
-> How do we ensure replicas are consistent?
-> How do we keep things fast?
-> How do we ensure data is durable?
Slide 45
Slide 45 text
caches
databases
indexes
writes
Slide 46
Slide 46 text
Questions:
-> How do we ensure continuity of reads/writes?
-> How do we replicate data?
-> How do we ensure replicas are consistent?
-> How do we keep things fast?
-> How do we ensure data is durable?
Questions:
-> How do we ensure continuity of reads/writes?
-> How do we replicate data?
-> How do we ensure replicas are consistent?
-> How do we keep things fast?
-> How do we ensure data is durable?
Consensus-Based Replication
1. Designate a leader
2. Replicate by either:
a) waiting for all replicas
—or—
b) waiting for a quorum of replicas
Slide 51
Slide 51 text
Pros Cons
All Replicas
Tolerates f failures with
f+1 replicas
Latency pegged to
slowest replica
Quorum
Hides delay from a slow
replica
Tolerates f failures with
2f+1 replicas
Consensus-Based Replication
Slide 52
Slide 52 text
Replication in Kafka
1. Select a leader
2. Maintain in-sync replica set (ISR) (initially every replica)
3. Leader writes messages to write-ahead log (WAL)
4. Leader commits messages when all replicas in ISR ack
5. Leader maintains high-water mark (HW) of last
committed message
6. Piggyback HW on replica fetch responses which
replicas periodically checkpoint to disk
Replication in NATS Streaming
1. Metadata Raft group replicates client state
2. Separate Raft group per topic replicates messages
and subscriptions
3. Conceptually, two logs: Raft log and message log
Slide 75
Slide 75 text
http://thesecretlivesofdata.com/raft
Slide 76
Slide 76 text
Challenges
1. Scaling Raft
Slide 77
Slide 77 text
Scaling Raft
With a single topic, one node is elected leader and it
heartbeats messages to followers
Slide 78
Slide 78 text
Scaling Raft
As the number of topics increases unbounded, so do the
number of Raft groups.
Slide 79
Slide 79 text
Scaling Raft
Technique 1: run a fixed number of Raft groups and use
a consistent hash to map a topic to a group.
Slide 80
Slide 80 text
Scaling Raft
Technique 2: run an entire node’s worth of topics as a
single group using a layer on top of Raft.
https://www.cockroachlabs.com/blog/scaling-raft
Treat the Raft log as our message
write-ahead log.
Slide 94
Slide 94 text
Questions:
-> How do we ensure continuity of reads/writes?
-> How do we replicate data?
-> How do we ensure replicas are consistent?
-> How do we keep things fast?
-> How do we ensure data is durable?
Slide 95
Slide 95 text
Performance
1. Publisher acks
-> broker acks on commit (slow but safe)
-> broker acks on local log append (fast but unsafe)
-> publisher doesn’t wait for ack (fast but unsafe)
2. Don’t fsync, rely on replication for durability
3. Keep disk access sequential and maximize zero-copy reads
4. Batch aggressively
Slide 96
Slide 96 text
Questions:
-> How do we ensure continuity of reads/writes?
-> How do we replicate data?
-> How do we ensure replicas are consistent?
-> How do we keep things fast?
-> How do we ensure data is durable?
Slide 97
Slide 97 text
Durability
1. Quorum guarantees durability
-> Comes for free with Raft
-> In Kafka, need to configure min.insync.replicas and acks, e.g.
topic with replication factor 3, min.insync.replicas=2, and
acks=all
2. Disable unclean leader elections
3. At odds with availability,
i.e. no quorum == no reads/writes
Scaling Message Delivery
1. Partitioning
2. High fan-out
Slide 106
Slide 106 text
High Fan-out
1. Observation: with an immutable log, there are no
stale/phantom reads
2. This should make it “easy” (in theory) to scale to a
large number of consumers (e.g. hundreds of
thousands of IoT/edge devices)
3. With Raft, we can use “non-voters” to act as read
replicas and load balance consumers
Slide 107
Slide 107 text
Scaling Message Delivery
1. Partitioning
2. High fan-out
3. Push vs. pull
Slide 108
Slide 108 text
Push vs. Pull
• In Kafka, consumers pull data from brokers
• In NATS Streaming, brokers push data to consumers
• Pros/cons to both:
-> With push we need flow control; implicit in pull
-> Need to make decisions about optimizing for
latency vs. throughput
-> Thick vs. thin client and API ergonomics
Slide 109
Slide 109 text
Scaling Message Delivery
1. Partitioning
2. High fan-out
3. Push vs. pull
4. Bookkeeping
Slide 110
Slide 110 text
Bookkeeping
• Two ways to track position in the log:
-> Have the server track it for consumers
-> Have consumers track it
• Trade-off between API simplicity and performance/server
complexity
• Also, consumers might not have stable storage (e.g. IoT device,
ephemeral container, etc.)
• Can we split the difference?
Slide 111
Slide 111 text
Offset Storage
• Can store offsets themselves in the log (in Kafka,
originally had to store them in ZooKeeper)
• Clients periodically checkpoint offset to log
• Use log compaction to retain only latest offsets
• On recovery, fetch latest offset from log
Offset Storage
Advantages:
-> Fault-tolerant
-> Consistent reads
-> High write throughput (unlike ZooKeeper)
-> Reuses existing structures, so less server
complexity
Slide 116
Slide 116 text
Trade-offs and Lessons Learned
1. Competing goals
Slide 117
Slide 117 text
Competing Goals
1. Performance
-> Easy to make something fast that’s not fault-tolerant or scalable
-> Simplicity of mechanism makes this easier
-> Simplicity of “UX” makes this harder
2. Scalability (and fault-tolerance)
-> Scalability and FT are at odds with simplicity
-> Cannot be an afterthought—needs to be designed from day 1
3. Simplicity (“UX”)
-> Simplicity of mechanism shifts complexity elsewhere (e.g. client)
-> Easy to let server handle complexity; hard when that needs to be
distributed and consistent while still being fast
Slide 118
Slide 118 text
Trade-offs and Lessons Learned
1. Competing goals
2. Availability vs. Consistency
Slide 119
Slide 119 text
Availability vs. Consistency
• CAP theorem
• Consistency requires quorum which hinders
availability and performance
• Minimize what you need to replicate
Slide 120
Slide 120 text
Trade-offs and Lessons Learned
1. Competing goals
2. Availability vs. Consistency
3. Aim for simplicity
Slide 121
Slide 121 text
Distributed systems are complex enough.
Simple is usually better (and faster).
Slide 122
Slide 122 text
Trade-offs and Lessons Learned
1. Competing goals
2. Availability vs. Consistency
3. Aim for simplicity
4. Lean on existing work
Slide 123
Slide 123 text
Don’t roll your own coordination protocol,
use Raft, ZooKeeper, etc.
Slide 124
Slide 124 text
Trade-offs and Lessons Learned
1. Competing goals
2. Availability vs. Consistency
3. Aim for simplicity
4. Lean on existing work
5. There are probably edge cases for which you
haven’t written tests
Slide 125
Slide 125 text
There are many failure modes, and you can
only write so many tests.
Formal methods and property-based/
generative testing can help.
Slide 126
Slide 126 text
No content
Slide 127
Slide 127 text
Trade-offs and Lessons Learned
1. Competing goals
2. Availability vs. Consistency
3. Aim for simplicity
4. Lean on existing work
5. There are probably edge cases for which you
haven’t written tests
6. Be honest with your users
Slide 128
Slide 128 text
Don’t try to be everything to everyone. Be
explicit about design decisions, trade-
offs, guarantees, defaults, etc.