@tyler_treat
Examples in the wild:
-> Apache Kafka
-> Amazon Kinesis
-> NATS Streaming
-> Apache Pulsar
Slide 19
Slide 19 text
@tyler_treat
Key Goals:
-> Performance
-> High Availability
-> Scalability
Slide 20
Slide 20 text
@tyler_treat
The purpose of this talk is to learn…
-> a bit about the internals of a log abstraction.
-> how it can achieve these goals.
-> some applied distributed systems theory.
Slide 21
Slide 21 text
@tyler_treat
You will probably never need to
build something like this yourself,
but it helps to know how it works.
Slide 22
Slide 22 text
@tyler_treat
Implemen-
tation
Slide 23
Slide 23 text
@tyler_treat
Implemen-
tation
Don’t try this at
home.
Slide 24
Slide 24 text
@tyler_treat
Storage
Mechanics
Slide 25
Slide 25 text
@tyler_treat
Some first principles…
• The log is an ordered, immutable sequence of messages
• Messages are atomic (meaning they can’t be broken up)
• The log has a notion of message retention based on some policies
(time, number of messages, bytes, etc.)
• The log can be played back from any arbitrary position
• The log is stored on disk
• Sequential disk access is fast*
• OS page cache means sequential access often avoids disk
@tyler_treat
Zero-Copy Reads
user space
kernel space
page cache
disk
socket
NIC
application
read send
Slide 38
Slide 38 text
@tyler_treat
Zero-Copy Reads
user space
kernel space
page cache
disk NIC
sendfile
Slide 39
Slide 39 text
@tyler_treat
Left as an exercise for the listener…
-> Batching
-> Compression
Slide 40
Slide 40 text
@tyler_treat
Data-Replication
Techniques
Slide 41
Slide 41 text
@tyler_treat
caches
databases
indexes
writes
Slide 42
Slide 42 text
@tyler_treat
caches
databases
indexes
writes
Slide 43
Slide 43 text
@tyler_treat
caches
databases
indexes
writes
Slide 44
Slide 44 text
@tyler_treat
How do we achieve high availability
and fault tolerance?
Slide 45
Slide 45 text
@tyler_treat
Questions:
-> How do we ensure continuity of reads/writes?
-> How do we replicate data?
-> How do we ensure replicas are consistent?
-> How do we keep things fast?
-> How do we ensure data is durable?
Slide 46
Slide 46 text
@tyler_treat
Questions:
-> How do we ensure continuity of reads/writes?
-> How do we replicate data?
-> How do we ensure replicas are consistent?
-> How do we keep things fast?
-> How do we ensure data is durable?
Slide 47
Slide 47 text
@tyler_treat
caches
databases
indexes
writes
Slide 48
Slide 48 text
@tyler_treat
Questions:
-> How do we ensure continuity of reads/writes?
-> How do we replicate data?
-> How do we ensure replicas are consistent?
-> How do we keep things fast?
-> How do we ensure data is durable?
@tyler_treat
Questions:
-> How do we ensure continuity of reads/writes?
-> How do we replicate data?
-> How do we ensure replicas are consistent?
-> How do we keep things fast?
-> How do we ensure data is durable?
@tyler_treat
Replication in Kafka
1. Select a leader
2. Maintain in-sync replica set (ISR) (initially every replica)
3. Leader writes messages to write-ahead log (WAL)
4. Leader commits messages when all replicas in ISR ack
5. Leader maintains high-water mark (HW) of last
committed message
6. Piggyback HW on replica fetch responses which
replicas periodically checkpoint to disk
@tyler_treat
Replication in NATS Streaming
1. Raft replicates client state, messages, and
subscriptions
2. Conceptually, two logs: Raft log and message log
3. Parallels work implementing Raft in RabbitMQ
Slide 70
Slide 70 text
@tyler_treat
http://thesecretlivesofdata.com/raft
Slide 71
Slide 71 text
@tyler_treat
Replication in NATS Streaming
• Initially used Raft group per topic and separate
metadata group
• A couple issues with this:
-> Topic scalability
-> Increased complexity due to lack of ordering between Raft groups
Slide 72
Slide 72 text
@tyler_treat
Challenges
1. Scaling topics
Slide 73
Slide 73 text
@tyler_treat
Scaling Raft
With a single topic, one node is elected leader and it
heartbeats messages to followers
Slide 74
Slide 74 text
@tyler_treat
Scaling Raft
As the number of topics increases, so does the number
of Raft groups.
Slide 75
Slide 75 text
@tyler_treat
Scaling Raft
Technique 1: run a fixed number of Raft groups and use
a consistent hash to map a topic to a group.
Slide 76
Slide 76 text
@tyler_treat
Scaling Raft
Technique 2: run an entire node’s worth of topics as a
single group using a layer on top of Raft.
https://www.cockroachlabs.com/blog/scaling-raft
Slide 77
Slide 77 text
@tyler_treat
Scaling Raft
Technique 3: use a single Raft group for all topics and
metadata.
@tyler_treat
Treat the Raft log as our message
write-ahead log.
Slide 91
Slide 91 text
@tyler_treat
Questions:
-> How do we ensure continuity of reads/writes?
-> How do we replicate data?
-> How do we ensure replicas are consistent?
-> How do we keep things fast?
-> How do we ensure data is durable?
Slide 92
Slide 92 text
@tyler_treat
Performance
1. Publisher acks
-> broker acks on commit (slow but safe)
-> broker acks on local log append (fast but unsafe)
-> publisher doesn’t wait for ack (fast but unsafe)
2. Don’t fsync, rely on replication for durability
3. Keep disk access sequential and maximize zero-copy reads
4. Batch aggressively
Slide 93
Slide 93 text
@tyler_treat
Questions:
-> How do we ensure continuity of reads/writes?
-> How do we replicate data?
-> How do we ensure replicas are consistent?
-> How do we keep things fast?
-> How do we ensure data is durable?
Slide 94
Slide 94 text
@tyler_treat
Durability
1. Quorum guarantees durability
-> Comes for free with Raft
-> In Kafka, need to configure min.insync.replicas and acks, e.g.
topic with replication factor 3, min.insync.replicas=2, and
acks=all
2. Disable unclean leader elections
3. At odds with availability,
i.e. no quorum == no reads/writes
@tyler_treat
Replication in Kafka and NATS
Streaming is purely a means of HA.
Slide 106
Slide 106 text
@tyler_treat
High Fan-Out
1. Observation: with an immutable log, there are no
stale/phantom reads
2. This should make it “easy” (in theory) to scale to a
large number of consumers
3. With Raft, we can use “non-voters” to act as read
replicas and load balance consumers
Slide 107
Slide 107 text
@tyler_treat
Scaling Message Delivery
1. Partitioning
2. High fan-out
3. Push vs. pull
Slide 108
Slide 108 text
@tyler_treat
Push vs. Pull
• In Kafka, consumers pull data from brokers
• In NATS Streaming, brokers push data to consumers
• Design implications:
• Fan-out
• Flow control
• Optimizing for latency vs. throughput
• Client complexity
Slide 109
Slide 109 text
@tyler_treat
Trade-Offs and
Lessons Learned
Slide 110
Slide 110 text
@tyler_treat
Trade-Offs and Lessons Learned
1. Competing goals
Slide 111
Slide 111 text
@tyler_treat
Competing Goals
1. Performance
-> Easy to make something fast that’s not fault-tolerant or scalable
-> Simplicity of mechanism makes this easier
-> Simplicity of “UX” makes this harder
2. Scalability and fault-tolerance
-> At odds with simplicity
-> Cannot be an afterthought
3. Simplicity
-> Simplicity of mechanism shifts complexity elsewhere (e.g. client)
-> Easy to let server handle complexity; hard when that needs to be
distributed, consistent, and fast
Slide 112
Slide 112 text
@tyler_treat
Trade-Offs and Lessons Learned
1. Competing goals
2. Aim for simplicity
Slide 113
Slide 113 text
@tyler_treat
Distributed systems are complex enough.
Simple is usually better (and faster).
Slide 114
Slide 114 text
@tyler_treat
“A complex system that works
is invariably found to have
evolved from a simple system
that works.”
Slide 115
Slide 115 text
@tyler_treat
Trade-Offs and Lessons Learned
1. Competing goals
2. Aim for simplicity
3. You can’t effectively bolt on fault-tolerance
Slide 116
Slide 116 text
@tyler_treat
“A complex system designed from
scratch never works and cannot
be patched up to make it work.
You have to start over, beginning
with a working simple system.”
Slide 117
Slide 117 text
@tyler_treat
Trade-Offs and Lessons Learned
1. Competing goals
2. Aim for simplicity
3. You can’t effectively bolt on fault-tolerance
4. Lean on existing work
Slide 118
Slide 118 text
@tyler_treat
Don’t roll your own coordination protocol,
use Raft, ZooKeeper, etc.
Slide 119
Slide 119 text
@tyler_treat
Trade-Offs and Lessons Learned
1. Competing goals
2. Aim for simplicity
3. You can’t effectively bolt on fault-tolerance
4. Lean on existing work
5. There are probably edge cases for which you
haven’t written tests
Slide 120
Slide 120 text
@tyler_treat
There are many failure modes, and you can
only write so many tests.
Formal methods and property-based/
generative testing can help.
Slide 121
Slide 121 text
@tyler_treat
@tyler_treat
Slide 122
Slide 122 text
@tyler_treat
Trade-Offs and Lessons Learned
1. Competing goals
2. Aim for simplicity
3. You can’t effectively bolt on fault-tolerance
4. Lean on existing work
5. There are probably edge cases for which you
haven’t written tests
6. Be honest with your users
Slide 123
Slide 123 text
@tyler_treat
Don’t try to be everything to everyone.
Be explicit about design decisions, trade-
offs, guarantees, defaults, etc.