Building a Distributed Message Log from Scratch

Slide 1

Slide 1 text

Building a Distributed Message Log from Scratch Tyler Treat · Iowa Code Camp · 11/04/17

Slide 2

Slide 2 text

- Messaging Nerd @ Apcera - Working on nats.io - Distributed systems - bravenewgeek.com Tyler Treat

Slide 3

Slide 3 text

No content

Slide 4

Slide 4 text

- The Log  -> What?  -> Why? - Implementation  -> Storage mechanics  -> Data-replication techniques  -> Scaling message delivery  -> Trade-oﬀs and lessons learned Outline

Slide 5

Slide 5 text

The Log

Slide 6

Slide 6 text

The Log A totally-ordered, append-only data structure.

Slide 7

Slide 7 text

The Log 0

Slide 8

Slide 8 text

0 1 The Log

Slide 9

Slide 9 text

0 1 2 The Log

Slide 10

Slide 10 text

0 1 2 3 The Log

Slide 11

Slide 11 text

0 1 2 3 4 The Log

Slide 12

Slide 12 text

0 1 2 3 4 5 The Log

Slide 13

Slide 13 text

0 1 2 3 4 5 newest record oldest record The Log

Slide 14

Slide 14 text

newest record oldest record The Log

Slide 15

Slide 15 text

Logs record what happened and when.

Slide 16

Slide 16 text

caches databases indexes writes

Slide 17

Slide 17 text

No content

Slide 18

Slide 18 text

Examples in the wild: -> Apache Kafka  -> Amazon Kinesis -> NATS Streaming  -> Tank

Slide 19

Slide 19 text

Key Goals: -> Performance -> High Availability -> Scalability

Slide 20

Slide 20 text

The purpose of this talk is to learn…  -> a bit about the internals of a log abstraction. -> how it can achieve these goals. -> some applied distributed systems theory.

Slide 21

Slide 21 text

You will probably never need to build something like this yourself, but it helps to know how it works.

Slide 22

Slide 22 text

Implemen- tation

Slide 23

Slide 23 text

Implemen- tation Don’t try this at home.

Slide 24

Slide 24 text

Some ﬁrst principles… Storage Mechanics • The log is an ordered, immutable sequence of messages • Messages are atomic (meaning they can’t be broken up) • The log has a notion of message retention based on some policies (time, number of messages, bytes, etc.) • The log can be played back from any arbitrary position • The log is stored on disk • Sequential disk access is fast* • OS page cache means sequential access often avoids disk

Slide 25

Slide 25 text

http://queue.acm.org/detail.cfm?id=1563874

Slide 26

Slide 26 text

avg-cpu: %user %nice %system %iowait %steal %idle 13.53 0.00 11.28 0.00 0.00 75.19 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn xvda 0.00 0.00 0.00 0 0 iostat

Slide 27

Slide 27 text

Storage Mechanics log ﬁle 0

Slide 28

Slide 28 text

Storage Mechanics log ﬁle 0 1

Slide 29

Slide 29 text

Storage Mechanics log ﬁle 0 1 2

Slide 30

Slide 30 text

Storage Mechanics log ﬁle 0 1 2 3

Slide 31

Slide 31 text

Storage Mechanics log ﬁle 0 1 2 3 4

Slide 32

Slide 32 text

Storage Mechanics log ﬁle 0 1 2 3 4 5

Slide 33

Slide 33 text

Storage Mechanics log ﬁle … 0 1 2 3 4 5

Slide 34

Slide 34 text

Storage Mechanics log segment 3 ﬁle log segment 0 ﬁle 0 1 2 3 4 5

Slide 35

Slide 35 text

Storage Mechanics log segment 3 file log segment 0 file 0 1 2 3 4 5 0 1 2 0 1 2 index segment 0 file index segment 3 file

Slide 36

Slide 36 text

Zero-copy Reads user space kernel space page cache disk socket NIC application read send

Slide 37

Slide 37 text

Zero-copy Reads user space kernel space page cache disk NIC sendﬁle

Slide 38

Slide 38 text

Left as an exercise for the listener…  -> Batching  -> Compression

Slide 39

Slide 39 text

caches databases indexes writes

Slide 40

Slide 40 text

caches databases indexes writes

Slide 41

Slide 41 text

caches databases indexes writes

Slide 42

Slide 42 text

How do we achieve high availability and fault tolerance?

Slide 43

Slide 43 text

Questions:  -> How do we ensure continuity of reads/writes? -> How do we replicate data? -> How do we ensure replicas are consistent? -> How do we keep things fast? -> How do we ensure data is durable?

Slide 44

Slide 44 text

Slide 45

Slide 45 text

caches databases indexes writes

Slide 46

Slide 46 text

Slide 47

Slide 47 text

Data-Replication Techniques 1. Gossip/multicast protocols Epidemic broadcast trees, bimodal multicast, SWIM, HyParView, NeEM  2. Consensus protocols 2PC/3PC, Paxos, Raft, Zab, chain replication

Slide 48

Slide 48 text

Slide 49

Slide 49 text

Data-Replication Techniques 1. Gossip/multicast protocols Epidemic broadcast trees, bimodal multicast, SWIM, HyParView, NeEM  2. Consensus protocols 2PC/3PC, Paxos, Raft, Zab, chain replication

Slide 50

Slide 50 text

Consensus-Based Replication 1. Designate a leader 2. Replicate by either:  a) waiting for all replicas  —or— b) waiting for a quorum of replicas

Slide 51

Slide 51 text

Pros Cons All Replicas Tolerates f failures with f+1 replicas Latency pegged to slowest replica Quorum Hides delay from a slow replica Tolerates f failures with 2f+1 replicas Consensus-Based Replication

Slide 52

Slide 52 text

Replication in Kafka 1. Select a leader 2. Maintain in-sync replica set (ISR) (initially every replica) 3. Leader writes messages to write-ahead log (WAL) 4. Leader commits messages when all replicas in ISR ack 5. Leader maintains high-water mark (HW) of last committed message 6. Piggyback HW on replica fetch responses which replicas periodically checkpoint to disk

Slide 53

Slide 53 text

0 1 2 3 4 5 b1 (leader) 0 1 2 3 4 HW: 3 0 1 2 3 HW: 3 HW: 3 b2 (follower) b3 (follower) ISR: {b1, b2, b3} writes Replication in Kafka

Slide 54

Slide 54 text

Failure Modes 1. Leader fails

Slide 55

Slide 55 text

0 1 2 3 4 5 b1 (leader) 0 1 2 3 4 HW: 3 0 1 2 3 HW: 3 HW: 3 b2 (follower) b3 (follower) ISR: {b1, b2, b3} writes Leader fails

Slide 56

Slide 56 text

0 1 2 3 4 5 b1 (leader) 0 1 2 3 4 HW: 3 0 1 2 3 HW: 3 HW: 3 b2 (follower) b3 (follower) ISR: {b1, b2, b3} writes Leader fails

Slide 57

Slide 57 text

0 1 2 3 4 5 b1 (leader) 0 1 2 3 4 HW: 3 0 1 2 3 HW: 3 HW: 3 b2 (follower) b3 (follower) ISR: {b1, b2, b3} writes Leader fails

Slide 58

Slide 58 text

0 1 2 3 HW: 3 0 1 2 3 HW: 3 b2 (leader) b3 (follower) ISR: {b2, b3} writes Leader fails

Slide 59

Slide 59 text

Failure Modes 1. Leader fails  2. Follower fails

Slide 60

Slide 60 text

0 1 2 3 4 5 b1 (leader) 0 1 2 3 4 HW: 3 0 1 2 3 HW: 3 HW: 3 b2 (follower) b3 (follower) ISR: {b1, b2, b3} writes Follower fails

Slide 61

Slide 61 text

0 1 2 3 4 5 b1 (leader) 0 1 2 3 4 HW: 3 0 1 2 3 HW: 3 HW: 3 b2 (follower) b3 (follower) ISR: {b1, b2, b3} writes Follower fails

Slide 62

Slide 62 text

0 1 2 3 4 5 b1 (leader) 0 1 2 3 4 HW: 3 0 1 2 3 HW: 3 HW: 3 b2 (follower) b3 (follower) ISR: {b1, b2, b3} writes Follower fails replica.lag.time.max.ms

Slide 63

Slide 63 text

0 1 2 3 4 5 b1 (leader) HW: 3 0 1 2 3 HW: 3 b3 (follower) ISR: {b1, b3} writes Follower fails replica.lag.time.max.ms

Slide 64

Slide 64 text

Failure Modes 1. Leader fails  2. Follower fails  3. Follower temporarily partitioned

Slide 65

Slide 65 text

0 1 2 3 4 5 b1 (leader) 0 1 2 3 4 HW: 3 0 1 2 3 HW: 3 HW: 3 b2 (follower) b3 (follower) ISR: {b1, b2, b3} writes Follower temporarily  partitioned

Slide 66

Slide 66 text

Follower temporarily  partitioned 0 1 2 3 4 5 b1 (leader) 0 1 2 3 4 HW: 3 0 1 2 3 HW: 3 HW: 3 b2 (follower) b3 (follower) ISR: {b1, b2, b3} writes

Slide 67

Slide 67 text

Follower temporarily  partitioned 0 1 2 3 4 5 b1 (leader) 0 1 2 3 4 HW: 3 0 1 2 3 HW: 3 HW: 3 b2 (follower) b3 (follower) ISR: {b1, b2, b3} writes replica.lag.time.max.ms

Slide 68

Slide 68 text

Follower temporarily  partitioned 0 1 2 3 4 5 b1 (leader) 0 1 2 3 4 HW: 3 0 1 2 3 HW: 3 HW: 3 b2 (follower) b3 (follower) ISR: {b1, b2} writes replica.lag.time.max.ms

Slide 69

Slide 69 text

Follower temporarily  partitioned 0 1 2 3 4 5 b1 (leader) 0 1 2 3 4 HW: 5 0 1 2 3 HW: 5 HW: 3 b2 (follower) b3 (follower) ISR: {b1, b2} writes 5

Slide 70

Slide 70 text

Follower temporarily  partitioned 0 1 2 3 4 5 b1 (leader) 0 1 2 3 4 HW: 5 0 1 2 3 HW: 5 HW: 3 b2 (follower) b3 (follower) ISR: {b1, b2} writes 5

Slide 71

Slide 71 text

Follower temporarily  partitioned 0 1 2 3 4 5 b1 (leader) 0 1 2 3 4 HW: 5 0 1 2 3 HW: 5 HW: 4 b2 (follower) b3 (follower) ISR: {b1, b2} writes 5 4

Slide 72

Slide 72 text

Follower temporarily  partitioned 0 1 2 3 4 5 b1 (leader) 0 1 2 3 4 HW: 5 0 1 2 3 HW: 5 HW: 5 b2 (follower) b3 (follower) ISR: {b1, b2} writes 5 4 5

Slide 73

Slide 73 text

Follower temporarily  partitioned 0 1 2 3 4 5 b1 (leader) 0 1 2 3 4 HW: 5 0 1 2 3 HW: 5 HW: 5 b2 (follower) b3 (follower) ISR: {b1, b2, b3} writes 5 4 5

Slide 74

Slide 74 text

Replication in NATS Streaming 1. Metadata Raft group replicates client state  2. Separate Raft group per topic replicates messages and subscriptions  3. Conceptually, two logs: Raft log and message log

Slide 75

Slide 75 text

http://thesecretlivesofdata.com/raft

Slide 76

Slide 76 text

Challenges 1. Scaling Raft

Slide 77

Slide 77 text

Scaling Raft With a single topic, one node is elected leader and it heartbeats messages to followers

Slide 78

Slide 78 text

Scaling Raft As the number of topics increases unbounded, so do the number of Raft groups.

Slide 79

Slide 79 text

Scaling Raft Technique 1: run a ﬁxed number of Raft groups and use a consistent hash to map a topic to a group.

Slide 80

Slide 80 text

Scaling Raft Technique 2: run an entire node’s worth of topics as a single group using a layer on top of Raft. https://www.cockroachlabs.com/blog/scaling-raft

Slide 81

Slide 81 text

Challenges 1. Scaling Raft 2. Dual writes

Slide 82

Slide 82 text

Dual Writes Raft Store committed

Slide 83

Slide 83 text

Dual Writes msg 1 Raft Store committed

Slide 84

Slide 84 text

Dual Writes msg 1 msg 2 Raft Store committed

Slide 85

Slide 85 text

Dual Writes msg 1 msg 2 Raft msg 1 msg 2 Store committed

Slide 86

Slide 86 text

Dual Writes msg 1 msg 2 sub Raft msg 1 msg 2 Store committed

Slide 87

Slide 87 text

Dual Writes msg 1 msg 2 sub msg 3 Raft msg 1 msg 2 Store committed

Slide 88

Slide 88 text

Dual Writes msg 1 msg 2 sub msg 3 add peer msg 4 Raft msg 1 msg 2 msg 3 Store committed

Slide 89

Slide 89 text

Dual Writes msg 1 msg 2 sub msg 3 add peer msg 4 Raft msg 1 msg 2 msg 3 Store committed

Slide 90

Slide 90 text

Dual Writes msg 1 msg 2 sub msg 3 add peer msg 4 Raft msg 1 msg 2 msg 3 msg 4 Store commit

Slide 91

Slide 91 text

Dual Writes msg 1 msg 2 sub msg 3 add peer msg 4 Raft msg 1 msg 2 msg 3 msg 4 Store 0 1 2 3 4 5 0 1 2 3 physical oﬀset logical oﬀset

Slide 92

Slide 92 text

Dual Writes msg 1 msg 2 sub msg 3 add peer msg 4 Raft msg 1 msg 2 Index 0 1 2 3 4 5 0 1 2 3 physical oﬀset logical oﬀset msg 3 msg 4

Slide 93

Slide 93 text

Treat the Raft log as our message write-ahead log.

Slide 94

Slide 94 text

Slide 95

Slide 95 text

Performance 1. Publisher acks   -> broker acks on commit (slow but safe)  -> broker acks on local log append (fast but unsafe)  -> publisher doesn’t wait for ack (fast but unsafe)   2. Don’t fsync, rely on replication for durability  3. Keep disk access sequential and maximize zero-copy reads  4. Batch aggressively

Slide 96

Slide 96 text

Slide 97

Slide 97 text

Durability 1. Quorum guarantees durability  -> Comes for free with Raft  -> In Kafka, need to conﬁgure min.insync.replicas and acks, e.g.  topic with replication factor 3, min.insync.replicas=2, and  acks=all  2. Disable unclean leader elections  3. At odds with availability,  i.e. no quorum == no reads/writes

Slide 98

Slide 98 text

Scaling Message Delivery 1. Partitioning

Slide 99

Slide 99 text

Partitioning is how we scale linearly.

Slide 100

Slide 100 text

caches databases indexes writes

Slide 101

Slide 101 text

HELLA WRITES caches databases indexes

Slide 102

Slide 102 text

caches databases indexes HELLA WRITES

Slide 103

Slide 103 text

caches databases indexes writes writes writes writes Topic: purchases Topic: inventory

Slide 104

Slide 104 text

caches databases indexes writes writes writes writes Topic: purchases Topic: inventory Accounts A-M Accounts N-Z SKUs A-M SKUs N-Z

Slide 105

Slide 105 text

Scaling Message Delivery 1. Partitioning 2. High fan-out

Slide 106

Slide 106 text

High Fan-out 1. Observation: with an immutable log, there are no stale/phantom reads  2. This should make it “easy” (in theory) to scale to a large number of consumers (e.g. hundreds of thousands of IoT/edge devices)  3. With Raft, we can use “non-voters” to act as read replicas and load balance consumers

Slide 107

Slide 107 text

Scaling Message Delivery 1. Partitioning 2. High fan-out 3. Push vs. pull

Slide 108

Slide 108 text

Push vs. Pull • In Kafka, consumers pull data from brokers • In NATS Streaming, brokers push data to consumers • Pros/cons to both:  -> With push we need ﬂow control; implicit in pull  -> Need to make decisions about optimizing for  latency vs. throughput  -> Thick vs. thin client and API ergonomics

Slide 109

Slide 109 text

Scaling Message Delivery 1. Partitioning 2. High fan-out 3. Push vs. pull 4. Bookkeeping

Slide 110

Slide 110 text

Bookkeeping • Two ways to track position in the log:  -> Have the server track it for consumers  -> Have consumers track it  • Trade-off between API simplicity and performance/server complexity  • Also, consumers might not have stable storage (e.g. IoT device, ephemeral container, etc.)  • Can we split the difference?

Slide 111

Slide 111 text

Offset Storage • Can store offsets themselves in the log (in Kafka, originally had to store them in ZooKeeper)  • Clients periodically checkpoint offset to log  • Use log compaction to retain only latest offsets  • On recovery, fetch latest offset from log

Slide 112

Slide 112 text

Offset Storage bob-foo-0  11 alice-foo-0  15 Oﬀsets 0 1 2 3 bob-foo-1  20 bob-foo-0  18 4 bob-foo-0  21

Slide 113

Slide 113 text

Offset Storage bob-foo-0  11 alice-foo-0  15 Oﬀsets 0 1 2 3 bob-foo-1  20 bob-foo-0  18 4 bob-foo-0  21

Slide 114

Slide 114 text

Offset Storage alice-foo-0  15 bob-foo-1  20 Oﬀsets 1 2 4 bob-foo-0  21

Slide 115

Slide 115 text

Offset Storage Advantages:  -> Fault-tolerant  -> Consistent reads  -> High write throughput (unlike ZooKeeper)  -> Reuses existing structures, so less server  complexity

Slide 116

Slide 116 text

Trade-offs and Lessons Learned 1. Competing goals

Slide 117

Slide 117 text

Competing Goals 1. Performance  -> Easy to make something fast that’s not fault-tolerant or scalable  -> Simplicity of mechanism makes this easier  -> Simplicity of “UX” makes this harder 2. Scalability (and fault-tolerance)  -> Scalability and FT are at odds with simplicity  -> Cannot be an afterthought—needs to be designed from day 1 3. Simplicity (“UX”)  -> Simplicity of mechanism shifts complexity elsewhere (e.g. client)  -> Easy to let server handle complexity; hard when that needs to be  distributed and consistent while still being fast

Slide 118

Slide 118 text

Trade-offs and Lessons Learned 1. Competing goals 2. Availability vs. Consistency

Slide 119

Slide 119 text

Availability vs. Consistency • CAP theorem • Consistency requires quorum which hinders availability and performance • Minimize what you need to replicate

Slide 120

Slide 120 text

Trade-offs and Lessons Learned 1. Competing goals 2. Availability vs. Consistency 3. Aim for simplicity

Slide 121

Slide 121 text

Distributed systems are complex enough.  Simple is usually better (and faster).

Slide 122

Slide 122 text

Trade-offs and Lessons Learned 1. Competing goals 2. Availability vs. Consistency 3. Aim for simplicity 4. Lean on existing work

Slide 123

Slide 123 text

Don’t roll your own coordination protocol,  use Raft, ZooKeeper, etc.

Slide 124

Slide 124 text

Trade-offs and Lessons Learned 1. Competing goals 2. Availability vs. Consistency 3. Aim for simplicity 4. Lean on existing work 5. There are probably edge cases for which you haven’t written tests

Slide 125

Slide 125 text

There are many failure modes, and you can only write so many tests.    Formal methods and property-based/ generative testing can help.

Slide 126

Slide 126 text

No content

Slide 127

Slide 127 text

Trade-offs and Lessons Learned 1. Competing goals 2. Availability vs. Consistency 3. Aim for simplicity 4. Lean on existing work 5. There are probably edge cases for which you haven’t written tests 6. Be honest with your users

Slide 128

Slide 128 text

Don’t try to be everything to everyone. Be explicit about design decisions, trade- offs, guarantees, defaults, etc.

Slide 129

Slide 129 text

Thanks! @tyler_treat  bravenewgeek.com