Building a Distributed Message Log from Scratch - SCaLE 16x

Slide 1

Slide 1 text

@tyler_treat Building a Distributed Message Log from Scratch Tyler Treat · SCALE 16x · 3/11/18

Slide 2

Slide 2 text

@tyler_treat - Managing Partner @ Real Kinetic - Messaging & distributed systems - Former nats.io core contributor - bravenewgeek.com Tyler Treat

Slide 3

Slide 3 text

@tyler_treat @tyler_treat

Slide 4

Slide 4 text

@tyler_treat - The Log  -> What?  -> Why? - Implementation  -> Storage mechanics  -> Data-replication techniques  -> Scaling message delivery  -> Trade-oﬀs and lessons learned Outline

Slide 5

Slide 5 text

@tyler_treat The Log

Slide 6

Slide 6 text

@tyler_treat The Log A totally-ordered, append-only data structure.

Slide 7

Slide 7 text

@tyler_treat The Log 0

Slide 8

Slide 8 text

@tyler_treat 0 1 The Log

Slide 9

Slide 9 text

@tyler_treat 0 1 2 The Log

Slide 10

Slide 10 text

@tyler_treat 0 1 2 3 The Log

Slide 11

Slide 11 text

@tyler_treat 0 1 2 3 4 The Log

Slide 12

Slide 12 text

@tyler_treat 0 1 2 3 4 5 The Log

Slide 13

Slide 13 text

@tyler_treat 0 1 2 3 4 5 newest record oldest record The Log

Slide 14

Slide 14 text

@tyler_treat newest record oldest record The Log

Slide 15

Slide 15 text

@tyler_treat Logs record what happened and when.

Slide 16

Slide 16 text

@tyler_treat caches databases indexes writes

Slide 17

Slide 17 text

@tyler_treat https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying

Slide 18

Slide 18 text

@tyler_treat Examples in the wild: -> Apache Kafka  -> Amazon Kinesis -> NATS Streaming  -> Apache Pulsar

Slide 19

Slide 19 text

@tyler_treat Key Goals: -> Performance -> High Availability -> Scalability

Slide 20

Slide 20 text

@tyler_treat The purpose of this talk is to learn…  -> a bit about the internals of a log abstraction. -> how it can achieve these goals. -> some applied distributed systems theory.

Slide 21

Slide 21 text

@tyler_treat You will probably never need to build something like this yourself, but it helps to know how it works.

Slide 22

Slide 22 text

@tyler_treat Implemen- tation

Slide 23

Slide 23 text

@tyler_treat Implemen- tation Don’t try this at home.

Slide 24

Slide 24 text

@tyler_treat Storage  Mechanics

Slide 25

Slide 25 text

@tyler_treat Some ﬁrst principles… • The log is an ordered, immutable sequence of messages • Messages are atomic (meaning they can’t be broken up) • The log has a notion of message retention based on some policies (time, number of messages, bytes, etc.) • The log can be played back from any arbitrary position • The log is stored on disk • Sequential disk access is fast* • OS page cache means sequential access often avoids disk

Slide 26

Slide 26 text

@tyler_treat http://queue.acm.org/detail.cfm?id=1563874

Slide 27

Slide 27 text

@tyler_treat avg-cpu: %user %nice %system %iowait %steal %idle 13.53 0.00 11.28 0.00 0.00 75.19 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn xvda 0.00 0.00 0.00 0 0 iostat

Slide 28

Slide 28 text

@tyler_treat Storage Mechanics log ﬁle 0

Slide 29

Slide 29 text

@tyler_treat Storage Mechanics log ﬁle 0 1

Slide 30

Slide 30 text

@tyler_treat Storage Mechanics log ﬁle 0 1 2

Slide 31

Slide 31 text

@tyler_treat Storage Mechanics log ﬁle 0 1 2 3

Slide 32

Slide 32 text

@tyler_treat Storage Mechanics log ﬁle 0 1 2 3 4

Slide 33

Slide 33 text

@tyler_treat Storage Mechanics log ﬁle 0 1 2 3 4 5

Slide 34

Slide 34 text

@tyler_treat Storage Mechanics log ﬁle … 0 1 2 3 4 5

Slide 35

Slide 35 text

@tyler_treat Storage Mechanics log segment 3 ﬁle log segment 0 ﬁle 0 1 2 3 4 5

Slide 36

Slide 36 text

@tyler_treat Storage Mechanics log segment 3 file log segment 0 file 0 1 2 3 4 5 0 1 2 0 1 2 index segment 0 file index segment 3 file

Slide 37

Slide 37 text

@tyler_treat Zero-Copy Reads user space kernel space page cache disk socket NIC application read send

Slide 38

Slide 38 text

@tyler_treat Zero-Copy Reads user space kernel space page cache disk NIC sendﬁle

Slide 39

Slide 39 text

@tyler_treat Left as an exercise for the listener…  -> Batching  -> Compression

Slide 40

Slide 40 text

@tyler_treat Data-Replication  Techniques

Slide 41

Slide 41 text

@tyler_treat caches databases indexes writes

Slide 42

Slide 42 text

@tyler_treat caches databases indexes writes

Slide 43

Slide 43 text

@tyler_treat caches databases indexes writes

Slide 44

Slide 44 text

@tyler_treat How do we achieve high availability and fault tolerance?

Slide 45

Slide 45 text

@tyler_treat Questions:  -> How do we ensure continuity of reads/writes? -> How do we replicate data? -> How do we ensure replicas are consistent? -> How do we keep things fast? -> How do we ensure data is durable?

Slide 46

Slide 46 text

Slide 47

Slide 47 text

@tyler_treat caches databases indexes writes

Slide 48

Slide 48 text

Slide 49

Slide 49 text

@tyler_treat Data-Replication Techniques 1. Gossip/multicast protocols Epidemic broadcast trees, bimodal multicast, SWIM, HyParView  2. Consensus protocols 2PC/3PC, Paxos, Raft, Zab, chain replication

Slide 50

Slide 50 text

Slide 51

Slide 51 text

@tyler_treat Data-Replication Techniques 1. Gossip/multicast protocols Epidemic broadcast trees, bimodal multicast, SWIM, HyParView, NeEM  2. Consensus protocols 2PC/3PC, Paxos, Raft, Zab, chain replication

Slide 52

Slide 52 text

@tyler_treat Replication in Kafka 1. Select a leader 2. Maintain in-sync replica set (ISR) (initially every replica) 3. Leader writes messages to write-ahead log (WAL) 4. Leader commits messages when all replicas in ISR ack 5. Leader maintains high-water mark (HW) of last committed message 6. Piggyback HW on replica fetch responses which replicas periodically checkpoint to disk

Slide 53

Slide 53 text

@tyler_treat 0 1 2 3 4 5 b1 (leader) 0 1 2 3 4 HW: 3 0 1 2 3 HW: 3 HW: 3 b2 (follower) b3 (follower) ISR: {b1, b2, b3} writes Replication in Kafka

Slide 54

Slide 54 text

@tyler_treat Failure Modes 1. Leader fails

Slide 55

Slide 55 text

@tyler_treat 0 1 2 3 4 5 b1 (leader) 0 1 2 3 4 HW: 3 0 1 2 3 HW: 3 HW: 3 b2 (follower) b3 (follower) ISR: {b1, b2, b3} writes Leader fails

Slide 56

Slide 56 text

@tyler_treat 0 1 2 3 4 5 b1 (leader) 0 1 2 3 4 HW: 3 0 1 2 3 HW: 3 HW: 3 b2 (follower) b3 (follower) ISR: {b1, b2, b3} writes Leader fails

Slide 57

Slide 57 text

@tyler_treat 0 1 2 3 4 5 b1 (leader) 0 1 2 3 4 HW: 3 0 1 2 3 HW: 3 HW: 3 b2 (follower) b3 (follower) ISR: {b1, b2, b3} writes Leader fails

Slide 58

Slide 58 text

@tyler_treat 0 1 2 3 HW: 3 0 1 2 3 HW: 3 b2 (leader) b3 (follower) ISR: {b2, b3} writes Leader fails

Slide 59

Slide 59 text

@tyler_treat Failure Modes 1. Leader fails  2. Follower fails

Slide 60

Slide 60 text

@tyler_treat 0 1 2 3 4 5 b1 (leader) 0 1 2 3 4 HW: 3 0 1 2 3 HW: 3 HW: 3 b2 (follower) b3 (follower) ISR: {b1, b2, b3} writes Follower fails

Slide 61

Slide 61 text

@tyler_treat Follower fails 0 1 2 3 4 5 b1 (leader) 0 1 2 3 4 HW: 3 0 1 2 3 HW: 3 HW: 3 b2 (follower) b3 (follower) ISR: {b1, b2, b3} writes

Slide 62

Slide 62 text

@tyler_treat Follower fails 0 1 2 3 4 5 b1 (leader) 0 1 2 3 4 HW: 3 0 1 2 3 HW: 3 HW: 3 b2 (follower) b3 (follower) ISR: {b1, b2, b3} writes replica.lag.time.max.ms

Slide 63

Slide 63 text

@tyler_treat Follower fails 0 1 2 3 4 5 b1 (leader) 0 1 2 3 4 HW: 3 0 1 2 3 HW: 3 HW: 3 b2 (follower) b3 (follower) ISR: {b1, b2} writes replica.lag.time.max.ms

Slide 64

Slide 64 text

@tyler_treat Follower fails 0 1 2 3 4 5 b1 (leader) 0 1 2 3 4 HW: 5 0 1 2 3 HW: 5 HW: 3 b2 (follower) b3 (follower) ISR: {b1, b2} writes 5

Slide 65

Slide 65 text

@tyler_treat Follower fails 0 1 2 3 4 5 b1 (leader) 0 1 2 3 4 HW: 5 0 1 2 3 HW: 5 HW: 3 b2 (follower) b3 (follower) ISR: {b1, b2} writes 5

Slide 66

Slide 66 text

@tyler_treat Follower fails 0 1 2 3 4 5 b1 (leader) 0 1 2 3 4 HW: 5 0 1 2 3 HW: 5 HW: 4 b2 (follower) b3 (follower) ISR: {b1, b2} writes 5 4

Slide 67

Slide 67 text

@tyler_treat Follower fails 0 1 2 3 4 5 b1 (leader) 0 1 2 3 4 HW: 5 0 1 2 3 HW: 5 HW: 5 b2 (follower) b3 (follower) ISR: {b1, b2} writes 5 4 5

Slide 68

Slide 68 text

@tyler_treat Follower fails 0 1 2 3 4 5 b1 (leader) 0 1 2 3 4 HW: 5 0 1 2 3 HW: 5 HW: 5 b2 (follower) b3 (follower) ISR: {b1, b2, b3} writes 5 4 5

Slide 69

Slide 69 text

@tyler_treat Replication in NATS Streaming 1. Raft replicates client state, messages, and subscriptions  2. Conceptually, two logs: Raft log and message log  3. Parallels work implementing Raft in RabbitMQ

Slide 70

Slide 70 text

@tyler_treat http://thesecretlivesofdata.com/raft

Slide 71

Slide 71 text

@tyler_treat Replication in NATS Streaming • Initially used Raft group per topic and separate metadata group   • A couple issues with this:  -> Topic scalability  -> Increased complexity due to lack of ordering between Raft groups

Slide 72

Slide 72 text

@tyler_treat Challenges 1. Scaling topics

Slide 73

Slide 73 text

@tyler_treat Scaling Raft With a single topic, one node is elected leader and it heartbeats messages to followers

Slide 74

Slide 74 text

@tyler_treat Scaling Raft As the number of topics increases, so does the number of Raft groups.

Slide 75

Slide 75 text

@tyler_treat Scaling Raft Technique 1: run a ﬁxed number of Raft groups and use a consistent hash to map a topic to a group.

Slide 76

Slide 76 text

@tyler_treat Scaling Raft Technique 2: run an entire node’s worth of topics as a single group using a layer on top of Raft. https://www.cockroachlabs.com/blog/scaling-raft

Slide 77

Slide 77 text

@tyler_treat Scaling Raft Technique 3: use a single Raft group for all topics and metadata.

Slide 78

Slide 78 text

@tyler_treat Challenges 1. Scaling topics 2. Dual writes

Slide 79

Slide 79 text

@tyler_treat Dual Writes Raft Store committed

Slide 80

Slide 80 text

@tyler_treat Dual Writes msg 1 Raft Store committed

Slide 81

Slide 81 text

@tyler_treat Dual Writes msg 1 msg 2 Raft Store committed

Slide 82

Slide 82 text

@tyler_treat Dual Writes msg 1 msg 2 Raft msg 1 msg 2 Store committed

Slide 83

Slide 83 text

@tyler_treat Dual Writes msg 1 msg 2 sub Raft msg 1 msg 2 Store committed

Slide 84

Slide 84 text

@tyler_treat Dual Writes msg 1 msg 2 sub msg 3 Raft msg 1 msg 2 Store committed

Slide 85

Slide 85 text

@tyler_treat Dual Writes msg 1 msg 2 sub msg 3 add peer msg 4 Raft msg 1 msg 2 msg 3 Store committed

Slide 86

Slide 86 text

@tyler_treat Dual Writes msg 1 msg 2 sub msg 3 add peer msg 4 Raft msg 1 msg 2 msg 3 Store committed

Slide 87

Slide 87 text

@tyler_treat Dual Writes msg 1 msg 2 sub msg 3 add peer msg 4 Raft msg 1 msg 2 msg 3 msg 4 Store commit

Slide 88

Slide 88 text

@tyler_treat Dual Writes msg 1 msg 2 sub msg 3 add peer msg 4 Raft msg 1 msg 2 msg 3 msg 4 Store 0 1 2 3 4 5 0 1 2 3 physical oﬀset logical oﬀset

Slide 89

Slide 89 text

@tyler_treat Dual Writes msg 1 msg 2 sub msg 3 add peer msg 4 Raft msg 1 msg 2 Index 0 1 2 3 4 5 0 1 2 3 physical oﬀset logical oﬀset msg 3 msg 4

Slide 90

Slide 90 text

@tyler_treat Treat the Raft log as our message write-ahead log.

Slide 91

Slide 91 text

Slide 92

Slide 92 text

@tyler_treat Performance 1. Publisher acks   -> broker acks on commit (slow but safe)  -> broker acks on local log append (fast but unsafe)  -> publisher doesn’t wait for ack (fast but unsafe)   2. Don’t fsync, rely on replication for durability  3. Keep disk access sequential and maximize zero-copy reads  4. Batch aggressively

Slide 93

Slide 93 text

Slide 94

Slide 94 text

@tyler_treat Durability 1. Quorum guarantees durability  -> Comes for free with Raft  -> In Kafka, need to conﬁgure min.insync.replicas and acks, e.g.  topic with replication factor 3, min.insync.replicas=2, and  acks=all  2. Disable unclean leader elections  3. At odds with availability,  i.e. no quorum == no reads/writes

Slide 95

Slide 95 text

@tyler_treat Scaling Message  Delivery

Slide 96

Slide 96 text

@tyler_treat Scaling Message Delivery 1. Partitioning

Slide 97

Slide 97 text

@tyler_treat Partitioning is how we scale linearly.

Slide 98

Slide 98 text

@tyler_treat caches databases indexes writes

Slide 99

Slide 99 text

@tyler_treat HELLA WRITES caches databases indexes

Slide 100

Slide 100 text

@tyler_treat caches databases indexes HELLA WRITES

Slide 101

Slide 101 text

@tyler_treat caches databases indexes writes writes writes writes Topic: purchases Topic: inventory

Slide 102

Slide 102 text

@tyler_treat caches databases indexes writes writes writes writes Topic: purchases Topic: inventory Accounts A-M Accounts N-Z SKUs A-M SKUs N-Z

Slide 103

Slide 103 text

@tyler_treat Scaling Message Delivery 1. Partitioning 2. High fan-out

Slide 104

Slide 104 text

@tyler_treat Kinesis Fan-Out consumers shard-1 consumers shard-2 consumers shard-3 writes

Slide 105

Slide 105 text

@tyler_treat Replication in Kafka and NATS Streaming is purely a means of HA.

Slide 106

Slide 106 text

@tyler_treat High Fan-Out 1. Observation: with an immutable log, there are no stale/phantom reads  2. This should make it “easy” (in theory) to scale to a large number of consumers  3. With Raft, we can use “non-voters” to act as read replicas and load balance consumers

Slide 107

Slide 107 text

@tyler_treat Scaling Message Delivery 1. Partitioning 2. High fan-out 3. Push vs. pull

Slide 108

Slide 108 text

@tyler_treat Push vs. Pull • In Kafka, consumers pull data from brokers • In NATS Streaming, brokers push data to consumers • Design implications: • Fan-out • Flow control • Optimizing for latency vs. throughput • Client complexity

Slide 109

Slide 109 text

@tyler_treat Trade-Oﬀs and  Lessons Learned

Slide 110

Slide 110 text

@tyler_treat Trade-Offs and Lessons Learned 1. Competing goals

Slide 111

Slide 111 text

@tyler_treat Competing Goals 1. Performance  -> Easy to make something fast that’s not fault-tolerant or scalable  -> Simplicity of mechanism makes this easier  -> Simplicity of “UX” makes this harder 2. Scalability and fault-tolerance  -> At odds with simplicity  -> Cannot be an afterthought 3. Simplicity  -> Simplicity of mechanism shifts complexity elsewhere (e.g. client)  -> Easy to let server handle complexity; hard when that needs to be  distributed, consistent, and fast

Slide 112

Slide 112 text

@tyler_treat Trade-Offs and Lessons Learned 1. Competing goals 2. Aim for simplicity

Slide 113

Slide 113 text

@tyler_treat Distributed systems are complex enough.  Simple is usually better (and faster).

Slide 114

Slide 114 text

@tyler_treat “A complex system that works is invariably found to have evolved from a simple system that works.”

Slide 115

Slide 115 text

@tyler_treat Trade-Offs and Lessons Learned 1. Competing goals 2. Aim for simplicity 3. You can’t effectively bolt on fault-tolerance

Slide 116

Slide 116 text

@tyler_treat “A complex system designed from scratch never works and cannot be patched up to make it work. You have to start over, beginning with a working simple system.”

Slide 117

Slide 117 text

@tyler_treat Trade-Offs and Lessons Learned 1. Competing goals 2. Aim for simplicity 3. You can’t effectively bolt on fault-tolerance 4. Lean on existing work

Slide 118

Slide 118 text

@tyler_treat Don’t roll your own coordination protocol,  use Raft, ZooKeeper, etc.

Slide 119

Slide 119 text

@tyler_treat Trade-Offs and Lessons Learned 1. Competing goals 2. Aim for simplicity 3. You can’t effectively bolt on fault-tolerance 4. Lean on existing work 5. There are probably edge cases for which you haven’t written tests

Slide 120

Slide 120 text

@tyler_treat There are many failure modes, and you can only write so many tests.    Formal methods and property-based/ generative testing can help.

Slide 121

Slide 121 text

@tyler_treat @tyler_treat

Slide 122

Slide 122 text

Slide 123

Slide 123 text

@tyler_treat Don’t try to be everything to everyone.  Be explicit about design decisions, trade- offs, guarantees, defaults, etc.

Slide 124

Slide 124 text

@tyler_treat https://bravenewgeek.com/tag/building-a-distributed-log-from-scratch/

Slide 125

Slide 125 text

@tyler_treat Thanks! bravenewgeek.com realkinetic.com