Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Building a Distributed Message Log from Scratch

Building a Distributed Message Log from Scratch

Apache Kafka has shown that the log is a powerful abstraction for data-intensive applications. It can play a key role in managing data and distributing it across the enterprise efficiently. Vital to any data plane is not just performance, but availability and scalability. In this session, we examine what a distributed log is, how it works, and how it can achieve these goals. Specifically, we'll discuss lessons learned while building NATS Streaming, a reliable messaging layer built on NATS that provides similar semantics. We'll cover core components like leader election, data replication, log persistence, and message delivery. Come learn about distributed systems!

Tyler Treat

November 04, 2017
Tweet

More Decks by Tyler Treat

Other Decks in Programming

Transcript

  1. Building a Distributed
    Message Log from
    Scratch
    Tyler Treat · Iowa Code Camp · 11/04/17

    View full-size slide

  2. - Messaging Nerd @ Apcera

    - Working on nats.io

    - Distributed systems

    - bravenewgeek.com
    Tyler Treat

    View full-size slide

  3. - The Log

    -> What?

    -> Why?

    - Implementation

    -> Storage mechanics

    -> Data-replication techniques

    -> Scaling message delivery

    -> Trade-offs and lessons learned
    Outline

    View full-size slide

  4. The Log
    A totally-ordered,
    append-only data
    structure.

    View full-size slide

  5. 0 1 2
    The Log

    View full-size slide

  6. 0 1 2 3
    The Log

    View full-size slide

  7. 0 1 2 3 4
    The Log

    View full-size slide

  8. 0 1 2 3 4 5
    The Log

    View full-size slide

  9. 0 1 2 3 4 5
    newest record
    oldest record
    The Log

    View full-size slide

  10. newest record
    oldest record
    The Log

    View full-size slide

  11. Logs record what
    happened and when.

    View full-size slide

  12. caches
    databases
    indexes
    writes

    View full-size slide

  13. Examples in the wild:
    -> Apache Kafka

    -> Amazon Kinesis
    -> NATS Streaming

    -> Tank

    View full-size slide

  14. Key Goals:
    -> Performance
    -> High Availability
    -> Scalability

    View full-size slide

  15. The purpose of this talk is to learn…

    -> a bit about the internals of a log abstraction.
    -> how it can achieve these goals.
    -> some applied distributed systems theory.

    View full-size slide

  16. You will probably never need to
    build something like this yourself,
    but it helps to know how it works.

    View full-size slide

  17. Implemen-
    tation

    View full-size slide

  18. Implemen-
    tation
    Don’t try this at
    home.

    View full-size slide

  19. Some first principles…
    Storage Mechanics
    • The log is an ordered, immutable sequence of messages
    • Messages are atomic (meaning they can’t be broken up)
    • The log has a notion of message retention based on some policies
    (time, number of messages, bytes, etc.)
    • The log can be played back from any arbitrary position
    • The log is stored on disk
    • Sequential disk access is fast*
    • OS page cache means sequential access often avoids disk

    View full-size slide

  20. http://queue.acm.org/detail.cfm?id=1563874

    View full-size slide

  21. avg-cpu: %user %nice %system %iowait %steal %idle
    13.53 0.00 11.28 0.00 0.00 75.19
    Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
    xvda 0.00 0.00 0.00 0 0
    iostat

    View full-size slide

  22. Storage Mechanics
    log file
    0

    View full-size slide

  23. Storage Mechanics
    log file
    0 1

    View full-size slide

  24. Storage Mechanics
    log file
    0 1 2

    View full-size slide

  25. Storage Mechanics
    log file
    0 1 2 3

    View full-size slide

  26. Storage Mechanics
    log file
    0 1 2 3 4

    View full-size slide

  27. Storage Mechanics
    log file
    0 1 2 3 4 5

    View full-size slide

  28. Storage Mechanics
    log file

    0 1 2 3 4 5

    View full-size slide

  29. Storage Mechanics
    log segment 3 file
    log segment 0 file
    0 1 2 3 4 5

    View full-size slide

  30. Storage Mechanics
    log segment 3 file
    log segment 0 file
    0 1 2 3 4 5
    0 1 2 0 1 2
    index segment 0 file index segment 3 file

    View full-size slide

  31. Zero-copy Reads
    user space
    kernel space
    page cache
    disk
    socket
    NIC
    application
    read send

    View full-size slide

  32. Zero-copy Reads
    user space
    kernel space
    page cache
    disk NIC
    sendfile

    View full-size slide

  33. Left as an exercise for the listener…

    -> Batching

    -> Compression

    View full-size slide

  34. caches
    databases
    indexes
    writes

    View full-size slide

  35. caches
    databases
    indexes
    writes

    View full-size slide

  36. caches
    databases
    indexes
    writes

    View full-size slide

  37. How do we achieve high availability
    and fault tolerance?

    View full-size slide

  38. Questions:

    -> How do we ensure continuity of reads/writes?
    -> How do we replicate data?
    -> How do we ensure replicas are consistent?
    -> How do we keep things fast?
    -> How do we ensure data is durable?

    View full-size slide

  39. Questions:

    -> How do we ensure continuity of reads/writes?
    -> How do we replicate data?
    -> How do we ensure replicas are consistent?
    -> How do we keep things fast?
    -> How do we ensure data is durable?

    View full-size slide

  40. caches
    databases
    indexes
    writes

    View full-size slide

  41. Questions:

    -> How do we ensure continuity of reads/writes?
    -> How do we replicate data?
    -> How do we ensure replicas are consistent?
    -> How do we keep things fast?
    -> How do we ensure data is durable?

    View full-size slide

  42. Data-Replication Techniques
    1. Gossip/multicast protocols
    Epidemic broadcast trees, bimodal multicast, SWIM, HyParView, NeEM

    2. Consensus protocols
    2PC/3PC, Paxos, Raft, Zab, chain replication

    View full-size slide

  43. Questions:

    -> How do we ensure continuity of reads/writes?
    -> How do we replicate data?
    -> How do we ensure replicas are consistent?
    -> How do we keep things fast?
    -> How do we ensure data is durable?

    View full-size slide

  44. Data-Replication Techniques
    1. Gossip/multicast protocols
    Epidemic broadcast trees, bimodal multicast, SWIM, HyParView, NeEM

    2. Consensus protocols
    2PC/3PC, Paxos, Raft, Zab, chain replication

    View full-size slide

  45. Consensus-Based Replication
    1. Designate a leader
    2. Replicate by either:

    a) waiting for all replicas

    —or—
    b) waiting for a quorum of replicas

    View full-size slide

  46. Pros Cons
    All Replicas
    Tolerates f failures with
    f+1 replicas
    Latency pegged to
    slowest replica
    Quorum
    Hides delay from a slow
    replica
    Tolerates f failures with
    2f+1 replicas
    Consensus-Based Replication

    View full-size slide

  47. Replication in Kafka
    1. Select a leader
    2. Maintain in-sync replica set (ISR) (initially every replica)
    3. Leader writes messages to write-ahead log (WAL)
    4. Leader commits messages when all replicas in ISR ack
    5. Leader maintains high-water mark (HW) of last
    committed message
    6. Piggyback HW on replica fetch responses which
    replicas periodically checkpoint to disk

    View full-size slide

  48. 0 1 2 3 4 5
    b1 (leader)
    0 1 2 3 4
    HW: 3
    0 1 2 3
    HW: 3
    HW: 3
    b2 (follower)
    b3 (follower)
    ISR: {b1, b2, b3}
    writes
    Replication in Kafka

    View full-size slide

  49. Failure Modes
    1. Leader fails

    View full-size slide

  50. 0 1 2 3 4 5
    b1 (leader)
    0 1 2 3 4
    HW: 3
    0 1 2 3
    HW: 3
    HW: 3
    b2 (follower)
    b3 (follower)
    ISR: {b1, b2, b3}
    writes
    Leader fails

    View full-size slide

  51. 0 1 2 3 4 5
    b1 (leader)
    0 1 2 3 4
    HW: 3
    0 1 2 3
    HW: 3
    HW: 3
    b2 (follower)
    b3 (follower)
    ISR: {b1, b2, b3}
    writes
    Leader fails

    View full-size slide

  52. 0 1 2 3 4 5
    b1 (leader)
    0 1 2 3 4
    HW: 3
    0 1 2 3
    HW: 3
    HW: 3
    b2 (follower)
    b3 (follower)
    ISR: {b1, b2, b3}
    writes
    Leader fails

    View full-size slide

  53. 0 1 2 3
    HW: 3
    0 1 2 3
    HW: 3
    b2 (leader)
    b3 (follower)
    ISR: {b2, b3}
    writes
    Leader fails

    View full-size slide

  54. Failure Modes
    1. Leader fails

    2. Follower fails

    View full-size slide

  55. 0 1 2 3 4 5
    b1 (leader)
    0 1 2 3 4
    HW: 3
    0 1 2 3
    HW: 3
    HW: 3
    b2 (follower)
    b3 (follower)
    ISR: {b1, b2, b3}
    writes
    Follower fails

    View full-size slide

  56. 0 1 2 3 4 5
    b1 (leader)
    0 1 2 3 4
    HW: 3
    0 1 2 3
    HW: 3
    HW: 3
    b2 (follower)
    b3 (follower)
    ISR: {b1, b2, b3}
    writes
    Follower fails

    View full-size slide

  57. 0 1 2 3 4 5
    b1 (leader)
    0 1 2 3 4
    HW: 3
    0 1 2 3
    HW: 3
    HW: 3
    b2 (follower)
    b3 (follower)
    ISR: {b1, b2, b3}
    writes
    Follower fails
    replica.lag.time.max.ms

    View full-size slide

  58. 0 1 2 3 4 5
    b1 (leader)
    HW: 3
    0 1 2 3
    HW: 3
    b3 (follower)
    ISR: {b1, b3}
    writes
    Follower fails
    replica.lag.time.max.ms

    View full-size slide

  59. Failure Modes
    1. Leader fails

    2. Follower fails

    3. Follower temporarily partitioned

    View full-size slide

  60. 0 1 2 3 4 5
    b1 (leader)
    0 1 2 3 4
    HW: 3
    0 1 2 3
    HW: 3
    HW: 3
    b2 (follower)
    b3 (follower)
    ISR: {b1, b2, b3}
    writes
    Follower temporarily

    partitioned

    View full-size slide

  61. Follower temporarily

    partitioned
    0 1 2 3 4 5
    b1 (leader)
    0 1 2 3 4
    HW: 3
    0 1 2 3
    HW: 3
    HW: 3
    b2 (follower)
    b3 (follower)
    ISR: {b1, b2, b3}
    writes

    View full-size slide

  62. Follower temporarily

    partitioned
    0 1 2 3 4 5
    b1 (leader)
    0 1 2 3 4
    HW: 3
    0 1 2 3
    HW: 3
    HW: 3
    b2 (follower)
    b3 (follower)
    ISR: {b1, b2, b3}
    writes
    replica.lag.time.max.ms

    View full-size slide

  63. Follower temporarily

    partitioned
    0 1 2 3 4 5
    b1 (leader)
    0 1 2 3 4
    HW: 3
    0 1 2 3
    HW: 3
    HW: 3
    b2 (follower)
    b3 (follower)
    ISR: {b1, b2}
    writes
    replica.lag.time.max.ms

    View full-size slide

  64. Follower temporarily

    partitioned
    0 1 2 3 4 5
    b1 (leader)
    0 1 2 3 4
    HW: 5
    0 1 2 3
    HW: 5
    HW: 3
    b2 (follower)
    b3 (follower)
    ISR: {b1, b2}
    writes
    5

    View full-size slide

  65. Follower temporarily

    partitioned
    0 1 2 3 4 5
    b1 (leader)
    0 1 2 3 4
    HW: 5
    0 1 2 3
    HW: 5
    HW: 3
    b2 (follower)
    b3 (follower)
    ISR: {b1, b2}
    writes
    5

    View full-size slide

  66. Follower temporarily

    partitioned
    0 1 2 3 4 5
    b1 (leader)
    0 1 2 3 4
    HW: 5
    0 1 2 3
    HW: 5
    HW: 4
    b2 (follower)
    b3 (follower)
    ISR: {b1, b2}
    writes
    5
    4

    View full-size slide

  67. Follower temporarily

    partitioned
    0 1 2 3 4 5
    b1 (leader)
    0 1 2 3 4
    HW: 5
    0 1 2 3
    HW: 5
    HW: 5
    b2 (follower)
    b3 (follower)
    ISR: {b1, b2}
    writes
    5
    4 5

    View full-size slide

  68. Follower temporarily

    partitioned
    0 1 2 3 4 5
    b1 (leader)
    0 1 2 3 4
    HW: 5
    0 1 2 3
    HW: 5
    HW: 5
    b2 (follower)
    b3 (follower)
    ISR: {b1, b2, b3}
    writes
    5
    4 5

    View full-size slide

  69. Replication in NATS Streaming
    1. Metadata Raft group replicates client state

    2. Separate Raft group per topic replicates messages
    and subscriptions

    3. Conceptually, two logs: Raft log and message log

    View full-size slide

  70. http://thesecretlivesofdata.com/raft

    View full-size slide

  71. Challenges
    1. Scaling Raft

    View full-size slide

  72. Scaling Raft
    With a single topic, one node is elected leader and it
    heartbeats messages to followers

    View full-size slide

  73. Scaling Raft
    As the number of topics increases unbounded, so do the
    number of Raft groups.

    View full-size slide

  74. Scaling Raft
    Technique 1: run a fixed number of Raft groups and use
    a consistent hash to map a topic to a group.

    View full-size slide

  75. Scaling Raft
    Technique 2: run an entire node’s worth of topics as a
    single group using a layer on top of Raft.
    https://www.cockroachlabs.com/blog/scaling-raft

    View full-size slide

  76. Challenges
    1. Scaling Raft
    2. Dual writes

    View full-size slide

  77. Dual Writes
    Raft
    Store
    committed

    View full-size slide

  78. Dual Writes
    msg 1
    Raft
    Store
    committed

    View full-size slide

  79. Dual Writes
    msg 1 msg 2
    Raft
    Store
    committed

    View full-size slide

  80. Dual Writes
    msg 1 msg 2
    Raft
    msg 1 msg 2
    Store
    committed

    View full-size slide

  81. Dual Writes
    msg 1 msg 2 sub
    Raft
    msg 1 msg 2
    Store
    committed

    View full-size slide

  82. Dual Writes
    msg 1 msg 2 sub msg 3
    Raft
    msg 1 msg 2
    Store
    committed

    View full-size slide

  83. Dual Writes
    msg 1 msg 2 sub msg 3
    add
    peer
    msg 4
    Raft
    msg 1 msg 2 msg 3
    Store
    committed

    View full-size slide

  84. Dual Writes
    msg 1 msg 2 sub msg 3
    add
    peer
    msg 4
    Raft
    msg 1 msg 2 msg 3
    Store
    committed

    View full-size slide

  85. Dual Writes
    msg 1 msg 2 sub msg 3
    add
    peer
    msg 4
    Raft
    msg 1 msg 2 msg 3 msg 4
    Store
    commit

    View full-size slide

  86. Dual Writes
    msg 1 msg 2 sub msg 3
    add
    peer
    msg 4
    Raft
    msg 1 msg 2 msg 3 msg 4
    Store
    0 1 2 3 4 5
    0 1 2 3
    physical offset
    logical offset

    View full-size slide

  87. Dual Writes
    msg 1 msg 2 sub msg 3
    add
    peer
    msg 4
    Raft
    msg 1 msg 2
    Index
    0 1 2 3 4 5
    0 1 2 3
    physical offset
    logical offset
    msg 3 msg 4

    View full-size slide

  88. Treat the Raft log as our message
    write-ahead log.

    View full-size slide

  89. Questions:

    -> How do we ensure continuity of reads/writes?
    -> How do we replicate data?
    -> How do we ensure replicas are consistent?
    -> How do we keep things fast?
    -> How do we ensure data is durable?

    View full-size slide

  90. Performance
    1. Publisher acks 

    -> broker acks on commit (slow but safe)

    -> broker acks on local log append (fast but unsafe)

    -> publisher doesn’t wait for ack (fast but unsafe) 

    2. Don’t fsync, rely on replication for durability

    3. Keep disk access sequential and maximize zero-copy reads

    4. Batch aggressively

    View full-size slide

  91. Questions:

    -> How do we ensure continuity of reads/writes?
    -> How do we replicate data?
    -> How do we ensure replicas are consistent?
    -> How do we keep things fast?
    -> How do we ensure data is durable?

    View full-size slide

  92. Durability
    1. Quorum guarantees durability

    -> Comes for free with Raft

    -> In Kafka, need to configure min.insync.replicas and acks, e.g.

    topic with replication factor 3, min.insync.replicas=2, and

    acks=all

    2. Disable unclean leader elections

    3. At odds with availability,

    i.e. no quorum == no reads/writes

    View full-size slide

  93. Scaling Message Delivery
    1. Partitioning

    View full-size slide

  94. Partitioning is how we scale linearly.

    View full-size slide

  95. caches
    databases
    indexes
    writes

    View full-size slide

  96. HELLA WRITES
    caches
    databases
    indexes

    View full-size slide

  97. caches
    databases
    indexes
    HELLA WRITES

    View full-size slide

  98. caches
    databases
    indexes
    writes
    writes
    writes
    writes
    Topic: purchases
    Topic: inventory

    View full-size slide

  99. caches
    databases
    indexes
    writes
    writes
    writes
    writes
    Topic: purchases
    Topic: inventory
    Accounts A-M
    Accounts N-Z
    SKUs A-M
    SKUs N-Z

    View full-size slide

  100. Scaling Message Delivery
    1. Partitioning
    2. High fan-out

    View full-size slide

  101. High Fan-out
    1. Observation: with an immutable log, there are no
    stale/phantom reads

    2. This should make it “easy” (in theory) to scale to a
    large number of consumers (e.g. hundreds of
    thousands of IoT/edge devices)

    3. With Raft, we can use “non-voters” to act as read
    replicas and load balance consumers

    View full-size slide

  102. Scaling Message Delivery
    1. Partitioning
    2. High fan-out
    3. Push vs. pull

    View full-size slide

  103. Push vs. Pull
    • In Kafka, consumers pull data from brokers
    • In NATS Streaming, brokers push data to consumers
    • Pros/cons to both:

    -> With push we need flow control; implicit in pull

    -> Need to make decisions about optimizing for

    latency vs. throughput

    -> Thick vs. thin client and API ergonomics

    View full-size slide

  104. Scaling Message Delivery
    1. Partitioning
    2. High fan-out
    3. Push vs. pull
    4. Bookkeeping

    View full-size slide

  105. Bookkeeping
    • Two ways to track position in the log:

    -> Have the server track it for consumers

    -> Have consumers track it

    • Trade-off between API simplicity and performance/server
    complexity

    • Also, consumers might not have stable storage (e.g. IoT device,
    ephemeral container, etc.)

    • Can we split the difference?

    View full-size slide

  106. Offset Storage
    • Can store offsets themselves in the log (in Kafka,
    originally had to store them in ZooKeeper)

    • Clients periodically checkpoint offset to log

    • Use log compaction to retain only latest offsets

    • On recovery, fetch latest offset from log

    View full-size slide

  107. Offset Storage
    bob-foo-0

    11
    alice-foo-0

    15
    Offsets
    0 1 2 3
    bob-foo-1

    20
    bob-foo-0

    18
    4
    bob-foo-0

    21

    View full-size slide

  108. Offset Storage
    bob-foo-0

    11
    alice-foo-0

    15
    Offsets
    0 1 2 3
    bob-foo-1

    20
    bob-foo-0

    18
    4
    bob-foo-0

    21

    View full-size slide

  109. Offset Storage
    alice-foo-0

    15
    bob-foo-1

    20
    Offsets
    1 2 4
    bob-foo-0

    21

    View full-size slide

  110. Offset Storage
    Advantages:

    -> Fault-tolerant

    -> Consistent reads

    -> High write throughput (unlike ZooKeeper)

    -> Reuses existing structures, so less server

    complexity

    View full-size slide

  111. Trade-offs and Lessons Learned
    1. Competing goals

    View full-size slide

  112. Competing Goals
    1. Performance

    -> Easy to make something fast that’s not fault-tolerant or scalable

    -> Simplicity of mechanism makes this easier

    -> Simplicity of “UX” makes this harder
    2. Scalability (and fault-tolerance)

    -> Scalability and FT are at odds with simplicity

    -> Cannot be an afterthought—needs to be designed from day 1
    3. Simplicity (“UX”)

    -> Simplicity of mechanism shifts complexity elsewhere (e.g. client)

    -> Easy to let server handle complexity; hard when that needs to be

    distributed and consistent while still being fast

    View full-size slide

  113. Trade-offs and Lessons Learned
    1. Competing goals
    2. Availability vs. Consistency

    View full-size slide

  114. Availability vs. Consistency
    • CAP theorem
    • Consistency requires quorum which hinders
    availability and performance
    • Minimize what you need to replicate

    View full-size slide

  115. Trade-offs and Lessons Learned
    1. Competing goals
    2. Availability vs. Consistency
    3. Aim for simplicity

    View full-size slide

  116. Distributed systems are complex enough.

    Simple is usually better (and faster).

    View full-size slide

  117. Trade-offs and Lessons Learned
    1. Competing goals
    2. Availability vs. Consistency
    3. Aim for simplicity
    4. Lean on existing work

    View full-size slide

  118. Don’t roll your own coordination protocol,

    use Raft, ZooKeeper, etc.

    View full-size slide

  119. Trade-offs and Lessons Learned
    1. Competing goals
    2. Availability vs. Consistency
    3. Aim for simplicity
    4. Lean on existing work
    5. There are probably edge cases for which you
    haven’t written tests

    View full-size slide

  120. There are many failure modes, and you can
    only write so many tests.


    Formal methods and property-based/
    generative testing can help.

    View full-size slide

  121. Trade-offs and Lessons Learned
    1. Competing goals
    2. Availability vs. Consistency
    3. Aim for simplicity
    4. Lean on existing work
    5. There are probably edge cases for which you
    haven’t written tests
    6. Be honest with your users

    View full-size slide

  122. Don’t try to be everything to everyone. Be
    explicit about design decisions, trade-
    offs, guarantees, defaults, etc.

    View full-size slide

  123. Thanks!
    @tyler_treat

    bravenewgeek.com

    View full-size slide