Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Building a Distributed Message Log from Scratch - SCaLE 16x

Building a Distributed Message Log from Scratch - SCaLE 16x

Apache Kafka has shown that the log is a powerful abstraction for data-intensive applications. It can play a key role in managing data and distributing it across the enterprise efficiently. Vital to any data plane is not just performance, but availability and scalability. In this session, we examine what a distributed log is, how it works, and how it can achieve these goals. Specifically, we'll discuss lessons learned while building NATS Streaming, a reliable messaging layer built on NATS that provides similar semantics. We'll cover core components like leader election, data replication, log persistence, and message delivery. Come learn about distributed systems!

Tyler Treat

March 11, 2018
Tweet

More Decks by Tyler Treat

Other Decks in Programming

Transcript

  1. @tyler_treat
    Building a Distributed
    Message Log from
    Scratch
    Tyler Treat · SCALE 16x · 3/11/18

    View full-size slide

  2. @tyler_treat
    - Managing Partner @ Real Kinetic

    - Messaging & distributed systems

    - Former nats.io core contributor

    - bravenewgeek.com
    Tyler Treat

    View full-size slide

  3. @tyler_treat
    @tyler_treat

    View full-size slide

  4. @tyler_treat
    - The Log

    -> What?

    -> Why?

    - Implementation

    -> Storage mechanics

    -> Data-replication techniques

    -> Scaling message delivery

    -> Trade-offs and lessons learned
    Outline

    View full-size slide

  5. @tyler_treat
    The Log

    View full-size slide

  6. @tyler_treat
    The Log
    A totally-ordered,
    append-only data
    structure.

    View full-size slide

  7. @tyler_treat
    The Log
    0

    View full-size slide

  8. @tyler_treat
    0 1
    The Log

    View full-size slide

  9. @tyler_treat
    0 1 2
    The Log

    View full-size slide

  10. @tyler_treat
    0 1 2 3
    The Log

    View full-size slide

  11. @tyler_treat
    0 1 2 3 4
    The Log

    View full-size slide

  12. @tyler_treat
    0 1 2 3 4 5
    The Log

    View full-size slide

  13. @tyler_treat
    0 1 2 3 4 5
    newest record
    oldest record
    The Log

    View full-size slide

  14. @tyler_treat
    newest record
    oldest record
    The Log

    View full-size slide

  15. @tyler_treat
    Logs record what
    happened and when.

    View full-size slide

  16. @tyler_treat
    caches
    databases
    indexes
    writes

    View full-size slide

  17. @tyler_treat
    https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying

    View full-size slide

  18. @tyler_treat
    Examples in the wild:
    -> Apache Kafka

    -> Amazon Kinesis
    -> NATS Streaming

    -> Apache Pulsar

    View full-size slide

  19. @tyler_treat
    Key Goals:
    -> Performance
    -> High Availability
    -> Scalability

    View full-size slide

  20. @tyler_treat
    The purpose of this talk is to learn…

    -> a bit about the internals of a log abstraction.
    -> how it can achieve these goals.
    -> some applied distributed systems theory.

    View full-size slide

  21. @tyler_treat
    You will probably never need to
    build something like this yourself,
    but it helps to know how it works.

    View full-size slide

  22. @tyler_treat
    Implemen-
    tation

    View full-size slide

  23. @tyler_treat
    Implemen-
    tation
    Don’t try this at
    home.

    View full-size slide

  24. @tyler_treat
    Storage

    Mechanics

    View full-size slide

  25. @tyler_treat
    Some first principles…
    • The log is an ordered, immutable sequence of messages
    • Messages are atomic (meaning they can’t be broken up)
    • The log has a notion of message retention based on some policies
    (time, number of messages, bytes, etc.)
    • The log can be played back from any arbitrary position
    • The log is stored on disk
    • Sequential disk access is fast*
    • OS page cache means sequential access often avoids disk

    View full-size slide

  26. @tyler_treat
    http://queue.acm.org/detail.cfm?id=1563874

    View full-size slide

  27. @tyler_treat
    avg-cpu: %user %nice %system %iowait %steal %idle
    13.53 0.00 11.28 0.00 0.00 75.19
    Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
    xvda 0.00 0.00 0.00 0 0
    iostat

    View full-size slide

  28. @tyler_treat
    Storage Mechanics
    log file
    0

    View full-size slide

  29. @tyler_treat
    Storage Mechanics
    log file
    0 1

    View full-size slide

  30. @tyler_treat
    Storage Mechanics
    log file
    0 1 2

    View full-size slide

  31. @tyler_treat
    Storage Mechanics
    log file
    0 1 2 3

    View full-size slide

  32. @tyler_treat
    Storage Mechanics
    log file
    0 1 2 3 4

    View full-size slide

  33. @tyler_treat
    Storage Mechanics
    log file
    0 1 2 3 4 5

    View full-size slide

  34. @tyler_treat
    Storage Mechanics
    log file

    0 1 2 3 4 5

    View full-size slide

  35. @tyler_treat
    Storage Mechanics
    log segment 3 file
    log segment 0 file
    0 1 2 3 4 5

    View full-size slide

  36. @tyler_treat
    Storage Mechanics
    log segment 3 file
    log segment 0 file
    0 1 2 3 4 5
    0 1 2 0 1 2
    index segment 0 file index segment 3 file

    View full-size slide

  37. @tyler_treat
    Zero-Copy Reads
    user space
    kernel space
    page cache
    disk
    socket
    NIC
    application
    read send

    View full-size slide

  38. @tyler_treat
    Zero-Copy Reads
    user space
    kernel space
    page cache
    disk NIC
    sendfile

    View full-size slide

  39. @tyler_treat
    Left as an exercise for the listener…

    -> Batching

    -> Compression

    View full-size slide

  40. @tyler_treat
    Data-Replication

    Techniques

    View full-size slide

  41. @tyler_treat
    caches
    databases
    indexes
    writes

    View full-size slide

  42. @tyler_treat
    caches
    databases
    indexes
    writes

    View full-size slide

  43. @tyler_treat
    caches
    databases
    indexes
    writes

    View full-size slide

  44. @tyler_treat
    How do we achieve high availability
    and fault tolerance?

    View full-size slide

  45. @tyler_treat
    Questions:

    -> How do we ensure continuity of reads/writes?
    -> How do we replicate data?
    -> How do we ensure replicas are consistent?
    -> How do we keep things fast?
    -> How do we ensure data is durable?

    View full-size slide

  46. @tyler_treat
    Questions:

    -> How do we ensure continuity of reads/writes?
    -> How do we replicate data?
    -> How do we ensure replicas are consistent?
    -> How do we keep things fast?
    -> How do we ensure data is durable?

    View full-size slide

  47. @tyler_treat
    caches
    databases
    indexes
    writes

    View full-size slide

  48. @tyler_treat
    Questions:

    -> How do we ensure continuity of reads/writes?
    -> How do we replicate data?
    -> How do we ensure replicas are consistent?
    -> How do we keep things fast?
    -> How do we ensure data is durable?

    View full-size slide

  49. @tyler_treat
    Data-Replication Techniques
    1. Gossip/multicast protocols
    Epidemic broadcast trees, bimodal multicast, SWIM, HyParView

    2. Consensus protocols
    2PC/3PC, Paxos, Raft, Zab, chain replication

    View full-size slide

  50. @tyler_treat
    Questions:

    -> How do we ensure continuity of reads/writes?
    -> How do we replicate data?
    -> How do we ensure replicas are consistent?
    -> How do we keep things fast?
    -> How do we ensure data is durable?

    View full-size slide

  51. @tyler_treat
    Data-Replication Techniques
    1. Gossip/multicast protocols
    Epidemic broadcast trees, bimodal multicast, SWIM, HyParView, NeEM

    2. Consensus protocols
    2PC/3PC, Paxos, Raft, Zab, chain replication

    View full-size slide

  52. @tyler_treat
    Replication in Kafka
    1. Select a leader
    2. Maintain in-sync replica set (ISR) (initially every replica)
    3. Leader writes messages to write-ahead log (WAL)
    4. Leader commits messages when all replicas in ISR ack
    5. Leader maintains high-water mark (HW) of last
    committed message
    6. Piggyback HW on replica fetch responses which
    replicas periodically checkpoint to disk

    View full-size slide

  53. @tyler_treat
    0 1 2 3 4 5
    b1 (leader)
    0 1 2 3 4
    HW: 3
    0 1 2 3
    HW: 3
    HW: 3
    b2 (follower)
    b3 (follower)
    ISR: {b1, b2, b3}
    writes
    Replication in Kafka

    View full-size slide

  54. @tyler_treat
    Failure Modes
    1. Leader fails

    View full-size slide

  55. @tyler_treat
    0 1 2 3 4 5
    b1 (leader)
    0 1 2 3 4
    HW: 3
    0 1 2 3
    HW: 3
    HW: 3
    b2 (follower)
    b3 (follower)
    ISR: {b1, b2, b3}
    writes
    Leader fails

    View full-size slide

  56. @tyler_treat
    0 1 2 3 4 5
    b1 (leader)
    0 1 2 3 4
    HW: 3
    0 1 2 3
    HW: 3
    HW: 3
    b2 (follower)
    b3 (follower)
    ISR: {b1, b2, b3}
    writes
    Leader fails

    View full-size slide

  57. @tyler_treat
    0 1 2 3 4 5
    b1 (leader)
    0 1 2 3 4
    HW: 3
    0 1 2 3
    HW: 3
    HW: 3
    b2 (follower)
    b3 (follower)
    ISR: {b1, b2, b3}
    writes
    Leader fails

    View full-size slide

  58. @tyler_treat
    0 1 2 3
    HW: 3
    0 1 2 3
    HW: 3
    b2 (leader)
    b3 (follower)
    ISR: {b2, b3}
    writes
    Leader fails

    View full-size slide

  59. @tyler_treat
    Failure Modes
    1. Leader fails

    2. Follower fails

    View full-size slide

  60. @tyler_treat
    0 1 2 3 4 5
    b1 (leader)
    0 1 2 3 4
    HW: 3
    0 1 2 3
    HW: 3
    HW: 3
    b2 (follower)
    b3 (follower)
    ISR: {b1, b2, b3}
    writes
    Follower fails

    View full-size slide

  61. @tyler_treat
    Follower fails
    0 1 2 3 4 5
    b1 (leader)
    0 1 2 3 4
    HW: 3
    0 1 2 3
    HW: 3
    HW: 3
    b2 (follower)
    b3 (follower)
    ISR: {b1, b2, b3}
    writes

    View full-size slide

  62. @tyler_treat
    Follower fails
    0 1 2 3 4 5
    b1 (leader)
    0 1 2 3 4
    HW: 3
    0 1 2 3
    HW: 3
    HW: 3
    b2 (follower)
    b3 (follower)
    ISR: {b1, b2, b3}
    writes
    replica.lag.time.max.ms

    View full-size slide

  63. @tyler_treat
    Follower fails
    0 1 2 3 4 5
    b1 (leader)
    0 1 2 3 4
    HW: 3
    0 1 2 3
    HW: 3
    HW: 3
    b2 (follower)
    b3 (follower)
    ISR: {b1, b2}
    writes
    replica.lag.time.max.ms

    View full-size slide

  64. @tyler_treat
    Follower fails
    0 1 2 3 4 5
    b1 (leader)
    0 1 2 3 4
    HW: 5
    0 1 2 3
    HW: 5
    HW: 3
    b2 (follower)
    b3 (follower)
    ISR: {b1, b2}
    writes
    5

    View full-size slide

  65. @tyler_treat
    Follower fails
    0 1 2 3 4 5
    b1 (leader)
    0 1 2 3 4
    HW: 5
    0 1 2 3
    HW: 5
    HW: 3
    b2 (follower)
    b3 (follower)
    ISR: {b1, b2}
    writes
    5

    View full-size slide

  66. @tyler_treat
    Follower fails
    0 1 2 3 4 5
    b1 (leader)
    0 1 2 3 4
    HW: 5
    0 1 2 3
    HW: 5
    HW: 4
    b2 (follower)
    b3 (follower)
    ISR: {b1, b2}
    writes
    5
    4

    View full-size slide

  67. @tyler_treat
    Follower fails
    0 1 2 3 4 5
    b1 (leader)
    0 1 2 3 4
    HW: 5
    0 1 2 3
    HW: 5
    HW: 5
    b2 (follower)
    b3 (follower)
    ISR: {b1, b2}
    writes
    5
    4 5

    View full-size slide

  68. @tyler_treat
    Follower fails
    0 1 2 3 4 5
    b1 (leader)
    0 1 2 3 4
    HW: 5
    0 1 2 3
    HW: 5
    HW: 5
    b2 (follower)
    b3 (follower)
    ISR: {b1, b2, b3}
    writes
    5
    4 5

    View full-size slide

  69. @tyler_treat
    Replication in NATS Streaming
    1. Raft replicates client state, messages, and
    subscriptions

    2. Conceptually, two logs: Raft log and message log

    3. Parallels work implementing Raft in RabbitMQ

    View full-size slide

  70. @tyler_treat
    http://thesecretlivesofdata.com/raft

    View full-size slide

  71. @tyler_treat
    Replication in NATS Streaming
    • Initially used Raft group per topic and separate
    metadata group 

    • A couple issues with this:

    -> Topic scalability

    -> Increased complexity due to lack of ordering between Raft groups

    View full-size slide

  72. @tyler_treat
    Challenges
    1. Scaling topics

    View full-size slide

  73. @tyler_treat
    Scaling Raft
    With a single topic, one node is elected leader and it
    heartbeats messages to followers

    View full-size slide

  74. @tyler_treat
    Scaling Raft
    As the number of topics increases, so does the number
    of Raft groups.

    View full-size slide

  75. @tyler_treat
    Scaling Raft
    Technique 1: run a fixed number of Raft groups and use
    a consistent hash to map a topic to a group.

    View full-size slide

  76. @tyler_treat
    Scaling Raft
    Technique 2: run an entire node’s worth of topics as a
    single group using a layer on top of Raft.
    https://www.cockroachlabs.com/blog/scaling-raft

    View full-size slide

  77. @tyler_treat
    Scaling Raft
    Technique 3: use a single Raft group for all topics and
    metadata.

    View full-size slide

  78. @tyler_treat
    Challenges
    1. Scaling topics
    2. Dual writes

    View full-size slide

  79. @tyler_treat
    Dual Writes
    Raft
    Store
    committed

    View full-size slide

  80. @tyler_treat
    Dual Writes
    msg 1
    Raft
    Store
    committed

    View full-size slide

  81. @tyler_treat
    Dual Writes
    msg 1 msg 2
    Raft
    Store
    committed

    View full-size slide

  82. @tyler_treat
    Dual Writes
    msg 1 msg 2
    Raft
    msg 1 msg 2
    Store
    committed

    View full-size slide

  83. @tyler_treat
    Dual Writes
    msg 1 msg 2 sub
    Raft
    msg 1 msg 2
    Store
    committed

    View full-size slide

  84. @tyler_treat
    Dual Writes
    msg 1 msg 2 sub msg 3
    Raft
    msg 1 msg 2
    Store
    committed

    View full-size slide

  85. @tyler_treat
    Dual Writes
    msg 1 msg 2 sub msg 3
    add
    peer
    msg 4
    Raft
    msg 1 msg 2 msg 3
    Store
    committed

    View full-size slide

  86. @tyler_treat
    Dual Writes
    msg 1 msg 2 sub msg 3
    add
    peer
    msg 4
    Raft
    msg 1 msg 2 msg 3
    Store
    committed

    View full-size slide

  87. @tyler_treat
    Dual Writes
    msg 1 msg 2 sub msg 3
    add
    peer
    msg 4
    Raft
    msg 1 msg 2 msg 3 msg 4
    Store
    commit

    View full-size slide

  88. @tyler_treat
    Dual Writes
    msg 1 msg 2 sub msg 3
    add
    peer
    msg 4
    Raft
    msg 1 msg 2 msg 3 msg 4
    Store
    0 1 2 3 4 5
    0 1 2 3
    physical offset
    logical offset

    View full-size slide

  89. @tyler_treat
    Dual Writes
    msg 1 msg 2 sub msg 3
    add
    peer
    msg 4
    Raft
    msg 1 msg 2
    Index
    0 1 2 3 4 5
    0 1 2 3
    physical offset
    logical offset
    msg 3 msg 4

    View full-size slide

  90. @tyler_treat
    Treat the Raft log as our message
    write-ahead log.

    View full-size slide

  91. @tyler_treat
    Questions:

    -> How do we ensure continuity of reads/writes?
    -> How do we replicate data?
    -> How do we ensure replicas are consistent?
    -> How do we keep things fast?
    -> How do we ensure data is durable?

    View full-size slide

  92. @tyler_treat
    Performance
    1. Publisher acks 

    -> broker acks on commit (slow but safe)

    -> broker acks on local log append (fast but unsafe)

    -> publisher doesn’t wait for ack (fast but unsafe) 

    2. Don’t fsync, rely on replication for durability

    3. Keep disk access sequential and maximize zero-copy reads

    4. Batch aggressively

    View full-size slide

  93. @tyler_treat
    Questions:

    -> How do we ensure continuity of reads/writes?
    -> How do we replicate data?
    -> How do we ensure replicas are consistent?
    -> How do we keep things fast?
    -> How do we ensure data is durable?

    View full-size slide

  94. @tyler_treat
    Durability
    1. Quorum guarantees durability

    -> Comes for free with Raft

    -> In Kafka, need to configure min.insync.replicas and acks, e.g.

    topic with replication factor 3, min.insync.replicas=2, and

    acks=all

    2. Disable unclean leader elections

    3. At odds with availability,

    i.e. no quorum == no reads/writes

    View full-size slide

  95. @tyler_treat
    Scaling Message

    Delivery

    View full-size slide

  96. @tyler_treat
    Scaling Message Delivery
    1. Partitioning

    View full-size slide

  97. @tyler_treat
    Partitioning is how we scale linearly.

    View full-size slide

  98. @tyler_treat
    caches
    databases
    indexes
    writes

    View full-size slide

  99. @tyler_treat
    HELLA WRITES
    caches
    databases
    indexes

    View full-size slide

  100. @tyler_treat
    caches
    databases
    indexes
    HELLA WRITES

    View full-size slide

  101. @tyler_treat
    caches
    databases
    indexes
    writes
    writes
    writes
    writes
    Topic: purchases
    Topic: inventory

    View full-size slide

  102. @tyler_treat
    caches
    databases
    indexes
    writes
    writes
    writes
    writes
    Topic: purchases
    Topic: inventory
    Accounts A-M
    Accounts N-Z
    SKUs A-M
    SKUs N-Z

    View full-size slide

  103. @tyler_treat
    Scaling Message Delivery
    1. Partitioning
    2. High fan-out

    View full-size slide

  104. @tyler_treat
    Kinesis Fan-Out
    consumers
    shard-1
    consumers
    shard-2
    consumers
    shard-3
    writes

    View full-size slide

  105. @tyler_treat
    Replication in Kafka and NATS
    Streaming is purely a means of HA.

    View full-size slide

  106. @tyler_treat
    High Fan-Out
    1. Observation: with an immutable log, there are no
    stale/phantom reads

    2. This should make it “easy” (in theory) to scale to a
    large number of consumers

    3. With Raft, we can use “non-voters” to act as read
    replicas and load balance consumers

    View full-size slide

  107. @tyler_treat
    Scaling Message Delivery
    1. Partitioning
    2. High fan-out
    3. Push vs. pull

    View full-size slide

  108. @tyler_treat
    Push vs. Pull
    • In Kafka, consumers pull data from brokers
    • In NATS Streaming, brokers push data to consumers
    • Design implications:
    • Fan-out
    • Flow control
    • Optimizing for latency vs. throughput
    • Client complexity

    View full-size slide

  109. @tyler_treat
    Trade-Offs and

    Lessons Learned

    View full-size slide

  110. @tyler_treat
    Trade-Offs and Lessons Learned
    1. Competing goals

    View full-size slide

  111. @tyler_treat
    Competing Goals
    1. Performance

    -> Easy to make something fast that’s not fault-tolerant or scalable

    -> Simplicity of mechanism makes this easier

    -> Simplicity of “UX” makes this harder
    2. Scalability and fault-tolerance

    -> At odds with simplicity

    -> Cannot be an afterthought
    3. Simplicity

    -> Simplicity of mechanism shifts complexity elsewhere (e.g. client)

    -> Easy to let server handle complexity; hard when that needs to be

    distributed, consistent, and fast

    View full-size slide

  112. @tyler_treat
    Trade-Offs and Lessons Learned
    1. Competing goals
    2. Aim for simplicity

    View full-size slide

  113. @tyler_treat
    Distributed systems are complex enough.

    Simple is usually better (and faster).

    View full-size slide

  114. @tyler_treat
    “A complex system that works
    is invariably found to have
    evolved from a simple system
    that works.”

    View full-size slide

  115. @tyler_treat
    Trade-Offs and Lessons Learned
    1. Competing goals
    2. Aim for simplicity
    3. You can’t effectively bolt on fault-tolerance

    View full-size slide

  116. @tyler_treat
    “A complex system designed from
    scratch never works and cannot
    be patched up to make it work.
    You have to start over, beginning
    with a working simple system.”

    View full-size slide

  117. @tyler_treat
    Trade-Offs and Lessons Learned
    1. Competing goals
    2. Aim for simplicity
    3. You can’t effectively bolt on fault-tolerance
    4. Lean on existing work

    View full-size slide

  118. @tyler_treat
    Don’t roll your own coordination protocol,

    use Raft, ZooKeeper, etc.

    View full-size slide

  119. @tyler_treat
    Trade-Offs and Lessons Learned
    1. Competing goals
    2. Aim for simplicity
    3. You can’t effectively bolt on fault-tolerance
    4. Lean on existing work
    5. There are probably edge cases for which you
    haven’t written tests

    View full-size slide

  120. @tyler_treat
    There are many failure modes, and you can
    only write so many tests.


    Formal methods and property-based/
    generative testing can help.

    View full-size slide

  121. @tyler_treat
    @tyler_treat

    View full-size slide

  122. @tyler_treat
    Trade-Offs and Lessons Learned
    1. Competing goals
    2. Aim for simplicity
    3. You can’t effectively bolt on fault-tolerance
    4. Lean on existing work
    5. There are probably edge cases for which you
    haven’t written tests
    6. Be honest with your users

    View full-size slide

  123. @tyler_treat
    Don’t try to be everything to everyone.

    Be explicit about design decisions, trade-
    offs, guarantees, defaults, etc.

    View full-size slide

  124. @tyler_treat
    https://bravenewgeek.com/tag/building-a-distributed-log-from-scratch/

    View full-size slide

  125. @tyler_treat
    Thanks!
    bravenewgeek.com

    realkinetic.com

    View full-size slide