Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Building a Distributed Message Log from Scratch - SCaLE 16x

Building a Distributed Message Log from Scratch - SCaLE 16x

Apache Kafka has shown that the log is a powerful abstraction for data-intensive applications. It can play a key role in managing data and distributing it across the enterprise efficiently. Vital to any data plane is not just performance, but availability and scalability. In this session, we examine what a distributed log is, how it works, and how it can achieve these goals. Specifically, we'll discuss lessons learned while building NATS Streaming, a reliable messaging layer built on NATS that provides similar semantics. We'll cover core components like leader election, data replication, log persistence, and message delivery. Come learn about distributed systems!

Dcbf01e42178cd9698fb3d4806e33d84?s=128

Tyler Treat

March 11, 2018
Tweet

Transcript

  1. @tyler_treat Building a Distributed Message Log from Scratch Tyler Treat

    · SCALE 16x · 3/11/18
  2. @tyler_treat - Managing Partner @ Real Kinetic - Messaging &

    distributed systems - Former nats.io core contributor - bravenewgeek.com Tyler Treat
  3. @tyler_treat @tyler_treat

  4. @tyler_treat - The Log
 -> What?
 -> Why? - Implementation


    -> Storage mechanics
 -> Data-replication techniques
 -> Scaling message delivery
 -> Trade-offs and lessons learned Outline
  5. @tyler_treat The Log

  6. @tyler_treat The Log A totally-ordered, append-only data structure.

  7. @tyler_treat The Log 0

  8. @tyler_treat 0 1 The Log

  9. @tyler_treat 0 1 2 The Log

  10. @tyler_treat 0 1 2 3 The Log

  11. @tyler_treat 0 1 2 3 4 The Log

  12. @tyler_treat 0 1 2 3 4 5 The Log

  13. @tyler_treat 0 1 2 3 4 5 newest record oldest

    record The Log
  14. @tyler_treat newest record oldest record The Log

  15. @tyler_treat Logs record what happened and when.

  16. @tyler_treat caches databases indexes writes

  17. @tyler_treat https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying

  18. @tyler_treat Examples in the wild: -> Apache Kafka
 -> Amazon

    Kinesis -> NATS Streaming
 -> Apache Pulsar
  19. @tyler_treat Key Goals: -> Performance -> High Availability -> Scalability

  20. @tyler_treat The purpose of this talk is to learn…
 ->

    a bit about the internals of a log abstraction. -> how it can achieve these goals. -> some applied distributed systems theory.
  21. @tyler_treat You will probably never need to build something like

    this yourself, but it helps to know how it works.
  22. @tyler_treat Implemen- tation

  23. @tyler_treat Implemen- tation Don’t try this at home.

  24. @tyler_treat Storage
 Mechanics

  25. @tyler_treat Some first principles… • The log is an ordered,

    immutable sequence of messages • Messages are atomic (meaning they can’t be broken up) • The log has a notion of message retention based on some policies (time, number of messages, bytes, etc.) • The log can be played back from any arbitrary position • The log is stored on disk • Sequential disk access is fast* • OS page cache means sequential access often avoids disk
  26. @tyler_treat http://queue.acm.org/detail.cfm?id=1563874

  27. @tyler_treat avg-cpu: %user %nice %system %iowait %steal %idle 13.53 0.00

    11.28 0.00 0.00 75.19 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn xvda 0.00 0.00 0.00 0 0 iostat
  28. @tyler_treat Storage Mechanics log file 0

  29. @tyler_treat Storage Mechanics log file 0 1

  30. @tyler_treat Storage Mechanics log file 0 1 2

  31. @tyler_treat Storage Mechanics log file 0 1 2 3

  32. @tyler_treat Storage Mechanics log file 0 1 2 3 4

  33. @tyler_treat Storage Mechanics log file 0 1 2 3 4

    5
  34. @tyler_treat Storage Mechanics log file … 0 1 2 3

    4 5
  35. @tyler_treat Storage Mechanics log segment 3 file log segment 0

    file 0 1 2 3 4 5
  36. @tyler_treat Storage Mechanics log segment 3 file log segment 0

    file 0 1 2 3 4 5 0 1 2 0 1 2 index segment 0 file index segment 3 file
  37. @tyler_treat Zero-Copy Reads user space kernel space page cache disk

    socket NIC application read send
  38. @tyler_treat Zero-Copy Reads user space kernel space page cache disk

    NIC sendfile
  39. @tyler_treat Left as an exercise for the listener…
 -> Batching


    -> Compression
  40. @tyler_treat Data-Replication
 Techniques

  41. @tyler_treat caches databases indexes writes

  42. @tyler_treat caches databases indexes writes

  43. @tyler_treat caches databases indexes writes

  44. @tyler_treat How do we achieve high availability and fault tolerance?

  45. @tyler_treat Questions:
 -> How do we ensure continuity of reads/writes?

    -> How do we replicate data? -> How do we ensure replicas are consistent? -> How do we keep things fast? -> How do we ensure data is durable?
  46. @tyler_treat Questions:
 -> How do we ensure continuity of reads/writes?

    -> How do we replicate data? -> How do we ensure replicas are consistent? -> How do we keep things fast? -> How do we ensure data is durable?
  47. @tyler_treat caches databases indexes writes

  48. @tyler_treat Questions:
 -> How do we ensure continuity of reads/writes?

    -> How do we replicate data? -> How do we ensure replicas are consistent? -> How do we keep things fast? -> How do we ensure data is durable?
  49. @tyler_treat Data-Replication Techniques 1. Gossip/multicast protocols Epidemic broadcast trees, bimodal

    multicast, SWIM, HyParView
 2. Consensus protocols 2PC/3PC, Paxos, Raft, Zab, chain replication
  50. @tyler_treat Questions:
 -> How do we ensure continuity of reads/writes?

    -> How do we replicate data? -> How do we ensure replicas are consistent? -> How do we keep things fast? -> How do we ensure data is durable?
  51. @tyler_treat Data-Replication Techniques 1. Gossip/multicast protocols Epidemic broadcast trees, bimodal

    multicast, SWIM, HyParView, NeEM
 2. Consensus protocols 2PC/3PC, Paxos, Raft, Zab, chain replication
  52. @tyler_treat Replication in Kafka 1. Select a leader 2. Maintain

    in-sync replica set (ISR) (initially every replica) 3. Leader writes messages to write-ahead log (WAL) 4. Leader commits messages when all replicas in ISR ack 5. Leader maintains high-water mark (HW) of last committed message 6. Piggyback HW on replica fetch responses which replicas periodically checkpoint to disk
  53. @tyler_treat 0 1 2 3 4 5 b1 (leader) 0

    1 2 3 4 HW: 3 0 1 2 3 HW: 3 HW: 3 b2 (follower) b3 (follower) ISR: {b1, b2, b3} writes Replication in Kafka
  54. @tyler_treat Failure Modes 1. Leader fails

  55. @tyler_treat 0 1 2 3 4 5 b1 (leader) 0

    1 2 3 4 HW: 3 0 1 2 3 HW: 3 HW: 3 b2 (follower) b3 (follower) ISR: {b1, b2, b3} writes Leader fails
  56. @tyler_treat 0 1 2 3 4 5 b1 (leader) 0

    1 2 3 4 HW: 3 0 1 2 3 HW: 3 HW: 3 b2 (follower) b3 (follower) ISR: {b1, b2, b3} writes Leader fails
  57. @tyler_treat 0 1 2 3 4 5 b1 (leader) 0

    1 2 3 4 HW: 3 0 1 2 3 HW: 3 HW: 3 b2 (follower) b3 (follower) ISR: {b1, b2, b3} writes Leader fails
  58. @tyler_treat 0 1 2 3 HW: 3 0 1 2

    3 HW: 3 b2 (leader) b3 (follower) ISR: {b2, b3} writes Leader fails
  59. @tyler_treat Failure Modes 1. Leader fails
 2. Follower fails

  60. @tyler_treat 0 1 2 3 4 5 b1 (leader) 0

    1 2 3 4 HW: 3 0 1 2 3 HW: 3 HW: 3 b2 (follower) b3 (follower) ISR: {b1, b2, b3} writes Follower fails
  61. @tyler_treat Follower fails 0 1 2 3 4 5 b1

    (leader) 0 1 2 3 4 HW: 3 0 1 2 3 HW: 3 HW: 3 b2 (follower) b3 (follower) ISR: {b1, b2, b3} writes
  62. @tyler_treat Follower fails 0 1 2 3 4 5 b1

    (leader) 0 1 2 3 4 HW: 3 0 1 2 3 HW: 3 HW: 3 b2 (follower) b3 (follower) ISR: {b1, b2, b3} writes replica.lag.time.max.ms
  63. @tyler_treat Follower fails 0 1 2 3 4 5 b1

    (leader) 0 1 2 3 4 HW: 3 0 1 2 3 HW: 3 HW: 3 b2 (follower) b3 (follower) ISR: {b1, b2} writes replica.lag.time.max.ms
  64. @tyler_treat Follower fails 0 1 2 3 4 5 b1

    (leader) 0 1 2 3 4 HW: 5 0 1 2 3 HW: 5 HW: 3 b2 (follower) b3 (follower) ISR: {b1, b2} writes 5
  65. @tyler_treat Follower fails 0 1 2 3 4 5 b1

    (leader) 0 1 2 3 4 HW: 5 0 1 2 3 HW: 5 HW: 3 b2 (follower) b3 (follower) ISR: {b1, b2} writes 5
  66. @tyler_treat Follower fails 0 1 2 3 4 5 b1

    (leader) 0 1 2 3 4 HW: 5 0 1 2 3 HW: 5 HW: 4 b2 (follower) b3 (follower) ISR: {b1, b2} writes 5 4
  67. @tyler_treat Follower fails 0 1 2 3 4 5 b1

    (leader) 0 1 2 3 4 HW: 5 0 1 2 3 HW: 5 HW: 5 b2 (follower) b3 (follower) ISR: {b1, b2} writes 5 4 5
  68. @tyler_treat Follower fails 0 1 2 3 4 5 b1

    (leader) 0 1 2 3 4 HW: 5 0 1 2 3 HW: 5 HW: 5 b2 (follower) b3 (follower) ISR: {b1, b2, b3} writes 5 4 5
  69. @tyler_treat Replication in NATS Streaming 1. Raft replicates client state,

    messages, and subscriptions
 2. Conceptually, two logs: Raft log and message log
 3. Parallels work implementing Raft in RabbitMQ
  70. @tyler_treat http://thesecretlivesofdata.com/raft

  71. @tyler_treat Replication in NATS Streaming • Initially used Raft group

    per topic and separate metadata group 
 • A couple issues with this:
 -> Topic scalability
 -> Increased complexity due to lack of ordering between Raft groups
  72. @tyler_treat Challenges 1. Scaling topics

  73. @tyler_treat Scaling Raft With a single topic, one node is

    elected leader and it heartbeats messages to followers
  74. @tyler_treat Scaling Raft As the number of topics increases, so

    does the number of Raft groups.
  75. @tyler_treat Scaling Raft Technique 1: run a fixed number of

    Raft groups and use a consistent hash to map a topic to a group.
  76. @tyler_treat Scaling Raft Technique 2: run an entire node’s worth

    of topics as a single group using a layer on top of Raft. https://www.cockroachlabs.com/blog/scaling-raft
  77. @tyler_treat Scaling Raft Technique 3: use a single Raft group

    for all topics and metadata.
  78. @tyler_treat Challenges 1. Scaling topics 2. Dual writes

  79. @tyler_treat Dual Writes Raft Store committed

  80. @tyler_treat Dual Writes msg 1 Raft Store committed

  81. @tyler_treat Dual Writes msg 1 msg 2 Raft Store committed

  82. @tyler_treat Dual Writes msg 1 msg 2 Raft msg 1

    msg 2 Store committed
  83. @tyler_treat Dual Writes msg 1 msg 2 sub Raft msg

    1 msg 2 Store committed
  84. @tyler_treat Dual Writes msg 1 msg 2 sub msg 3

    Raft msg 1 msg 2 Store committed
  85. @tyler_treat Dual Writes msg 1 msg 2 sub msg 3

    add peer msg 4 Raft msg 1 msg 2 msg 3 Store committed
  86. @tyler_treat Dual Writes msg 1 msg 2 sub msg 3

    add peer msg 4 Raft msg 1 msg 2 msg 3 Store committed
  87. @tyler_treat Dual Writes msg 1 msg 2 sub msg 3

    add peer msg 4 Raft msg 1 msg 2 msg 3 msg 4 Store commit
  88. @tyler_treat Dual Writes msg 1 msg 2 sub msg 3

    add peer msg 4 Raft msg 1 msg 2 msg 3 msg 4 Store 0 1 2 3 4 5 0 1 2 3 physical offset logical offset
  89. @tyler_treat Dual Writes msg 1 msg 2 sub msg 3

    add peer msg 4 Raft msg 1 msg 2 Index 0 1 2 3 4 5 0 1 2 3 physical offset logical offset msg 3 msg 4
  90. @tyler_treat Treat the Raft log as our message write-ahead log.

  91. @tyler_treat Questions:
 -> How do we ensure continuity of reads/writes?

    -> How do we replicate data? -> How do we ensure replicas are consistent? -> How do we keep things fast? -> How do we ensure data is durable?
  92. @tyler_treat Performance 1. Publisher acks 
 -> broker acks on

    commit (slow but safe)
 -> broker acks on local log append (fast but unsafe)
 -> publisher doesn’t wait for ack (fast but unsafe) 
 2. Don’t fsync, rely on replication for durability
 3. Keep disk access sequential and maximize zero-copy reads
 4. Batch aggressively
  93. @tyler_treat Questions:
 -> How do we ensure continuity of reads/writes?

    -> How do we replicate data? -> How do we ensure replicas are consistent? -> How do we keep things fast? -> How do we ensure data is durable?
  94. @tyler_treat Durability 1. Quorum guarantees durability
 -> Comes for free

    with Raft
 -> In Kafka, need to configure min.insync.replicas and acks, e.g.
 topic with replication factor 3, min.insync.replicas=2, and
 acks=all
 2. Disable unclean leader elections
 3. At odds with availability,
 i.e. no quorum == no reads/writes
  95. @tyler_treat Scaling Message
 Delivery

  96. @tyler_treat Scaling Message Delivery 1. Partitioning

  97. @tyler_treat Partitioning is how we scale linearly.

  98. @tyler_treat caches databases indexes writes

  99. @tyler_treat HELLA WRITES caches databases indexes

  100. @tyler_treat caches databases indexes HELLA WRITES

  101. @tyler_treat caches databases indexes writes writes writes writes Topic: purchases

    Topic: inventory
  102. @tyler_treat caches databases indexes writes writes writes writes Topic: purchases

    Topic: inventory Accounts A-M Accounts N-Z SKUs A-M SKUs N-Z
  103. @tyler_treat Scaling Message Delivery 1. Partitioning 2. High fan-out

  104. @tyler_treat Kinesis Fan-Out consumers shard-1 consumers shard-2 consumers shard-3 writes

  105. @tyler_treat Replication in Kafka and NATS Streaming is purely a

    means of HA.
  106. @tyler_treat High Fan-Out 1. Observation: with an immutable log, there

    are no stale/phantom reads
 2. This should make it “easy” (in theory) to scale to a large number of consumers
 3. With Raft, we can use “non-voters” to act as read replicas and load balance consumers
  107. @tyler_treat Scaling Message Delivery 1. Partitioning 2. High fan-out 3.

    Push vs. pull
  108. @tyler_treat Push vs. Pull • In Kafka, consumers pull data

    from brokers • In NATS Streaming, brokers push data to consumers • Design implications: • Fan-out • Flow control • Optimizing for latency vs. throughput • Client complexity
  109. @tyler_treat Trade-Offs and
 Lessons Learned

  110. @tyler_treat Trade-Offs and Lessons Learned 1. Competing goals

  111. @tyler_treat Competing Goals 1. Performance
 -> Easy to make something

    fast that’s not fault-tolerant or scalable
 -> Simplicity of mechanism makes this easier
 -> Simplicity of “UX” makes this harder 2. Scalability and fault-tolerance
 -> At odds with simplicity
 -> Cannot be an afterthought 3. Simplicity
 -> Simplicity of mechanism shifts complexity elsewhere (e.g. client)
 -> Easy to let server handle complexity; hard when that needs to be
 distributed, consistent, and fast
  112. @tyler_treat Trade-Offs and Lessons Learned 1. Competing goals 2. Aim

    for simplicity
  113. @tyler_treat Distributed systems are complex enough.
 Simple is usually better

    (and faster).
  114. @tyler_treat “A complex system that works is invariably found to

    have evolved from a simple system that works.”
  115. @tyler_treat Trade-Offs and Lessons Learned 1. Competing goals 2. Aim

    for simplicity 3. You can’t effectively bolt on fault-tolerance
  116. @tyler_treat “A complex system designed from scratch never works and

    cannot be patched up to make it work. You have to start over, beginning with a working simple system.”
  117. @tyler_treat Trade-Offs and Lessons Learned 1. Competing goals 2. Aim

    for simplicity 3. You can’t effectively bolt on fault-tolerance 4. Lean on existing work
  118. @tyler_treat Don’t roll your own coordination protocol,
 use Raft, ZooKeeper,

    etc.
  119. @tyler_treat Trade-Offs and Lessons Learned 1. Competing goals 2. Aim

    for simplicity 3. You can’t effectively bolt on fault-tolerance 4. Lean on existing work 5. There are probably edge cases for which you haven’t written tests
  120. @tyler_treat There are many failure modes, and you can only

    write so many tests.
 
 Formal methods and property-based/ generative testing can help.
  121. @tyler_treat @tyler_treat

  122. @tyler_treat Trade-Offs and Lessons Learned 1. Competing goals 2. Aim

    for simplicity 3. You can’t effectively bolt on fault-tolerance 4. Lean on existing work 5. There are probably edge cases for which you haven’t written tests 6. Be honest with your users
  123. @tyler_treat Don’t try to be everything to everyone.
 Be explicit

    about design decisions, trade- offs, guarantees, defaults, etc.
  124. @tyler_treat https://bravenewgeek.com/tag/building-a-distributed-log-from-scratch/

  125. @tyler_treat Thanks! bravenewgeek.com realkinetic.com