Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Kafka, the hard parts

Kafka, the hard parts

This talk tries to summarize a lot of the lessons I've learned building systems on kafka.

06f8b41980eb4c577fa40c41d5030c19?s=128

Chris Keathley

January 10, 2019
Tweet

Transcript

  1. Kafka The Hard Parts Chris Keathley / @ChrisKeathley / keathley.io

  2. Kafka is great

  3. Kafka is just a log

  4. https://flic.kr/p/9aXr88

  5. https://flic.kr/p/9aXr88 Kafka

  6. Kafka https://flic.kr/p/9aXr88 (metaphor)

  7. Log aggregation Analytics and activity tracking Queuing ETL Messaging Stream

    Processing Kafka Uses
  8. Event Sourcing

  9. Log aggregation Analytics and activity tracking Queuing ETL Messaging Stream

    Processing Kafka Uses
  10. https://flic.kr/p/hrrbVx

  11. https://flic.kr/p/hrrbVx (still a metaphor) Kafka

  12. Large consequences for failure

  13. Joke about mr. glass

  14. Joke about mr. glass

  15. Iteration Is Hard

  16. Lets talk about… Kafka Terminology Maintaining Order Errors Distributed Systems

    and the joys of functional programming Data Validation Finding Errors Monitoring Capacity Planning #hottakes
  17. Lets talk about… Kafka Terminology Maintaining Order Errors Distributed Systems

    and the joys of functional programming Data Validation Finding Errors Monitoring Capacity Planning #hottakes
  18. Topic

  19. Topic

  20. Partition 1 Partition 2 Partition 3 Partition 4 Partition 5

  21. Partition 1 Partition 2 Partition 3 Partition 4 Partition 5

  22. Partition 1 Partition 2 Partition 3 Partition 4 Partition 5

  23. Partition 1 Partition 2 Partition 3 Partition 4 Partition 5

    Written to the File system
  24. Partition 1 Partition 2 Partition 3 Partition 4 Partition 5

  25. Partition 1 Partition 2 Partition 3 Partition 4 Partition 5

    Messages are ordered
  26. Partition 1 Partition 2 Partition 3 Partition 4 Partition 5

  27. Partition 1 Partition 2 Partition 3 Partition 4 Partition 5

  28. Partition 1 Partition 2 Partition 3 Partition 4 Partition 5

    Consumer
  29. Partition 1 Partition 2 Partition 3 Partition 4 Partition 5

    Consumer
  30. Partition 1 Partition 2 Partition 3 Partition 4 Partition 5

    Consumer
  31. Partition 1 Partition 2 Partition 3 Partition 4 Partition 5

    Consumer
  32. Partition 1 Partition 2 Partition 3 Partition 4 Partition 5

    Consumer
  33. Partition 1 Partition 2 Partition 3 Partition 4 Partition 5

    Consumer Consumer
  34. Partition 1 Partition 2 Partition 3 Partition 4 Partition 5

    Consumer Consumer
  35. Partition 1 Partition 2 Partition 3 Partition 4 Partition 5

    Consumer Consumer
  36. Partition 1 Partition 2 Partition 3 Partition 4 Partition 5

    Consumer Consumer
  37. Topic

  38. Topic Topic Topic Topic Broker

  39. Broker Broker Broker

  40. None
  41. Replication Leader

  42. Clients Java Client librdkafka

  43. Lets talk about… Kafka Terminology Maintaining Order Errors Distributed Systems

    and the joys of functional programming Data Validation Finding Errors Monitoring Capacity Planning #hottakes
  44. Lets talk about… Kafka Terminology Maintaining Order Errors Distributed Systems

    and the joys of functional programming Data Validation Finding Errors Monitoring Capacity Planning #hottakes
  45. Order is important User Events

  46. Order is important User Events

  47. Order is important Follow

  48. Order is important Follow

  49. Order is important Follow Message

  50. Order is important Follow Message Unfollow

  51. Order is important Follow Message Unfollow Causal

  52. Order is important Follow Message Unfollow Consumer

  53. Order is important Follow Message Unfollow Consumer

  54. Order is important Follow Message Unfollow Consumer

  55. Order is important Follow Message Unfollow

  56. Order is important Follow Message Unfollow Consumer

  57. Order is important Follow Message Unfollow Consumer

  58. Order is important Follow Message Unfollow Consumer

  59. Group records based on order

  60. Partitioner to_int(hash(key)) % partitions

  61. Partitioner to_int(hash(user_id)) % partitions

  62. Follow Message Unfollow Grouping Consumers

  63. Follow Message Unfollow Causal Grouping Consumers

  64. Follow Message Unfollow Grouping Consumers Follow Processor Message Processor

  65. Follow Message Unfollow Grouping Consumers Follow Processor Message Processor

  66. Follow Message Unfollow Grouping Consumers Follow Processor Message Processor

  67. Follow Message Unfollow Grouping Consumers User event processor

  68. Follow Message Unfollow Grouping Consumers User event processor

  69. Follow Message Unfollow Grouping Consumers User event processor

  70. User Events Create pipelines User event processor Messages

  71. User Events Create pipelines User event processor Messages Consumes

  72. User Events Create pipelines User event processor Messages Consumes Produces

  73. "Commander: Better Distributed Applications through CQRS and Event Sourcing" by

    Bobby Calderwood https://youtu.be/B1-gS0oEtYc
  74. The less dependence you can have between consumers the better

  75. Random partitioning is best if you can avoid ordering

  76. Lets talk about… Kafka Terminology Maintaining Order Errors Distributed Systems

    and the joys of functional programming Data Validation Finding Errors Monitoring Capacity Planning #hottakes
  77. Lets talk about… Kafka Terminology Maintaining Order Errors Distributed Systems

    and the joys of functional programming Data Validation Finding Errors Monitoring Capacity Planning #hottakes
  78. Errors have the potential to wreck your day

  79. Consumer Errors

  80. Consumer Errors

  81. Consumer Errors

  82. Consumer Errors

  83. Consumer Errors Blocking the head of the line

  84. Consumer What should we do? Errors

  85. Non-Blocking vs. Blocking

  86. Non-Blocking vs. Blocking

  87. Non-Blocking Errors Consumer 42 1337 “Robert’);drop table students;—”

  88. Non-Blocking Errors Consumer 42 1337 “Robert’);drop table students;—” What do

    we do?
  89. Non-Blocking Errors Consumer

  90. Non-Blocking Errors Consumer

  91. Non-Blocking Errors Consumer Error Topic

  92. Non-Blocking Errors Consumer

  93. Non-Blocking Errors Consumer

  94. Non-Blocking Errors Consumer

  95. Non-Blocking vs. Blocking

  96. Non-Blocking vs. Blocking

  97. Blocking Errors Database Consumer

  98. Blocking Errors Database Consumer Process messages Store Information

  99. Blocking Errors Database Consumer

  100. Blocking Errors Database Consumer

  101. Blocking Errors Database Consumer What do we do?

  102. Blocking Errors Database Consumer Retry

  103. Blocking Errors Database Consumer Send alerts

  104. Skip non-blocking errors & Retry blocking errors

  105. Design errors out of existence

  106. Lets talk about… Kafka Terminology Maintaining Order Errors Distributed Systems

    and the joys of functional programming Data Validation Finding Errors Monitoring Capacity Planning #hottakes
  107. Lets talk about… Kafka Terminology Maintaining Order Errors Distributed Systems

    and the joys of functional programming Data Validation Finding Errors Monitoring Capacity Planning #hottakes
  108. Delivery Guarantees

  109. Computer A Communication is hard Computer B What time is

    it?
  110. Computer A Communication is hard Computer B

  111. Computer A Communication is hard Computer B Did you get

    it?
  112. Computer A Communication is hard Computer B How about now?

  113. Computer A Communication is hard Computer B Now?

  114. 0 <= 1 <= n Delivery At least once At

    most once Impossible-ish
  115. Consumers should *ALWAYS* assume “At Least Once”

  116. The Joys of Functional Programming

  117. None
  118. You

  119. You Functional Programming

  120. Immutability and Idempotence

  121. Immutability: An immutable object is an object whose state cannot

    be modified after it is created.
  122. Idempotence: …the property of certain operations in mathematics and computer

    science whereby they can be applied multiple times without changing the result beyond the initial application.
  123. Idempotence: Execute the same operation more than once but only

    see the effect once.
  124. Idempotent Operations

  125. Counting comments comment comment comment increment 1

  126. Counting comments comment comment comment increment 1

  127. Counting comments comment comment comment increment 2

  128. Counting comments comment comment comment increment 2

  129. Counting comments comment comment comment increment 3

  130. Counting comments comment comment comment increment 3 Some Error

  131. Counting comments comment comment comment increment 3

  132. Counting comments comment comment comment increment 3

  133. Counting comments comment comment comment increment 4

  134. Counting comments comment comment comment increment 4

  135. Counting comments comment comment comment increment 5

  136. Counting comments comment comment comment increment 5

  137. Counting comments comment comment comment increment 6

  138. Kafka Record { data: {}, type: “comment.created”, }

  139. Kafka Record { data: {}, type: “comment.created”, msg_id: UUIDv4 }

  140. Kafka Record { data: {}, type: “comment.created”, msg_id: UUIDv4 }

    Used for managing idempotence
  141. Counting comments comment comment comment increment 1

  142. Counting comments comment comment comment Set.add(id) id: 1 id: 2

    id: 3 (1)
  143. Counting comments comment comment comment Set.add(id) id: 1 id: 2

    id: 3 (1)
  144. Counting comments comment comment comment Set.add(id) id: 1 id: 2

    id: 3 (1, 2)
  145. Counting comments comment comment comment Set.add(id) id: 1 id: 2

    id: 3 (1, 2)
  146. Counting comments comment comment comment Set.add(id) id: 1 id: 2

    id: 3 (1, 2, 3)
  147. Counting comments comment comment comment Set.add(id) id: 1 id: 2

    id: 3 (1, 2, 3) Some Error
  148. Counting comments comment comment comment id: 1 id: 2 id:

    3 Set.add(id) (1, 2, 3)
  149. Counting comments comment comment comment id: 1 id: 2 id:

    3 Set.add(id) (1, 2, 3)
  150. Counting comments comment comment comment id: 1 id: 2 id:

    3 Set.add(id) (1, 2, 3)
  151. Counting comments (1, 2, 3)

  152. Counting comments cardinality(1, 2, 3)

  153. Counting comments cardinality(1, 2, 3) => 3

  154. Idempotent Side-Effects

  155. smtp send_email Sending Emails email id: 1 email id: 2

    email id: 3
  156. smtp send_email Sending Emails email id: 1 email id: 2

    email id: 3 What do we do if this fails?
  157. smtp send_email Sending Emails email id: 1 email id: 2

    email id: 3 Send at most once
  158. smtp send_email Sending Emails email id: 1

  159. Cache send_email Sending Emails email id: 1 smtp

  160. Cache send_email Sending Emails email id: 1 smtp id?(1)

  161. Cache send_email Sending Emails email id: 1 smtp id?(1) If

    id exists then skip it
  162. Cache send_email Sending Emails email id: 1 smtp

  163. Cache send_email Sending Emails email id: 1 smtp add(1)

  164. Cache send_email Sending Emails email id: 1 smtp

  165. Cache send_email Sending Emails email id: 1 smtp

  166. Cache send_email Sending Emails email id: 1 smtp

  167. send_email Sending Emails email id: 1

  168. send_email Sending Emails email id: 1 If we see this

    message again move it to an audit topic
  169. send_email Sending Emails If we see this message again move

    it to an audit topic email id: 1
  170. send_email Sending Emails

  171. Lets talk about… Kafka Terminology Maintaining Order Errors Distributed Systems

    and the joys of functional programming Data Validation Finding Errors Monitoring Capacity Planning #hottakes
  172. Lets talk about… Kafka Terminology Maintaining Order Errors Distributed Systems

    and the joys of functional programming Data Validation Finding Errors Monitoring Capacity Planning #hottakes
  173. User Events Teams User event processor Messages Notifications Notifications Notification

    Sender
  174. User Events Teams User event processor Messages Notifications Notifications Notification

    Sender Teams
  175. Data is the language of the system

  176. { msg_id: "8700635f-1802-417e-89e7-595ad3600104", type: "comment.created", data: { user_id: 1234, msg:

    "This is a super fun conference!" } } Data payloads
  177. { msg_id: String, type: String, data: { user_id: Integer, msg:

    String } } Data payloads
  178. { msg_id: String, type: String, data: { user_id: Integer, msg:

    String } } Data payloads None of this tells you anything useful about your data
  179. { msg_id: String, type: String, data: { user_id: Integer, msg:

    String } } Data payloads What do we do when these things change?
  180. { msg_id: String, type: String, data: { user_id: String, msg:

    String } } Data payloads What do we do when these things change?
  181. { msg_id: String, type: String, data: { user_id: String, msg:

    String } } Data payloads Lets just use versions!
  182. { msg_id: String, type: String, data: { user_id: String, msg:

    String } } Data payloads Lets just use versions! (spoiler: this isn’t great)
  183. { msg_id: String, type: String, data: { user_id: String, msg:

    String } } Data payloads
  184. { msg_id: String, type: String, data: { user_id: String, msg:

    String }, meta: { version: 2 } } Data payloads
  185. Data Versions Consumer v1 v1 v1 v1 v2

  186. Data Versions Consumer v1 v1 v1 v1 v2 This consumer

    needs to understand both versions
  187. Data Versions Consumer v1 v1 v1 v1 v2 This team

    needs to know to make these changes
  188. Versioning is broken

  189. (sem)Versioning is broken

  190. Change Growth Breakage

  191. Change Growth Breakage Never do this

  192. Growing schemas should be the default

  193. { msg_id: String, type: String, data: { user_id: String, msg:

    String } } Data payloads
  194. { msg_id: String, type: String, data: { user_id: Integer, msg:

    String } } Data payloads What are these?
  195. Dependent Types

  196. { msg_id: String, type: String, data: { user_id: Integer, msg:

    String } } Data payloads What are these?
  197. Norm

  198. { msg_id: String, type: String, data: { user_id: String, msg:

    String } } Data payloads
  199. UUID = string? & re_matches?(/^[0-9A-F]{8}-[0-9A-F] {4}-4[0-9A-F]{3}-[89AB][0-9A-F]{3}-[0-9A-F]{12}$/i) ) CommentCreated = schema{

    req :msg_id, UUID req :type, lit(“comment.created”) req :data, schema { req :user_id, integer? | UUID req :msg, string? } } Data payloads
  200. json = {type: “comment.created”, msg: “Hello world”} Norm.decode(CommentEvent, json) =>

    {:ok, data} Norm.decode(CommentEvent, {}) => {:error, errors}
 Norm.explain(CommentEvent, {}) => "In :msg_id, val: {} fails spec: required In :type, val: {} fails spec: required In :data, val: {} fails spec: required" Data payloads
  201. Norm is built for extensibility

  202. CommentEvent = schema{ req :type, lit(“comment.created”) req :msg, string? }

    json = { type: “comment.created”, msg: “Hello world”, data: { msg: “Hello world” } } Norm.decode(CommentEvent, json) => {:ok, data} Norm is extensible
  203. CommentEvent = schema{ req :type, lit(“comment.created”) req :msg, string? }

    json = { type: “comment.created”, msg: “Hello world”, data: { msg: “Hello world” } } Norm.decode(CommentEvent, json) => {:ok, data} Norm is extensible This will still get passed through
  204. Lets talk about… Kafka Terminology Maintaining Order Errors Distributed Systems

    and the joys of functional programming Data Validation Finding Errors Monitoring Capacity Planning #hottakes
  205. Lets talk about… Kafka Terminology Maintaining Order Errors Distributed Systems

    and the joys of functional programming Data Validation Finding Errors Monitoring Capacity Planning #hottakes
  206. Property Based Testing

  207. Property based testing Database Consumer

  208. Property based testing Database Consumer id: 1 id: 2 id:

    3 id: 1
  209. Property based testing Database Consumer id: 1 id: 2 id:

    3 id: 1 Information should end up here
  210. Property based testing Database Consumer id: 1 id: 2 id:

    3 id: 1 Some combination of these messages causes a failure
  211. Property based testing Database id: 1 id: 1 Consumer

  212. Property based testing Database id: 1 id: 1 Looks like

    we aren’t handling duplicates correctly Consumer
  213. Property based testing Database id: 1 id: 1 Consumer

  214. Property based testing Database Consumer id: 1 id: 1 Deterministically

    fail this connection
  215. Chaos Engineering

  216. Lets talk about… Kafka Terminology Maintaining Order Errors Distributed Systems

    and the joys of functional programming Data Validation Finding Errors Monitoring Capacity Planning #hottakes
  217. Lets talk about… Kafka Terminology Maintaining Order Errors Distributed Systems

    and the joys of functional programming Data Validation Finding Errors Monitoring Capacity Planning #hottakes
  218. Monitoring vs. Observability

  219. Monitoring: Figuring out that there’s a problem

  220. Observability: Determining what the problem is.

  221. Goal: Detect lagging or blocked consumers

  222. Wisen

  223. Wisen User Events User Consumer

  224. metadata topic Wisen User Events Checkpoints its position in the

    log to an offset topic User Consumer
  225. Wisen metadata topic Wisen User Consumer User Events

  226. Wisen metadata topic Wisen User Consumer User Events Compares farthest

    offset from checkpoints over a time-window
  227. Wisen user_consumer_errors Wisen User Consumer User Events

  228. Wisen user_consumer_errors Wisen User Consumer User Events

  229. Wisen user_consumer_errors Wisen User Consumer User Events Alert if we

    see a rise in errors
  230. Other useful metrics: Median and Tail latencies Internal buffers DB/Cache/RPC

    latencies
  231. OpenTracing

  232. Lets talk about… Kafka Terminology Maintaining Order Errors Distributed Systems

    and the joys of functional programming Data Validation Monitoring Capacity Planning #hottakes
  233. Lets talk about… Kafka Terminology Maintaining Order Errors Distributed Systems

    and the joys of functional programming Data Validation Monitoring Capacity Planning #hottakes
  234. This has to be done up-front

  235. Calculating partions messages in the system = arrival rate *

    mean time in system
  236. Calculating partions Desired throughput / measured throughput on one partition

    => partitions needed
  237. Calculating partions partitions < 100 x brokers x replication factor

    source: https://www.confluent.io/blog/how-choose-number-topics-partitions-kafka-cluster
  238. Increasing partitions is tricky if you rely on ordering

  239. to_int(hash(user_id)) % partitions

  240. to_int(hash(user_id)) % partitions Existing data is not reshuffled if partitions

    are increased
  241. Data is not forever.

  242. Lets talk about… Kafka Terminology Maintaining Order Errors Distributed Systems

    and the joys of functional programming Data Validation Monitoring Capacity Planning #hottakes
  243. Lets talk about… Kafka Terminology Maintaining Order Errors Distributed Systems

    and the joys of functional programming Data Validation Monitoring Capacity Planning #hottakes
  244. CQRS & Event Sourcing

  245. Don’t rush to democratize your data

  246. Embrace data and design

  247. Go forth and build awesome stuff!

  248. Thanks Chris Keathley / @ChrisKeathley / keathley.io