Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Kafka from A to Z

Kafka from A to Z

This presentation introduces every aspects of the Kafka ecosystem:
- Concepts: explain all misleading concepts such as topic vs partition vs replication, producer vs consumer vs consumer group, group leader vs group coordinator, ...
- Advanced concepts: delivery semantic; idempotent producers; isolation levels; differences between the offsets such as High Watermark, Log End Offset, Last Stable Offset ...
- Kafka architecture: explain all Kafka components such as brokers, controllers, zookeeper, ...
- Overview of Kafka security: TLS/SSL, SASL, Kerberos, ...
- Overview of Kafka ecosystem: Kafka Stream, Kafka Connect, Schema Registry, monitoring tools.
- Kafka in Golang: How to use Kafka client in Golang.
- Comparison with other message queues such as RabbitMQ.

Huỳnh Quang Thảo

September 09, 2019
Tweet

More Decks by Huỳnh Quang Thảo

Other Decks in Programming

Transcript

  1. Message Queue - Message queues provide an asynchronous communications protocol,

    meaning that the sender and receiver of the message do not need to interact with the message queue at the same time. - Messages placed onto the queue are stored until the recipient retrieves them. (Wikipedia)
  2. Zookeeper - It is essentially a centralized service for distributed

    systems to a hierarchical key-value store, which is used to provide a distributed con#iguration service, synchronization service, and naming registry for large distributed systems. (Wikipedia)
  3. Zookeeper - It is essentially a centralized service for distributed

    systems to a hierarchical key-value store, which is used to provide a distributed con#iguration service, synchronization service, and naming registry for large distributed systems. (Wikipedia) - Offer high availability: Ka#ka, Spark, Solr, Hadoop.
  4. Zookeeper - It is essentially a centralized service for distributed

    systems to a hierarchical key-value store, which is used to provide a distributed con#iguration service, synchronization service, and naming registry for large distributed systems. (Wikipedia) - Offer high availability: Ka#ka, Spark, Solr, Hadoop. - Other similar services: etcd (used in Kubernetes), consult.
  5. Zookeeper - It is essentially a centralized service for distributed

    systems to a hierarchical key-value store, which is used to provide a distributed con#iguration service, synchronization service, and naming registry for large distributed systems. (Wikipedia) - Offer high availability: Ka#ka, Spark, Solr, Hadoop. - Other similar services: etcd (used in Kubernetes), consult. - Ka#ka uses the Zookeeper to maintains broker con#igurations, consumer con#igurations (old version), available broker nodes, electing controller, service discovery, …
  6. Broker & Controller - A Ka#ka cluster consists of one

    or more servers which are called brokers. - Producers are processes that publish the data into Ka#ka topics which stored in the brokers.
  7. Broker & Controller - In a Ka#ka cluster, one of

    the brokers serves as the controller. - Elected using leader election on the Zookeeper. - Responsible for managing the states of partitions and replicas. - Electing partition leader.
  8. Broker & Controller - In a Ka#ka cluster, one of

    the brokers serves as the controller. - Responsible for managing the states of partitions and replicas. - Electing partition leader. - Elected using leader election on the Zookeeper.
  9. Topic-Partition-Replication - The topic is the place where data is

    published by producers and is pulled by consumers.
  10. Topic-Partition-Replication - The topic is the place where data is

    published by producers and is pulled by consumers. - The topic is divided into several partitions. - Each partition is ordered and messages within a partition get an incrementing id, called offset. - Partitions allow us to parallelize the consuming work by splitting the data into multiple places.
  11. Topic-Partition-Replication - The topic is the place where data is

    published by producers and is pulled by consumers. - The topic is divided into several partitions. - Each partition is ordered and messages within a partition get an incrementing id, called offset. - Partitions allow us to parallelize the consuming work by splitting the data into multiple places. - Every topic partition in Ka:ka is replicated n times, which n is the replication factor of the topic. - Each replication will be on a separate broker. - The replication factor must be less than or equal to the total brokers.
  12. Topic-Partition-Replication - The topic is the place where data is

    published by producers and is pulled by consumers. - The topic is divided into several partitions. - Each partition is ordered and messages within a partition get an incrementing id, called offset. - Partitions allow us to parallelize the consuming work by splitting the data into multiple places. - Every topic partition in Ka:ka is replicated n times, which n is the replication factor of the topic. - Each replication will be on a separate broker. - The replication factor must be less than or equal to the total brokers. Question: Can the consumer read data from the replica?
  13. Topic-Partition-Replication - Order is guaranteed only within the partition. -

    Data is kept only for a limited time. (default 2 weeks) - Once data is written to a partition, it can’t be changed. - Data is assigned randomly to a partition unless a key is provided.
  14. Partition Leader/Follower/ISR - Ka#ka chooses one partition’s replica as the

    leader and all other partition’s leaders as followers. - All producers will write to this leader and all consumers will read from this leader. - A follower that is in-sync called an ISR. (in-sync replica) - Once the leader is down, Ka#ka will elect one follower in the "ISR set" to become the leader.
  15. Partition Leader/Follower/ISR - Ka#ka chooses one partition’s replica as the

    leader and all other partition’s leaders as followers. - All producers will write to this leader and all consumers will read from this leader. - A follower that is in-sync called an ISR. (in-sync replica) - Once the leader is down, Ka#ka will elect one follower in the "ISR set" to become the leader. Question: How does the Ka#ka maintain the ISR set for each partition?
  16. Producer and Consumer - Producer: a Ka#ka client that publishes

    messages to the Ka#ka cluster. - Consumer: a Ka#ka client that reads messages which is published by publishers at their own pace.
  17. Producer - The producer has to specify the topic name

    and at least one broker to connect. - It it better to specify multiple brokers for the high availability. - Ka#ka will automatically take care of routing the data to the right brokers.
  18. Producer - acknowledgement Producers can set the acknowledgment con#iguration for

    sending data. Acks = 0 - The message is pushed to the socket buffer. - The producer won't wait for the acknowledgment from the leader. Acks = 1 - The leader will append the message to its log and then returns the acknowledgment to the producer . - The producer will wait for the acknowledgment from the leader. Acks = all - The leader will append the message to its log. - The leader will wait for all acknowledgments from all in-sync replicas. - The producer will wait for the leader's acknowledgment. - No data loss.
  19. Producer - acknowledgement Producers can set the acknowledgment con#iguration for

    sending data. Acks = 0 - The message is pushed to the socket buffer. - The producer won't wait for the acknowledgment from the leader. Acks = 1 - The leader will append the message to its log and then returns the acknowledgment to the producer . - The producer will wait for the acknowledgment from the leader. Acks = all - The leader will append the message to its log. - The leader will wait for all acknowledgments from all in-sync replicas. - The producer will wait for the leader's acknowledgment. - No data loss. Question: How does choosing the ACK value affect the in-sync replica set?
  20. Producer - key message - Producers can choose to send

    a key with the message. - If a key is sent along with the message, it guarantees that all messages with the same key will always store in the same partition. - Some Ka#ka functionalities based on the key message: log compaction, cleaning up offsets ...
  21. Consumer - The consumer has to specify the topic name,

    partition (optional) and at least one broker to connect. - It it better to specify multiple brokers for the high availability. - Ka#ka will automatically take care of pulling data from the right brokers. - Data is read in order in each partition.
  22. Consumer - Offset types High Watermark: - The offset of

    messages that are fully replicated to all ISR-replicas. Log End Offset: - The latest offset of messages on the leader partition. - Consumer only reads up to the high watermark. Leader Partition 1 2 3 4 (HW) 5 6 7 (LEO) ISR-Replica 1 1 2 3 4 5 6 7 ISR-Replica 2 1 2 3 4 Out-of-sync 1 2 replica.lag.max.messages=4
  23. Consumer - Offset types High Watermark: - The offset of

    messages that are fully replicated to all ISR-replicas. Log End Offset: - The latest offset of messages on the leader partition. - Consumer only reads up to the high watermark. Leader Partition 1 2 3 4 5 6 7 (HW + LEO) ISR-Replica 1 1 2 3 4 5 6 7 ISR-Replica 2 1 2 3 4 5 6 7 Out-of-sync 1 2 replica.lag.max.messages=4
  24. Consumer Group - Messaging traditionally has two models: queuing and

    publish- subscribe. - queuing: allows us to divide up the processing of data over multiple consumer instances, which lets you scale your processing - Publish-subscribe: allow us broadcast data to multiple processes but has no way of scaling processing since every message goes to every subscriber. - The consumer group concept in Ka#ka generalizes these two concepts.
  25. Consumer Group - Consumers can join a group by using

    the same group.id. - Ka#ka assigns partitions to the consumers in the same group. - Each partition is consumed by exactly one consumer in the group.
  26. Consumer Group - Consumers can join a group by using

    the same group.id. - Ka#ka assigns partitions to the consumers in the same group. - Each partition is consumed by exactly one consumer in the group. - Each consumer within a group reads from exclusive partitions. - One consumer can consume multiple partitions. - It cannot have more consumers than partitions. Otherwise, some consumers will be inactive state.
  27. Consumer Group - Consumers can join a group by using

    the same group.id. - Ka#ka assigns partitions to the consumers in the same group. - Each partition is consumed by exactly one consumer in the group. - Each consumer within a group reads from exclusive partitions. - One consumer can consume multiple partitions. - It cannot have more consumers than partitions. Otherwise, some consumers will be inactive state. Question: Do Ka#ka allow one topic have multiple consumer groups?
  28. Consumer Group - Consumers can join a group by using

    the same group.id. - Ka#ka assigns partitions to the consumers in the same group. - Each partition is consumed by exactly one consumer in the group. - Each consumer within a group reads from exclusive partitions. - One consumer can consume multiple partitions. - It cannot have more consumers than partitions. Otherwise, some consumers will be inactive state. Question: Do Ka#ka allow one topic have multiple consumer groups? Answer: Yes - E.g: consumer groups for saving the data to database. Consumer group for transforming the data into other systems.
  29. Consumer Offset - Ka#ka stores the consumer offset in the

    topic named "_ _consumer_offsets" - [Group, Topic, Partition] -> [Offset, MetaData, TimeStamp] - Automatically create when a consumer using a group connects to the cluster. - This information stored in the Ka#ka cluster (old version: Zookeeper). - The consumer should commit the offset automatically or manually after reading the message.
  30. Consumer Offset - Ka#ka stores the consumer offset in the

    topic named "_ _consumer_offsets" - [Group, Topic, Partition] -> [Offset, MetaData, TimeStamp] - Automatically create when a consumer using a group connects to the cluster. - This information stored in the Ka#ka cluster (old version: Zookeeper). - The consumer should commit the offset automatically or manually after reading the message. Question: What if the "_ _consumer_offsets" topic grows huge cause slowing down the search?
  31. Consumer Group Coordinator - Group coordinator is the broker which

    receives heartbeats(or polling for message) from all consumers within the consumer group. - Every consumer group has only one group coordinator. - When a consumer want to join a consumer group, it sends a request to group coordinator.
  32. Consumer Group Leader - The consumer group leader is one

    of the consumers in a consumer group. - When a consumer wants to join a consumer group, it sends a JoinGroup request to the group coordinator. - The #irst consumer to join the group becomes the group leader.
  33. Group Leader and Coordinator relationship Rebalancing Event occurs when: -

    A new consumer joins the consumer group. - A consumer leaves the consumer group. - A consumer is "disconnected" to consumer group (group coordinator doesn't receive heartbeat event from this consumer)
  34. Group Leader and Coordinator relationship Rebalancing Event occurs when: -

    A new consumer joins the consumer group. - A consumer leaves the consumer group. - A consumer is "disconnected" to consumer group (group coordinator doesn't receive heartbeat event from this consumer) - The group leader will make the assignment of each partition to each consumer. Then the group leader sends this assignment to the group coordinator. - The group coordinator will send back the assignment to all consumers. - Rebalancing event happens.
  35. Group Leader and Coordinator relationship Rebalancing Event occurs when: -

    A new consumer joins the consumer group. - A consumer leaves the consumer group. - A consumer is "disconnected" to consumer group (group coordinator doesn't receive heartbeat event from this consumer) - The group leader will make the assignment of each partition to each consumer. Then the group leader sends this assignment to the group coordinator. - The group coordinator will send back the assignment to all consumers. - Rebalancing event happens. Question: Why doesn't the group coordinator (broker) take the job that assigns partitions to consumers, but the group leader (consumer) do that?
  36. Group Leader and Coordinator Algorithm: 1.Members JoinGroup with their respective

    subscriptions. 2.The leader collects member subscriptions from its JoinGroup response and performs the group assignment. 3.All members (including the leader) send SyncGroup to #ind their assignment. 4.Once created, there are two cases which can trigger reassignment: a.Topic metadata changes that have no impact on subscriptions cause re-sync. The leader computes the new assignment and sends SyncGroup. b.Membership or subscription changes cause rejoin.
  37. Idempotent Producer - Generate PID for each producer - PID

    and a sequence number are bundled together with the message
  38. Isolation level for consumer Read Committed - non-transactional and COMMITTED

    transactional messages are visible. - Read until LSO: Last Stable Offset Read Uncommitted - All messages are visible
  39. Exactly Once Semantic Part 1: Idempotent producer - guarantee messages

    are produced once and in order. Part 2: Producer Transaction support - Commit or fail a set of produced messages and consumer offsets across partitions Part 3: Consumer transaction support - Isolation level is read committed. - Filter out messages for aborted transactions, and wait for transactions to be committed before processing.
  40. Exactly Once Semantic Part 1: Idempotent producer - guarantee messages

    are produced once and in order. Part 2: Producer Transaction support - Commit or fail a set of produced messages and consumer offsets across partitions Part 3: Consumer transaction support - Isolation level is read committed. - Filter out messages for aborted transactions, and wait for transactions to be committed before processing. Question: What if the producer is crashed, or the consumer is failed to commit its processed offset.
  41. Authentication TLS/SSL Integrating with PKI infrastructure SASL/GSSAPI Integrating with Kerberos

    infrastructure SASL/PLAIN Integrating with existing password server/database Custom SASL mechanism Integrating with existing authentication database Others SASL/SCAM, SASL/OathBearer(beta), SASL_SSL
  42. Ka#ka Connect Ka#ka Connect is a framework for connecting Ka#ka

    with external systems e.g: databases, key-value stores, search indexes, and #ile systems.
  43. Ka#ka Stream Ka#ka Streams is a library for building streaming

    applications that transform input Ka#ka topics into output Ka#ka topics.
  44. Con#luent Schema Registry - Con#luent Schema Registry provides a RESTful

    interface for storing and retrieving Apache Avro® schemas. - It stores a versioned history of all schemas.
  45. Monitoring Tools ⁃ Ka#ka consumer lag checking: https://github.com/linkedin/Burrow ⁃ Ka#ka

    toolbox (Topic UI, Schema UI, Connect UI): https://www.landoop.com/lenses-box/ ⁃ Ka#ka Manager: https://github.com/yahoo/ka#ka-manager ⁃ Zookeeper Con#iguration, Monitoring, Backup (Net#lix): https://github.com/soabase/exhibitor ⁃ Monitor the availability of Ka#ka clusters: https://github.com/linkedin/ka#ka-monitor ⁃ Ka#ka Tools: https://github.com/linkedin/ka#ka-tools ⁃ Cluster Management: https://github.com/apache/helix ⁃ Security Manager: https://github.com/simplesteph/ka#ka-security-manager ⁃ ....
  46. Ka#ka vs RabbitMQ Advanced Message Queuing Protocol (RabbitMQ) Log based

    Message Protocol (Ka:ka) A single message is slow to process No problem Holds up the processing of subsequent messages in that partition. Message Order Cannot keep order in case of multiple consumers Can keep the order in the partition Consumers cannot keep up with producers when consumer die, we should carefully delete any queues whose consumer has shut down. Otherwise, they continue consume memory, affect other alive consumers. Just store messages on queue (hard drive) We choose Ka:ka when: - each message is fast to process and high message throughput - message order is important - Keyed messages - Replay old messages - Semantic Delivery
  47. Sarama - Written in Golang. Open sourced by Shopify -

    Easier to read. Missed many features
  48. Con#luent Ka#ka Go - Wrap again librdka#ka, which is written

    in C++. - Have many latest features from Ka#ka. - Still features behind of#icial Java Ka#ka client. - Con#luent: founded by Ka#ka original authors.
  49. References - http://ka#ka.apache.org/documentation.html - Consumer Heartbeats Proposal - https://cwiki.apache.org/con#luence/display/KAFKA/Offset+Management -

    Exactly Once Delivery Proposal - https://cwiki.apache.org/con#luence/display/KAFKA/Ka#ka+Replication - https://cwiki.apache.org/con#luence/display/KAFKA/Ka#ka+Client-side+Assignment+Proposal - https://github.com/edenhill/librdka#ka/issues/1308 - https://www.slideshare.net/jjkoshy/offset-management-in-ka#ka - https://softwareengineeringdaily.com/2016/10/07/ka#ka-streams-with-jay-kreps/ - Logs: What every software engineer should know about real-time data unifying - http://martin.kleppmann.com/2015/08/05/ka#ka-samza-unix-philosophy-distributed-data.html - http://martin.kleppmann.com/2018/01/18/event-types-in-ka#ka-topic.html - http://martin.kleppmann.com/2015/04/23/bottled-water-real-time-postgresql-ka#ka.html - https://www.con#luent.io/blog/unifying-stream-processing-and-interactive-queries-in-apache-ka#ka/ - https://www.con#luent.io/blog/how-choose-number-topics-partitions-ka#ka-cluster - https://engineering.linkedin.com/ka#ka/benchmarking-apache-ka#ka-2-million-writes-second-three- cheap-machines
  50. Q&A