Intro to Apache Kafka #phpbnl20

Intro to Apache Kafka #phpbnl20

A 30-minute introduction to the features of Apache Kafka, the anatomy of a Kafka cluster, and how to talk to a Kafka cluster once you've got one.

This talk was held as part of the PHPBenelux 2020 Unconference track. Please give me feedback for this talk on Joind.in: https://joind.in/talk/27a56

503778aa6a31b4ecb5b37ffb62ff5dab?s=128

Tobias Gies

January 25, 2020
Tweet

Transcript

  1. 3.

    3 Kafka is not just a message queue • Distributed

    data streaming and storage platform • Publish / subscribe model • Fault-tolerant and scalable (partitioning and replication are first-class citizens) • Read: At-least-once delivery or exactly-once-delivery* • Write: Consistency settings can be tuned to fit use-case (trade-off vs. data loss risk)
  2. 4.

    4 Kafka is not just a message queue (2) •

    Reactive to the core: You always work with streams of data, never static tables • Strictly ordered (per partition, more on this later) • Data compaction: Different ways of getting rid of old* data (more on this later)
  3. 5.

    5 Kafka is FAST. Like, seriously. • Millions of messages

    per second are not a big problem, even on small clusters • Most often, speed is I/O-limited (disk, network interface…) • Very little delay for a single message • Does not need expensive hardware (comparatively)
  4. 7.

    7 Brokers and Clusters Image from "Kafka in a Nutshell"

    by Kevin Sookocheff • Broker is what Kafka calls a single server. • Multiple brokers form a Cluster. • Data is replicated to several brokers in the cluster, with one broker the Leader for a given partition of data. • Reads and writes for a partition are served by its leader. • The leader coordinates replication. • In case of failure, a replica will take over leadership.
  5. 8.

    Kafka Topics • One Topic consists of one or more

    Partitions… • … which each contain any number of Messages. • Partitions have an ordering guarantee: Messages will be stored in the same order they are written.
  6. 9.

    9 Messages Consist of… • Headers (e.g. Timestamp) • Key

    (Byte-Array) • Value (Byte-Array) Default maximum message size is ~1 MB.
  7. 10.

    Topics (2) – Partition assignment Two options for partition assignment:

    • If the message has no key (Key is null), partition assignment happens round-robin • If the message has a key, partition assignment happens based on key hash… • … which means messages with the same key will always be in the same partition.
  8. 11.

    11 Messages (2) – Special cases • In an event

    stream topic (example: access log): • All messages have null keys, because there is no meaningful identity for an event • In a data changelog topic (example: topic ingested from a database): • A message with null value marks a deleted record
  9. 12.

    12 Topics (3) – (Change-)Log Compaction Partition 1 Partition 2

    Key Value Key Value User1 u1@example.com User2 two@example.net User1 user1@gmail.com User3 three@yahoo.com
  10. 13.

    13 Partition 1 Partition 2 Key Value Key Value User1

    u1@example.com User2 two@example.net User1 user1@gmail.com User3 three@yahoo.com User2 null Topics (3) – (Change-)Log Compaction
  11. 14.

    14 Topics (3) – (Change-)Log Compaction Partition 1 Partition 2

    Key Value Key Value User1 u1@example.com User2 two@example.net User1 user1@gmail.com User3 three@yahoo.com User2 null
  12. 15.

    15 Topics (3) – (Change-)Log Compaction Partition 1 Partition 2

    Key Value Key Value User1 u1@example.com User2 two@example.net User1 user1@gmail.com User3 three@yahoo.com User2 null
  13. 16.

    16 Topics (3) – (Change-)Log Compaction Partition 1 Partition 2

    Key Value Key Value User1 u1@example.com User2 two@example.net User1 user1@gmail.com User3 three@yahoo.com User2 null
  14. 17.

    Part 3 Interacting with Kafka How to get data in

    and out, aggregate and transform it
  15. 18.

    18 Terminology • Producers put data into Kafka • Consumers

    read data from Kafka • Connectors link Kafka to external data stores • Stream processors filter, merge, aggregate, and transform data
  16. 19.

    A basic Kafka Producer in PHP $conf = new RdKafka\Conf();

    $conf->set('metadata.broker.list', 'localhost:9092'); $producer = new RdKafka\Producer($conf); $topic = $producer->newTopic("test"); for ($i = 0; $i < 10; $i++) { $topic->produce(RD_KAFKA_PARTITION_UA, 0, "Message $i"); $producer->poll(0); } // TODO: ensure producer flush on shutdown
  17. 20.

    A basic Kafka Consumer in PHP $conf = new RdKafka\Conf();

    $consumer = new RdKafka\KafkaConsumer($conf); $consumer->subscribe(['test']); while (true) { // TODO: This should probably be a proper event loop... $message = $consumer->consume(120 * 1000); switch ($message->err) { case RD_KAFKA_RESP_ERR_NO_ERROR: var_dump($message); break; // TODO: Handle errors } }
  18. 21.

    21 Kafka's own frameworks & services • Kafka Connect: •

    Links external data stores to Kafka using pre-built Connectors that only need configuration. • Great tool to make existing systems' data accessible in Kafka. • Kafka Streams: • Java framework for stream processing. Build your own stream processor in a single Java file. • Manages consumers, producers, data stores, etc. transparently. • KSQL: • Java not your thing? Write stream processors in an SQL-like language.
  19. 22.

    22 Bonus slide: More Cool tools • First-party CLI scripts:

    kafka-topics, kafka-consumer-groups, kafka-console-consumer,many more • MirrorMaker: Replicate topics across different Kafka clusters • Kaf: alternative open source CLI client written in Go • Cruise Control: Kafka cluster management, workload rebalancing, self-healing • Debezium: Live replication of data from RDBMS (MySQL &co.) into Kafka • Many more: • https://github.com/monksy/awesome-kafka/ • https://github.com/infoslack/awesome-kafka/