Intro to Apache Kafka #phpbnl20

Slide 1

Slide 1 text

Introduction to Apache Kafka PHPBenelux conference 2020-01-25

Slide 2

Slide 2 text

Part 1 The feature pitch Why I think this Kafka thing is pretty cool

Slide 3

Slide 3 text

3 Kafka is not just a message queue • Distributed data streaming and storage platform • Publish / subscribe model • Fault-tolerant and scalable (partitioning and replication are first-class citizens) • Read: At-least-once delivery or exactly-once-delivery* • Write: Consistency settings can be tuned to fit use-case (trade-off vs. data loss risk)

Slide 4

Slide 4 text

4 Kafka is not just a message queue (2) • Reactive to the core: You always work with streams of data, never static tables • Strictly ordered (per partition, more on this later) • Data compaction: Different ways of getting rid of old* data (more on this later)

Slide 5

Slide 5 text

5 Kafka is FAST. Like, seriously. • Millions of messages per second are not a big problem, even on small clusters • Most often, speed is I/O-limited (disk, network interface…) • Very little delay for a single message • Does not need expensive hardware (comparatively)

Slide 6

Slide 6 text

Part 2 High-Level Concepts A Kafka anatomy lesson

Slide 7

Slide 7 text

7 Brokers and Clusters Image from "Kafka in a Nutshell" by Kevin Sookocheff • Broker is what Kafka calls a single server. • Multiple brokers form a Cluster. • Data is replicated to several brokers in the cluster, with one broker the Leader for a given partition of data. • Reads and writes for a partition are served by its leader. • The leader coordinates replication. • In case of failure, a replica will take over leadership.

Slide 8

Slide 8 text

Kafka Topics • One Topic consists of one or more Partitions… • … which each contain any number of Messages. • Partitions have an ordering guarantee: Messages will be stored in the same order they are written.

Slide 9

Slide 9 text

9 Messages Consist of… • Headers (e.g. Timestamp) • Key (Byte-Array) • Value (Byte-Array) Default maximum message size is ~1 MB.

Slide 10

Slide 10 text

Topics (2) – Partition assignment Two options for partition assignment: • If the message has no key (Key is null), partition assignment happens round-robin • If the message has a key, partition assignment happens based on key hash… • … which means messages with the same key will always be in the same partition.

Slide 11

Slide 11 text

11 Messages (2) – Special cases • In an event stream topic (example: access log): • All messages have null keys, because there is no meaningful identity for an event • In a data changelog topic (example: topic ingested from a database): • A message with null value marks a deleted record

Slide 12

Slide 12 text

12 Topics (3) – (Change-)Log Compaction Partition 1 Partition 2 Key Value Key Value User1 [email protected] User2 [email protected] User1 [email protected] User3 [email protected]

Slide 13

Slide 13 text

13 Partition 1 Partition 2 Key Value Key Value User1 [email protected] User2 [email protected] User1 [email protected] User3 [email protected] User2 null Topics (3) – (Change-)Log Compaction

Slide 14

Slide 14 text

14 Topics (3) – (Change-)Log Compaction Partition 1 Partition 2 Key Value Key Value User1 [email protected] User2 [email protected] User1 [email protected] User3 [email protected] User2 null

Slide 15

Slide 15 text

15 Topics (3) – (Change-)Log Compaction Partition 1 Partition 2 Key Value Key Value User1 [email protected] User2 [email protected] User1 [email protected] User3 [email protected] User2 null

Slide 16

Slide 16 text

16 Topics (3) – (Change-)Log Compaction Partition 1 Partition 2 Key Value Key Value User1 [email protected] User2 [email protected] User1 [email protected] User3 [email protected] User2 null

Slide 17

Slide 17 text

Part 3 Interacting with Kafka How to get data in and out, aggregate and transform it

Slide 18

Slide 18 text

18 Terminology • Producers put data into Kafka • Consumers read data from Kafka • Connectors link Kafka to external data stores • Stream processors filter, merge, aggregate, and transform data

Slide 19

Slide 19 text

A basic Kafka Producer in PHP $conf = new RdKafka\Conf(); $conf->set('metadata.broker.list', 'localhost:9092'); $producer = new RdKafka\Producer($conf); $topic = $producer->newTopic("test"); for ($i = 0; $i < 10; $i++) { $topic->produce(RD_KAFKA_PARTITION_UA, 0, "Message $i"); $producer->poll(0); } // TODO: ensure producer flush on shutdown

Slide 20

Slide 20 text

A basic Kafka Consumer in PHP $conf = new RdKafka\Conf(); $consumer = new RdKafka\KafkaConsumer($conf); $consumer->subscribe(['test']); while (true) { // TODO: This should probably be a proper event loop... $message = $consumer->consume(120 * 1000); switch ($message->err) { case RD_KAFKA_RESP_ERR_NO_ERROR: var_dump($message); break; // TODO: Handle errors } }

Slide 21

Slide 21 text

21 Kafka's own frameworks & services • Kafka Connect: • Links external data stores to Kafka using pre-built Connectors that only need configuration. • Great tool to make existing systems' data accessible in Kafka. • Kafka Streams: • Java framework for stream processing. Build your own stream processor in a single Java file. • Manages consumers, producers, data stores, etc. transparently. • KSQL: • Java not your thing? Write stream processors in an SQL-like language.

Slide 22

Slide 22 text

22 Bonus slide: More Cool tools • First-party CLI scripts: kafka-topics, kafka-consumer-groups, kafka-console-consumer,many more • MirrorMaker: Replicate topics across different Kafka clusters • Kaf: alternative open source CLI client written in Go • Cruise Control: Kafka cluster management, workload rebalancing, self-healing • Debezium: Live replication of data from RDBMS (MySQL &co.) into Kafka • Many more: • https://github.com/monksy/awesome-kafka/ • https://github.com/infoslack/awesome-kafka/

Slide 23

Slide 23 text

Thank you! Got feedback? https://joind.in/talk/27a56