An Introduction to Apache Kafka

By Amir Sedighi @amirsedighi Data Solutions Engineer at DatisPars Nov
2014

2 References • http://kafka.apache.org/documentation.html • http://www.slideshare.net/charmalloc/current-an d-future-of-apache-kafka • http://www.michael-noll.com/blog/2013/03/13/ru nning-a-multi-broker-apache-kafka-cluster-on-a
-single-node/

3 At first data pipelining looks easy! • It often
starts with one data pipeline from a producer to a consumer.

4 It looks pretty wise either to reuse things! •
Reusing the pipeline for new producers.

5 We may handle some situations! • Reusing added producers
for new consumers.

6 But we can't go far! • Eventually the solution
becomes the problem!

7 The additional requirements make things complicated! • By later
developments it gets even worse!

8 How to avoid this mess?

9 Decoupling Data-Pipelines

10 Message Delivery Semantics • At most once – Messages
may be lost by are never delivered. • At least once – Messages are never lost byt may be redliverd. • Exactly once – This is what people actually want.

11 Apache Kafka is publish-subscribe messaging rethought as a distributed
commit log.

12 Apache Kafka • Apache Kafka is publish-subscribe messaging rethought
as a distributed commit log. – Kafka is super fast. – Kafka is scalable. – Kafka is durable. – Kafka is distributed by design.

15 Apache Kafka • A single Kafka broker (server) can
handle hundreds of megabytes of reads and writes per second from thousands of clients.

17 Apache Kafka • Kafka is designed to allow a
single cluster to serve as the central data backbone for a large organization. It can be elastically and transparently expanded without downtime.

19 Apache Kafka • Messages are persisted on disk and
replicated within the cluster to prevent data loss. Each broker can handle terabytes of messages without performance impact.

21 Apache Kafka • Kafka has a modern cluster-centric design
that offers strong durability and fault- tolerance guarantees.

22 Kafka in Linkedin

24 Kafka is a distributed, partitioned, replicated commit log service.

25 Main Components • Topic • Producer • Consumer •
Broker

26 Topic • Topic • Producer • Consumer • Broker
• Kafka maintains feeds of messages in categories called topics. • Topics are the highest level of abstraction that Kafka provides.

27 Topic

28 Topic

29 Topic

30 Producer • Topic • Producer • Consumer • Broker
• We'll call processes that publish messages to a Kafka topic producers.

31 Producer

32 Producer

33 Producer

34 Consumer • Topic • Producer • Consumer • Broker
• We'll call processes that subscribe to topics and process the feed of published messages, consumers. – Hadoop Consumer

35 Consumer

36 Broker • Topic • Producer • Consumer • Broker
• Kafka is run as a cluster comprised of one or more servers each of which is called a broker.

37 Broker

38 Broker

39 Topics • A topic is a category or feed
name to which messages are published. • Kafka cluster maintains a partitioned log for each topic.

40 Partition • Is an ordered, immutable sequence of messages
that is continually appended to a commit log. • The messages in the partitions are each assigned a sequential id number called the offset.

41 Partition

42 Again Topic and Partition

43 Log Compaction

44 Producer • The producer is responsible for choosing which
message to assign to which partition within the topic. – Round-Robin – Load-Balanced – Key-Based (Semantic-Oriented)

45 Log Compaction

46 How a Kafka cluster looks Like?

47 How Kafka replicates a Topic's partitions through the cluster?

48 Logical Consumers

49 What if we put jobs (Processors) cross the flow?

50 Where to Start? • http://kafka.apache.org/downloads.html

51 Run Zookeeper • bin/zookeeper-server-start.sh config/zookeeper.properties

52 Run kafka-server • bin/kafka-server-start.sh config/server.properties

53 Create Topic • bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1
--partitions 1 --topic test > Created topic "test".

54 List all Topics • bin/kafka-topics.sh --list --zookeeper localhost:2181

55 Send some Messages by Producer • bin/kafka-console-producer.sh --broker-list localhost:9092
--topic test Hello DatisPars Guys! How is it going with you?

56 Start a Consumer • bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic test
--from-beginning

57 Producing ...

58 Consuming

59 Use Cases • Messaging – Kafka is comparable to
traditional messaging systems such as ActiveMQ and RabbitMQ. • Kafka provides customizable latency • Kafka has better throughput • Kafka is highly Fault-tolerance

60 Use Cases • Log Aggregation – Many people use
Kafka as a replacement for a log aggregation solution. – Log aggregation typically collects physical log files off servers and puts them in a central place (a file server or HDFS perhaps) for processing. – In comparison to log-centric systems like Scribe or Flume, Kafka offers equally good performance, stronger durability guarantees due to replication, and much lower end-to-end latency. • Lower-latency • Easier support

61 Use Cases • Stream Processing – Storm and Samza
are popular frameworks for stream processing. They both use Kafka. • Event Sourcing – Event sourcing is a style of application design where state changes are logged as a time-ordered sequence of records. Kafka's support for very large stored log data makes it an excellent backend for an application built in this style. • Commit Log – Kafka can serve as a kind of external commit-log for a distributed system. The log helps replicate data between nodes and acts as a re- syncing mechanism for failed nodes to restore their data.

62 Message Format • /** • * A message. The
format of an N byte message is the following: • * If magic byte is 0 • * 1. 1 byte "magic" identifier to allow format changes • * 2. 4 byte CRC32 of the payload • * 3. N - 5 byte payload • * If magic byte is 1 • * 1. 1 byte "magic" identifier to allow format changes • * 2. 1 byte "attributes" identifier to allow annotations on the message independent of the version (e.g. compression enabled, type of codec used) • * 3. 4 byte CRC32 of the payload • * 4. N - 6 byte payload • */

63 Questions?

An Introduction to Apache Kafka

An Introduction to Apache Kafka

More Decks by Amir Sedighi

Other Decks in Programming

Featured

Transcript