Upgrade to Pro — share decks privately, control downloads, hide ads and more …

An Introduction to Apache Kafka

Avatar for Amir Sedighi Amir Sedighi
December 01, 2014

An Introduction to Apache Kafka

Apache Kafka is an open-source message broker project developed by the Apache Software Foundation written in Scala.
Kafka is a fast, scalable, durable, and fault-tolerant publish-subscribe messaging system.

Avatar for Amir Sedighi

Amir Sedighi

December 01, 2014
Tweet

More Decks by Amir Sedighi

Other Decks in Programming

Transcript

  1. 3 At first data pipelining looks easy! • It often

    starts with one data pipeline from a producer to a consumer.
  2. 4 It looks pretty wise either to reuse things! •

    Reusing the pipeline for new producers.
  3. 10 Message Delivery Semantics • At most once – Messages

    may be lost by are never delivered. • At least once – Messages are never lost byt may be redliverd. • Exactly once – This is what people actually want.
  4. 12 Apache Kafka • Apache Kafka is publish-subscribe messaging rethought

    as a distributed commit log. – Kafka is super fast. – Kafka is scalable. – Kafka is durable. – Kafka is distributed by design.
  5. 13 Apache Kafka • Apache Kafka is publish-subscribe messaging rethought

    as a distributed commit log. – Kafka is super fast. – Kafka is scalable. – Kafka is durable. – Kafka is distributed by design.
  6. 14 Apache Kafka • Apache Kafka is publish-subscribe messaging rethought

    as a distributed commit log. – Kafka is super fast. – Kafka is scalable. – Kafka is durable. – Kafka is distributed by design.
  7. 15 Apache Kafka • A single Kafka broker (server) can

    handle hundreds of megabytes of reads and writes per second from thousands of clients.
  8. 16 Apache Kafka • Apache Kafka is publish-subscribe messaging rethought

    as a distributed commit log. – Kafka is super fast. – Kafka is scalable. – Kafka is durable. – Kafka is distributed by design.
  9. 17 Apache Kafka • Kafka is designed to allow a

    single cluster to serve as the central data backbone for a large organization. It can be elastically and transparently expanded without downtime.
  10. 18 Apache Kafka • Apache Kafka is publish-subscribe messaging rethought

    as a distributed commit log. – Kafka is super fast. – Kafka is scalable. – Kafka is durable. – Kafka is distributed by design.
  11. 19 Apache Kafka • Messages are persisted on disk and

    replicated within the cluster to prevent data loss. Each broker can handle terabytes of messages without performance impact.
  12. 20 Apache Kafka • Apache Kafka is publish-subscribe messaging rethought

    as a distributed commit log. – Kafka is super fast. – Kafka is scalable. – Kafka is durable. – Kafka is distributed by design.
  13. 21 Apache Kafka • Kafka has a modern cluster-centric design

    that offers strong durability and fault- tolerance guarantees.
  14. 23

  15. 26 Topic • Topic • Producer • Consumer • Broker

    • Kafka maintains feeds of messages in categories called topics. • Topics are the highest level of abstraction that Kafka provides.
  16. 30 Producer • Topic • Producer • Consumer • Broker

    • We'll call processes that publish messages to a Kafka topic producers.
  17. 34 Consumer • Topic • Producer • Consumer • Broker

    • We'll call processes that subscribe to topics and process the feed of published messages, consumers. – Hadoop Consumer
  18. 36 Broker • Topic • Producer • Consumer • Broker

    • Kafka is run as a cluster comprised of one or more servers each of which is called a broker.
  19. 39 Topics • A topic is a category or feed

    name to which messages are published. • Kafka cluster maintains a partitioned log for each topic.
  20. 40 Partition • Is an ordered, immutable sequence of messages

    that is continually appended to a commit log. • The messages in the partitions are each assigned a sequential id number called the offset.
  21. 44 Producer • The producer is responsible for choosing which

    message to assign to which partition within the topic. – Round-Robin – Load-Balanced – Key-Based (Semantic-Oriented)
  22. 59 Use Cases • Messaging – Kafka is comparable to

    traditional messaging systems such as ActiveMQ and RabbitMQ. • Kafka provides customizable latency • Kafka has better throughput • Kafka is highly Fault-tolerance
  23. 60 Use Cases • Log Aggregation – Many people use

    Kafka as a replacement for a log aggregation solution. – Log aggregation typically collects physical log files off servers and puts them in a central place (a file server or HDFS perhaps) for processing. – In comparison to log-centric systems like Scribe or Flume, Kafka offers equally good performance, stronger durability guarantees due to replication, and much lower end-to-end latency. • Lower-latency • Easier support
  24. 61 Use Cases • Stream Processing – Storm and Samza

    are popular frameworks for stream processing. They both use Kafka. • Event Sourcing – Event sourcing is a style of application design where state changes are logged as a time-ordered sequence of records. Kafka's support for very large stored log data makes it an excellent backend for an application built in this style. • Commit Log – Kafka can serve as a kind of external commit-log for a distributed system. The log helps replicate data between nodes and acts as a re- syncing mechanism for failed nodes to restore their data.
  25. 62 Message Format • /** • * A message. The

    format of an N byte message is the following: • * If magic byte is 0 • * 1. 1 byte "magic" identifier to allow format changes • * 2. 4 byte CRC32 of the payload • * 3. N - 5 byte payload • * If magic byte is 1 • * 1. 1 byte "magic" identifier to allow format changes • * 2. 1 byte "attributes" identifier to allow annotations on the message independent of the version (e.g. compression enabled, type of codec used) • * 3. 4 byte CRC32 of the payload • * 4. N - 6 byte payload • */