Upgrade to Pro — share decks privately, control downloads, hide ads and more …

An Introduction to Apache Kafka

Sponsored · Ship Features Fearlessly Turn features on and off without deploys. Used by thousands of Ruby developers.
Avatar for Amir Sedighi Amir Sedighi
December 01, 2014

An Introduction to Apache Kafka

Apache Kafka is an open-source message broker project developed by the Apache Software Foundation written in Scala.
Kafka is a fast, scalable, durable, and fault-tolerant publish-subscribe messaging system.

Avatar for Amir Sedighi

Amir Sedighi

December 01, 2014

More Decks by Amir Sedighi

Other Decks in Programming

Transcript

  1. 3 At first data pipelining looks easy! • It often

    starts with one data pipeline from a producer to a consumer.
  2. 4 It looks pretty wise either to reuse things! •

    Reusing the pipeline for new producers.
  3. 10 Message Delivery Semantics • At most once – Messages

    may be lost by are never delivered. • At least once – Messages are never lost byt may be redliverd. • Exactly once – This is what people actually want.
  4. 12 Apache Kafka • Apache Kafka is publish-subscribe messaging rethought

    as a distributed commit log. – Kafka is super fast. – Kafka is scalable. – Kafka is durable. – Kafka is distributed by design.
  5. 13 Apache Kafka • Apache Kafka is publish-subscribe messaging rethought

    as a distributed commit log. – Kafka is super fast. – Kafka is scalable. – Kafka is durable. – Kafka is distributed by design.
  6. 14 Apache Kafka • Apache Kafka is publish-subscribe messaging rethought

    as a distributed commit log. – Kafka is super fast. – Kafka is scalable. – Kafka is durable. – Kafka is distributed by design.
  7. 15 Apache Kafka • A single Kafka broker (server) can

    handle hundreds of megabytes of reads and writes per second from thousands of clients.
  8. 16 Apache Kafka • Apache Kafka is publish-subscribe messaging rethought

    as a distributed commit log. – Kafka is super fast. – Kafka is scalable. – Kafka is durable. – Kafka is distributed by design.
  9. 17 Apache Kafka • Kafka is designed to allow a

    single cluster to serve as the central data backbone for a large organization. It can be elastically and transparently expanded without downtime.
  10. 18 Apache Kafka • Apache Kafka is publish-subscribe messaging rethought

    as a distributed commit log. – Kafka is super fast. – Kafka is scalable. – Kafka is durable. – Kafka is distributed by design.
  11. 19 Apache Kafka • Messages are persisted on disk and

    replicated within the cluster to prevent data loss. Each broker can handle terabytes of messages without performance impact.
  12. 20 Apache Kafka • Apache Kafka is publish-subscribe messaging rethought

    as a distributed commit log. – Kafka is super fast. – Kafka is scalable. – Kafka is durable. – Kafka is distributed by design.
  13. 21 Apache Kafka • Kafka has a modern cluster-centric design

    that offers strong durability and fault- tolerance guarantees.
  14. 23

  15. 26 Topic • Topic • Producer • Consumer • Broker

    • Kafka maintains feeds of messages in categories called topics. • Topics are the highest level of abstraction that Kafka provides.
  16. 30 Producer • Topic • Producer • Consumer • Broker

    • We'll call processes that publish messages to a Kafka topic producers.
  17. 34 Consumer • Topic • Producer • Consumer • Broker

    • We'll call processes that subscribe to topics and process the feed of published messages, consumers. – Hadoop Consumer
  18. 36 Broker • Topic • Producer • Consumer • Broker

    • Kafka is run as a cluster comprised of one or more servers each of which is called a broker.
  19. 39 Topics • A topic is a category or feed

    name to which messages are published. • Kafka cluster maintains a partitioned log for each topic.
  20. 40 Partition • Is an ordered, immutable sequence of messages

    that is continually appended to a commit log. • The messages in the partitions are each assigned a sequential id number called the offset.
  21. 44 Producer • The producer is responsible for choosing which

    message to assign to which partition within the topic. – Round-Robin – Load-Balanced – Key-Based (Semantic-Oriented)
  22. 59 Use Cases • Messaging – Kafka is comparable to

    traditional messaging systems such as ActiveMQ and RabbitMQ. • Kafka provides customizable latency • Kafka has better throughput • Kafka is highly Fault-tolerance
  23. 60 Use Cases • Log Aggregation – Many people use

    Kafka as a replacement for a log aggregation solution. – Log aggregation typically collects physical log files off servers and puts them in a central place (a file server or HDFS perhaps) for processing. – In comparison to log-centric systems like Scribe or Flume, Kafka offers equally good performance, stronger durability guarantees due to replication, and much lower end-to-end latency. • Lower-latency • Easier support
  24. 61 Use Cases • Stream Processing – Storm and Samza

    are popular frameworks for stream processing. They both use Kafka. • Event Sourcing – Event sourcing is a style of application design where state changes are logged as a time-ordered sequence of records. Kafka's support for very large stored log data makes it an excellent backend for an application built in this style. • Commit Log – Kafka can serve as a kind of external commit-log for a distributed system. The log helps replicate data between nodes and acts as a re- syncing mechanism for failed nodes to restore their data.
  25. 62 Message Format • /** • * A message. The

    format of an N byte message is the following: • * If magic byte is 0 • * 1. 1 byte "magic" identifier to allow format changes • * 2. 4 byte CRC32 of the payload • * 3. N - 5 byte payload • * If magic byte is 1 • * 1. 1 byte "magic" identifier to allow format changes • * 2. 1 byte "attributes" identifier to allow annotations on the message independent of the version (e.g. compression enabled, type of codec used) • * 3. 4 byte CRC32 of the payload • * 4. N - 6 byte payload • */