Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Introduction to Apache Kafka

Alex Objelean
September 28, 2015

Introduction to Apache Kafka

Alex Objelean

September 28, 2015
Tweet

More Decks by Alex Objelean

Other Decks in Technology

Transcript

  1. Overview (Project Info) • Distributed Publisher-Subscribe messaging system • Created

    by LinkedIn • Open sourced in 2011 (donated to apache) • Written in scala • Multi language support
  2. Overview (highlights) • Functionality of a messaging system • Communication

    - high performance, language agnostic, TCP protocol • Run as a cluster of 1 or more servers (brokers) • Uses ZooKeeper for broker coordination • Maintains feeds of messages (topics) • Consumer - process subscribed to a topic • Producer - process publishing messages
  3. Overview (Zookeeper) • Open source, high performance coordination service for

    distributed applications • Centralized service for: configuration management, naming, distributed sync.
  4. Topic • Partition log for each topic • Each partition

    is ordered • Offset - sequence id • Configurable message retention • Partitions distributed in cluster • Each partition replicated • Leader/followers partitions
  5. Producer • Publish data to topics of their choice •

    Before publishing, queries the leader broker for each partition • Responsible for assigning partition to each message • Can use round robin or custom semantics • Partitioner is informed about number of partitions
  6. Producer Configuration Several interesting properties: • metadata.broker.list • serializer.class •

    partitioner.class • request.required.acks • message.send.max.retries
  7. Producer Partitioner • Each producer configures a partitioner • Sample

    implementation of partitioner public class SimplePartitioner implements Partitioner<String> { public SimplePartitioner (VerifiableProperties props) { } public int partition(String key, int numberOfPartitions) { return Math.abs(key.hashCode()) % numberOfPartitions; } }
  8. Consumer • Traditionally 2 models: queue vs broadcast • Consumer

    group abstraction - generalize both models. • Strong ordering guarantee • Each partition consumed by one consumer in group • Keep track of progress (store in zookeeper)
  9. Consumer • Consumer stream per thread • Each thread receive

    message from a partition • Total threads <= number of partitions
  10. Consumer Configuration Several interesting properties: • zookeeper.connect • group.id •

    auto.commit.enable • auto.commit.interval.ms • auto.offset.reset (smallest, largest)
  11. Consumer API • Two types: High Level & Simple Consumer

    • High Level - easier and straightforward (no offset management) • Simple Consumer - not so simple ◦ read a message multiple times ◦ consume a subset of partitions in a process ◦ manage transactions to ensure message processed only once ◦ Require following several steps: ▪ find active broker & leader for a topic and a partition ▪ determine replica ▪ build request & fetch data ▪ recover from leader changes.
  12. Getting Started (1) • Start Zookeper ./bin/zookeeper-server-start.sh config/zookeeper.properties • Start

    brokers ./bin/kafka-server-start.sh config/server.properties • Create topic ./bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 2 --partitions 4 --topic test • List topics ./bin/kafka-topics.sh --list --zookeeper localhost:2181
  13. Getting Started (2) • Describe topic ./bin/kafka-topics.sh --zookeeper localhost:2181 --describe

    test • Change number of partitions ./bin/kafka-topics.sh --alter --zookeeper localhost:2181 --partitions 5 --topic test • Start producer ./bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test • Start consumer ./bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic test --from-beginning
  14. Kafka and CPP • Messages in JSON format • High

    Level Consumer • No security on messaging layer (not supported by kafka) • Producer can sending duplicate messages (identified by transactionId). • Consumer, responsible for not processing duplicate messaging by managing message transactionId (store a limited set in a document). • Number of partitions - 12. • 3 Zookeeper servers & 3 Kafka servers • Messages persisted 1-2 days • Partitioning to be decided later • Monitoring needs to be addressed
  15. References • http://kafka.apache.org/ • http://kafka.apache.org/documentation.html#quickstart • Introduction to Kafka •

    Error codes • Ops guide • Why we didn’t use Kafka • Apache Kafka and zookeeper (slides)