Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Connect "K" of SMACK:pykafka, kafka-python or ?

suci
June 09, 2017

Connect "K" of SMACK:pykafka, kafka-python or ?

Apache Kafka is considered as a distributed streaming platform to a build real-time data pipelines and streaming apps. You can also take Kafka as commit log service with functions much like a publish/subscribe messaging system, but with better throughput, built-in partitioning, replication, and fault tolerance and runs in production in thousands of companies. Recently, Kafka has been widely applied as one component of SMACK stack because of it's role connected with Apache Hadoop, Apache Storm, and Spark Streaming in the data pipeline.

In this talk, I will start with introduce data stream processing and the general concept of Kafka's architecture and components by several use cases. Then, Kafka' API will be introduced by python clients with demo. Finally, the benchmark, comparison and limitation of different python clients will be discussed.

suci

June 09, 2017
Tweet

More Decks by suci

Other Decks in Programming

Transcript

  1. About Me Data Software Engineer of EAD in the manufacturer,

    Micron Currently working with - data and people - Lurking in PyHug, Taipei.py and various Meetups Shuhsi Lin sucitw gmail.com sucitw gmail.com
  2. Agenda » Pipeline to streaming » What is Apache Kafka

    ⋄ Overview ⋄ Architecture ⋄ Use cases » Kafka API ⋄ Python clients » Conclusion and More about Kafka
  3. What we will not focus on » Reliability and durability

    ⋄ Scaling, replication, guarantee ⋄ Zookeeper » Compact log » Administration, Configuration, Operations » Kafka connect » Kafka Stream » Apache Kafka vs XXX ⋄ RabbitMQ, AWS Kinesis, GCP Pub/Sub, ActiveMQ, ZeroMQ, Redis, and ....
  4. 3 Paradigms for Programming 1. Request/response 2. Batch 3. Stream

    processing https://qconnewyork.com/ny2016/ny2016/presentation/large-scale-stream-processing-apache-kafka.html
  5. What is streaming process » Data comes from the rise

    of events (orders, sales, clicks or trades) » Databases are event streams ⋄ the process of creating a backup or standby copy of a database ⋄ publishing the database changes
  6. Data pipeline https://www.linkedin.com/pulse/data-pipeline-hadoop-part-1-2-birender-saini What often happen in a complex Data

    pipeline • Complexity meant that the data was always unreliable • Reports were untrustworthy, • Derived indexes and stores were questionable • Everyone spent a lot of time battling data quality issues of all kinds. • Data discrepancy
  7. What is Apache Kafka? Apache Kafka is a distributed system

    designed for streams. It is built to be fault-tolerant, high-throughput, horizontally scalable, and allows geographically distributing data streams and processing. https://kafka.apache.org
  8. What a streaming data platform can provide » “Data integration”

    (ETL) ⋄ How to transport data between systems ⋄ Captures streams of events or data changes and feeds these to other data systems » “Stream processing” (messaging) ⋄ Continuous, real-time processing and transformation of these streams and makes the results available system-wide. various systems in LinkedIn https://www.confluent.io/blog/stream-data-platform-1/ Analytical data processing with very low latency
  9. Kafka terminology » Producer » Consumer ⋄ Consumer group ⋄

    offset » Broker » Topic » Partition » Message » Replica
  10. What Kafka Does Publish & subscribe • to streams of

    data like a messaging system Process • streams of data efficiently and in real time Store • streams of data safely in a distributed replicated cluster https://kafka.apache.org/
  11. The key abstraction in Kafka is a structured commit log

    of updates append records to this log https://www.confluent.io/blog/stream-data-platform-1/ Each of these data consumers has its own position in the log and advances independently. This allows a reliable, ordered stream of updates to be distributed to each consumer. The log can be sharded and spread over a cluster of machines, and each shard is replicated for fault-tolerance. consumers producers parallel, ordered consumption (important to a change capture system for database updates) TBs of data
  12. Topics and Partitions » Topics are split into partitions »

    Partitions are strongly ordered & immutable » Partitions can exist on different servers » Partition enable scalability » Producers assign a message to a partition within the topic ⋄ Either round robin ( simply to balance load) ⋄ or according to the keys https://kafka.apache.org/documentation/#gettingStarted
  13. Offsets » Message are assigned an offset in the partition

    » Consumers track with ( offset, partition, topic) https://kafka.apache.org/documentation/#gettingStarted A two server Kafka cluster hosting four partitions (P0-P3) with two consumer groups
  14. Consumers and Partitions » A consumer group consumes one topic

    » A partition is always sent to the same consumer instance https://kafka.apache.org/documentation/#gettingStarted
  15. Consumer • Messages are available to consumers only when they

    have been committed • Kafka does not push ◦ Unlike JMS • Read does not destroy by consumers ◦ Unlike JMS Topic • (some) History available ◦ Offline consumers can catch up ◦ Consumers can re-consume from the past • Delivery Guarantees ◦ Ordering maintained ◦ At-least-once (per consumer) by default; at-most-once and exactly-once can be implemented P11 at https://www.slideshare.net/lucasjellema/amis-sig-introducing-apache-kafka-scalable-reliable-event-bus-message-queue
  16. ZooKeeper: the coordination interface between the Kafka broker and consumers

    https://hortonworks.com/hadoop-tutorial/realtime-event-processing-nifi-kafka-storm/#section_3 » Stores configuration data for distributed services » Used primarily by brokers » Used by consumers in 0.8 but not 0.9
  17. Apache Kafka timeline 2011-Nov 2016-May 2013-Nov 2015-Nov Next version v0.10

    Kafka Stream rack awareness v0.8 New Producer Reassign-partitions v0.9 Kafka Connect Security New Consumer Apache Software Foundation incubator 2010 Creation In Linkedin 2014, Confluent v0.10.2 Single Message Transforms for Kafka Connect
  18. TLS connection SSL is supported only for the new Kafka

    Producer and Consumer (Kafka versions 0.9.0 and higher) http://kafka.apache.org/documentation.html#security_ssl http://docs.confluent.io/current/kafka/ssl.html http://maximilianchrist.com/blog/connect-to-apache-kafka-from-python-using-ssl https://github.com/edenhill/librdkafka/wiki/Using-SSL-with-librdkafka
  19. Apache Kafka is consider as : Stream data platform »

    Commit log service » Messaging system » Circular buffer
  20. Cons of Apache Kafka » Consumer Complexity (smart, but poor

    client) » Lack of tooling/monitoring (3rd party) » Still pre 1.0 release » Operationally, it’s more manual than desired » Requires ZooKeeper Sep 26, 2015 http://www.slideshare.net/jimplush/introduction-to-apache-kafka-53225326
  21. Use Cases » Website Activity Tracking » Log Aggregation »

    Stream Processing » Event Sourcing » Commit logs » Metrics (Performance index streaming) ⋄ CPU/IO/Memory usage ⋄ Application Specific: ⋄ Time taken to load a web-page ⋄ Time taken to build a web-page ⋄ No. of requests ⋄ No. of hits on a particular page/url
  22. Event-driven Applications » how it first is adopted and how

    its role evolves over time in their architecture. https://aws.amazon.com/tw/kafka/
  23. Case: Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with

    FiloDB and Spark Streaming http://helenaedelson.com/?p=1186 (2016/03)
  24. Four Core APIs » Producer API » Consumer API »

    Connect API » Streams API » Legacy APIs $ cat < in.txt | grep “python” | tr a-z A-Z > out.txt https://www.slideshare.net/ConfluentInc/apache-kafkaa-distributed-streaming-platform
  25. Kafka Clients » JAVA (officially maintain) » C/C++ (librdkafka) »

    Go (AKA golang) » Erlang » .NET » Clojure » Ruby » Node.js » Proxy (HTTP REST, etc) » Perl » stdin/stdout » PHP » Rust » Alternative Java » Storm » Scala DSL » Clojure https://cwiki.apache.org/confluence/display/KAFKA/Clients » Python ⋄ Confluent-kafka-python ⋄ Kafka-python ⋄ pykafka
  26. Kafka Clients survey https://www.confluent.io/blog/first-annual-state-apache-kafka-client-use-survey (February 14, 2017) How users choose

    a Kafka client Kafka Client: Language Adoption Results from 187 responses Reliability: • Stability should be priority • Good error handling • Good testing • Good metrics and logging 3rd
  27. See your brokers and topics • Kafka-topics-ui ◦ Demo http://kafka-topics-ui.landoop.com/#/

    • Kafka-connect-ui ◦ Demo http://kafka-connect-ui.landoop.com/ • Kafka-manager (yahoo) • Kafka Eagle • kafka-offset-monitor Kafka Tool (GUI) https://www.datadoghq.com/
  28. Apache Kafka client for Python » Pykafka » kafka-python »

    Confluent-kafka-python » Librdkafka ⋄ The Apache Kafka C/C++ library
  29. Pykafka https://github.com/Parsely/pykafka http://pykafka.readthedocs.io/en/latest/ » Similar level of abstraction to the

    JVM Kafka client » Built on librdkafka https://blog.parse.ly/post/3886/pykafka-now/ (2016,June)
  30. kafka-python https://github.com/dpkp/kafka-python/ http://kafka-python.readthedocs.io/ API • Producer • Consumer • Message

    • TopicPartition • KafkaError • KafkaException • kafka-python is designed to function much like the official java client, with a sprinkling of pythonic interfaces.
  31. Confluent-kafka-python Confluent's Python client for Apache Kafka and the Confluent

    Platform. Features: • High performance ⋄ librdkafka • Reliability • Supported • Future proof https://github.com/confluentinc/confluent-kafka-python http://docs.confluent.io/current/clients/confluent-kafka-python/index.html?
  32. Producer API (JAVA) https://kafka.apache.org/0102/javadoc/index.html?org/apache/kafka/clients/producer/KafkaProducer.html https://www.tutorialspoint.com/apache_kafka/apache_kafka_simple_producer_example.htm • KafkaProducer – Sync and

    Async ◦ close() ◦ flush() ◦ metrics() ◦ partitionsFor( topic) ◦ send(ProducerRecord<K,V> record) Writing data to Kafka: A client that publishes records to the Kafka cluster. Class KafkaProducer<K,V> Class ProducerRecord<K,V> • ProducerRecord( topic, V value) • ProducerRecord( topic, Integer partition, K key, V value) A key/value pair to be sent to Kafka. Configuration Settings (configuration is externalized in a property file) • client.id • producer.type • acks • retries • bootstrap.servers • linger.ms • key.serializer • value.serializer • batch.size • buffer.memory messages
  33. Producer API -Pykafka from pykafka import KafkaClient from settings import

    …. client = KafkaClient(hosts=bootstrap_servers) topic = client.topics [topic.encode('UTF-8')] producer = topic.get_producer(use_rdkafka=use_rdkafka) producer.produce(msg_payload) producer.stop() # Will flush background queue Class pykafka.producer.Producer() Classpykafka.topic.Topic(cluster, topic_metadata) http://pykafka.readthedocs.io/en/latest/api/producer.html • produce(msg, partition_key=None) • stop() • get_producer(use_rdkafka=False, **kwargs)
  34. Must be type bytes, or be serializable to bytes via

    configured value_serializer. Producer API -Kafka-Python from kafka import KafkaConsumer, KafkaProducer from settings import BOOTSTRAP_SERVERS, TOPICS, MSG p = KafkaProducer(bootstrap_servers=BOOTSTRAP_SERVERS) p.send(TOPICS, MSG.encode('utf-8')) p.flush() Class kafka.KafkaProducer(**configs) https://kafka-python.readthedocs.io/en/master/_modules/kafka/producer/kafka.html#KafkaProducer • close(timeout=None) • flush(timeout=None) • partitions_for(topic) • send(topic, value=None, key=None, partition=None, timestamp_ms=None) http://kafka-python.readthedocs.io/en/master/apidoc/KafkaProducer.html
  35. Producer API -Confluent-python -Kafka from confluent_kafka import Producer from settings

    import BOOTSTRAP_SERVERS, TOPICS, MSG p = Producer({'bootstrap.servers': BOOTSTRAP_SERVERS}) p.produce(TOPICS, MSG.encode('utf-8')) p.flush() http://docs.confluent.io/current/clients/confluent-kafka-python/#producer Class confluent_kafka.Producer(*kwargs) • len() • flush([timeout]) • poll([timeout]) • produce(topic[, value][, key][, partition][, on_delivery][, timestamp])
  36. Consumer • Consumer group ◦ group.id ◦ session.timout.ms ◦ max.poll.records

    ◦ heartbeat.interval.ms • Offset Management ◦ enable.auto.commit ◦ Auto.commit.interval.ms ◦ auto.offset.reset https://kafka.apache.org/documentation.html#newconsumerconfigs
  37. Consumer API (JAVA) https://kafka.apache.org/0102/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html • assign(<TopicPartition> partitions) • assignment() •

    beginningOffsets(<TopicPartition> partitions) • close(long timeout, TimeUnit timeUnit) • commitAsync(Map<TopicPartition,OffsetAndMetadata> offsets, OffsetCommitCallback callback) • commitSync(Map<TopicPartition,OffsetAndMetadata> offsets) • committed(TopicPartition partition) • endOffsets(<TopicPartition> partitions) • listTopics() • metrics() • offsetsForTimes(Map<TopicPartition,Long> timestampsToSearch) • partitionsFor(topic) • pause(<TopicPartition> partitions) Reading data from Kafka: A client that consumes records from a Kafka cluster. Class KafkaConsumer<K,V> • poll(long timeout) • position(TopicPartition partition) • resume(<TopicPartition> partitions) • seek(TopicPartition partition, long offset) • seekToBeginning(<TopicPartition> partitions) • seekToEnd(<TopicPartition> partitions) • subscribe(topics, ConsumerRebalanceListener listener) • subscribe(Pattern pattern, ConsumerRebalanceListener listener) • subscription() • unsubscribe() • wakeup()
  38. Create a Kafka Topic » Let's create a topic named

    "test" with a single partition and only one replica: ⋄ kafka-topics.sh --create --zookeeper zhost:2181 --replication-factor 1 --partitions 1 --topic test » See that topic ⋄ bin/kafka-topics.sh --list --zookeeper zhost:2181 bin/kafka-topics.sh » Create, delete, describe, or change a topic.
  39. More about Kafka » Reliability and durability ⋄ Scaling, replication,

    guarantee, Zookeeper » Compact log » Administration, Configuration, Operations, Monitoring » Kafka connect » Kafka Stream » Schema Registry » Rest proxy » Apache Kafka vs XXX ⋄ RabbitMQ, AWS Kinesis, GCP Pub/Sub, ActiveMQ, ZeroMQ, Redis, and ....
  40. The Another 2 APIs » Connect API ◦ JDBC, HDFS,

    S3, …. » Streams API ◦ MAP, filter, aggregate, join
  41. More references 1. The Log: What every software engineer should

    know about real-time data's unifying abstraction, Jay Kreps, 2013 2. Pykafka and Kafka-python? https://github.com/Parsely/pykafka/issues/559 3. Why I am not a fan of Apache Kafka (2015-2016 Sep) 4. Kafka vs RabbitMQ a. What are the differences between Apache Kafka and RabbitMQ? b. Understanding When to use RabbitMQ or Apache Kafka 5. Kafka summit (2016~) 6. Future features of Kafka (Kafka Improvement Proposals) 7. Kafka- The Definitive Guide