$30 off During Our Annual Pro Sale. View Details »

Connect "K" of SMACK:pykafka, kafka-python or ?

June 09, 2017

Connect "K" of SMACK:pykafka, kafka-python or ?

Apache Kafka is considered as a distributed streaming platform to a build real-time data pipelines and streaming apps. You can also take Kafka as commit log service with functions much like a publish/subscribe messaging system, but with better throughput, built-in partitioning, replication, and fault tolerance and runs in production in thousands of companies. Recently, Kafka has been widely applied as one component of SMACK stack because of it's role connected with Apache Hadoop, Apache Storm, and Spark Streaming in the data pipeline.

In this talk, I will start with introduce data stream processing and the general concept of Kafka's architecture and components by several use cases. Then, Kafka' API will be introduced by python clients with demo. Finally, the benchmark, comparison and limitation of different python clients will be discussed.


June 09, 2017

More Decks by suci

Other Decks in Programming


  1. Shuhsi Lin 2017/06/09 at PyconTw 2017 Connect K of SMACK:

    pykafka, kafka-python or ?
  2. About Me Data Software Engineer of EAD in the manufacturer,

    Micron Currently working with - data and people - Lurking in PyHug, Taipei.py and various Meetups Shuhsi Lin sucitw gmail.com sucitw gmail.com
  3. K in SMACK

  4. http://datastrophic.io/data-processing-platforms-architectures-with-spark-mesos-akka-cassandra-and-kafka/ https://www.linkedin.com/pulse/smack-my-bdaas-why-2017-year-big-data-goes-tom-martin http://www.natalinobusa.com/2015/11/why-is-smack-stack-all-rage-lately.html https://dzone.com/articles/short-interview-with-smack-tech-stack-1 https://www.slideshare.net/akirillov/data-processing-platforms-architectures-with-spark-mesos-akka-cassandra-and-kafka • Apache Spark: Processing Engine.

    • Apache Mesos: The Container. • Akka: The Model. • Apache Cassandra: The Storage. • Apache Kafka: The Broker.
  5. Agenda » Pipeline to streaming » What is Apache Kafka

    ⋄ Overview ⋄ Architecture ⋄ Use cases » Kafka API ⋄ Python clients » Conclusion and More about Kafka
  6. What we will not focus on » Reliability and durability

    ⋄ Scaling, replication, guarantee ⋄ Zookeeper » Compact log » Administration, Configuration, Operations » Kafka connect » Kafka Stream » Apache Kafka vs XXX ⋄ RabbitMQ, AWS Kinesis, GCP Pub/Sub, ActiveMQ, ZeroMQ, Redis, and ....
  7. What is Stream Processing

  8. 3 Paradigms for Programming 1. Request/response 2. Batch 3. Stream

    processing https://qconnewyork.com/ny2016/ny2016/presentation/large-scale-stream-processing-apache-kafka.html
  9. Request/response

  10. Batch

  11. Stream Processing

  12. What is streaming process » Data comes from the rise

    of events (orders, sales, clicks or trades) » Databases are event streams ⋄ the process of creating a backup or standby copy of a database ⋄ publishing the database changes
  13. Data pipeline https://www.linkedin.com/pulse/data-pipeline-hadoop-part-1-2-birender-saini What often happen in a complex Data

    pipeline • Complexity meant that the data was always unreliable • Reports were untrustworthy, • Derived indexes and stores were questionable • Everyone spent a lot of time battling data quality issues of all kinds. • Data discrepancy
  14. Data pipeline Data streaming

  15. Apache Kafka 101

  16. The name, “Kafka”, came from? https://www.quora.com/What-is-the-relation-between-Kafka-the-writer-and-Apache-Kafka-the-distributed-messaging-system http://slideplayer.com/slide/4221536/ https://en.wikipedia.org/wiki/Franz_Kafka

  17. What is Apache Kafka? Apache Kafka is a distributed system

    designed for streams. It is built to be fault-tolerant, high-throughput, horizontally scalable, and allows geographically distributing data streams and processing. https://kafka.apache.org
  18. Why Apache Kafka

  19. Fast Scalable Durable Distributed https://pixabay.com/photo-2135057/

  20. Stream data platform (Orignal mechanism) https://www.confluent.io/blog/stream-data-platform-1/ Integration mechanism between systems

  21. Kafka as a service https://www.confluent.io/

  22. What a streaming data platform can provide » “Data integration”

    (ETL) ⋄ How to transport data between systems ⋄ Captures streams of events or data changes and feeds these to other data systems » “Stream processing” (messaging) ⋄ Continuous, real-time processing and transformation of these streams and makes the results available system-wide. various systems in LinkedIn https://www.confluent.io/blog/stream-data-platform-1/ Analytical data processing with very low latency
  23. Kafka terminology » Producer » Consumer ⋄ Consumer group ⋄

    offset » Broker » Topic » Partition » Message » Replica
  24. What Kafka Does Publish & subscribe • to streams of

    data like a messaging system Process • streams of data efficiently and in real time Store • streams of data safely in a distributed replicated cluster https://kafka.apache.org/
  25. Publish/Subscribe P14 at https://www.slideshare.net/lucasjellema/amis-sig-introducing-a pache-kafka-scalable-reliable-event-bus-message-queue

  26. P15 at https://www.slideshare.net/rahuldausa/real-time-analytics-with-apache-kafka-and-apache-spark v0.10 Update offset v08 Update offset Smart

    consumer 2181 9092
  27. A modern stream-centric data architecture built around Apache Kafka https://www.confluent.io/blog/stream-data-platform-1/

    500 billion events per day
  28. The key abstraction in Kafka is a structured commit log

    of updates append records to this log https://www.confluent.io/blog/stream-data-platform-1/ Each of these data consumers has its own position in the log and advances independently. This allows a reliable, ordered stream of updates to be distributed to each consumer. The log can be sharded and spread over a cluster of machines, and each shard is replicated for fault-tolerance. consumers producers parallel, ordered consumption (important to a change capture system for database updates) TBs of data
  29. Topics and Partitions » Topics are split into partitions »

    Partitions are strongly ordered & immutable » Partitions can exist on different servers » Partition enable scalability » Producers assign a message to a partition within the topic ⋄ Either round robin ( simply to balance load) ⋄ or according to the keys https://kafka.apache.org/documentation/#gettingStarted
  30. Offsets » Message are assigned an offset in the partition

    » Consumers track with ( offset, partition, topic) https://kafka.apache.org/documentation/#gettingStarted A two server Kafka cluster hosting four partitions (P0-P3) with two consumer groups
  31. Consumers and Partitions » A consumer group consumes one topic

    » A partition is always sent to the same consumer instance https://kafka.apache.org/documentation/#gettingStarted
  32. Consumer • Messages are available to consumers only when they

    have been committed • Kafka does not push ◦ Unlike JMS • Read does not destroy by consumers ◦ Unlike JMS Topic • (some) History available ◦ Offline consumers can catch up ◦ Consumers can re-consume from the past • Delivery Guarantees ◦ Ordering maintained ◦ At-least-once (per consumer) by default; at-most-once and exactly-once can be implemented P11 at https://www.slideshare.net/lucasjellema/amis-sig-introducing-apache-kafka-scalable-reliable-event-bus-message-queue
  33. ZooKeeper: the coordination interface between the Kafka broker and consumers

    https://hortonworks.com/hadoop-tutorial/realtime-event-processing-nifi-kafka-storm/#section_3 » Stores configuration data for distributed services » Used primarily by brokers » Used by consumers in 0.8 but not 0.9
  34. Apache Kafka timeline

  35. Apache Kafka timeline 2011-Nov 2016-May 2013-Nov 2015-Nov Next version v0.10

    Kafka Stream rack awareness v0.8 New Producer Reassign-partitions v0.9 Kafka Connect Security New Consumer Apache Software Foundation incubator 2010 Creation In Linkedin 2014, Confluent v0.10.2 Single Message Transforms for Kafka Connect
  36. TLS connection SSL is supported only for the new Kafka

    Producer and Consumer (Kafka versions 0.9.0 and higher) http://kafka.apache.org/documentation.html#security_ssl http://docs.confluent.io/current/kafka/ssl.html http://maximilianchrist.com/blog/connect-to-apache-kafka-from-python-using-ssl https://github.com/edenhill/librdkafka/wiki/Using-SSL-with-librdkafka
  37. Apache Kafka is consider as : Stream data platform »

    Commit log service » Messaging system » Circular buffer
  38. Cons of Apache Kafka » Consumer Complexity (smart, but poor

    client) » Lack of tooling/monitoring (3rd party) » Still pre 1.0 release » Operationally, it’s more manual than desired » Requires ZooKeeper Sep 26, 2015 http://www.slideshare.net/jimplush/introduction-to-apache-kafka-53225326
  39. Use Cases » Website Activity Tracking » Log Aggregation »

    Stream Processing » Event Sourcing » Commit logs » Metrics (Performance index streaming) ⋄ CPU/IO/Memory usage ⋄ Application Specific: ⋄ Time taken to load a web-page ⋄ Time taken to build a web-page ⋄ No. of requests ⋄ No. of hits on a particular page/url
  40. Event-driven Applications » how it first is adopted and how

    its role evolves over time in their architecture. https://aws.amazon.com/tw/kafka/
  41. https://www.slideshare.net/ConfluentInc/iot-data-platforms-processing-iot-data-with-apache-kafka

  42. Conceptual Reference Architecture for Real-Time Processing in HDP 2.2 https://hortonworks.com/blog/storm-kafka-together-real-time-data-refinery/

    February 12, 2015
  43. Event delivery system design in Spotify 43 https://labs.spotify.com/2016/03/03/spotifys-event-delivery-the-road-to-the-cloud-part-ii/

  44. Case: Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with

    FiloDB and Spark Streaming http://helenaedelson.com/?p=1186 (2016/03)
  45. 2 + 2 Core APIs

  46. Four Core APIs » Producer API » Consumer API »

    Connect API » Streams API » Legacy APIs $ cat < in.txt | grep “python” | tr a-z A-Z > out.txt https://www.slideshare.net/ConfluentInc/apache-kafkaa-distributed-streaming-platform
  47. Kafka Clients » JAVA (officially maintain) » C/C++ (librdkafka) »

    Go (AKA golang) » Erlang » .NET » Clojure » Ruby » Node.js » Proxy (HTTP REST, etc) » Perl » stdin/stdout » PHP » Rust » Alternative Java » Storm » Scala DSL » Clojure https://cwiki.apache.org/confluence/display/KAFKA/Clients » Python ⋄ Confluent-kafka-python ⋄ Kafka-python ⋄ pykafka
  48. Kafka Clients survey https://www.confluent.io/blog/first-annual-state-apache-kafka-client-use-survey (February 14, 2017) How users choose

    a Kafka client Kafka Client: Language Adoption Results from 187 responses Reliability: • Stability should be priority • Good error handling • Good testing • Good metrics and logging 3rd
  49. Create your own Kafka broker https://github.com/Landoop/fast-data-dev

  50. See your brokers and topics • Kafka-topics-ui ◦ Demo http://kafka-topics-ui.landoop.com/#/

    • Kafka-connect-ui ◦ Demo http://kafka-connect-ui.landoop.com/ • Kafka-manager (yahoo) • Kafka Eagle • kafka-offset-monitor Kafka Tool (GUI) https://www.datadoghq.com/
  51. Kafka Tool

  52. Kafka UI(landoop)

  53. 2 + 2 Core APIs And python clients

  54. Kafka API Documents https://kafka.apache.org/0102/javadoc/index.html?

  55. Apache Kafka client for Python » Pykafka » kafka-python »

    Confluent-kafka-python » Librdkafka ⋄ The Apache Kafka C/C++ library
  56. Pykafka https://github.com/Parsely/pykafka http://pykafka.readthedocs.io/en/latest/ » Similar level of abstraction to the

    JVM Kafka client » Built on librdkafka https://blog.parse.ly/post/3886/pykafka-now/ (2016,June)
  57. kafka-python https://github.com/dpkp/kafka-python/ http://kafka-python.readthedocs.io/ API • Producer • Consumer • Message

    • TopicPartition • KafkaError • KafkaException • kafka-python is designed to function much like the official java client, with a sprinkling of pythonic interfaces.
  58. Confluent-kafka-python Confluent's Python client for Apache Kafka and the Confluent

    Platform. Features: • High performance ⋄ librdkafka • Reliability • Supported • Future proof https://github.com/confluentinc/confluent-kafka-python http://docs.confluent.io/current/clients/confluent-kafka-python/index.html?
  59. Producer API (JAVA) https://kafka.apache.org/0102/javadoc/index.html?org/apache/kafka/clients/producer/KafkaProducer.html https://www.tutorialspoint.com/apache_kafka/apache_kafka_simple_producer_example.htm • KafkaProducer – Sync and

    Async ◦ close() ◦ flush() ◦ metrics() ◦ partitionsFor( topic) ◦ send(ProducerRecord<K,V> record) Writing data to Kafka: A client that publishes records to the Kafka cluster. Class KafkaProducer<K,V> Class ProducerRecord<K,V> • ProducerRecord( topic, V value) • ProducerRecord( topic, Integer partition, K key, V value) A key/value pair to be sent to Kafka. Configuration Settings (configuration is externalized in a property file) • client.id • producer.type • acks • retries • bootstrap.servers • linger.ms • key.serializer • value.serializer • batch.size • buffer.memory messages
  60. Producer API -Pykafka from pykafka import KafkaClient from settings import

    …. client = KafkaClient(hosts=bootstrap_servers) topic = client.topics [topic.encode('UTF-8')] producer = topic.get_producer(use_rdkafka=use_rdkafka) producer.produce(msg_payload) producer.stop() # Will flush background queue Class pykafka.producer.Producer() Classpykafka.topic.Topic(cluster, topic_metadata) http://pykafka.readthedocs.io/en/latest/api/producer.html • produce(msg, partition_key=None) • stop() • get_producer(use_rdkafka=False, **kwargs)
  61. Performance assessment https://blog.parse.ly/post/3886/pykafka-now/

  62. Must be type bytes, or be serializable to bytes via

    configured value_serializer. Producer API -Kafka-Python from kafka import KafkaConsumer, KafkaProducer from settings import BOOTSTRAP_SERVERS, TOPICS, MSG p = KafkaProducer(bootstrap_servers=BOOTSTRAP_SERVERS) p.send(TOPICS, MSG.encode('utf-8')) p.flush() Class kafka.KafkaProducer(**configs) https://kafka-python.readthedocs.io/en/master/_modules/kafka/producer/kafka.html#KafkaProducer • close(timeout=None) • flush(timeout=None) • partitions_for(topic) • send(topic, value=None, key=None, partition=None, timestamp_ms=None) http://kafka-python.readthedocs.io/en/master/apidoc/KafkaProducer.html
  63. Producer API -Confluent-python -Kafka from confluent_kafka import Producer from settings

    import BOOTSTRAP_SERVERS, TOPICS, MSG p = Producer({'bootstrap.servers': BOOTSTRAP_SERVERS}) p.produce(TOPICS, MSG.encode('utf-8')) p.flush() http://docs.confluent.io/current/clients/confluent-kafka-python/#producer Class confluent_kafka.Producer(*kwargs) • len() • flush([timeout]) • poll([timeout]) • produce(topic[, value][, key][, partition][, on_delivery][, timestamp])
  64. Consumer • Consumer group ◦ group.id ◦ session.timout.ms ◦ max.poll.records

    ◦ heartbeat.interval.ms • Offset Management ◦ enable.auto.commit ◦ Auto.commit.interval.ms ◦ auto.offset.reset https://kafka.apache.org/documentation.html#newconsumerconfigs
  65. Consumer API (JAVA) https://kafka.apache.org/0102/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html • assign(<TopicPartition> partitions) • assignment() •

    beginningOffsets(<TopicPartition> partitions) • close(long timeout, TimeUnit timeUnit) • commitAsync(Map<TopicPartition,OffsetAndMetadata> offsets, OffsetCommitCallback callback) • commitSync(Map<TopicPartition,OffsetAndMetadata> offsets) • committed(TopicPartition partition) • endOffsets(<TopicPartition> partitions) • listTopics() • metrics() • offsetsForTimes(Map<TopicPartition,Long> timestampsToSearch) • partitionsFor(topic) • pause(<TopicPartition> partitions) Reading data from Kafka: A client that consumes records from a Kafka cluster. Class KafkaConsumer<K,V> • poll(long timeout) • position(TopicPartition partition) • resume(<TopicPartition> partitions) • seek(TopicPartition partition, long offset) • seekToBeginning(<TopicPartition> partitions) • seekToEnd(<TopicPartition> partitions) • subscribe(topics, ConsumerRebalanceListener listener) • subscribe(Pattern pattern, ConsumerRebalanceListener listener) • subscription() • unsubscribe() • wakeup()
  66. Kafka shell scripts

  67. Create a Kafka Topic » Let's create a topic named

    "test" with a single partition and only one replica: ⋄ kafka-topics.sh --create --zookeeper zhost:2181 --replication-factor 1 --partitions 1 --topic test » See that topic ⋄ bin/kafka-topics.sh --list --zookeeper zhost:2181 bin/kafka-topics.sh » Create, delete, describe, or change a topic.
  68. Python Kafka Client Benchmarking

  69. DEMO 1. http://activisiongamescience.github.io/2016/06/15/Kafka-Client-Benchmarking/ 2. https://github.com/sucitw/benchmark-python-client-for-kafka

  70. http://activisiongamescience.github.io/2016/06/15/Kafka-Client-Benchmarking/ Python Kafka Client Benchmarking

  71. Conclusion: pykafka, kafka-python or ?

  72. https://github.com/Parsely/pykafka/issues/559

  73. More about Kafka

  74. More about Kafka » Reliability and durability ⋄ Scaling, replication,

    guarantee, Zookeeper » Compact log » Administration, Configuration, Operations, Monitoring » Kafka connect » Kafka Stream » Schema Registry » Rest proxy » Apache Kafka vs XXX ⋄ RabbitMQ, AWS Kinesis, GCP Pub/Sub, ActiveMQ, ZeroMQ, Redis, and ....
  75. The Another 2 APIs » Connect API ◦ JDBC, HDFS,

    S3, …. » Streams API ◦ MAP, filter, aggregate, join
  76. More references 1. The Log: What every software engineer should

    know about real-time data's unifying abstraction, Jay Kreps, 2013 2. Pykafka and Kafka-python? https://github.com/Parsely/pykafka/issues/559 3. Why I am not a fan of Apache Kafka (2015-2016 Sep) 4. Kafka vs RabbitMQ a. What are the differences between Apache Kafka and RabbitMQ? b. Understanding When to use RabbitMQ or Apache Kafka 5. Kafka summit (2016~) 6. Future features of Kafka (Kafka Improvement Proposals) 7. Kafka- The Definitive Guide
  77. We’re hiring!! (104 link)