Connect "K" of SMACK：pykafka, kafka-python or ?

Slide 1

Slide 1 text

Shuhsi Lin 2017/06/09 at PyconTw 2017 Connect K of SMACK: pykafka, kafka-python or ?

Slide 2

Slide 2 text

About Me Data Software Engineer of EAD in the manufacturer, Micron Currently working with - data and people - Lurking in PyHug, Taipei.py and various Meetups Shuhsi Lin sucitw gmail.com sucitw gmail.com

Slide 3

Slide 3 text

K in SMACK

Slide 4

Slide 4 text

http://datastrophic.io/data-processing-platforms-architectures-with-spark-mesos-akka-cassandra-and-kafka/ https://www.linkedin.com/pulse/smack-my-bdaas-why-2017-year-big-data-goes-tom-martin http://www.natalinobusa.com/2015/11/why-is-smack-stack-all-rage-lately.html https://dzone.com/articles/short-interview-with-smack-tech-stack-1 https://www.slideshare.net/akirillov/data-processing-platforms-architectures-with-spark-mesos-akka-cassandra-and-kafka ● Apache Spark: Processing Engine. ● Apache Mesos: The Container. ● Akka: The Model. ● Apache Cassandra: The Storage. ● Apache Kafka: The Broker.

Slide 5

Slide 5 text

Agenda » Pipeline to streaming » What is Apache Kafka ⋄ Overview ⋄ Architecture ⋄ Use cases » Kafka API ⋄ Python clients » Conclusion and More about Kafka

Slide 6

Slide 6 text

What we will not focus on » Reliability and durability ⋄ Scaling, replication, guarantee ⋄ Zookeeper » Compact log » Administration, Configuration, Operations » Kafka connect » Kafka Stream » Apache Kafka vs XXX ⋄ RabbitMQ, AWS Kinesis, GCP Pub/Sub, ActiveMQ, ZeroMQ, Redis, and ....

Slide 7

Slide 7 text

What is Stream Processing

Slide 8

Slide 8 text

3 Paradigms for Programming 1. Request/response 2. Batch 3. Stream processing https://qconnewyork.com/ny2016/ny2016/presentation/large-scale-stream-processing-apache-kafka.html

Slide 9

Slide 9 text

Request/response

Slide 10

Slide 10 text

Batch

Slide 11

Slide 11 text

Stream Processing

Slide 12

Slide 12 text

What is streaming process » Data comes from the rise of events (orders, sales, clicks or trades) » Databases are event streams ⋄ the process of creating a backup or standby copy of a database ⋄ publishing the database changes

Slide 13

Slide 13 text

Data pipeline https://www.linkedin.com/pulse/data-pipeline-hadoop-part-1-2-birender-saini What often happen in a complex Data pipeline ● Complexity meant that the data was always unreliable ● Reports were untrustworthy, ● Derived indexes and stores were questionable ● Everyone spent a lot of time battling data quality issues of all kinds. ● Data discrepancy

Slide 14

Slide 14 text

Data pipeline Data streaming

Slide 15

Slide 15 text

Apache Kafka 101

Slide 16

Slide 16 text

The name, “Kafka”, came from? https://www.quora.com/What-is-the-relation-between-Kafka-the-writer-and-Apache-Kafka-the-distributed-messaging-system http://slideplayer.com/slide/4221536/ https://en.wikipedia.org/wiki/Franz_Kafka

Slide 17

Slide 17 text

What is Apache Kafka? Apache Kafka is a distributed system designed for streams. It is built to be fault-tolerant, high-throughput, horizontally scalable, and allows geographically distributing data streams and processing. https://kafka.apache.org

Slide 18

Slide 18 text

Why Apache Kafka

Slide 19

Slide 19 text

Fast Scalable Durable Distributed https://pixabay.com/photo-2135057/

Slide 20

Slide 20 text

Stream data platform （Orignal mechanism) https://www.confluent.io/blog/stream-data-platform-1/ Integration mechanism between systems

Slide 21

Slide 21 text

Kafka as a service https://www.confluent.io/

Slide 22

Slide 22 text

What a streaming data platform can provide » “Data integration” (ETL) ⋄ How to transport data between systems ⋄ Captures streams of events or data changes and feeds these to other data systems » “Stream processing” (messaging) ⋄ Continuous, real-time processing and transformation of these streams and makes the results available system-wide. various systems in LinkedIn https://www.confluent.io/blog/stream-data-platform-1/ Analytical data processing with very low latency

Slide 23

Slide 23 text

Kafka terminology » Producer » Consumer ⋄ Consumer group ⋄ offset » Broker » Topic » Partition » Message » Replica

Slide 24

Slide 24 text

What Kafka Does Publish & subscribe ● to streams of data like a messaging system Process ● streams of data efficiently and in real time Store ● streams of data safely in a distributed replicated cluster https://kafka.apache.org/

Slide 25

Slide 25 text

Publish/Subscribe P14 at https://www.slideshare.net/lucasjellema/amis-sig-introducing-a pache-kafka-scalable-reliable-event-bus-message-queue

Slide 26

Slide 26 text

P15 at https://www.slideshare.net/rahuldausa/real-time-analytics-with-apache-kafka-and-apache-spark v0.10 Update offset v08 Update offset Smart consumer 2181 9092

Slide 27

Slide 27 text

A modern stream-centric data architecture built around Apache Kafka https://www.confluent.io/blog/stream-data-platform-1/ 500 billion events per day

Slide 28

Slide 28 text

The key abstraction in Kafka is a structured commit log of updates append records to this log https://www.confluent.io/blog/stream-data-platform-1/ Each of these data consumers has its own position in the log and advances independently. This allows a reliable, ordered stream of updates to be distributed to each consumer. The log can be sharded and spread over a cluster of machines, and each shard is replicated for fault-tolerance. consumers producers parallel, ordered consumption (important to a change capture system for database updates) TBs of data

Slide 29

Slide 29 text

Topics and Partitions » Topics are split into partitions » Partitions are strongly ordered & immutable » Partitions can exist on different servers » Partition enable scalability » Producers assign a message to a partition within the topic ⋄ Either round robin ( simply to balance load) ⋄ or according to the keys https://kafka.apache.org/documentation/#gettingStarted

Slide 30

Slide 30 text

Offsets » Message are assigned an offset in the partition » Consumers track with ( offset, partition, topic) https://kafka.apache.org/documentation/#gettingStarted A two server Kafka cluster hosting four partitions (P0-P3) with two consumer groups

Slide 31

Slide 31 text

Consumers and Partitions » A consumer group consumes one topic » A partition is always sent to the same consumer instance https://kafka.apache.org/documentation/#gettingStarted

Slide 32

Slide 32 text

Consumer ● Messages are available to consumers only when they have been committed ● Kafka does not push ○ Unlike JMS ● Read does not destroy by consumers ○ Unlike JMS Topic ● (some) History available ○ Offline consumers can catch up ○ Consumers can re-consume from the past ● Delivery Guarantees ○ Ordering maintained ○ At-least-once (per consumer) by default; at-most-once and exactly-once can be implemented P11 at https://www.slideshare.net/lucasjellema/amis-sig-introducing-apache-kafka-scalable-reliable-event-bus-message-queue

Slide 33

Slide 33 text

ZooKeeper: the coordination interface between the Kafka broker and consumers https://hortonworks.com/hadoop-tutorial/realtime-event-processing-nifi-kafka-storm/#section_3 » Stores configuration data for distributed services » Used primarily by brokers » Used by consumers in 0.8 but not 0.9

Slide 34

Slide 34 text

Apache Kafka timeline

Slide 35

Slide 35 text

Apache Kafka timeline 2011-Nov 2016-May 2013-Nov 2015-Nov Next version v0.10 Kafka Stream rack awareness v0.8 New Producer Reassign-partitions v0.9 Kafka Connect Security New Consumer Apache Software Foundation incubator 2010 Creation In Linkedin 2014, Confluent v0.10.2 Single Message Transforms for Kafka Connect

Slide 36

Slide 36 text

TLS connection SSL is supported only for the new Kafka Producer and Consumer (Kafka versions 0.9.0 and higher) http://kafka.apache.org/documentation.html#security_ssl http://docs.confluent.io/current/kafka/ssl.html http://maximilianchrist.com/blog/connect-to-apache-kafka-from-python-using-ssl https://github.com/edenhill/librdkafka/wiki/Using-SSL-with-librdkafka

Slide 37

Slide 37 text

Apache Kafka is consider as : Stream data platform » Commit log service » Messaging system » Circular buffer

Slide 38

Slide 38 text

Cons of Apache Kafka » Consumer Complexity (smart, but poor client) » Lack of tooling/monitoring (3rd party) » Still pre 1.0 release » Operationally, it’s more manual than desired » Requires ZooKeeper Sep 26, 2015 http://www.slideshare.net/jimplush/introduction-to-apache-kafka-53225326

Slide 39

Slide 39 text

Use Cases » Website Activity Tracking » Log Aggregation » Stream Processing » Event Sourcing » Commit logs » Metrics (Performance index streaming) ⋄ CPU/IO/Memory usage ⋄ Application Specific: ⋄ Time taken to load a web-page ⋄ Time taken to build a web-page ⋄ No. of requests ⋄ No. of hits on a particular page/url

Slide 40

Slide 40 text

Event-driven Applications » how it first is adopted and how its role evolves over time in their architecture. https://aws.amazon.com/tw/kafka/

Slide 41

Slide 41 text

https://www.slideshare.net/ConfluentInc/iot-data-platforms-processing-iot-data-with-apache-kafka

Slide 42

Slide 42 text

Conceptual Reference Architecture for Real-Time Processing in HDP 2.2 https://hortonworks.com/blog/storm-kafka-together-real-time-data-refinery/ February 12, 2015

Slide 43

Slide 43 text

Event delivery system design in Spotify 43 https://labs.spotify.com/2016/03/03/spotifys-event-delivery-the-road-to-the-cloud-part-ii/

Slide 44

Slide 44 text

Case: Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spark Streaming http://helenaedelson.com/?p=1186 (2016/03)

Slide 45

Slide 45 text

2 + 2 Core APIs

Slide 46

Slide 46 text

Four Core APIs » Producer API » Consumer API » Connect API » Streams API » Legacy APIs $ cat < in.txt | grep “python” | tr a-z A-Z > out.txt https://www.slideshare.net/ConfluentInc/apache-kafkaa-distributed-streaming-platform

Slide 47

Slide 47 text

Kafka Clients » JAVA (officially maintain) » C/C++ (librdkafka) » Go (AKA golang) » Erlang » .NET » Clojure » Ruby » Node.js » Proxy (HTTP REST, etc) » Perl » stdin/stdout » PHP » Rust » Alternative Java » Storm » Scala DSL » Clojure https://cwiki.apache.org/confluence/display/KAFKA/Clients » Python ⋄ Confluent-kafka-python ⋄ Kafka-python ⋄ pykafka

Slide 48

Slide 48 text

Kafka Clients survey https://www.confluent.io/blog/first-annual-state-apache-kafka-client-use-survey (February 14, 2017) How users choose a Kafka client Kafka Client: Language Adoption Results from 187 responses Reliability: ● Stability should be priority ● Good error handling ● Good testing ● Good metrics and logging 3rd

Slide 49

Slide 49 text

Create your own Kafka broker https://github.com/Landoop/fast-data-dev

Slide 50

Slide 50 text

See your brokers and topics ● Kafka-topics-ui ○ Demo http://kafka-topics-ui.landoop.com/#/ ● Kafka-connect-ui ○ Demo http://kafka-connect-ui.landoop.com/ ● Kafka-manager (yahoo) ● Kafka Eagle ● kafka-offset-monitor Kafka Tool (GUI) https://www.datadoghq.com/

Slide 51

Slide 51 text

Kafka Tool

Slide 52

Slide 52 text

Kafka UI(landoop)

Slide 53

Slide 53 text

2 + 2 Core APIs And python clients

Slide 54

Slide 54 text

Kafka API Documents https://kafka.apache.org/0102/javadoc/index.html?

Slide 55

Slide 55 text

Apache Kafka client for Python » Pykafka » kafka-python » Confluent-kafka-python » Librdkafka ⋄ The Apache Kafka C/C++ library

Slide 56

Slide 56 text

Pykafka https://github.com/Parsely/pykafka http://pykafka.readthedocs.io/en/latest/ » Similar level of abstraction to the JVM Kafka client » Built on librdkafka https://blog.parse.ly/post/3886/pykafka-now/ （2016,June)

Slide 57

Slide 57 text

kafka-python https://github.com/dpkp/kafka-python/ http://kafka-python.readthedocs.io/ API ● Producer ● Consumer ● Message ● TopicPartition ● KafkaError ● KafkaException ● kafka-python is designed to function much like the official java client, with a sprinkling of pythonic interfaces.

Slide 58

Slide 58 text

Confluent-kafka-python Confluent's Python client for Apache Kafka and the Confluent Platform. Features: ● High performance ⋄ librdkafka ● Reliability ● Supported ● Future proof https://github.com/confluentinc/confluent-kafka-python http://docs.confluent.io/current/clients/confluent-kafka-python/index.html?

Slide 59

Slide 59 text

Producer API (JAVA) https://kafka.apache.org/0102/javadoc/index.html?org/apache/kafka/clients/producer/KafkaProducer.html https://www.tutorialspoint.com/apache_kafka/apache_kafka_simple_producer_example.htm ● KafkaProducer – Sync and Async ○ close() ○ flush() ○ metrics() ○ partitionsFor( topic) ○ send(ProducerRecord record) Writing data to Kafka: A client that publishes records to the Kafka cluster. Class KafkaProducer Class ProducerRecord ● ProducerRecord( topic, V value) ● ProducerRecord( topic, Integer partition, K key, V value) A key/value pair to be sent to Kafka. Configuration Settings (configuration is externalized in a property file) ● client.id ● producer.type ● acks ● retries ● bootstrap.servers ● linger.ms ● key.serializer ● value.serializer ● batch.size ● buffer.memory messages

Slide 60

Slide 60 text

Producer API -Pykafka from pykafka import KafkaClient from settings import …. client = KafkaClient(hosts=bootstrap_servers) topic = client.topics [topic.encode('UTF-8')] producer = topic.get_producer(use_rdkafka=use_rdkafka) producer.produce(msg_payload) producer.stop() # Will flush background queue Class pykafka.producer.Producer() Classpykafka.topic.Topic(cluster, topic_metadata) http://pykafka.readthedocs.io/en/latest/api/producer.html ● produce(msg, partition_key=None) ● stop() ● get_producer(use_rdkafka=False, **kwargs)

Slide 61

Slide 61 text

Performance assessment https://blog.parse.ly/post/3886/pykafka-now/

Slide 62

Slide 62 text

Must be type bytes, or be serializable to bytes via configured value_serializer. Producer API -Kafka-Python from kafka import KafkaConsumer, KafkaProducer from settings import BOOTSTRAP_SERVERS, TOPICS, MSG p = KafkaProducer(bootstrap_servers=BOOTSTRAP_SERVERS) p.send(TOPICS, MSG.encode('utf-8')) p.flush() Class kafka.KafkaProducer(**configs) https://kafka-python.readthedocs.io/en/master/_modules/kafka/producer/kafka.html#KafkaProducer ● close(timeout=None) ● flush(timeout=None) ● partitions_for(topic) ● send(topic, value=None, key=None, partition=None, timestamp_ms=None) http://kafka-python.readthedocs.io/en/master/apidoc/KafkaProducer.html

Slide 63

Slide 63 text

Producer API -Confluent-python -Kafka from confluent_kafka import Producer from settings import BOOTSTRAP_SERVERS, TOPICS, MSG p = Producer({'bootstrap.servers': BOOTSTRAP_SERVERS}) p.produce(TOPICS, MSG.encode('utf-8')) p.flush() http://docs.confluent.io/current/clients/confluent-kafka-python/#producer Class confluent_kafka.Producer(*kwargs) ● len() ● flush([timeout]) ● poll([timeout]) ● produce(topic[, value][, key][, partition][, on_delivery][, timestamp])

Slide 64

Slide 64 text

Consumer ● Consumer group ○ group.id ○ session.timout.ms ○ max.poll.records ○ heartbeat.interval.ms ● Offset Management ○ enable.auto.commit ○ Auto.commit.interval.ms ○ auto.offset.reset https://kafka.apache.org/documentation.html#newconsumerconfigs

Slide 65

Slide 65 text

Consumer API (JAVA) https://kafka.apache.org/0102/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html ● assign( partitions) ● assignment() ● beginningOffsets( partitions) ● close(long timeout, TimeUnit timeUnit) ● commitAsync(Map offsets, OffsetCommitCallback callback) ● commitSync(Map offsets) ● committed(TopicPartition partition) ● endOffsets( partitions) ● listTopics() ● metrics() ● offsetsForTimes(Map timestampsToSearch) ● partitionsFor(topic) ● pause( partitions) Reading data from Kafka: A client that consumes records from a Kafka cluster. Class KafkaConsumer ● poll(long timeout) ● position(TopicPartition partition) ● resume( partitions) ● seek(TopicPartition partition, long offset) ● seekToBeginning( partitions) ● seekToEnd( partitions) ● subscribe(topics, ConsumerRebalanceListener listener) ● subscribe(Pattern pattern, ConsumerRebalanceListener listener) ● subscription() ● unsubscribe() ● wakeup()

Slide 66

Slide 66 text

Kafka shell scripts

Slide 67

Slide 67 text

Create a Kafka Topic » Let's create a topic named "test" with a single partition and only one replica: ⋄ kafka-topics.sh --create --zookeeper zhost:2181 --replication-factor 1 --partitions 1 --topic test » See that topic ⋄ bin/kafka-topics.sh --list --zookeeper zhost:2181 bin/kafka-topics.sh » Create, delete, describe, or change a topic.

Slide 68

Slide 68 text

Python Kafka Client Benchmarking

Slide 69

Slide 69 text

DEMO 1. http://activisiongamescience.github.io/2016/06/15/Kafka-Client-Benchmarking/ 2. https://github.com/sucitw/benchmark-python-client-for-kafka

Slide 70

Slide 70 text

http://activisiongamescience.github.io/2016/06/15/Kafka-Client-Benchmarking/ Python Kafka Client Benchmarking

Slide 71

Slide 71 text

Conclusion: pykafka, kafka-python or ?

Slide 72

Slide 72 text

https://github.com/Parsely/pykafka/issues/559

Slide 73

Slide 73 text

More about Kafka

Slide 74

Slide 74 text

More about Kafka » Reliability and durability ⋄ Scaling, replication, guarantee, Zookeeper » Compact log » Administration, Configuration, Operations, Monitoring » Kafka connect » Kafka Stream » Schema Registry » Rest proxy » Apache Kafka vs XXX ⋄ RabbitMQ, AWS Kinesis, GCP Pub/Sub, ActiveMQ, ZeroMQ, Redis, and ....

Slide 75

Slide 75 text

The Another 2 APIs » Connect API ○ JDBC, HDFS, S3, …. » Streams API ○ MAP, filter, aggregate, join

Slide 76

Slide 76 text

More references 1. The Log: What every software engineer should know about real-time data's unifying abstraction, Jay Kreps, 2013 2. Pykafka and Kafka-python? https://github.com/Parsely/pykafka/issues/559 3. Why I am not a fan of Apache Kafka (2015-2016 Sep) 4. Kafka vs RabbitMQ a. What are the differences between Apache Kafka and RabbitMQ? b. Understanding When to use RabbitMQ or Apache Kafka 5. Kafka summit (2016~) 6. Future features of Kafka (Kafka Improvement Proposals) 7. Kafka- The Definitive Guide

Slide 77

Slide 77 text

We’re hiring!! (104 link)