Connect "K" of SMACK：pykafka, kafka-python or ?

Shuhsi Lin 2017/06/09 at PyconTw 2017 Connect K of SMACK:
pykafka, kafka-python or ?

About Me Data Software Engineer of EAD in the manufacturer,
Micron Currently working with - data and people - Lurking in PyHug, Taipei.py and various Meetups Shuhsi Lin sucitw gmail.com sucitw gmail.com

K in SMACK

http://datastrophic.io/data-processing-platforms-architectures-with-spark-mesos-akka-cassandra-and-kafka/ https://www.linkedin.com/pulse/smack-my-bdaas-why-2017-year-big-data-goes-tom-martin http://www.natalinobusa.com/2015/11/why-is-smack-stack-all-rage-lately.html https://dzone.com/articles/short-interview-with-smack-tech-stack-1 https://www.slideshare.net/akirillov/data-processing-platforms-architectures-with-spark-mesos-akka-cassandra-and-kafka • Apache Spark: Processing Engine.
• Apache Mesos: The Container. • Akka: The Model. • Apache Cassandra: The Storage. • Apache Kafka: The Broker.

Agenda » Pipeline to streaming » What is Apache Kafka
⋄ Overview ⋄ Architecture ⋄ Use cases » Kafka API ⋄ Python clients » Conclusion and More about Kafka

What we will not focus on » Reliability and durability
⋄ Scaling, replication, guarantee ⋄ Zookeeper » Compact log » Administration, Configuration, Operations » Kafka connect » Kafka Stream » Apache Kafka vs XXX ⋄ RabbitMQ, AWS Kinesis, GCP Pub/Sub, ActiveMQ, ZeroMQ, Redis, and ....

What is Stream Processing

3 Paradigms for Programming 1. Request/response 2. Batch 3. Stream
processing https://qconnewyork.com/ny2016/ny2016/presentation/large-scale-stream-processing-apache-kafka.html

Request/response

Stream Processing

What is streaming process » Data comes from the rise
of events (orders, sales, clicks or trades) » Databases are event streams ⋄ the process of creating a backup or standby copy of a database ⋄ publishing the database changes

Data pipeline https://www.linkedin.com/pulse/data-pipeline-hadoop-part-1-2-birender-saini What often happen in a complex Data
pipeline • Complexity meant that the data was always unreliable • Reports were untrustworthy, • Derived indexes and stores were questionable • Everyone spent a lot of time battling data quality issues of all kinds. • Data discrepancy

Data pipeline Data streaming

Apache Kafka 101

The name, “Kafka”, came from? https://www.quora.com/What-is-the-relation-between-Kafka-the-writer-and-Apache-Kafka-the-distributed-messaging-system http://slideplayer.com/slide/4221536/ https://en.wikipedia.org/wiki/Franz_Kafka

What is Apache Kafka? Apache Kafka is a distributed system
designed for streams. It is built to be fault-tolerant, high-throughput, horizontally scalable, and allows geographically distributing data streams and processing. https://kafka.apache.org

Why Apache Kafka

Fast Scalable Durable Distributed https://pixabay.com/photo-2135057/

Stream data platform （Orignal mechanism) https://www.confluent.io/blog/stream-data-platform-1/ Integration mechanism between systems

Kafka as a service https://www.confluent.io/

What a streaming data platform can provide » “Data integration”
(ETL) ⋄ How to transport data between systems ⋄ Captures streams of events or data changes and feeds these to other data systems » “Stream processing” (messaging) ⋄ Continuous, real-time processing and transformation of these streams and makes the results available system-wide. various systems in LinkedIn https://www.confluent.io/blog/stream-data-platform-1/ Analytical data processing with very low latency

Kafka terminology » Producer » Consumer ⋄ Consumer group ⋄
offset » Broker » Topic » Partition » Message » Replica

What Kafka Does Publish & subscribe • to streams of
data like a messaging system Process • streams of data efficiently and in real time Store • streams of data safely in a distributed replicated cluster https://kafka.apache.org/

Publish/Subscribe P14 at https://www.slideshare.net/lucasjellema/amis-sig-introducing-a pache-kafka-scalable-reliable-event-bus-message-queue

P15 at https://www.slideshare.net/rahuldausa/real-time-analytics-with-apache-kafka-and-apache-spark v0.10 Update offset v08 Update offset Smart
consumer 2181 9092

A modern stream-centric data architecture built around Apache Kafka https://www.confluent.io/blog/stream-data-platform-1/
500 billion events per day

The key abstraction in Kafka is a structured commit log
of updates append records to this log https://www.confluent.io/blog/stream-data-platform-1/ Each of these data consumers has its own position in the log and advances independently. This allows a reliable, ordered stream of updates to be distributed to each consumer. The log can be sharded and spread over a cluster of machines, and each shard is replicated for fault-tolerance. consumers producers parallel, ordered consumption (important to a change capture system for database updates) TBs of data

Topics and Partitions » Topics are split into partitions »
Partitions are strongly ordered & immutable » Partitions can exist on different servers » Partition enable scalability » Producers assign a message to a partition within the topic ⋄ Either round robin ( simply to balance load) ⋄ or according to the keys https://kafka.apache.org/documentation/#gettingStarted

Offsets » Message are assigned an offset in the partition
» Consumers track with ( offset, partition, topic) https://kafka.apache.org/documentation/#gettingStarted A two server Kafka cluster hosting four partitions (P0-P3) with two consumer groups

Consumers and Partitions » A consumer group consumes one topic
» A partition is always sent to the same consumer instance https://kafka.apache.org/documentation/#gettingStarted

Consumer • Messages are available to consumers only when they
have been committed • Kafka does not push ◦ Unlike JMS • Read does not destroy by consumers ◦ Unlike JMS Topic • (some) History available ◦ Offline consumers can catch up ◦ Consumers can re-consume from the past • Delivery Guarantees ◦ Ordering maintained ◦ At-least-once (per consumer) by default; at-most-once and exactly-once can be implemented P11 at https://www.slideshare.net/lucasjellema/amis-sig-introducing-apache-kafka-scalable-reliable-event-bus-message-queue

ZooKeeper: the coordination interface between the Kafka broker and consumers
https://hortonworks.com/hadoop-tutorial/realtime-event-processing-nifi-kafka-storm/#section_3 » Stores configuration data for distributed services » Used primarily by brokers » Used by consumers in 0.8 but not 0.9

Apache Kafka timeline

Apache Kafka timeline 2011-Nov 2016-May 2013-Nov 2015-Nov Next version v0.10
Kafka Stream rack awareness v0.8 New Producer Reassign-partitions v0.9 Kafka Connect Security New Consumer Apache Software Foundation incubator 2010 Creation In Linkedin 2014, Confluent v0.10.2 Single Message Transforms for Kafka Connect

TLS connection SSL is supported only for the new Kafka
Producer and Consumer (Kafka versions 0.9.0 and higher) http://kafka.apache.org/documentation.html#security_ssl http://docs.confluent.io/current/kafka/ssl.html http://maximilianchrist.com/blog/connect-to-apache-kafka-from-python-using-ssl https://github.com/edenhill/librdkafka/wiki/Using-SSL-with-librdkafka

Apache Kafka is consider as : Stream data platform »
Commit log service » Messaging system » Circular buffer

Cons of Apache Kafka » Consumer Complexity (smart, but poor
client) » Lack of tooling/monitoring (3rd party) » Still pre 1.0 release » Operationally, it’s more manual than desired » Requires ZooKeeper Sep 26, 2015 http://www.slideshare.net/jimplush/introduction-to-apache-kafka-53225326

Use Cases » Website Activity Tracking » Log Aggregation »
Stream Processing » Event Sourcing » Commit logs » Metrics (Performance index streaming) ⋄ CPU/IO/Memory usage ⋄ Application Specific: ⋄ Time taken to load a web-page ⋄ Time taken to build a web-page ⋄ No. of requests ⋄ No. of hits on a particular page/url

Event-driven Applications » how it first is adopted and how
its role evolves over time in their architecture. https://aws.amazon.com/tw/kafka/

https://www.slideshare.net/ConfluentInc/iot-data-platforms-processing-iot-data-with-apache-kafka

Conceptual Reference Architecture for Real-Time Processing in HDP 2.2 https://hortonworks.com/blog/storm-kafka-together-real-time-data-refinery/
February 12, 2015

Event delivery system design in Spotify 43 https://labs.spotify.com/2016/03/03/spotifys-event-delivery-the-road-to-the-cloud-part-ii/

Case: Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with
FiloDB and Spark Streaming http://helenaedelson.com/?p=1186 (2016/03)

2 + 2 Core APIs

Four Core APIs » Producer API » Consumer API »
Connect API » Streams API » Legacy APIs $ cat < in.txt | grep “python” | tr a-z A-Z > out.txt https://www.slideshare.net/ConfluentInc/apache-kafkaa-distributed-streaming-platform

Kafka Clients » JAVA (officially maintain) » C/C++ (librdkafka) »
Go (AKA golang) » Erlang » .NET » Clojure » Ruby » Node.js » Proxy (HTTP REST, etc) » Perl » stdin/stdout » PHP » Rust » Alternative Java » Storm » Scala DSL » Clojure https://cwiki.apache.org/confluence/display/KAFKA/Clients » Python ⋄ Confluent-kafka-python ⋄ Kafka-python ⋄ pykafka

Kafka Clients survey https://www.confluent.io/blog/first-annual-state-apache-kafka-client-use-survey (February 14, 2017) How users choose
a Kafka client Kafka Client: Language Adoption Results from 187 responses Reliability: • Stability should be priority • Good error handling • Good testing • Good metrics and logging 3rd

Create your own Kafka broker https://github.com/Landoop/fast-data-dev

See your brokers and topics • Kafka-topics-ui ◦ Demo http://kafka-topics-ui.landoop.com/#/
• Kafka-connect-ui ◦ Demo http://kafka-connect-ui.landoop.com/ • Kafka-manager (yahoo) • Kafka Eagle • kafka-offset-monitor Kafka Tool (GUI) https://www.datadoghq.com/

Kafka Tool

Kafka UI(landoop)

2 + 2 Core APIs And python clients

Kafka API Documents https://kafka.apache.org/0102/javadoc/index.html?

Apache Kafka client for Python » Pykafka » kafka-python »
Confluent-kafka-python » Librdkafka ⋄ The Apache Kafka C/C++ library

Pykafka https://github.com/Parsely/pykafka http://pykafka.readthedocs.io/en/latest/ » Similar level of abstraction to the
JVM Kafka client » Built on librdkafka https://blog.parse.ly/post/3886/pykafka-now/ （2016,June)

kafka-python https://github.com/dpkp/kafka-python/ http://kafka-python.readthedocs.io/ API • Producer • Consumer • Message
• TopicPartition • KafkaError • KafkaException • kafka-python is designed to function much like the official java client, with a sprinkling of pythonic interfaces.

Confluent-kafka-python Confluent's Python client for Apache Kafka and the Confluent
Platform. Features: • High performance ⋄ librdkafka • Reliability • Supported • Future proof https://github.com/confluentinc/confluent-kafka-python http://docs.confluent.io/current/clients/confluent-kafka-python/index.html?

Producer API (JAVA) https://kafka.apache.org/0102/javadoc/index.html?org/apache/kafka/clients/producer/KafkaProducer.html https://www.tutorialspoint.com/apache_kafka/apache_kafka_simple_producer_example.htm • KafkaProducer – Sync and
Async ◦ close() ◦ flush() ◦ metrics() ◦ partitionsFor( topic) ◦ send(ProducerRecord<K,V> record) Writing data to Kafka: A client that publishes records to the Kafka cluster. Class KafkaProducer<K,V> Class ProducerRecord<K,V> • ProducerRecord( topic, V value) • ProducerRecord( topic, Integer partition, K key, V value) A key/value pair to be sent to Kafka. Configuration Settings (configuration is externalized in a property file) • client.id • producer.type • acks • retries • bootstrap.servers • linger.ms • key.serializer • value.serializer • batch.size • buffer.memory messages

Producer API -Pykafka from pykafka import KafkaClient from settings import
…. client = KafkaClient(hosts=bootstrap_servers) topic = client.topics [topic.encode('UTF-8')] producer = topic.get_producer(use_rdkafka=use_rdkafka) producer.produce(msg_payload) producer.stop() # Will flush background queue Class pykafka.producer.Producer() Classpykafka.topic.Topic(cluster, topic_metadata) http://pykafka.readthedocs.io/en/latest/api/producer.html • produce(msg, partition_key=None) • stop() • get_producer(use_rdkafka=False, **kwargs)

Performance assessment https://blog.parse.ly/post/3886/pykafka-now/

Must be type bytes, or be serializable to bytes via
configured value_serializer. Producer API -Kafka-Python from kafka import KafkaConsumer, KafkaProducer from settings import BOOTSTRAP_SERVERS, TOPICS, MSG p = KafkaProducer(bootstrap_servers=BOOTSTRAP_SERVERS) p.send(TOPICS, MSG.encode('utf-8')) p.flush() Class kafka.KafkaProducer(**configs) https://kafka-python.readthedocs.io/en/master/_modules/kafka/producer/kafka.html#KafkaProducer • close(timeout=None) • flush(timeout=None) • partitions_for(topic) • send(topic, value=None, key=None, partition=None, timestamp_ms=None) http://kafka-python.readthedocs.io/en/master/apidoc/KafkaProducer.html

Producer API -Confluent-python -Kafka from confluent_kafka import Producer from settings
import BOOTSTRAP_SERVERS, TOPICS, MSG p = Producer({'bootstrap.servers': BOOTSTRAP_SERVERS}) p.produce(TOPICS, MSG.encode('utf-8')) p.flush() http://docs.confluent.io/current/clients/confluent-kafka-python/#producer Class confluent_kafka.Producer(*kwargs) • len() • flush([timeout]) • poll([timeout]) • produce(topic[, value][, key][, partition][, on_delivery][, timestamp])

Consumer • Consumer group ◦ group.id ◦ session.timout.ms ◦ max.poll.records
◦ heartbeat.interval.ms • Offset Management ◦ enable.auto.commit ◦ Auto.commit.interval.ms ◦ auto.offset.reset https://kafka.apache.org/documentation.html#newconsumerconfigs

Consumer API (JAVA) https://kafka.apache.org/0102/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html • assign(<TopicPartition> partitions) • assignment() •
beginningOffsets(<TopicPartition> partitions) • close(long timeout, TimeUnit timeUnit) • commitAsync(Map<TopicPartition,OffsetAndMetadata> offsets, OffsetCommitCallback callback) • commitSync(Map<TopicPartition,OffsetAndMetadata> offsets) • committed(TopicPartition partition) • endOffsets(<TopicPartition> partitions) • listTopics() • metrics() • offsetsForTimes(Map<TopicPartition,Long> timestampsToSearch) • partitionsFor(topic) • pause(<TopicPartition> partitions) Reading data from Kafka: A client that consumes records from a Kafka cluster. Class KafkaConsumer<K,V> • poll(long timeout) • position(TopicPartition partition) • resume(<TopicPartition> partitions) • seek(TopicPartition partition, long offset) • seekToBeginning(<TopicPartition> partitions) • seekToEnd(<TopicPartition> partitions) • subscribe(topics, ConsumerRebalanceListener listener) • subscribe(Pattern pattern, ConsumerRebalanceListener listener) • subscription() • unsubscribe() • wakeup()

Kafka shell scripts

Create a Kafka Topic » Let's create a topic named
"test" with a single partition and only one replica: ⋄ kafka-topics.sh --create --zookeeper zhost:2181 --replication-factor 1 --partitions 1 --topic test » See that topic ⋄ bin/kafka-topics.sh --list --zookeeper zhost:2181 bin/kafka-topics.sh » Create, delete, describe, or change a topic.

Python Kafka Client Benchmarking

DEMO 1. http://activisiongamescience.github.io/2016/06/15/Kafka-Client-Benchmarking/ 2. https://github.com/sucitw/benchmark-python-client-for-kafka

http://activisiongamescience.github.io/2016/06/15/Kafka-Client-Benchmarking/ Python Kafka Client Benchmarking

Conclusion: pykafka, kafka-python or ?

https://github.com/Parsely/pykafka/issues/559

More about Kafka

More about Kafka » Reliability and durability ⋄ Scaling, replication,
guarantee, Zookeeper » Compact log » Administration, Configuration, Operations, Monitoring » Kafka connect » Kafka Stream » Schema Registry » Rest proxy » Apache Kafka vs XXX ⋄ RabbitMQ, AWS Kinesis, GCP Pub/Sub, ActiveMQ, ZeroMQ, Redis, and ....

The Another 2 APIs » Connect API ◦ JDBC, HDFS,
S3, …. » Streams API ◦ MAP, filter, aggregate, join

More references 1. The Log: What every software engineer should
know about real-time data's unifying abstraction, Jay Kreps, 2013 2. Pykafka and Kafka-python? https://github.com/Parsely/pykafka/issues/559 3. Why I am not a fan of Apache Kafka (2015-2016 Sep) 4. Kafka vs RabbitMQ a. What are the differences between Apache Kafka and RabbitMQ? b. Understanding When to use RabbitMQ or Apache Kafka 5. Kafka summit (2016~) 6. Future features of Kafka (Kafka Improvement Proposals) 7. Kafka- The Definitive Guide

We’re hiring!! (104 link)

Connect "K" of SMACK：pykafka, kafka-python or ?

Connect "K" of SMACK：pykafka, kafka-python or ?

More Decks by suci

Other Decks in Programming

Featured

Transcript