Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Connect "K" of SMACK:pykafka, kafka-python or ?

suci
June 09, 2017

Connect "K" of SMACK:pykafka, kafka-python or ?

Apache Kafka is considered as a distributed streaming platform to a build real-time data pipelines and streaming apps. You can also take Kafka as commit log service with functions much like a publish/subscribe messaging system, but with better throughput, built-in partitioning, replication, and fault tolerance and runs in production in thousands of companies. Recently, Kafka has been widely applied as one component of SMACK stack because of it's role connected with Apache Hadoop, Apache Storm, and Spark Streaming in the data pipeline.

In this talk, I will start with introduce data stream processing and the general concept of Kafka's architecture and components by several use cases. Then, Kafka' API will be introduced by python clients with demo. Finally, the benchmark, comparison and limitation of different python clients will be discussed.

suci

June 09, 2017
Tweet

More Decks by suci

Other Decks in Programming

Transcript

  1. Shuhsi Lin
    2017/06/09 at PyconTw 2017
    Connect K of SMACK:
    pykafka, kafka-python or ?

    View Slide

  2. About Me
    Data Software Engineer of EAD
    in the manufacturer, Micron
    Currently working with
    - data and people
    - Lurking in PyHug, Taipei.py and various Meetups
    Shuhsi Lin
    sucitw gmail.com
    sucitw gmail.com

    View Slide

  3. K in
    SMACK

    View Slide

  4. http://datastrophic.io/data-processing-platforms-architectures-with-spark-mesos-akka-cassandra-and-kafka/
    https://www.linkedin.com/pulse/smack-my-bdaas-why-2017-year-big-data-goes-tom-martin
    http://www.natalinobusa.com/2015/11/why-is-smack-stack-all-rage-lately.html
    https://dzone.com/articles/short-interview-with-smack-tech-stack-1
    https://www.slideshare.net/akirillov/data-processing-platforms-architectures-with-spark-mesos-akka-cassandra-and-kafka
    ● Apache Spark: Processing Engine.
    ● Apache Mesos: The Container.
    ● Akka: The Model.
    ● Apache Cassandra: The Storage.
    ● Apache Kafka: The Broker.

    View Slide

  5. Agenda
    » Pipeline to streaming
    » What is Apache Kafka
    ⋄ Overview
    ⋄ Architecture
    ⋄ Use cases
    » Kafka API
    ⋄ Python clients
    » Conclusion and More about Kafka

    View Slide

  6. What we will not focus on
    » Reliability and durability
    ⋄ Scaling, replication, guarantee
    ⋄ Zookeeper
    » Compact log
    » Administration, Configuration, Operations
    » Kafka connect
    » Kafka Stream
    » Apache Kafka vs XXX
    ⋄ RabbitMQ, AWS Kinesis, GCP Pub/Sub, ActiveMQ,
    ZeroMQ, Redis, and ....

    View Slide

  7. What is
    Stream Processing

    View Slide

  8. 3 Paradigms for Programming
    1. Request/response
    2. Batch
    3. Stream processing
    https://qconnewyork.com/ny2016/ny2016/presentation/large-scale-stream-processing-apache-kafka.html

    View Slide

  9. Request/response

    View Slide

  10. Batch

    View Slide

  11. Stream Processing

    View Slide

  12. What is streaming process
    » Data comes from the rise of events
    (orders, sales, clicks or trades)
    » Databases are event streams
    ⋄ the process of creating a backup or standby copy
    of a database
    ⋄ publishing the database changes

    View Slide

  13. Data pipeline
    https://www.linkedin.com/pulse/data-pipeline-hadoop-part-1-2-birender-saini
    What often happen in a
    complex Data pipeline
    ● Complexity meant that the data
    was always unreliable
    ● Reports were untrustworthy,
    ● Derived indexes and stores were
    questionable
    ● Everyone spent a lot of time
    battling data quality issues of
    all kinds.
    ● Data discrepancy

    View Slide

  14. Data pipeline
    Data streaming

    View Slide

  15. Apache Kafka 101

    View Slide

  16. The name, “Kafka”, came from?
    https://www.quora.com/What-is-the-relation-between-Kafka-the-writer-and-Apache-Kafka-the-distributed-messaging-system
    http://slideplayer.com/slide/4221536/
    https://en.wikipedia.org/wiki/Franz_Kafka

    View Slide

  17. What is Apache Kafka?
    Apache Kafka is a distributed system designed for streams. It is built to be
    fault-tolerant, high-throughput, horizontally scalable, and allows geographically
    distributing data streams and processing.
    https://kafka.apache.org

    View Slide

  18. Why Apache Kafka

    View Slide

  19. Fast
    Scalable
    Durable
    Distributed
    https://pixabay.com/photo-2135057/

    View Slide

  20. Stream data platform (Orignal mechanism)
    https://www.confluent.io/blog/stream-data-platform-1/
    Integration mechanism between systems

    View Slide

  21. Kafka as a service
    https://www.confluent.io/

    View Slide

  22. What a streaming data platform can provide
    » “Data integration” (ETL)
    ⋄ How to transport data between systems
    ⋄ Captures streams of events or data changes and
    feeds these to other data systems
    » “Stream processing” (messaging)
    ⋄ Continuous, real-time processing and
    transformation of these streams and makes the
    results available system-wide.
    various systems in LinkedIn
    https://www.confluent.io/blog/stream-data-platform-1/
    Analytical data processing with very low latency

    View Slide

  23. Kafka terminology
    » Producer
    » Consumer
    ⋄ Consumer group
    ⋄ offset
    » Broker
    » Topic
    » Partition
    » Message
    » Replica

    View Slide

  24. What Kafka Does
    Publish & subscribe
    ● to streams of data like a messaging system
    Process
    ● streams of data efficiently and in real time
    Store
    ● streams of data safely in a distributed replicated cluster
    https://kafka.apache.org/

    View Slide

  25. Publish/Subscribe
    P14 at
    https://www.slideshare.net/lucasjellema/amis-sig-introducing-a
    pache-kafka-scalable-reliable-event-bus-message-queue

    View Slide

  26. P15 at https://www.slideshare.net/rahuldausa/real-time-analytics-with-apache-kafka-and-apache-spark
    v0.10
    Update offset
    v08
    Update offset
    Smart consumer
    2181
    9092

    View Slide

  27. A modern stream-centric data architecture built around Apache Kafka
    https://www.confluent.io/blog/stream-data-platform-1/
    500 billion events per day

    View Slide

  28. The key abstraction in Kafka is a
    structured commit log of updates
    append records to this log
    https://www.confluent.io/blog/stream-data-platform-1/
    Each of these data consumers
    has its own position in the log
    and advances independently.
    This allows a reliable, ordered stream of updates
    to be distributed to each consumer.
    The log can be sharded and spread
    over a cluster of machines, and
    each shard is replicated for
    fault-tolerance.
    consumers
    producers
    parallel, ordered consumption
    (important to a change capture system
    for database updates)
    TBs of data

    View Slide

  29. Topics and Partitions
    » Topics are split into partitions
    » Partitions are strongly ordered & immutable
    » Partitions can exist on different servers
    » Partition enable scalability
    » Producers assign a message to a partition within the topic
    ⋄ Either round robin ( simply to balance load)
    ⋄ or according to the keys
    https://kafka.apache.org/documentation/#gettingStarted

    View Slide

  30. Offsets
    » Message are assigned an offset in the partition
    » Consumers track with ( offset, partition, topic)
    https://kafka.apache.org/documentation/#gettingStarted
    A two server Kafka cluster hosting four partitions (P0-P3) with two consumer groups

    View Slide

  31. Consumers and Partitions
    » A consumer group consumes one topic
    » A partition is always sent to the same consumer instance
    https://kafka.apache.org/documentation/#gettingStarted

    View Slide

  32. Consumer
    ● Messages are available to consumers only when they have been
    committed
    ● Kafka does not push
    ○ Unlike JMS
    ● Read does not destroy by consumers
    ○ Unlike JMS Topic
    ● (some) History available
    ○ Offline consumers can catch up
    ○ Consumers can re-consume from the past
    ● Delivery Guarantees
    ○ Ordering maintained
    ○ At-least-once (per consumer) by default; at-most-once and exactly-once can be
    implemented
    P11 at https://www.slideshare.net/lucasjellema/amis-sig-introducing-apache-kafka-scalable-reliable-event-bus-message-queue

    View Slide

  33. ZooKeeper: the coordination interface
    between the Kafka broker and consumers
    https://hortonworks.com/hadoop-tutorial/realtime-event-processing-nifi-kafka-storm/#section_3
    » Stores configuration data for distributed services
    » Used primarily by brokers
    » Used by consumers in 0.8 but not 0.9

    View Slide

  34. Apache Kafka timeline

    View Slide

  35. Apache Kafka timeline
    2011-Nov
    2016-May
    2013-Nov 2015-Nov
    Next
    version
    v0.10
    Kafka Stream
    rack awareness
    v0.8
    New Producer
    Reassign-partitions
    v0.9
    Kafka Connect
    Security
    New Consumer
    Apache
    Software
    Foundation
    incubator
    2010
    Creation
    In Linkedin
    2014, Confluent
    v0.10.2
    Single Message Transforms
    for Kafka Connect

    View Slide

  36. TLS connection
    SSL is supported only for the new Kafka Producer and Consumer (Kafka versions 0.9.0 and higher)
    http://kafka.apache.org/documentation.html#security_ssl
    http://docs.confluent.io/current/kafka/ssl.html
    http://maximilianchrist.com/blog/connect-to-apache-kafka-from-python-using-ssl
    https://github.com/edenhill/librdkafka/wiki/Using-SSL-with-librdkafka

    View Slide

  37. Apache Kafka is consider as :
    Stream data platform
    » Commit log service
    » Messaging system
    » Circular buffer

    View Slide

  38. Cons of Apache Kafka
    » Consumer Complexity (smart, but poor client)
    » Lack of tooling/monitoring (3rd party)
    » Still pre 1.0 release
    » Operationally, it’s more manual than desired
    » Requires ZooKeeper
    Sep 26, 2015
    http://www.slideshare.net/jimplush/introduction-to-apache-kafka-53225326

    View Slide

  39. Use Cases
    » Website Activity Tracking
    » Log Aggregation
    » Stream Processing
    » Event Sourcing
    » Commit logs
    » Metrics (Performance index streaming)
    ⋄ CPU/IO/Memory usage
    ⋄ Application Specific:
    ⋄ Time taken to load a web-page
    ⋄ Time taken to build a web-page
    ⋄ No. of requests
    ⋄ No. of hits on a particular page/url

    View Slide

  40. Event-driven Applications
    » how it first is adopted and how its role
    evolves over time in their architecture.
    https://aws.amazon.com/tw/kafka/

    View Slide

  41. https://www.slideshare.net/ConfluentInc/iot-data-platforms-processing-iot-data-with-apache-kafka

    View Slide

  42. Conceptual Reference Architecture
    for Real-Time Processing in HDP 2.2
    https://hortonworks.com/blog/storm-kafka-together-real-time-data-refinery/ February 12, 2015

    View Slide

  43. Event delivery system design in Spotify
    43
    https://labs.spotify.com/2016/03/03/spotifys-event-delivery-the-road-to-the-cloud-part-ii/

    View Slide

  44. Case: Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spark Streaming
    http://helenaedelson.com/?p=1186 (2016/03)

    View Slide

  45. 2 + 2 Core APIs

    View Slide

  46. Four Core APIs
    » Producer API
    » Consumer API
    » Connect API
    » Streams API
    » Legacy APIs
    $ cat < in.txt | grep “python” | tr a-z A-Z > out.txt
    https://www.slideshare.net/ConfluentInc/apache-kafkaa-distributed-streaming-platform

    View Slide

  47. Kafka Clients
    » JAVA (officially maintain)
    » C/C++ (librdkafka)
    » Go (AKA golang)
    » Erlang
    » .NET
    » Clojure
    » Ruby
    » Node.js
    » Proxy (HTTP REST, etc)
    » Perl
    » stdin/stdout
    » PHP
    » Rust
    » Alternative Java
    » Storm
    » Scala DSL
    » Clojure
    https://cwiki.apache.org/confluence/display/KAFKA/Clients
    » Python
    ⋄ Confluent-kafka-python
    ⋄ Kafka-python
    ⋄ pykafka

    View Slide

  48. Kafka Clients survey
    https://www.confluent.io/blog/first-annual-state-apache-kafka-client-use-survey (February 14, 2017)
    How users choose a Kafka client
    Kafka Client: Language Adoption
    Results from 187 responses
    Reliability:
    ● Stability should be
    priority
    ● Good error handling
    ● Good testing
    ● Good metrics and logging
    3rd

    View Slide

  49. Create your own Kafka broker
    https://github.com/Landoop/fast-data-dev

    View Slide

  50. See your brokers and topics
    ● Kafka-topics-ui
    ○ Demo http://kafka-topics-ui.landoop.com/#/
    ● Kafka-connect-ui
    ○ Demo http://kafka-connect-ui.landoop.com/
    ● Kafka-manager (yahoo)
    ● Kafka Eagle
    ● kafka-offset-monitor
    Kafka Tool (GUI)
    https://www.datadoghq.com/

    View Slide

  51. Kafka Tool

    View Slide

  52. Kafka UI(landoop)

    View Slide

  53. 2 + 2 Core APIs
    And python clients

    View Slide

  54. Kafka API Documents
    https://kafka.apache.org/0102/javadoc/index.html?

    View Slide

  55. Apache Kafka client for Python
    » Pykafka
    » kafka-python
    » Confluent-kafka-python
    » Librdkafka
    ⋄ The Apache Kafka C/C++ library

    View Slide

  56. Pykafka
    https://github.com/Parsely/pykafka
    http://pykafka.readthedocs.io/en/latest/
    » Similar level of abstraction
    to the JVM Kafka client
    » Built on librdkafka
    https://blog.parse.ly/post/3886/pykafka-now/ (2016,June)

    View Slide

  57. kafka-python
    https://github.com/dpkp/kafka-python/
    http://kafka-python.readthedocs.io/
    API
    ● Producer
    ● Consumer
    ● Message
    ● TopicPartition
    ● KafkaError
    ● KafkaException
    ● kafka-python is designed to function
    much like the official java client,
    with a sprinkling of pythonic
    interfaces.

    View Slide

  58. Confluent-kafka-python
    Confluent's Python client for Apache Kafka and
    the Confluent Platform.
    Features:
    ● High performance
    ⋄ librdkafka
    ● Reliability
    ● Supported
    ● Future proof
    https://github.com/confluentinc/confluent-kafka-python
    http://docs.confluent.io/current/clients/confluent-kafka-python/index.html?

    View Slide

  59. Producer API (JAVA)
    https://kafka.apache.org/0102/javadoc/index.html?org/apache/kafka/clients/producer/KafkaProducer.html
    https://www.tutorialspoint.com/apache_kafka/apache_kafka_simple_producer_example.htm
    ● KafkaProducer – Sync and Async
    ○ close()
    ○ flush()
    ○ metrics()
    ○ partitionsFor( topic)
    ○ send(ProducerRecord record)
    Writing data to Kafka: A client that publishes records to the Kafka cluster.
    Class KafkaProducer
    Class ProducerRecord
    ● ProducerRecord( topic, V value)
    ● ProducerRecord( topic, Integer partition, K key, V value)
    A key/value pair to be sent to Kafka.
    Configuration Settings
    (configuration is externalized in a property file)
    ● client.id
    ● producer.type
    ● acks
    ● retries
    ● bootstrap.servers
    ● linger.ms
    ● key.serializer
    ● value.serializer
    ● batch.size
    ● buffer.memory
    messages

    View Slide

  60. Producer API -Pykafka
    from pykafka import KafkaClient
    from settings import ….
    client = KafkaClient(hosts=bootstrap_servers)
    topic = client.topics [topic.encode('UTF-8')]
    producer = topic.get_producer(use_rdkafka=use_rdkafka)
    producer.produce(msg_payload)
    producer.stop() # Will flush background queue
    Class pykafka.producer.Producer()
    Classpykafka.topic.Topic(cluster, topic_metadata)
    http://pykafka.readthedocs.io/en/latest/api/producer.html
    ● produce(msg, partition_key=None)
    ● stop()
    ● get_producer(use_rdkafka=False,
    **kwargs)

    View Slide

  61. Performance assessment
    https://blog.parse.ly/post/3886/pykafka-now/

    View Slide

  62. Must be type bytes, or be
    serializable to bytes via
    configured value_serializer.
    Producer API -Kafka-Python
    from kafka import KafkaConsumer, KafkaProducer
    from settings import BOOTSTRAP_SERVERS, TOPICS, MSG
    p = KafkaProducer(bootstrap_servers=BOOTSTRAP_SERVERS)
    p.send(TOPICS, MSG.encode('utf-8'))
    p.flush()
    Class kafka.KafkaProducer(**configs)
    https://kafka-python.readthedocs.io/en/master/_modules/kafka/producer/kafka.html#KafkaProducer
    ● close(timeout=None)
    ● flush(timeout=None)
    ● partitions_for(topic)
    ● send(topic, value=None, key=None,
    partition=None, timestamp_ms=None)
    http://kafka-python.readthedocs.io/en/master/apidoc/KafkaProducer.html

    View Slide

  63. Producer API -Confluent-python -Kafka
    from confluent_kafka import Producer
    from settings import BOOTSTRAP_SERVERS,
    TOPICS, MSG
    p = Producer({'bootstrap.servers':
    BOOTSTRAP_SERVERS})
    p.produce(TOPICS, MSG.encode('utf-8'))
    p.flush()
    http://docs.confluent.io/current/clients/confluent-kafka-python/#producer
    Class confluent_kafka.Producer(*kwargs)
    ● len()
    ● flush([timeout])
    ● poll([timeout])
    ● produce(topic[, value][, key][, partition][,
    on_delivery][, timestamp])

    View Slide

  64. Consumer
    ● Consumer group
    ○ group.id
    ○ session.timout.ms
    ○ max.poll.records
    ○ heartbeat.interval.ms
    ● Offset Management
    ○ enable.auto.commit
    ○ Auto.commit.interval.ms
    ○ auto.offset.reset
    https://kafka.apache.org/documentation.html#newconsumerconfigs

    View Slide

  65. Consumer API (JAVA)
    https://kafka.apache.org/0102/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html
    ● assign( partitions)
    ● assignment()
    ● beginningOffsets( partitions)
    ● close(long timeout, TimeUnit timeUnit)
    ● commitAsync(Map offsets,
    OffsetCommitCallback callback)
    ● commitSync(Map offsets)
    ● committed(TopicPartition partition)
    ● endOffsets( partitions)
    ● listTopics()
    ● metrics()
    ● offsetsForTimes(Map timestampsToSearch)
    ● partitionsFor(topic)
    ● pause( partitions)
    Reading data from Kafka: A client that consumes records from a Kafka cluster.
    Class KafkaConsumer
    ● poll(long timeout)
    ● position(TopicPartition partition)
    ● resume( partitions)
    ● seek(TopicPartition partition, long offset)
    ● seekToBeginning( partitions)
    ● seekToEnd( partitions)
    ● subscribe(topics, ConsumerRebalanceListener
    listener)
    ● subscribe(Pattern pattern,
    ConsumerRebalanceListener listener)
    ● subscription()
    ● unsubscribe()
    ● wakeup()

    View Slide

  66. Kafka shell scripts

    View Slide

  67. Create a Kafka Topic
    » Let's create a topic named "test" with a single partition and
    only one replica:
    ⋄ kafka-topics.sh --create --zookeeper zhost:2181
    --replication-factor 1 --partitions 1 --topic test
    » See that topic
    ⋄ bin/kafka-topics.sh --list --zookeeper zhost:2181
    bin/kafka-topics.sh
    » Create, delete, describe, or change a topic.

    View Slide

  68. Python Kafka Client Benchmarking

    View Slide

  69. DEMO
    1. http://activisiongamescience.github.io/2016/06/15/Kafka-Client-Benchmarking/
    2. https://github.com/sucitw/benchmark-python-client-for-kafka

    View Slide

  70. http://activisiongamescience.github.io/2016/06/15/Kafka-Client-Benchmarking/
    Python Kafka Client Benchmarking

    View Slide

  71. Conclusion:
    pykafka, kafka-python or ?

    View Slide

  72. https://github.com/Parsely/pykafka/issues/559

    View Slide

  73. More about Kafka

    View Slide

  74. More about Kafka
    » Reliability and durability
    ⋄ Scaling, replication, guarantee, Zookeeper
    » Compact log
    » Administration, Configuration, Operations, Monitoring
    » Kafka connect
    » Kafka Stream
    » Schema Registry
    » Rest proxy
    » Apache Kafka vs XXX
    ⋄ RabbitMQ, AWS Kinesis, GCP Pub/Sub, ActiveMQ, ZeroMQ, Redis,
    and ....

    View Slide

  75. The Another 2 APIs
    » Connect API
    ○ JDBC, HDFS, S3, ….
    » Streams API
    ○ MAP, filter, aggregate, join

    View Slide

  76. More references
    1. The Log: What every software engineer should know about real-time data's unifying abstraction,
    Jay Kreps, 2013
    2. Pykafka and Kafka-python? https://github.com/Parsely/pykafka/issues/559
    3. Why I am not a fan of Apache Kafka (2015-2016 Sep)
    4. Kafka vs RabbitMQ
    a. What are the differences between Apache Kafka and RabbitMQ?
    b. Understanding When to use RabbitMQ or Apache Kafka
    5. Kafka summit (2016~)
    6. Future features of Kafka (Kafka Improvement Proposals)
    7. Kafka- The Definitive Guide

    View Slide

  77. We’re hiring!!
    (104 link)

    View Slide