Upgrade to Pro — share decks privately, control downloads, hide ads and more …

How to build stream data pipeline with Apache K...

dstaka
October 11, 2019

How to build stream data pipeline with Apache Kafka and Spark Structured Streaming

A deck used for PyCon SG 2019

dstaka

October 11, 2019
Tweet

Other Decks in Programming

Transcript

  1. How to build stream data pipeline with Apache Kafka and

    Spark Structured Streaming Takanori AOKI PyCon Singapore 2019, Oct. 11 2019
  2. Presenter profile 2 • Takanori Aoki • Based in Singapore

    since 2016 • Working as a Data Scientist • Natural language processing • Recommendation engine • Production system development • Professional background • IT infrastructure engineer at Tokyo • Data Scientist at Singapore
  3. Agenda • Session overview • Batch processing vs stream processing

    • Technology stack • Apache Kafka • Apache Spark Structured Streaming • Application examples in Python • IoT data aggregation • Tweet sentiment analysis 3
  4. Session overview • Target audience • Data Engineers, Data Scientists,

    and/or any Python developers who • are interested in stream processing.. • have not familiar with stream processing yet.. • What I will talk • Overview of Apache Kafka and Spark Structured Streaming • How to create simple stream data processing application in Python https://github.com/dstaka/kafka-spark-demo-pyconsg19 • What I will NOT talk • System architectural design • Parameter tuning and optimization • System configuration procedure • Comparison with other stream processing technologies 4
  5. Batch processing vs Stream processing 5 2019-10-01 12:00AM 2019-10-01 1:00AM

    2019-10-01 11:59PM ….. 2019-10-01 2019-10-02 12:00AM 2019-10-02 1:00AM 2019-10-02 11:59PM ….. 2019-10-02 - - - Batch processing Process on 2019-10-02 Process on 2019-10-03 2019-10-01 12:00AM 2019-10-01 1:00AM 2019-10-01 11:59PM ….. 2019-10-01 2019-10-02 12:00AM 2019-10-02 1:00AM 2019-10-02 11:59PM ….. 2019-10-02 Stream processing Process periodically (e.g. every 10 seconds), upon data arrival, or by event trigger …..
  6. Batch processing vs Stream processing 6 Batch processing Stream processing

    How process works A set of data is collected over a certain period, then processed. Data is processed piece-by-piece. When to run Scheduled timing (e.g. every 1:00AM, every Sunday 3:00AM) Process periodically (e.g. every 10 seconds), upon data arrival, or by event trigger Data source Database records, file, etc Generated data from IoT device, or any real- /semi-real time data Output storage Database, filesystem, object storage, etc Database, filesystem, object storage, or not stored Performance metrics Throughput (How many records are processed) Latency (How much time to take to process in each piece of data) Use case ETL for data warehouse Capture real-time events from mobile devices Fraud detection on online services
  7. Technology stack Example of stream processing system architecture 7 Stream

    data source Data store Stream processing Data hub
  8. Apache Kafka overview • Distributed, resilient, fault tolerant, and scalable

    messaging middleware • Distributed among multiple processes on multiple servers • Publish / Subscribe messaging model • Can be a “data-hub” between diverse data source (so called “Producer”) and processors (so called “Consumer”) • Decouple system dependencies • Integrate with other Big Data processing products such as Spark, Flink, and Storm • API is available to data transmission • Originally developed by LinkedIn • Currently Kafka is an open source project mainly maintained by Confluent • Used by LinkedIn, Netflix, Airbnb, Uber, Walmart, etc… 8
  9. bb Apache Kafka architecture 9 Producer 1 Producer 2 Producer

    X Producers (Publishers) ZooKeeper Broker 1 Broker 2 Consumers (Subscribers) Consumer 1 Consumer 2 Broker Topic A Consumer group α Kafka Cluster Broker Y Topic B Consumer Z Consumer group β
  10. Apache Kafka architecture in detail 10 bb Producer 1 Producer

    2 Producers (Publishers) Consumers (Subscribers) Broker Topic A Consumer group 1 2 3 4 5 6 7 Commit Offset Current Offset Log end offset 3 4 5 Consumer 1 Consumer 2 1 2 Kafka Cluster ZooKeeper
  11. Apache Spark Structured Streaming Basic concept 12 Consider the input

    data stream as the “Input Table”. Every data item arriving on the stream is like a new row being appended to the Input Table. https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
  12. Apache Spark Structured Streaming Basic concept 13 *1 Output mode

    of Spark Structured Streaming https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html (Example) Word count *1 Output mode Description Complete The entire updated Result Table will be written to the external storage. Append Only the new rows appended in the Result Table since the last trigger will be written to the external storage. This is applicable only on the queries where existing rows in the Result Table are not expected to change. Update Only the rows that were updated in the Result Table since the last trigger will be written to the external storage. This mode only outputs the rows that have changed since the last trigger.
  13. Apache Spark Structured Streaming Event-time-window-based aggregation 14 https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html (Example) Word

    count • Count words within 10 minutes time-windows • Update every 5 minutes • Word counts in words received between 10 minutes windows • e.g. 12:00 - 12:10, 12:05 - 12:15, 12:10 - 12:20 • A word that was received at 12:07 should increment the counts corresponding to two windows 12:00 - 12:10 and 12:05 - 12:15.
  14. Example 1: IoT data aggregation Data generated by IoT devices

    16 [{"device_id":10, "timestamp":"2019-10-01 09:00:00", "speed":12.245, "accelerometer_x":0.596, "accelerometer_y":-9.699, "accelerometer_z":-0.245}] [{"device_id":10, "timestamp":"2019-10-01 09:01:00", "speed":12.047, "accelerometer_x":-0.288, "accelerometer_y":-9.125, "accelerometer_z":-1.265}] [{"device_id":11, "timestamp":"2019-10-01 09:00:00", "speed":15.549, "accelerometer_x":-0.007, "accelerometer_y":9.005, "accelerometer_z":-2.079}] [{"device_id":11, "timestamp":"2019-10-01 09:01:00", "speed":15.29, "accelerometer_x":0.165, "accelerometer_y":9.287, "accelerometer_z":-1.2}]
  15. Example 1: IoT data aggregation Overview • IoT devices send

    data to Kafka • ThreeData is sent as a JSON format • Each JSON data includes device_id, timestamp, speed, accelerometer_x, accelerometer_y, and accelerometer_z • These JSON data is aggregated by using stream processing • IoT data aggregation flow 0. Create Kafka topic 1. Send IoT data to Kafka broker 2. Spark Structured Streaming pulls IoT data from Kafka broker 3. Parse JSON format data 4. Aggregate data by device_id and time-window 5. Output result • Demo codes • https://github.com/dstaka/kafka-spark-demo-pyconsg19/tree/master/iot_demo 17
  16. Example 1: IoT data aggregation 0. Create Kafka topic 18

    0. Create Kafka topic $ kafka-topics --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic topic-iot-raw Created topic topic-iot-raw. $ kafka-topics --list --zookeeper localhost:2181 | grep iot topic-iot-raw bb Producers ZooKeeper Broker 1 Spark (Consumers) Consumer 1 Broker topic-iot-raw Consumer group Kafka Cluster Producer 1 Broker
  17. Example 1: IoT data aggregation 1. Send IoT data to

    Kafka broker 19 import pandas as pd import time from kafka import KafkaProducer # Set Kafka config kafka_broker_hostname='localhost' kafka_broker_portno='9092' kafka_broker=kafka_broker_hostname + ':' + kafka_broker_portno kafka_topic='topic-iot-raw' data_send_interval=5 # Create KafkaProducer instance producer = KafkaProducer(bootstrap_servers=kafka_broker) # Load demo data iot_data_id10 = pd.read_csv('./data/iot_data_id10.csv') iot_data_id11 = pd.read_csv('./data/iot_data_id11.csv') iot_data_id12 = pd.read_csv('./data/iot_data_id12.csv') # Send demo data to Kafka broker for _index in range(0, len(iot_data_id10)): json_iot_id10 = iot_data_id10[iot_data_id10.index==_index].to_json(orient='records') producer.send(kafka_topic, bytes(json_iot_id10, 'utf-8')) json_iot_id11 = iot_data_id11[iot_data_id11.index==_index].to_json(orient='records') producer.send(kafka_topic, bytes(json_iot_id11, 'utf-8')) json_iot_id12 = iot_data_id12[iot_data_id12.index==_index].to_json(orient='records') producer.send(kafka_topic, bytes(json_iot_id12, 'utf-8')) time.sleep(data_send_interval) [{"device_id":10, "timestamp":"2019-10-01 09:01:00", "speed":12.047, "accelerometer_x":-0.288, "accelerometer_y":-9.125, "accelerometer_z":-1.265}] Producer iot_demo/kafka_producer_iot.py
  18. Example 1: IoT data aggregation 2. Spark Structured Streaming pulls

    IoT data from Kafka broker 20 from pyspark.sql import SparkSession from pyspark.sql.functions import * from pyspark.sql.types import * from kafka import KafkaConsumer # Set Kafka config kafka_broker_hostname='localhost' kafka_consumer_portno='9092' kafka_broker=kafka_broker_hostname + ':' + kafka_consumer_portno kafka_topic_input='topic-iot-raw' # Create Spark session spark = SparkSession.builder.appName("AggregateIoTdata").getOrCreate() spark.sparkContext.setLogLevel("WARN") # Pull data from Kafka topic consumer = KafkaConsumer(kafka_topic_input) df_kafka = spark.readStream \ .format("kafka") \ .option("kafka.bootstrap.servers", kafka_broker) \ .option("subscribe", kafka_topic_input) \ .load() # Convert data from Kafka broker into String type df_kafka_string = df_kafka.selectExpr( "CAST(value AS STRING) as value") Consumer iot_demo/kafka_consumer_iot_agg_to_console.py
  19. Example 1: IoT data aggregation 3. Parse JSON format data

    21 # Define schema to read JSON format data iot_schema = StructType() \ .add("device_id", LongType()) \ .add("timestamp", StringType()) \ .add("speed", DoubleType()) \ .add("accelerometer_x", DoubleType()) \ .add("accelerometer_y", DoubleType()) \ .add("accelerometer_z", DoubleType()) # Parse JSON data df_kafka_string_parsed = df_kafka_string.select(from_json(df_kafka_string.value, iot_schema).alias("iot_data")) [{"device_id":10, "timestamp":"2019-10-01 09:00:00", "speed":12.245, "accelerometer_x":0.596, "accelerometer_y":-9.699, "accelerometer_z":-0.245}] [{"device_id":11, "timestamp":"2019-10-01 09:00:00", "speed":15.549, "accelerometer_x":-0.007, "accelerometer_y":9.005, "accelerometer_z":-2.079}] device_id timestamp speed accelerometer_x accelerometer_y accelerometer_z 10 2019-10-01 09:00:00 12.245 0.596 -9.699 -0.245 11 2019-10-01 09:00:00 15.549 -0.007 9.005 -2.079 12 2019-10-01 09:00:00 10.076 -0.042 9.005 0.573 10 2019-10-01 09:01:00 12.047 -0.288 -9.125 -1.265 [{"device_id":12, "timestamp":"2019-10-01 09:00:00", "speed":10.076”, "accelerometer_x":-0.042, "accelerometer_y":9.005, "accelerometer_z":0.573}] [{"device_id":10, "timestamp":"2019-10-01 09:01:00", "speed":12.047, "accelerometer_x":-0.288, "accelerometer_y":-9.125, "accelerometer_z":-1.265}] iot_demo/kafka_consumer_iot_agg_to_console.py Consumer
  20. Example 1: IoT data aggregation 3. Parse JSON format data

    22 df_kafka_string_parsed_formatted = df_kafka_string_parsed.select( col("iot_data.device_id").alias("device_id"), col("iot_data.timestamp").alias("timestamp"), col("iot_data.speed").alias("speed"), col("iot_data.accelerometer_x").alias("accelerometer_x"), col("iot_data.accelerometer_y").alias("accelerometer_y"), col("iot_data.accelerometer_z").alias("accelerometer_z")) # Convert timestamp field from string to Timestamp format df_kafka_string_parsed_formatted_timestamped = df_kafka_string_parsed_formatted.withColumn("timestamp", to_timestamp(df_kafka_string_parsed_formatted.timestamp, 'yyyy-MM-dd HH:mm:ss')) [{"device_id":10, "timestamp":"2019-10-01 09:00:00", "speed":12.245, "accelerometer_x":0.596, "accelerometer_y":-9.699, "accelerometer_z":-0.245}] [{"device_id":11, "timestamp":"2019-10-01 09:00:00", "speed":15.549, "accelerometer_x":-0.007, "accelerometer_y":9.005, "accelerometer_z":-2.079}] device_id timestamp speed accelerometer_x accelerometer_y accelerometer_z 10 2019-10-01 09:00:00 12.245 0.596 -9.699 -0.245 11 2019-10-01 09:00:00 15.549 -0.007 9.005 -2.079 12 2019-10-01 09:00:00 10.076 -0.042 9.005 0.573 10 2019-10-01 09:01:00 12.047 -0.288 -9.125 -1.265 [{"device_id":12, "timestamp":"2019-10-01 09:00:00", "speed":10.076”, "accelerometer_x":-0.042, "accelerometer_y":9.005, "accelerometer_z":0.573}] [{"device_id":10, "timestamp":"2019-10-01 09:01:00", "speed":12.047, "accelerometer_x":-0.288, "accelerometer_y":-9.125, "accelerometer_z":-1.265}] iot_demo/kafka_consumer_iot_agg_to_console.py Consumer
  21. Example 1: IoT data aggregation 4. Aggregate data by device_id

    and time-window 23 # Compute average of speed, accelerometer_x, accelerometer_y, and accelerometer_z during 5 minutes # Data comes after 10 minutes will be ignores df_windowavg = df_kafka_string_parsed_formatted_timestamped.withWatermark("timestamp", "10 minutes").groupBy( window(df_kafka_string_parsed_formatted_timestamped.timestamp, "5 minutes"), df_kafka_string_parsed_formatted_timestamped.device_id).avg("speed", "accelerometer_x", "accelerometer_y", "accelerometer_z") # Add columns showing each window start and end timestamp df_windowavg_timewindow = df_windowavg.select( "device_id", col("window.start").alias("window_start"), col("window.end").alias("window_end"), col("avg(speed)").alias("avg_speed"), col("avg(accelerometer_x)").alias("avg_accelerometer_x"), col("avg(accelerometer_y)").alias("avg_accelerometer_y"), col("avg(accelerometer_z)").alias("avg_accelerometer_z") ).orderBy(asc("device_id"), asc("window_start")) iot_demo/kafka_consumer_iot_agg_to_console.py Consumer
  22. Example 1: IoT data aggregation 5. Output to console 24

    # Print output to console query_console = df_windowavg_timewindow.writeStream.outputMode("complete").format("console").start() query_console.awaitTermination() iot_demo/kafka_consumer_iot_agg_to_console.py Consumer
  23. Example 1: IoT data aggregation 5. Output to Kafka 25

    kafka_topic_output='topic-iot-agg’ # Send output to Kafka query_kafka = df_windowavg_timewindow \ .selectExpr("to_json(struct(*)) AS value") \ .writeStream.outputMode("update") \ .format("kafka") \ .option("kafka.bootstrap.servers", kafka_broker) \ .option("topic", kafka_topic_output) \ .option("checkpointLocation", "checkpoint/send_to_kafka") \ .start() query_kafka.awaitTermination() iot_demo/kafka_consumer_iot_agg_to_kafka.py Consumer
  24. Example 1: IoT data aggregation 5. Output to Kafka 26

    iot_demo/kafka_consumer_iot_agg_to_es.py # Set Elasticsearch config es_hostname='localhost' es_portno='9200’ es_doc_type_name='doc-iot-demo/doc' # Send output to Elasticsearch query_es = df_kafka_string_parsed_formatted_timestamped \ .selectExpr("to_json(struct(*)) AS value") \ .writeStream \ .format("es") \ .outputMode("append") \ .option("es.nodes", es_hostname) \ .option("es.port", es_portno) \ .option("checkpointLocation", "checkpoint/send_to_es") \ .option('es.resource', es_doc_type_name) \ .start("orders/log”) query_es.awaitTermination() Consumer
  25. Example 2: Tweet Sentiment Analysis Overview • Not only “SQL-like”

    process, but also more complex logic can be implemented on Spark Structured Streaming • Tweet sentiment analysis flow 0. Create Kafka topic 1. Extract tweets from timeline and send to Kafka broker 2. Spark Structured Streaming pulls tweet data from Kafka broker 3. Tokenize tweet text data 4. Compute sentiment score in each tweet 5. Output result • Demo codes • https://github.com/dstaka/kafka-spark-demo-pyconsg19/tree/master/tweet_sentiment_analysis_demo 27
  26. Example 2: Tweet Sentiment Analysis 1. Extract tweets from timeline

    and send to Kafka broker 28 import json, config from requests_oauthlib import OAuth1Session import pandas as pd import time import configparser from kafka import KafkaProducer # Set Kafka config kafka_broker_hostname='localhost' kafka_broker_portno='9092' kafka_broker=kafka_broker_hostname + ':' + kafka_broker_portno kafka_topic='topic-tweet-raw' producer = KafkaProducer(bootstrap_servers=kafka_broker) Producer tweet_sentiment_analysis_demo/kafka_producer_sentiment.py # Read Twitter API credentials information from config file config = configparser.RawConfigParser() config.read('./config/twitter_credentials.conf') CONSUMER_KEY = config.get('Twitter_API', 'CONSUMER_KEY') CONSUMER_SECRET = config.get('Twitter_API', 'CONSUMER_SECRET') ACCESS_TOKEN = config.get('Twitter_API', 'ACCESS_TOKEN') ACCESS_TOKEN_SECRET = config.get('Twitter_API', 'ACCESS_TOKEN_SECRET') endpoint_url = "https://api.twitter.com/1.1/statuses/home_timeline.json"
  27. Example 2: Tweet Sentiment Analysis 1. Extract tweets from timeline

    and send to Kafka broker 29 # Authenticate Twitter API twitter = OAuth1Session(CONSUMER_KEY, CONSUMER_SECRET, ACCESS_TOKEN, ACCESS_TOKEN_SECRET) res = twitter.get(endpoint_url, params = params) # Parse JSON data from Twitter API and send to Kafka timelines = json.loads(res.text) # Parse JSON in each tweet to send to Kafka for line_dict in timelines: # Remove retweet records if "RT @" in line_dict['text']: continue kafka_msg_str='' line_str = json.dumps(line_dict) kafka_msg_str = (kafka_msg_str + '{ "tweet_id" : ' + line_dict['id_str'] + ', "tweet_property" : { ' + '"user_id" : ' + str(line_dict['user']['id']) + ', ' + '"user_name" : "' + line_dict['user']['name'] + '", ' + '"tweet" : "' + line_dict['text'].replace('\n', ' ').replace('"', ' ') + '", ' + '"created_at" : "' + line_dict['created_at'] + '"} }\n') producer.send(kafka_topic, bytes(kafka_msg_str, 'utf-8')) Producer tweet_sentiment_analysis_demo/kafka_producer_sentiment.py
  28. Example 2: Tweet Sentiment Analysis 1. Extract tweets from timeline

    and send to Kafka broker 30 { "tweet_id" : 136, "tweet_property" : {"user_name" : "Taro", "tweet" : "タイム圧倒的やんけ、もったいね〜(...", "created_at" : "Sat Oct 05 09:20 +0000 2019"} } { "tweet_id" : 158, "tweet_property" : { "user_name" : "Rikako", "tweet" : "『その行為は子供の最も基本的な信頼", "created_at" : "Sat Oct 05 09:20 +0000 2019"} } { "tweet_id" : 141, "tweet_property" : {"user_name" : "Satoshi", "tweet" : "【ゆる募】コンパクト距離空間 X ", "created_at" : "Sat Oct 05 09:20 +0000 2019"} } Producer (Output examples) Tweet body Tweet body Tweet body
  29. Example 2: Tweet Sentiment Analysis 2. Spark Structured Streaming pulls

    tweet data from Kafka broker 31 # Set Kafka config kafka_broker_hostname='localhost' kafka_broker_portno='9092' kafka_broker=kafka_broker_hostname + ':' + kafka_broker_portno kafka_topic_input='topic-tweet-raw' # Create Spark session spark = SparkSession.builder.appName("tweet_sentiment_analysis") .getOrCreate() spark.sparkContext.setLogLevel("WARN") ## Input from Kafka # Pull data from Kafka topic df_kafka = spark.readStream \ .format("kafka") \ .option("kafka.bootstrap.servers", kafka_broker) \ .option("subscribe", kafka_topic_input) \ .load() # Convert data from Kafka into String type df_kafka_string = df_kafka.selectExpr("CAST(value AS STRING) as value") # Define schema to read JSON format data # Deal with nested structure tweet_property_schema = StructType() \ .add("user_name", StringType()) \ .add("tweet", StringType()) \ .add("created_at", StringType()) tweet_schema = StructType().add("tweet_id", LongType()).add("tweet_property", tweet_property_schema) # Parse JSON data df_kafka_string_parsed = df_kafka_string.select( from_json(df_kafka_string.value, tweet_schema).alias("tweet_data")) df_kafka_string_parsed_formatted = df_kafka_string_parsed.select( col("tweet_data.tweet_id").alias("tweet_id"), col("tweet_data.tweet_property.user_name").alias("user_name"), col("tweet_data.tweet_property.tweet").alias("tweet"), col("tweet_data.tweet_property.created_at").alias("created_at")) Consumer tweet_sentiment_analysis_demo/kafka_consumer_sentiment.py
  30. Example 2: Tweet Sentiment Analysis 3. Tokenize tweet text data

    4. Compute sentiment score in each tweet 32 # Define tokenize function as UDF to run on Spark def tokenize(text: str): tokenizer = JapaneseTokenizer() return tokenizer.create_wordlist(text) udf_tokenize = udf(tokenize) # Tokenize df_kafka_string_parsed_formatted_tokenized = df_kafka_string_parsed_formatted.select('*', udf_tokenize('tweet').alias('tweet_tokenized')) # Define score function as UDF to run on Spark # Derive polarity score of words from a list of tokens def get_pn_scores(tokens): counter=1 scores=0 for surface in tokens: if surface in pn_dic: counter=counter+1 scores=scores+(pn_dic[surface]) return scores/counter udf_get_pn_scores = udf(get_pn_scores) # Compute sentiment score df_kafka_string_parsed_formatted_score = df_kafka_string_parsed_formatted_tokenized.select( 'tweet_id’, 'user_name’, 'tweet’, 'created_at', udf_get_pn_scores('tweet_tokenized').alias('sentiment_score')) Consumer tweet_sentiment_analysis_demo/kafka_consumer_sentiment.py
  31. Summary • Stream processing is to process data piece-by-piece •

    We can use Python to implement stream processing with Apache Kafka and Spark Structured Streaming framework • Let’s play with stream data ☺ 34
  32. References • Structured Streaming Programming Guide https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html • Real-Time End-to-End

    Integration with Apache Kafka in Apache Spark’s Structured Streaming https://databricks.com/blog/2017/04/04/real-time-end-to-end-integration-with- apache-kafka-in-apache-sparks-structured-streaming.html • Introducing Low-latency Continuous Processing Mode in Structured Streaming in Apache Spark 2.3 https://databricks.com/blog/2018/03/20/low-latency-continuous-processing-mode- in-structured-streaming-in-apache-spark-2-3-0.html • Building Streaming Applications with Apache Spark https://pages.databricks.com/structured-streaming-apache-spark.html • Kafka usecase https://cwiki.apache.org/confluence/display/KAFKA/Powered+By 35