Slide 1

Slide 1 text

How to build stream data pipeline with Apache Kafka and Spark Structured Streaming Takanori AOKI PyCon Singapore 2019, Oct. 11 2019

Slide 2

Slide 2 text

Presenter profile 2 • Takanori Aoki • Based in Singapore since 2016 • Working as a Data Scientist • Natural language processing • Recommendation engine • Production system development • Professional background • IT infrastructure engineer at Tokyo • Data Scientist at Singapore

Slide 3

Slide 3 text

Agenda • Session overview • Batch processing vs stream processing • Technology stack • Apache Kafka • Apache Spark Structured Streaming • Application examples in Python • IoT data aggregation • Tweet sentiment analysis 3

Slide 4

Slide 4 text

Session overview • Target audience • Data Engineers, Data Scientists, and/or any Python developers who • are interested in stream processing.. • have not familiar with stream processing yet.. • What I will talk • Overview of Apache Kafka and Spark Structured Streaming • How to create simple stream data processing application in Python https://github.com/dstaka/kafka-spark-demo-pyconsg19 • What I will NOT talk • System architectural design • Parameter tuning and optimization • System configuration procedure • Comparison with other stream processing technologies 4

Slide 5

Slide 5 text

Batch processing vs Stream processing 5 2019-10-01 12:00AM 2019-10-01 1:00AM 2019-10-01 11:59PM ….. 2019-10-01 2019-10-02 12:00AM 2019-10-02 1:00AM 2019-10-02 11:59PM ….. 2019-10-02 - - - Batch processing Process on 2019-10-02 Process on 2019-10-03 2019-10-01 12:00AM 2019-10-01 1:00AM 2019-10-01 11:59PM ….. 2019-10-01 2019-10-02 12:00AM 2019-10-02 1:00AM 2019-10-02 11:59PM ….. 2019-10-02 Stream processing Process periodically (e.g. every 10 seconds), upon data arrival, or by event trigger …..

Slide 6

Slide 6 text

Batch processing vs Stream processing 6 Batch processing Stream processing How process works A set of data is collected over a certain period, then processed. Data is processed piece-by-piece. When to run Scheduled timing (e.g. every 1:00AM, every Sunday 3:00AM) Process periodically (e.g. every 10 seconds), upon data arrival, or by event trigger Data source Database records, file, etc Generated data from IoT device, or any real- /semi-real time data Output storage Database, filesystem, object storage, etc Database, filesystem, object storage, or not stored Performance metrics Throughput (How many records are processed) Latency (How much time to take to process in each piece of data) Use case ETL for data warehouse Capture real-time events from mobile devices Fraud detection on online services

Slide 7

Slide 7 text

Technology stack Example of stream processing system architecture 7 Stream data source Data store Stream processing Data hub

Slide 8

Slide 8 text

Apache Kafka overview • Distributed, resilient, fault tolerant, and scalable messaging middleware • Distributed among multiple processes on multiple servers • Publish / Subscribe messaging model • Can be a “data-hub” between diverse data source (so called “Producer”) and processors (so called “Consumer”) • Decouple system dependencies • Integrate with other Big Data processing products such as Spark, Flink, and Storm • API is available to data transmission • Originally developed by LinkedIn • Currently Kafka is an open source project mainly maintained by Confluent • Used by LinkedIn, Netflix, Airbnb, Uber, Walmart, etc… 8

Slide 9

Slide 9 text

bb Apache Kafka architecture 9 Producer 1 Producer 2 Producer X Producers (Publishers) ZooKeeper Broker 1 Broker 2 Consumers (Subscribers) Consumer 1 Consumer 2 Broker Topic A Consumer group α Kafka Cluster Broker Y Topic B Consumer Z Consumer group β

Slide 10

Slide 10 text

Apache Kafka architecture in detail 10 bb Producer 1 Producer 2 Producers (Publishers) Consumers (Subscribers) Broker Topic A Consumer group 1 2 3 4 5 6 7 Commit Offset Current Offset Log end offset 3 4 5 Consumer 1 Consumer 2 1 2 Kafka Cluster ZooKeeper

Slide 11

Slide 11 text

Apache Spark technology stack 11 https://www.slideshare.net/Cloudera_jp/spark-164769207

Slide 12

Slide 12 text

Apache Spark Structured Streaming Basic concept 12 Consider the input data stream as the “Input Table”. Every data item arriving on the stream is like a new row being appended to the Input Table. https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html

Slide 13

Slide 13 text

Apache Spark Structured Streaming Basic concept 13 *1 Output mode of Spark Structured Streaming https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html (Example) Word count *1 Output mode Description Complete The entire updated Result Table will be written to the external storage. Append Only the new rows appended in the Result Table since the last trigger will be written to the external storage. This is applicable only on the queries where existing rows in the Result Table are not expected to change. Update Only the rows that were updated in the Result Table since the last trigger will be written to the external storage. This mode only outputs the rows that have changed since the last trigger.

Slide 14

Slide 14 text

Apache Spark Structured Streaming Event-time-window-based aggregation 14 https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html (Example) Word count • Count words within 10 minutes time-windows • Update every 5 minutes • Word counts in words received between 10 minutes windows • e.g. 12:00 - 12:10, 12:05 - 12:15, 12:10 - 12:20 • A word that was received at 12:07 should increment the counts corresponding to two windows 12:00 - 12:10 and 12:05 - 12:15.

Slide 15

Slide 15 text

Apache Spark Structured Streaming Event-time-window-based aggregation 15 https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html (Example) Word count • Late arrival data will be ignored by watermarking

Slide 16

Slide 16 text

Example 1: IoT data aggregation Data generated by IoT devices 16 [{"device_id":10, "timestamp":"2019-10-01 09:00:00", "speed":12.245, "accelerometer_x":0.596, "accelerometer_y":-9.699, "accelerometer_z":-0.245}] [{"device_id":10, "timestamp":"2019-10-01 09:01:00", "speed":12.047, "accelerometer_x":-0.288, "accelerometer_y":-9.125, "accelerometer_z":-1.265}] [{"device_id":11, "timestamp":"2019-10-01 09:00:00", "speed":15.549, "accelerometer_x":-0.007, "accelerometer_y":9.005, "accelerometer_z":-2.079}] [{"device_id":11, "timestamp":"2019-10-01 09:01:00", "speed":15.29, "accelerometer_x":0.165, "accelerometer_y":9.287, "accelerometer_z":-1.2}]

Slide 17

Slide 17 text

Example 1: IoT data aggregation Overview • IoT devices send data to Kafka • ThreeData is sent as a JSON format • Each JSON data includes device_id, timestamp, speed, accelerometer_x, accelerometer_y, and accelerometer_z • These JSON data is aggregated by using stream processing • IoT data aggregation flow 0. Create Kafka topic 1. Send IoT data to Kafka broker 2. Spark Structured Streaming pulls IoT data from Kafka broker 3. Parse JSON format data 4. Aggregate data by device_id and time-window 5. Output result • Demo codes • https://github.com/dstaka/kafka-spark-demo-pyconsg19/tree/master/iot_demo 17

Slide 18

Slide 18 text

Example 1: IoT data aggregation 0. Create Kafka topic 18 0. Create Kafka topic $ kafka-topics --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic topic-iot-raw Created topic topic-iot-raw. $ kafka-topics --list --zookeeper localhost:2181 | grep iot topic-iot-raw bb Producers ZooKeeper Broker 1 Spark (Consumers) Consumer 1 Broker topic-iot-raw Consumer group Kafka Cluster Producer 1 Broker

Slide 19

Slide 19 text

Example 1: IoT data aggregation 1. Send IoT data to Kafka broker 19 import pandas as pd import time from kafka import KafkaProducer # Set Kafka config kafka_broker_hostname='localhost' kafka_broker_portno='9092' kafka_broker=kafka_broker_hostname + ':' + kafka_broker_portno kafka_topic='topic-iot-raw' data_send_interval=5 # Create KafkaProducer instance producer = KafkaProducer(bootstrap_servers=kafka_broker) # Load demo data iot_data_id10 = pd.read_csv('./data/iot_data_id10.csv') iot_data_id11 = pd.read_csv('./data/iot_data_id11.csv') iot_data_id12 = pd.read_csv('./data/iot_data_id12.csv') # Send demo data to Kafka broker for _index in range(0, len(iot_data_id10)): json_iot_id10 = iot_data_id10[iot_data_id10.index==_index].to_json(orient='records') producer.send(kafka_topic, bytes(json_iot_id10, 'utf-8')) json_iot_id11 = iot_data_id11[iot_data_id11.index==_index].to_json(orient='records') producer.send(kafka_topic, bytes(json_iot_id11, 'utf-8')) json_iot_id12 = iot_data_id12[iot_data_id12.index==_index].to_json(orient='records') producer.send(kafka_topic, bytes(json_iot_id12, 'utf-8')) time.sleep(data_send_interval) [{"device_id":10, "timestamp":"2019-10-01 09:01:00", "speed":12.047, "accelerometer_x":-0.288, "accelerometer_y":-9.125, "accelerometer_z":-1.265}] Producer iot_demo/kafka_producer_iot.py

Slide 20

Slide 20 text

Example 1: IoT data aggregation 2. Spark Structured Streaming pulls IoT data from Kafka broker 20 from pyspark.sql import SparkSession from pyspark.sql.functions import * from pyspark.sql.types import * from kafka import KafkaConsumer # Set Kafka config kafka_broker_hostname='localhost' kafka_consumer_portno='9092' kafka_broker=kafka_broker_hostname + ':' + kafka_consumer_portno kafka_topic_input='topic-iot-raw' # Create Spark session spark = SparkSession.builder.appName("AggregateIoTdata").getOrCreate() spark.sparkContext.setLogLevel("WARN") # Pull data from Kafka topic consumer = KafkaConsumer(kafka_topic_input) df_kafka = spark.readStream \ .format("kafka") \ .option("kafka.bootstrap.servers", kafka_broker) \ .option("subscribe", kafka_topic_input) \ .load() # Convert data from Kafka broker into String type df_kafka_string = df_kafka.selectExpr( "CAST(value AS STRING) as value") Consumer iot_demo/kafka_consumer_iot_agg_to_console.py

Slide 21

Slide 21 text

Example 1: IoT data aggregation 3. Parse JSON format data 21 # Define schema to read JSON format data iot_schema = StructType() \ .add("device_id", LongType()) \ .add("timestamp", StringType()) \ .add("speed", DoubleType()) \ .add("accelerometer_x", DoubleType()) \ .add("accelerometer_y", DoubleType()) \ .add("accelerometer_z", DoubleType()) # Parse JSON data df_kafka_string_parsed = df_kafka_string.select(from_json(df_kafka_string.value, iot_schema).alias("iot_data")) [{"device_id":10, "timestamp":"2019-10-01 09:00:00", "speed":12.245, "accelerometer_x":0.596, "accelerometer_y":-9.699, "accelerometer_z":-0.245}] [{"device_id":11, "timestamp":"2019-10-01 09:00:00", "speed":15.549, "accelerometer_x":-0.007, "accelerometer_y":9.005, "accelerometer_z":-2.079}] device_id timestamp speed accelerometer_x accelerometer_y accelerometer_z 10 2019-10-01 09:00:00 12.245 0.596 -9.699 -0.245 11 2019-10-01 09:00:00 15.549 -0.007 9.005 -2.079 12 2019-10-01 09:00:00 10.076 -0.042 9.005 0.573 10 2019-10-01 09:01:00 12.047 -0.288 -9.125 -1.265 [{"device_id":12, "timestamp":"2019-10-01 09:00:00", "speed":10.076”, "accelerometer_x":-0.042, "accelerometer_y":9.005, "accelerometer_z":0.573}] [{"device_id":10, "timestamp":"2019-10-01 09:01:00", "speed":12.047, "accelerometer_x":-0.288, "accelerometer_y":-9.125, "accelerometer_z":-1.265}] iot_demo/kafka_consumer_iot_agg_to_console.py Consumer

Slide 22

Slide 22 text

Example 1: IoT data aggregation 3. Parse JSON format data 22 df_kafka_string_parsed_formatted = df_kafka_string_parsed.select( col("iot_data.device_id").alias("device_id"), col("iot_data.timestamp").alias("timestamp"), col("iot_data.speed").alias("speed"), col("iot_data.accelerometer_x").alias("accelerometer_x"), col("iot_data.accelerometer_y").alias("accelerometer_y"), col("iot_data.accelerometer_z").alias("accelerometer_z")) # Convert timestamp field from string to Timestamp format df_kafka_string_parsed_formatted_timestamped = df_kafka_string_parsed_formatted.withColumn("timestamp", to_timestamp(df_kafka_string_parsed_formatted.timestamp, 'yyyy-MM-dd HH:mm:ss')) [{"device_id":10, "timestamp":"2019-10-01 09:00:00", "speed":12.245, "accelerometer_x":0.596, "accelerometer_y":-9.699, "accelerometer_z":-0.245}] [{"device_id":11, "timestamp":"2019-10-01 09:00:00", "speed":15.549, "accelerometer_x":-0.007, "accelerometer_y":9.005, "accelerometer_z":-2.079}] device_id timestamp speed accelerometer_x accelerometer_y accelerometer_z 10 2019-10-01 09:00:00 12.245 0.596 -9.699 -0.245 11 2019-10-01 09:00:00 15.549 -0.007 9.005 -2.079 12 2019-10-01 09:00:00 10.076 -0.042 9.005 0.573 10 2019-10-01 09:01:00 12.047 -0.288 -9.125 -1.265 [{"device_id":12, "timestamp":"2019-10-01 09:00:00", "speed":10.076”, "accelerometer_x":-0.042, "accelerometer_y":9.005, "accelerometer_z":0.573}] [{"device_id":10, "timestamp":"2019-10-01 09:01:00", "speed":12.047, "accelerometer_x":-0.288, "accelerometer_y":-9.125, "accelerometer_z":-1.265}] iot_demo/kafka_consumer_iot_agg_to_console.py Consumer

Slide 23

Slide 23 text

Example 1: IoT data aggregation 4. Aggregate data by device_id and time-window 23 # Compute average of speed, accelerometer_x, accelerometer_y, and accelerometer_z during 5 minutes # Data comes after 10 minutes will be ignores df_windowavg = df_kafka_string_parsed_formatted_timestamped.withWatermark("timestamp", "10 minutes").groupBy( window(df_kafka_string_parsed_formatted_timestamped.timestamp, "5 minutes"), df_kafka_string_parsed_formatted_timestamped.device_id).avg("speed", "accelerometer_x", "accelerometer_y", "accelerometer_z") # Add columns showing each window start and end timestamp df_windowavg_timewindow = df_windowavg.select( "device_id", col("window.start").alias("window_start"), col("window.end").alias("window_end"), col("avg(speed)").alias("avg_speed"), col("avg(accelerometer_x)").alias("avg_accelerometer_x"), col("avg(accelerometer_y)").alias("avg_accelerometer_y"), col("avg(accelerometer_z)").alias("avg_accelerometer_z") ).orderBy(asc("device_id"), asc("window_start")) iot_demo/kafka_consumer_iot_agg_to_console.py Consumer

Slide 24

Slide 24 text

Example 1: IoT data aggregation 5. Output to console 24 # Print output to console query_console = df_windowavg_timewindow.writeStream.outputMode("complete").format("console").start() query_console.awaitTermination() iot_demo/kafka_consumer_iot_agg_to_console.py Consumer

Slide 25

Slide 25 text

Example 1: IoT data aggregation 5. Output to Kafka 25 kafka_topic_output='topic-iot-agg’ # Send output to Kafka query_kafka = df_windowavg_timewindow \ .selectExpr("to_json(struct(*)) AS value") \ .writeStream.outputMode("update") \ .format("kafka") \ .option("kafka.bootstrap.servers", kafka_broker) \ .option("topic", kafka_topic_output) \ .option("checkpointLocation", "checkpoint/send_to_kafka") \ .start() query_kafka.awaitTermination() iot_demo/kafka_consumer_iot_agg_to_kafka.py Consumer

Slide 26

Slide 26 text

Example 1: IoT data aggregation 5. Output to Kafka 26 iot_demo/kafka_consumer_iot_agg_to_es.py # Set Elasticsearch config es_hostname='localhost' es_portno='9200’ es_doc_type_name='doc-iot-demo/doc' # Send output to Elasticsearch query_es = df_kafka_string_parsed_formatted_timestamped \ .selectExpr("to_json(struct(*)) AS value") \ .writeStream \ .format("es") \ .outputMode("append") \ .option("es.nodes", es_hostname) \ .option("es.port", es_portno) \ .option("checkpointLocation", "checkpoint/send_to_es") \ .option('es.resource', es_doc_type_name) \ .start("orders/log”) query_es.awaitTermination() Consumer

Slide 27

Slide 27 text

Example 2: Tweet Sentiment Analysis Overview • Not only “SQL-like” process, but also more complex logic can be implemented on Spark Structured Streaming • Tweet sentiment analysis flow 0. Create Kafka topic 1. Extract tweets from timeline and send to Kafka broker 2. Spark Structured Streaming pulls tweet data from Kafka broker 3. Tokenize tweet text data 4. Compute sentiment score in each tweet 5. Output result • Demo codes • https://github.com/dstaka/kafka-spark-demo-pyconsg19/tree/master/tweet_sentiment_analysis_demo 27

Slide 28

Slide 28 text

Example 2: Tweet Sentiment Analysis 1. Extract tweets from timeline and send to Kafka broker 28 import json, config from requests_oauthlib import OAuth1Session import pandas as pd import time import configparser from kafka import KafkaProducer # Set Kafka config kafka_broker_hostname='localhost' kafka_broker_portno='9092' kafka_broker=kafka_broker_hostname + ':' + kafka_broker_portno kafka_topic='topic-tweet-raw' producer = KafkaProducer(bootstrap_servers=kafka_broker) Producer tweet_sentiment_analysis_demo/kafka_producer_sentiment.py # Read Twitter API credentials information from config file config = configparser.RawConfigParser() config.read('./config/twitter_credentials.conf') CONSUMER_KEY = config.get('Twitter_API', 'CONSUMER_KEY') CONSUMER_SECRET = config.get('Twitter_API', 'CONSUMER_SECRET') ACCESS_TOKEN = config.get('Twitter_API', 'ACCESS_TOKEN') ACCESS_TOKEN_SECRET = config.get('Twitter_API', 'ACCESS_TOKEN_SECRET') endpoint_url = "https://api.twitter.com/1.1/statuses/home_timeline.json"

Slide 29

Slide 29 text

Example 2: Tweet Sentiment Analysis 1. Extract tweets from timeline and send to Kafka broker 29 # Authenticate Twitter API twitter = OAuth1Session(CONSUMER_KEY, CONSUMER_SECRET, ACCESS_TOKEN, ACCESS_TOKEN_SECRET) res = twitter.get(endpoint_url, params = params) # Parse JSON data from Twitter API and send to Kafka timelines = json.loads(res.text) # Parse JSON in each tweet to send to Kafka for line_dict in timelines: # Remove retweet records if "RT @" in line_dict['text']: continue kafka_msg_str='' line_str = json.dumps(line_dict) kafka_msg_str = (kafka_msg_str + '{ "tweet_id" : ' + line_dict['id_str'] + ', "tweet_property" : { ' + '"user_id" : ' + str(line_dict['user']['id']) + ', ' + '"user_name" : "' + line_dict['user']['name'] + '", ' + '"tweet" : "' + line_dict['text'].replace('\n', ' ').replace('"', ' ') + '", ' + '"created_at" : "' + line_dict['created_at'] + '"} }\n') producer.send(kafka_topic, bytes(kafka_msg_str, 'utf-8')) Producer tweet_sentiment_analysis_demo/kafka_producer_sentiment.py

Slide 30

Slide 30 text

Example 2: Tweet Sentiment Analysis 1. Extract tweets from timeline and send to Kafka broker 30 { "tweet_id" : 136, "tweet_property" : {"user_name" : "Taro", "tweet" : "タイム圧倒的やんけ、もったいね〜(...", "created_at" : "Sat Oct 05 09:20 +0000 2019"} } { "tweet_id" : 158, "tweet_property" : { "user_name" : "Rikako", "tweet" : "『その行為は子供の最も基本的な信頼", "created_at" : "Sat Oct 05 09:20 +0000 2019"} } { "tweet_id" : 141, "tweet_property" : {"user_name" : "Satoshi", "tweet" : "【ゆる募】コンパクト距離空間 X ", "created_at" : "Sat Oct 05 09:20 +0000 2019"} } Producer (Output examples) Tweet body Tweet body Tweet body

Slide 31

Slide 31 text

Example 2: Tweet Sentiment Analysis 2. Spark Structured Streaming pulls tweet data from Kafka broker 31 # Set Kafka config kafka_broker_hostname='localhost' kafka_broker_portno='9092' kafka_broker=kafka_broker_hostname + ':' + kafka_broker_portno kafka_topic_input='topic-tweet-raw' # Create Spark session spark = SparkSession.builder.appName("tweet_sentiment_analysis") .getOrCreate() spark.sparkContext.setLogLevel("WARN") ## Input from Kafka # Pull data from Kafka topic df_kafka = spark.readStream \ .format("kafka") \ .option("kafka.bootstrap.servers", kafka_broker) \ .option("subscribe", kafka_topic_input) \ .load() # Convert data from Kafka into String type df_kafka_string = df_kafka.selectExpr("CAST(value AS STRING) as value") # Define schema to read JSON format data # Deal with nested structure tweet_property_schema = StructType() \ .add("user_name", StringType()) \ .add("tweet", StringType()) \ .add("created_at", StringType()) tweet_schema = StructType().add("tweet_id", LongType()).add("tweet_property", tweet_property_schema) # Parse JSON data df_kafka_string_parsed = df_kafka_string.select( from_json(df_kafka_string.value, tweet_schema).alias("tweet_data")) df_kafka_string_parsed_formatted = df_kafka_string_parsed.select( col("tweet_data.tweet_id").alias("tweet_id"), col("tweet_data.tweet_property.user_name").alias("user_name"), col("tweet_data.tweet_property.tweet").alias("tweet"), col("tweet_data.tweet_property.created_at").alias("created_at")) Consumer tweet_sentiment_analysis_demo/kafka_consumer_sentiment.py

Slide 32

Slide 32 text

Example 2: Tweet Sentiment Analysis 3. Tokenize tweet text data 4. Compute sentiment score in each tweet 32 # Define tokenize function as UDF to run on Spark def tokenize(text: str): tokenizer = JapaneseTokenizer() return tokenizer.create_wordlist(text) udf_tokenize = udf(tokenize) # Tokenize df_kafka_string_parsed_formatted_tokenized = df_kafka_string_parsed_formatted.select('*', udf_tokenize('tweet').alias('tweet_tokenized')) # Define score function as UDF to run on Spark # Derive polarity score of words from a list of tokens def get_pn_scores(tokens): counter=1 scores=0 for surface in tokens: if surface in pn_dic: counter=counter+1 scores=scores+(pn_dic[surface]) return scores/counter udf_get_pn_scores = udf(get_pn_scores) # Compute sentiment score df_kafka_string_parsed_formatted_score = df_kafka_string_parsed_formatted_tokenized.select( 'tweet_id’, 'user_name’, 'tweet’, 'created_at', udf_get_pn_scores('tweet_tokenized').alias('sentiment_score')) Consumer tweet_sentiment_analysis_demo/kafka_consumer_sentiment.py

Slide 33

Slide 33 text

Example 2: Tweet Sentiment Analysis 5. Compute sentiment score in each tweet 33 Consumer

Slide 34

Slide 34 text

Summary • Stream processing is to process data piece-by-piece • We can use Python to implement stream processing with Apache Kafka and Spark Structured Streaming framework • Let’s play with stream data ☺ 34

Slide 35

Slide 35 text

References • Structured Streaming Programming Guide https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html • Real-Time End-to-End Integration with Apache Kafka in Apache Spark’s Structured Streaming https://databricks.com/blog/2017/04/04/real-time-end-to-end-integration-with- apache-kafka-in-apache-sparks-structured-streaming.html • Introducing Low-latency Continuous Processing Mode in Structured Streaming in Apache Spark 2.3 https://databricks.com/blog/2018/03/20/low-latency-continuous-processing-mode- in-structured-streaming-in-apache-spark-2-3-0.html • Building Streaming Applications with Apache Spark https://pages.databricks.com/structured-streaming-apache-spark.html • Kafka usecase https://cwiki.apache.org/confluence/display/KAFKA/Powered+By 35