1 Streaming data processing using Apache Spark Roksolana Diachuk WWCode Kyiv Lead, Data Engineer at Ciklum

2 1. Streaming data concept and use cases 2. Spark Streaming introduction and main types 3. Examples 3.1. Basic streaming examples 3.2. Structured streaming examples 4. Use Case 5. Spark Streaming future Agenda

3 Real-time data

4 Data stream A data stream is just a sequence of data units. Streaming data is the data which is generated in real-time by multiple sources.

5 Batch processing Stream processing

6 Batch-processing

7 Streaming data processing aspects Performance 1 Scalability and reliability 2 Messages delivery semantics 4 Tool/framework ecosystem 3

8 Streaming data processing aspects Performance 1 Scalability and reliability 2 Messages delivery semantics 4 Tool/framework ecosystem 3

9 Streaming data processing aspects Performance 1 Scalability and reliability 2 Messages delivery semantics 4 Tool/framework ecosystem 3

10 Streaming data processing aspects Performance 1 Scalability and reliability 2 Messages delivery semantics 4 Tool/framework ecosystem 3

11 Messages delivery guarantees At-most-once 1 At-least-once 2 Exactly-once 3

12 It all depends...

13 Apache Spark ecosystem

14 Basic streaming Spark streaming types Structured streaming DStream [RDD[T]] Dataframe

15 Micro-batching Spark driver short tasks micro-batch micro-batch micro-batch micro-batch short tasks short tasks short tasks WAL DStream

16 DStream RDD RDD RDD RDD time 1 time 2 time 3 time 4

17 flatMap DStream transformations flatMap flatMap lines DStream words DStream flatMap

18 Structured streaming Unbounded table from a stream New rows appended to the unbounded table Data stream

19 Incrementalization process Output modes ● Append ● Complete ● Update Input table Incremental query Result table

20 Window transformations Window transformations

21 Window operations on DStream original DStream windowed DStream window-based operation window 1 window 2 time 1 time 2 time 3 time 4 time 5

22 Sliding window operations on DStream original DStream windowed DStream window-based operation window 1 window 3 window 2 time 1 time 2 time 3 time 4 time 5

23 Structured streaming windows Data stream

24 Window transformations 11.05 Time 11.10 11.15 11.00 t1 1 t1 2 t1 1 t1 2 t2 8 t2 5 t2 3 t1 1 t1 2 t2 8 t2 5 t2 3 t3 4 t3 11

25 Late data 11.05 Time 11.10 11.15 11.00 t1 1 t1 2 t1 1 t1 2 t2 8 t2 5 t2 3 t1 1 t1 3 t2 8 t2 6 t2 3 t3 4 t3 11

26 Watermarking 11.05 Time 11.10 11.15 11.00 t1 1 t1 2 t1 1 t1 2 t2 8 t2 5 t2 3 t1 2 t1 2 t2 8 t2 6 t2 3 t3 4 t3 11 Watermark = 5 minutes

27 Output operations print() 01 saveAsTextFiles (filename) 02 saveAsHadoopFiles (filename) 04 saveAsObjectFiles (filename) 03

28 Client mode Cluster mode Application execution modes Standalone YARN Mesos Kubernetes

29 Examples

30 Streaming application entrypoint val batchInterval = Seconds(3) val streamingContext = new StreamingContext (sparkContext, batchInterval)

31 Example data source

32 Spark streaming example val lines = KafkaUtils.createDirectStream[String, String]( streamingContext, ...) val values =

33 Spark streaming example DStream [ConsumerRecord [String, String]] ConsumerRecord [String, String] String lines DStream [String] (_.value()) record_ _.value

34 Filtering val filteredData = values.filter(tweetText => tweetText.contains("big data")) DStream [String] DStream [String] values filteredData filter(...)

35 Spark streaming example values.foreachRDD(rdd => { val dataFrame = .schema(dataSchema) .load() })

36 Streaming example visualization DStream [String] values RDD [String] StructType => DataFrame Reader rdd foreachRdd schema(schema) Dataframe .load() dataFrame

37 Aggregations val colName = "user.followers_count" values.foreachRDD(rdd => { ... dataFrame.agg( min(dataFrame.col(colName)), max(dataFrame.col(colName))) })

38 Application launch streamingContext.start() streamingContext.awaitTermination()

39 Structured streaming

40 Structured streaming example val tweetsStream = sparkSession.readStream .format("kafka") .options(...) .load() tweetsStream = DataFrame

41 Selection val structuredData = tweetsStream .select(tweetsStream("value").cast(StringType)) .select(from_json($"value", dataStruct) .as("tweet"))

42 Selection DataFrame tweetsStream String=> Dataframe .select(tweetsStream("value") .cast(StringType)) .select(from_json ("$value",dataStruct)) Dataframe structuredData

43 Filtering val filteredData = structuredData .filter(structuredData.col("tweet.user.lang") .contains("en"))

44 Filtering DataFrame structuredData Column=> DataSet[Row] .filter(structuredData.col("tweet.user.lang") .contains("en")) DataSet[Row] filteredData

45 Aggregations val groupedData = structuredData .groupBy("tweet.user.lang") .count() .orderBy(desc("count"))

46 Aggregations DataFrame (String, String)=> Relational GroupedDataSet .groupBy ("tweet.user.lang") DataSet[Row] groupedData DataFrame .count() .orderBy(desc("count")) structuredData

47 Windowed aggregations val windowedData = structuredData .groupBy(window(structuredData.col("timestamp_ms"), "7 minutes", "3 minutes"), updatedTweets.col(userLanguage)) .count()

48 Windowed aggregations DataFrame Column => Relational GroupedDataSet .groupBy(window(..),updatedTweets .col(userLanguage)) (Column, String,String) => Column windowedData DataFrame .count() window(structuredData.col (""), "7 minutes", "3 minutes") structuredData

49 Windowed aggregations with watermark val watermarkedData = structuredData .withWatermark("timestamp_ms", "3 minutes") .groupBy(window(structuredData.col("timestamp_ms"), "7 minutes", "3 minutes"), updatedTweets.col(userLanguage)) .count()

50 Windowed aggregations with watermark DataFrame Column => Relational GroupedDataSet .groupBy(...) (String,String)=> DataSet[Row] watermarkedData DataFrame .count() .withWatermark ("", "3 minutes") structuredData

51 Structured streaming application launch val resultsOutput = dataTransformation .writeStream .outputMode(outputMode) .format(outputFormat) .start() resultsOutput.awaitTermination()

52 Monitoring

53 Access SparkUI Bound Spark UI to, and started at http://host:port

54 Monitoring with SparkUI

55 Monitoring with SparkUI

56 Monitoring with SparkUI

57 Monitoring with SparkUI

58 Monitoring generated load

59 Monitoring generated load

60 Use Case

61 Use case schema

62 Spark streaming future

63 Continuous streaming Spark driver Long running tasks epoch epoch epoch epoch epoch WAL

64 Continuous streaming example val continuousOutput = filteredData.writeStream .outputMode("append") .format(outputFormat) .trigger(Trigger.Continuous("1 second")) .start() continuousOutput.awaitTermination()

65 Upcoming features Stable continuous streaming Streaming with deep learning Stable Kubernetes support

66 Resources 5 Pro Spark Streaming: The Zen of Real-Time Analytics Using Apache Spark. Z. Nabi 4 3 High Performance Spark. H. Karau, R. Warren 2 Learning Spark Streaming. F. Garillot, G. Mass 1 Spark documentation Spark Summit sessions

67 Contacts roksolanadiachuk roksolana-d dead_flowers22

68 Examples repository

69 Thank you for attention