Slide 1

Slide 1 text

1 Streaming data processing using Apache Spark Roksolana Diachuk WWCode Kyiv Lead, Data Engineer at Ciklum

Slide 2

Slide 2 text

2 1. Streaming data concept and use cases 2. Spark Streaming introduction and main types 3. Examples 3.1. Basic streaming examples 3.2. Structured streaming examples 4. Use Case 5. Spark Streaming future Agenda

Slide 3

Slide 3 text

3 Real-time data

Slide 4

Slide 4 text

4 Data stream A data stream is just a sequence of data units. Streaming data is the data which is generated in real-time by multiple sources.

Slide 5

Slide 5 text

5 Batch processing Stream processing

Slide 6

Slide 6 text

6 Batch-processing

Slide 7

Slide 7 text

7 Streaming data processing aspects Performance 1 Scalability and reliability 2 Messages delivery semantics 4 Tool/framework ecosystem 3

Slide 8

Slide 8 text

8 Streaming data processing aspects Performance 1 Scalability and reliability 2 Messages delivery semantics 4 Tool/framework ecosystem 3

Slide 9

Slide 9 text

9 Streaming data processing aspects Performance 1 Scalability and reliability 2 Messages delivery semantics 4 Tool/framework ecosystem 3

Slide 10

Slide 10 text

10 Streaming data processing aspects Performance 1 Scalability and reliability 2 Messages delivery semantics 4 Tool/framework ecosystem 3

Slide 11

Slide 11 text

11 Messages delivery guarantees At-most-once 1 At-least-once 2 Exactly-once 3

Slide 12

Slide 12 text

12 It all depends...

Slide 13

Slide 13 text

13 Apache Spark ecosystem

Slide 14

Slide 14 text

14 Basic streaming Spark streaming types Structured streaming DStream [RDD[T]] Dataframe

Slide 15

Slide 15 text

15 Micro-batching Spark driver short tasks micro-batch micro-batch micro-batch micro-batch short tasks short tasks short tasks WAL DStream

Slide 16

Slide 16 text

16 DStream RDD RDD RDD RDD time 1 time 2 time 3 time 4

Slide 17

Slide 17 text

17 flatMap DStream transformations flatMap flatMap lines DStream words DStream flatMap

Slide 18

Slide 18 text

18 Structured streaming Unbounded table from a stream New rows appended to the unbounded table Data stream

Slide 19

Slide 19 text

19 Incrementalization process Output modes ● Append ● Complete ● Update Input table Incremental query Result table

Slide 20

Slide 20 text

20 Window transformations Window transformations

Slide 21

Slide 21 text

21 Window operations on DStream original DStream windowed DStream window-based operation window 1 window 2 time 1 time 2 time 3 time 4 time 5

Slide 22

Slide 22 text

22 Sliding window operations on DStream original DStream windowed DStream window-based operation window 1 window 3 window 2 time 1 time 2 time 3 time 4 time 5

Slide 23

Slide 23 text

23 Structured streaming windows Data stream

Slide 24

Slide 24 text

24 Window transformations 11.05 Time 11.10 11.15 11.00 t1 1 t1 2 t1 1 t1 2 t2 8 t2 5 t2 3 t1 1 t1 2 t2 8 t2 5 t2 3 t3 4 t3 11

Slide 25

Slide 25 text

25 Late data 11.05 Time 11.10 11.15 11.00 t1 1 t1 2 t1 1 t1 2 t2 8 t2 5 t2 3 t1 1 t1 3 t2 8 t2 6 t2 3 t3 4 t3 11

Slide 26

Slide 26 text

26 Watermarking 11.05 Time 11.10 11.15 11.00 t1 1 t1 2 t1 1 t1 2 t2 8 t2 5 t2 3 t1 2 t1 2 t2 8 t2 6 t2 3 t3 4 t3 11 Watermark = 5 minutes

Slide 27

Slide 27 text

27 Output operations print() 01 saveAsTextFiles (filename) 02 saveAsHadoopFiles (filename) 04 saveAsObjectFiles (filename) 03

Slide 28

Slide 28 text

28 Client mode Cluster mode Application execution modes Standalone YARN Mesos Kubernetes

Slide 29

Slide 29 text

29 Examples

Slide 30

Slide 30 text

30 Streaming application entrypoint val batchInterval = Seconds(3) val streamingContext = new StreamingContext (sparkContext, batchInterval)

Slide 31

Slide 31 text

31 Example data source

Slide 32

Slide 32 text

32 Spark streaming example val lines = KafkaUtils.createDirectStream[String, String]( streamingContext, ...) val values = lines.map(_.value())

Slide 33

Slide 33 text

33 Spark streaming example DStream [ConsumerRecord [String, String]] ConsumerRecord [String, String] String lines DStream [String] lines.map (_.value()) record_ _.value

Slide 34

Slide 34 text

34 Filtering val filteredData = values.filter(tweetText => tweetText.contains("big data")) DStream [String] DStream [String] values filteredData filter(...)

Slide 35

Slide 35 text

35 Spark streaming example values.foreachRDD(rdd => { val dataFrame = sparkSession.read .schema(dataSchema) .load() })

Slide 36

Slide 36 text

36 Streaming example visualization DStream [String] values RDD [String] StructType => DataFrame Reader rdd foreachRdd sparkSession.read. schema(schema) Dataframe .load() dataFrame

Slide 37

Slide 37 text

37 Aggregations val colName = "user.followers_count" values.foreachRDD(rdd => { ... dataFrame.agg( min(dataFrame.col(colName)), max(dataFrame.col(colName))) })

Slide 38

Slide 38 text

38 Application launch streamingContext.start() streamingContext.awaitTermination()

Slide 39

Slide 39 text

39 Structured streaming

Slide 40

Slide 40 text

40 Structured streaming example val tweetsStream = sparkSession.readStream .format("kafka") .options(...) .load() tweetsStream = DataFrame

Slide 41

Slide 41 text

41 Selection val structuredData = tweetsStream .select(tweetsStream("value").cast(StringType)) .select(from_json($"value", dataStruct) .as("tweet"))

Slide 42

Slide 42 text

42 Selection DataFrame tweetsStream String=> Dataframe .select(tweetsStream("value") .cast(StringType)) .select(from_json ("$value",dataStruct)) Dataframe structuredData

Slide 43

Slide 43 text

43 Filtering val filteredData = structuredData .filter(structuredData.col("tweet.user.lang") .contains("en"))

Slide 44

Slide 44 text

44 Filtering DataFrame structuredData Column=> DataSet[Row] .filter(structuredData.col("tweet.user.lang") .contains("en")) DataSet[Row] filteredData

Slide 45

Slide 45 text

45 Aggregations val groupedData = structuredData .groupBy("tweet.user.lang") .count() .orderBy(desc("count"))

Slide 46

Slide 46 text

46 Aggregations DataFrame (String, String)=> Relational GroupedDataSet .groupBy ("tweet.user.lang") DataSet[Row] groupedData DataFrame .count() .orderBy(desc("count")) structuredData

Slide 47

Slide 47 text

47 Windowed aggregations val windowedData = structuredData .groupBy(window(structuredData.col("timestamp_ms"), "7 minutes", "3 minutes"), updatedTweets.col(userLanguage)) .count()

Slide 48

Slide 48 text

48 Windowed aggregations DataFrame Column => Relational GroupedDataSet .groupBy(window(..),updatedTweets .col(userLanguage)) (Column, String,String) => Column windowedData DataFrame .count() window(structuredData.col ("timestamp.ms"), "7 minutes", "3 minutes") structuredData

Slide 49

Slide 49 text

49 Windowed aggregations with watermark val watermarkedData = structuredData .withWatermark("timestamp_ms", "3 minutes") .groupBy(window(structuredData.col("timestamp_ms"), "7 minutes", "3 minutes"), updatedTweets.col(userLanguage)) .count()

Slide 50

Slide 50 text

50 Windowed aggregations with watermark DataFrame Column => Relational GroupedDataSet .groupBy(...) (String,String)=> DataSet[Row] watermarkedData DataFrame .count() .withWatermark ("timestamp.ms", "3 minutes") structuredData

Slide 51

Slide 51 text

51 Structured streaming application launch val resultsOutput = dataTransformation .writeStream .outputMode(outputMode) .format(outputFormat) .start() resultsOutput.awaitTermination()

Slide 52

Slide 52 text

52 Monitoring

Slide 53

Slide 53 text

53 Access SparkUI Bound Spark UI to 0.0.0.0, and started at http://host:port

Slide 54

Slide 54 text

54 Monitoring with SparkUI

Slide 55

Slide 55 text

55 Monitoring with SparkUI

Slide 56

Slide 56 text

56 Monitoring with SparkUI

Slide 57

Slide 57 text

57 Monitoring with SparkUI

Slide 58

Slide 58 text

58 Monitoring generated load

Slide 59

Slide 59 text

59 Monitoring generated load

Slide 60

Slide 60 text

60 Use Case

Slide 61

Slide 61 text

61 Use case schema

Slide 62

Slide 62 text

62 Spark streaming future

Slide 63

Slide 63 text

63 Continuous streaming Spark driver Long running tasks epoch epoch epoch epoch epoch WAL

Slide 64

Slide 64 text

64 Continuous streaming example val continuousOutput = filteredData.writeStream .outputMode("append") .format(outputFormat) .trigger(Trigger.Continuous("1 second")) .start() continuousOutput.awaitTermination()

Slide 65

Slide 65 text

65 Upcoming features Stable continuous streaming Streaming with deep learning Stable Kubernetes support

Slide 66

Slide 66 text

66 Resources 5 Pro Spark Streaming: The Zen of Real-Time Analytics Using Apache Spark. Z. Nabi 4 3 High Performance Spark. H. Karau, R. Warren 2 Learning Spark Streaming. F. Garillot, G. Mass 1 Spark documentation Spark Summit sessions

Slide 67

Slide 67 text

67 Contacts roksolana.diachuk@gmail.com roksolanadiachuk roksolana-d dead_flowers22

Slide 68

Slide 68 text

68 Examples repository

Slide 69

Slide 69 text

69 Thank you for attention