Introduction to streaming data processing using Apache Spark

Slide 1

Slide 1 text

Introduction to streaming data processing using Apache Spark Roksolana Diachuk Data Engineer at Ciklum

Slide 2

Slide 2 text

Business decisions in various domains need to be made fast and should be based on the real-time data. This data is hard to process due to the high-velocity.

Slide 3

Slide 3 text

Streaming data A data stream is just a sequence of data units. Streaming data is the data which is generated in real-time by multiple sources.

Slide 4

Slide 4 text

Streaming data processing

Slide 5

Slide 5 text

Streaming data processing There is a wide range of tools but some aspect to be considered to choose the right tool for the job are: ● performance ● scalability and reliability ● tool/framework ecosystem ● specific sources and sinks support ● messages delivery semantics and so on...

Slide 6

Slide 6 text

...it all depends!

Slide 7

Slide 7 text

Messages delivery semantics Message delivery semantics can be defined in terms of message sending guarantees: ● at-most-once (0 or 1 time) ● at-least-once (1 or more times) ● exactly-once (1 time)

Slide 8

Slide 8 text

Apache Spark ecosystem

Slide 9

Slide 9 text

Spark Streaming abstractions DStream (Discretized Stream) is a basic abstraction of Spark Streaming which represents a continuous stream of data (in form of RDDs).

Slide 10

Slide 10 text

Micro-batching

Slide 11

Slide 11 text

Batch interval Batch interval is magical value which directly affects performance.

Slide 12

Slide 12 text

No content

Slide 13

Slide 13 text

Streaming application entrypoint 1. Streaming context creation with SparkContext import org.apache.spark.streaming._ val sc = ... // existing SparkContext val ssc = new StreamingContext(sc, Seconds(1))

Slide 14

Slide 14 text

Streaming application entrypoint 2. Streaming context creation with SparkConfig import org.apache.spark._ import org.apache.spark.streaming._ val conf = new SparkConf() .setAppName(appName).setMaster(master) val ssc = new StreamingContext(conf, Seconds(1))

Slide 15

Slide 15 text

Stream sources Built-in streaming sources: ● Basic sources - file systems, socket connections streamingContext.fileStream[KeyClass, ValueClass, InputFormatClass](dataDirectory) streamingContext.textFileStream(dataDirectory) ● Advanced sources - Kafka, Flume, Kinesis

Slide 16

Slide 16 text

Stream sinks ● Console ● Files (Hadoop file, object file, text file) ● Storage systems (ElasticSearch, relational and NoSQL databases)

Slide 17

Slide 17 text

No content

Slide 18

Slide 18 text

DStream transformations 1)Common transformations on RDDs (map, flatMap, filter, repartition, reduce, count, etc.)

Slide 19

Slide 19 text

DStream transformations 2)Window transformations (countByWindow,window)

Slide 20

Slide 20 text

Fault-tolerance mechanisms 1. Caching/Persistence 2. Metadata checkpointing 3. Data checkpointing

Slide 21

Slide 21 text

Use Case

Slide 22

Slide 22 text

Use case architecture

Slide 23

Slide 23 text

val streamingContext = new StreamingContext (sparkContext, Seconds(3)) val numStreams = 3 val lines = (1 to numStreams).map (i => kafkaConfig) Example

Slide 24

Slide 24 text

Example KafkaUtils.createDirectStream[String, String]( streamingContext, LocationStrategies.PreferConsistent, ConsumerStrategies.Subscribe[String, String] (topics, kafkaParams(i))))

Slide 25

Slide 25 text

val messages = streamingContext.union(lines) val values = messages.map( record => record.value().toString) val wordsArrays = values.map(_.split(“!&”)) DStream[Array[String]] DStream[String] DStream[ConsumerRecord[String,String]] More code...

Slide 26

Slide 26 text

wordsArrays.foreachRDD( rdd => rdd.flatMap(record => convertToTweetMap(record) ).saveToEs("tweets-time/output")) RDD[Array[String]] Array[String] And more...

Slide 27

Slide 27 text

streamingContext.start() streamingContext.awaitTermination() THIS IS THE END!!! Application launch

Slide 28

Slide 28 text

Visualizations dashboard

Slide 29

Slide 29 text

Spark Streaming metrics

Slide 30

Slide 30 text

Experimental features

Slide 31

Slide 31 text

Continuous streaming Optional low-latency mechanism introduced in Spark 2.3.0

Slide 32

Slide 32 text

Structured streaming

Slide 33

Slide 33 text

Incrementalization process Output modes: ● Append ● Complete ● Update

Slide 34

Slide 34 text

And other interesting things... Tasks parallelism Streaming jobs tuning (executors/partitions numbers choice, block and batch interval tuning, scheduling, memory setup) Spark Streaming in cluster mode

Slide 35

Slide 35 text

Resources 1. Spark documentation 2. Learning Spark. M. Zaharia, P. Wendell, A. Konwinski, H. Karau (2015) 3. Learning Spark Streaming. F. Garillot, G. Mass (2017) 4. Streaming Data: Understanding the real-time pipeline. A. G. Psaltis (2017) 5. High Performance Spark. Holden Karau, Rachel Warren (2017) 6. Pro Spark Streaming: The Zen of Real-Time Analytics Using Apache Spark. Z. Nabi (2016) 7. Spark Summit sessions 8. Databricks engineering blog 9. Cloudera engineering blog