Introduction to streaming data processing using Apache Spark Roksolana Diachuk Data Engineer at Ciklum

Business decisions in various domains need to be made fast and should be based on the real-time data. This data is hard to process due to the high-velocity.

Streaming data A data stream is just a sequence of data units. Streaming data is the data which is generated in real-time by multiple sources.

Streaming data processing

Streaming data processing There is a wide range of tools but some aspect to be considered to choose the right tool for the job are: ● performance ● scalability and reliability ● tool/framework ecosystem ● specific sources and sinks support ● messages delivery semantics and so on...

Slide 6 text all depends!

Messages delivery semantics Message delivery semantics can be defined in terms of message sending guarantees: ● at-most-once (0 or 1 time) ● at-least-once (1 or more times) ● exactly-once (1 time)

Apache Spark ecosystem

Spark Streaming abstractions DStream (Discretized Stream) is a basic abstraction of Spark Streaming which represents a continuous stream of data (in form of RDDs).

Slide 10 text


Batch interval Batch interval is magical value which directly affects performance.

Streaming application entrypoint 1. Streaming context creation with SparkContext import org.apache.spark.streaming._ val sc = ... // existing SparkContext val ssc = new StreamingContext(sc, Seconds(1))

Streaming application entrypoint 2. Streaming context creation with SparkConfig import org.apache.spark._ import org.apache.spark.streaming._ val conf = new SparkConf() .setAppName(appName).setMaster(master) val ssc = new StreamingContext(conf, Seconds(1))

Stream sources Built-in streaming sources: ● Basic sources - file systems, socket connections streamingContext.fileStream[KeyClass, ValueClass, InputFormatClass](dataDirectory) streamingContext.textFileStream(dataDirectory) ● Advanced sources - Kafka, Flume, Kinesis

Stream sinks ● Console ● Files (Hadoop file, object file, text file) ● Storage systems (ElasticSearch, relational and NoSQL databases)

DStream transformations 1)Common transformations on RDDs (map, flatMap, filter, repartition, reduce, count, etc.)

DStream transformations 2)Window transformations (countByWindow,window)

Fault-tolerance mechanisms 1. Caching/Persistence 2. Metadata checkpointing 3. Data checkpointing

Use Case

Use case architecture

val streamingContext = new StreamingContext (sparkContext, Seconds(3)) val numStreams = 3 val lines = (1 to numStreams).map (i => kafkaConfig) Example

Example KafkaUtils.createDirectStream[String, String]( streamingContext, LocationStrategies.PreferConsistent, ConsumerStrategies.Subscribe[String, String] (topics, kafkaParams(i))))

val messages = streamingContext.union(lines) val values = record => record.value().toString) val wordsArrays =“!&”)) DStream[Array[String]] DStream[String] DStream[ConsumerRecord[String,String]] More code...

wordsArrays.foreachRDD( rdd => rdd.flatMap(record => convertToTweetMap(record) ).saveToEs("tweets-time/output")) RDD[Array[String]] Array[String] And more...

streamingContext.start() streamingContext.awaitTermination() THIS IS THE END!!! Application launch

Visualizations dashboard

Spark Streaming metrics

Experimental features

Continuous streaming Optional low-latency mechanism introduced in Spark 2.3.0

Structured streaming

Incrementalization process Output modes: ● Append ● Complete ● Update

And other interesting things... Tasks parallelism Streaming jobs tuning (executors/partitions numbers choice, block and batch interval tuning, scheduling, memory setup) Spark Streaming in cluster mode

Resources 1. Spark documentation 2. Learning Spark. M. Zaharia, P. Wendell, A. Konwinski, H. Karau (2015) 3. Learning Spark Streaming. F. Garillot, G. Mass (2017) 4. Streaming Data: Understanding the real-time pipeline. A. G. Psaltis (2017) 5. High Performance Spark. Holden Karau, Rachel Warren (2017) 6. Pro Spark Streaming: The Zen of Real-Time Analytics Using Apache Spark. Z. Nabi (2016) 7. Spark Summit sessions 8. Databricks engineering blog 9. Cloudera engineering blog

Thank you for attention

Contact me LinkedIn Facebook