Introduction to streaming data processing using Apache Spark

Introduction to streaming data processing using Apache Spark Roksolana Diachuk
Data Engineer at Ciklum

Business decisions in various domains need to be made fast
and should be based on the real-time data. This data is hard to process due to the high-velocity.

Streaming data A data stream is just a sequence of
data units. Streaming data is the data which is generated in real-time by multiple sources.

Streaming data processing

Streaming data processing There is a wide range of tools
but some aspect to be considered to choose the right tool for the job are: • performance • scalability and reliability • tool/framework ecosystem • specific sources and sinks support • messages delivery semantics and so on...

...it all depends!

Messages delivery semantics Message delivery semantics can be defined in
terms of message sending guarantees: • at-most-once (0 or 1 time) • at-least-once (1 or more times) • exactly-once (1 time)

Apache Spark ecosystem

Spark Streaming abstractions DStream (Discretized Stream) is a basic abstraction
of Spark Streaming which represents a continuous stream of data (in form of RDDs).

Micro-batching

Batch interval Batch interval is magical value which directly affects
performance.

Streaming application entrypoint 1. Streaming context creation with SparkContext import
org.apache.spark.streaming._ val sc = ... // existing SparkContext val ssc = new StreamingContext(sc, Seconds(1))

Streaming application entrypoint 2. Streaming context creation with SparkConfig import
org.apache.spark._ import org.apache.spark.streaming._ val conf = new SparkConf() .setAppName(appName).setMaster(master) val ssc = new StreamingContext(conf, Seconds(1))

Stream sources Built-in streaming sources: • Basic sources - file
systems, socket connections streamingContext.fileStream[KeyClass, ValueClass, InputFormatClass](dataDirectory) streamingContext.textFileStream(dataDirectory) • Advanced sources - Kafka, Flume, Kinesis

Stream sinks • Console • Files (Hadoop file, object file,
text file) • Storage systems (ElasticSearch, relational and NoSQL databases)

DStream transformations 1)Common transformations on RDDs (map, flatMap, filter, repartition,
reduce, count, etc.)

DStream transformations 2)Window transformations (countByWindow,window)

Fault-tolerance mechanisms 1. Caching/Persistence 2. Metadata checkpointing 3. Data checkpointing

Use Case

Use case architecture

val streamingContext = new StreamingContext (sparkContext, Seconds(3)) val numStreams =
3 val lines = (1 to numStreams).map (i => kafkaConfig) Example

Example KafkaUtils.createDirectStream[String, String]( streamingContext, LocationStrategies.PreferConsistent, ConsumerStrategies.Subscribe[String, String] (topics, kafkaParams(i))))

val messages = streamingContext.union(lines) val values = messages.map( record =>
record.value().toString) val wordsArrays = values.map(_.split(“!&”)) DStream[Array[String]] DStream[String] DStream[ConsumerRecord[String,String]] More code...

wordsArrays.foreachRDD( rdd => rdd.flatMap(record => convertToTweetMap(record) ).saveToEs("tweets-time/output")) RDD[Array[String]] Array[String] And
more...

streamingContext.start() streamingContext.awaitTermination() THIS IS THE END!!! Application launch

Visualizations dashboard

Spark Streaming metrics

Experimental features

Continuous streaming Optional low-latency mechanism introduced in Spark 2.3.0

Structured streaming

Incrementalization process Output modes: • Append • Complete • Update

And other interesting things... Tasks parallelism Streaming jobs tuning (executors/partitions
numbers choice, block and batch interval tuning, scheduling, memory setup) Spark Streaming in cluster mode

Resources 1. Spark documentation 2. Learning Spark. M. Zaharia, P.
Wendell, A. Konwinski, H. Karau (2015) 3. Learning Spark Streaming. F. Garillot, G. Mass (2017) 4. Streaming Data: Understanding the real-time pipeline. A. G. Psaltis (2017) 5. High Performance Spark. Holden Karau, Rachel Warren (2017) 6. Pro Spark Streaming: The Zen of Real-Time Analytics Using Apache Spark. Z. Nabi (2016) 7. Spark Summit sessions 8. Databricks engineering blog 9. Cloudera engineering blog

Thank you for attention

Contact me roksolana.diachuk@gmail.com LinkedIn Facebook

Introduction to streaming data processing using...

Introduction to streaming data processing using Apache Spark

More Decks by Roksolana

Other Decks in Programming

Featured

Transcript