Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Introduction to streaming data processing using Apache Spark

Roksolana
September 11, 2018

Introduction to streaming data processing using Apache Spark

My talk from the Women Who Code Kyiv event called "Data Engineering talk" in which I defined the basic principles of streaming data processing using Apache Spark framework with some code examples.

Video recording: https://www.youtube.com/watch?v=Pi0Z0XWilHA

Roksolana

September 11, 2018
Tweet

More Decks by Roksolana

Other Decks in Programming

Transcript

  1. Business decisions in various domains need to be made fast

    and should be based on the real-time data. This data is hard to process due to the high-velocity.
  2. Streaming data A data stream is just a sequence of

    data units. Streaming data is the data which is generated in real-time by multiple sources.
  3. Streaming data processing There is a wide range of tools

    but some aspect to be considered to choose the right tool for the job are: • performance • scalability and reliability • tool/framework ecosystem • specific sources and sinks support • messages delivery semantics and so on...
  4. Messages delivery semantics Message delivery semantics can be defined in

    terms of message sending guarantees: • at-most-once (0 or 1 time) • at-least-once (1 or more times) • exactly-once (1 time)
  5. Spark Streaming abstractions DStream (Discretized Stream) is a basic abstraction

    of Spark Streaming which represents a continuous stream of data (in form of RDDs).
  6. Streaming application entrypoint 1. Streaming context creation with SparkContext import

    org.apache.spark.streaming._ val sc = ... // existing SparkContext val ssc = new StreamingContext(sc, Seconds(1))
  7. Streaming application entrypoint 2. Streaming context creation with SparkConfig import

    org.apache.spark._ import org.apache.spark.streaming._ val conf = new SparkConf() .setAppName(appName).setMaster(master) val ssc = new StreamingContext(conf, Seconds(1))
  8. Stream sources Built-in streaming sources: • Basic sources - file

    systems, socket connections streamingContext.fileStream[KeyClass, ValueClass, InputFormatClass](dataDirectory) streamingContext.textFileStream(dataDirectory) • Advanced sources - Kafka, Flume, Kinesis
  9. Stream sinks • Console • Files (Hadoop file, object file,

    text file) • Storage systems (ElasticSearch, relational and NoSQL databases)
  10. val streamingContext = new StreamingContext (sparkContext, Seconds(3)) val numStreams =

    3 val lines = (1 to numStreams).map (i => kafkaConfig) Example
  11. val messages = streamingContext.union(lines) val values = messages.map( record =>

    record.value().toString) val wordsArrays = values.map(_.split(“!&”)) DStream[Array[String]] DStream[String] DStream[ConsumerRecord[String,String]] More code...
  12. And other interesting things... Tasks parallelism Streaming jobs tuning (executors/partitions

    numbers choice, block and batch interval tuning, scheduling, memory setup) Spark Streaming in cluster mode
  13. Resources 1. Spark documentation 2. Learning Spark. M. Zaharia, P.

    Wendell, A. Konwinski, H. Karau (2015) 3. Learning Spark Streaming. F. Garillot, G. Mass (2017) 4. Streaming Data: Understanding the real-time pipeline. A. G. Psaltis (2017) 5. High Performance Spark. Holden Karau, Rachel Warren (2017) 6. Pro Spark Streaming: The Zen of Real-Time Analytics Using Apache Spark. Z. Nabi (2016) 7. Spark Summit sessions 8. Databricks engineering blog 9. Cloudera engineering blog