Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Streaming data processing using Apache Spark at ScalaUA 2019

Streaming data processing using Apache Spark at ScalaUA 2019

The talk for ScalaUA (Kyiv, Ukraine) 2019:
"Many business decisions strive to use real-time data in order to create more accurate results. And there are specified approaches to such data processing. One of the most popular approaches is implemented using Spark Streaming module of the Apache Spark. This talk will help you to learn about legacy streaming data processing as well as structured streaming, possible use cases and challenges. A set of short visual examples will guide you through the features of both streaming data processing types."

Video recording: https://www.youtube.com/watch?v=LZHeOeQGNbE

Roksolana

March 29, 2019
Tweet

More Decks by Roksolana

Other Decks in Technology

Transcript

  1. 2 1. Streaming data concept and use cases 2. Spark

    Streaming introduction and main types 3. Examples 3.1. Basic streaming examples 3.2. Structured streaming examples 4. Use Case 5. Spark Streaming future Agenda
  2. 4 Data stream A data stream is just a sequence

    of data units. Streaming data is the data which is generated in real-time by multiple sources.
  3. 7 Streaming data processing aspects Performance 1 Scalability and reliability

    2 Messages delivery semantics 4 Tool/framework ecosystem 3
  4. 8 Streaming data processing aspects Performance 1 Scalability and reliability

    2 Messages delivery semantics 4 Tool/framework ecosystem 3
  5. 9 Streaming data processing aspects Performance 1 Scalability and reliability

    2 Messages delivery semantics 4 Tool/framework ecosystem 3
  6. 10 Streaming data processing aspects Performance 1 Scalability and reliability

    2 Messages delivery semantics 4 Tool/framework ecosystem 3
  7. 18 Structured streaming Unbounded table from a stream New rows

    appended to the unbounded table Data stream
  8. 19 Incrementalization process Output modes • Append • Complete •

    Update Input table Incremental query Result table
  9. 21 Window operations on DStream original DStream windowed DStream window-based

    operation window 1 window 2 time 1 time 2 time 3 time 4 time 5
  10. 22 Sliding window operations on DStream original DStream windowed DStream

    window-based operation window 1 window 3 window 2 time 1 time 2 time 3 time 4 time 5
  11. 24 Window transformations 11.05 Time 11.10 11.15 11.00 t1 1

    t1 2 t1 1 t1 2 t2 8 t2 5 t2 3 t1 1 t1 2 t2 8 t2 5 t2 3 t3 4 t3 11
  12. 25 Late data 11.05 Time 11.10 11.15 11.00 t1 1

    t1 2 t1 1 t1 2 t2 8 t2 5 t2 3 t1 1 t1 3 t2 8 t2 6 t2 3 t3 4 t3 11
  13. 26 Watermarking 11.05 Time 11.10 11.15 11.00 t1 1 t1

    2 t1 1 t1 2 t2 8 t2 5 t2 3 t1 2 t1 2 t2 8 t2 6 t2 3 t3 4 t3 11 Watermark = 5 minutes
  14. 33 Spark streaming example DStream [ConsumerRecord [String, String]] ConsumerRecord [String,

    String] String lines DStream [String] lines.map (_.value()) record_ _.value
  15. 35 Spark streaming example values.foreachRDD(rdd => { val dataFrame =

    sparkSession.read .schema(dataSchema) .load() })
  16. 36 Streaming example visualization DStream [String] values RDD [String] StructType

    => DataFrame Reader rdd foreachRdd sparkSession.read. schema(schema) Dataframe .load() dataFrame
  17. 37 Aggregations val colName = "user.followers_count" values.foreachRDD(rdd => { ...

    dataFrame.agg( min(dataFrame.col(colName)), max(dataFrame.col(colName))) })
  18. 48 Windowed aggregations DataFrame Column => Relational GroupedDataSet .groupBy(window(..),updatedTweets .col(userLanguage))

    (Column, String,String) => Column windowedData DataFrame .count() window(structuredData.col ("timestamp.ms"), "7 minutes", "3 minutes") structuredData
  19. 49 Windowed aggregations with watermark val watermarkedData = structuredData .withWatermark("timestamp_ms",

    "3 minutes") .groupBy(window(structuredData.col("timestamp_ms"), "7 minutes", "3 minutes"), updatedTweets.col(userLanguage)) .count()
  20. 50 Windowed aggregations with watermark DataFrame Column => Relational GroupedDataSet

    .groupBy(...) (String,String)=> DataSet[Row] watermarkedData DataFrame .count() .withWatermark ("timestamp.ms", "3 minutes") structuredData
  21. 51 Structured streaming application launch val resultsOutput = dataTransformation .writeStream

    .outputMode(outputMode) .format(outputFormat) .start() resultsOutput.awaitTermination()
  22. 64 Continuous streaming example val continuousOutput = filteredData.writeStream .outputMode("append") .format(outputFormat)

    .trigger(Trigger.Continuous("1 second")) .start() continuousOutput.awaitTermination()
  23. 66 Resources 5 Pro Spark Streaming: The Zen of Real-Time

    Analytics Using Apache Spark. Z. Nabi 4 3 High Performance Spark. H. Karau, R. Warren 2 Learning Spark Streaming. F. Garillot, G. Mass 1 Spark documentation Spark Summit sessions