Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Mini-batch processing with Spark Streaming

Ruurtjan
November 22, 2017

Mini-batch processing with Spark Streaming

These are the slides for this meetup: https://www.meetup.com/Hands-On-Big-Data-Architecture/events/243703572/

In this presentation, we introduced Spark and Spark Streaming. We went on to discuss caveats when reading from Kafka in Spark Streaming, as well as the concept of windowing and concluded with a pro's/con's comparison of Spark Streaming.

Ruurtjan

November 22, 2017
Tweet

More Decks by Ruurtjan

Other Decks in Programming

Transcript

  1. DATA SCIENCE | BIG DATA ENGINEERING | BIG DATA ARCHITECTURES

    STREAMING DATA PIPELINES PART 2: STREAM PROCESSING WITH SPARK STREAMING
  2. Time Activity 17:00 – 18:30 Welcome, food 18:30 – 19:30

    Theory: Intro to spark & spark streaming 19:30 – 20:45 Hands-on 20:45 Wrap-up Agenda
  3. • Pre-requisites: docker and docker-compose • Download: https://pastebin.com/raw/T10CJXhF -> save

    as ‘docker-compose.yml’ • docker-compose up • Open browser at: http://localhost:8080 • Navigate to ‘Mini-batch processing with Spark Streaming’ Handson
  4. Meetup Sponsor Actionable insights Embedded analytics Use-case discovery Data science

    as-a-service Integrated data solutions Big data awareness Training & consultancy
  5. Big Data Production Infrastructure PRESENTATION endpoint data cache BATCH data

    lake processing framework data warehouse context data data ingest REAL TIME stream processing
  6. Streaming Ecosystem STREAMING PIPELINE message broker App App App stream

    processor models stream processor data warehouse data lake
  7. • RDD, DataSet, DataFrames: what’s in a name? ◦ RDD

    - Most basic data structure ◦ DataSet, DataFrame - Added for SQL like operations ◦ DataFrame => DataSet[Row] with convenience functions • Fault tolerant collection of elements • On which can be operated in parallel • Lazy evaluation Data structures
  8. val lines: RDD[String] = sc.textFile("data.txt") • RDD.map ⇒ A new

    RDD with each element transformed by a given function val lowerCaseLines: RDD[String] = lines.map { _.toLowerCase } • RDD.filter => A new RDD with only elements that adhere to given predicate val linesWithRuurtjan: RDD[String] = lowerCaseLines.filter{ line => line.contains(“ruurtjan”) } • RDD.flatMap ⇒ A new RDD with each element transformed into 0 or more elements val bertjanRdd: RDD[String] = lowerCaseLines.flatMap{ line => line.split(" ").filter(_ == "bertjan") } • RDD.count => Number of rows in the RDD val awesomeness: Long = bertjansRdd.count // So often mentioned together with my colleague Ruurtjan! Some common operations
  9. • In some cases, the spark API provides additional convenience

    ◦ RDD[Double] => DoubleRDDFunctions ▪ rdd.histogram ▪ rdd.stdev ▪ rdd.variance ◦ RDD[(K, V)] => PairedRDDFunctions ▪ rdd.countByKey ▪ rdd.groupByKey ▪ rdd.aggregateByKey ▪ rdd.reduceByKey Implicit functionality
  10. Shuffling • Shuffling: e.g. reduceByKey ◦ We initially have an

    RDD[(K, V)] ◦ Different values for same key may be on different nodes ◦ After the reduce we have RDD[(K, Iterable<V>)] ▪ meaning: all values for a given K must now be on the same node ◦ Shuffling involves: disk I/O, data serialization and network I/O ▪ I.e. (relatively) slow!
  11. First blood code import org.apache.spark._ import org.apache.spark.streaming._ val ssc =

    new StreamingContext(sc, Seconds(1)) ssc .socketTextStream("localhost", 9999) .foreachRDD(rdd => /* do something useful with RDD */) ssc.start()
  12. Reading from Kafka Topic - XYZ Part. 1 Part. 2

    Part. 3 Spark Streaming Driver Worker 1 Worker N { part:1, start: 0, end: 100 } { part:3, start: 20, end: 80 } Read offsets
  13. Reading from Kafka - Challenges / Consequences • Match between

    number of partitions and number of workers ◦ 100 workers with 1 partition ◦ 1 worker with 100 partitions ◦ Both cases: parallelization of 1 • Initial load when processing a topic with already a lot of data ◦ Too many rows in RDD at once causes out-of-memory on worker ◦ Need to configure batch size ▪ Too small: takes too long to process initial load ▪ I.e. manual tuning, dependent on cluster config • Processing time of mini batches ◦ what if time dependent on #rows? ◦ What if processing depends on latency between backend systems? ◦ MOAR TUNING!!!
  14. Windowing - What you want NOTE: This is supported since

    Spark 2.1 in the structured programming API
  15. Pros: • DStream → Stream of RDDs - potential for

    efficient processing of (relatively) large batches of data ◦ Useful for map / reduce operations! Cons: • Event time / Watermarking only available in structured streaming ◦ SQLContext has limited API • Not really real-time processing ◦ Smallest (sensible) batch duration 500ms ◦ However, does use case really, *really* need to be faster than this? Pros / cons
  16. • Pre-requisites: docker and docker-compose • Download: https://pastebin.com/raw/T10CJXhF • docker-compose

    up • Open browser at: http://localhost:8080 • Navigate to ‘Mini-batch processing with Spark Streaming’ Handson
  17. Hands-on Data Science Meetup: Recommender Systems in Practice (7th of

    December 2017) Hands-on Big Data Architecture Meetup: Streaming Data Pipelines #3: Real-time Event Processing with Apache Flink (7th of February 2018) Upcoming Events