Mini-batch processing with Spark Streaming

DATA SCIENCE | BIG DATA ENGINEERING | BIG DATA ARCHITECTURES
STREAMING DATA PIPELINES PART 2: STREAM PROCESSING WITH SPARK STREAMING

Time Activity 17:00 – 18:30 Welcome, food 18:30 – 19:30
Theory: Intro to spark & spark streaming 19:30 – 20:45 Hands-on 20:45 Wrap-up Agenda

• Pre-requisites: docker and docker-compose • Download: https://pastebin.com/raw/T10CJXhF -> save
as ‘docker-compose.yml’ • docker-compose up • Open browser at: http://localhost:8080 • Navigate to ‘Mini-batch processing with Spark Streaming’ Handson

Meetup Sponsor Actionable insights Embedded analytics Use-case discovery Data science
as-a-service Integrated data solutions Big data awareness Training & consultancy

Big Data Production Infrastructure PRESENTATION endpoint data cache BATCH data
lake processing framework data warehouse context data data ingest REAL TIME stream processing

Streaming Ecosystem STREAMING PIPELINE message broker App App App stream
processor models stream processor data warehouse data lake

A brief intro to

• RDD, DataSet, DataFrames: what’s in a name? ◦ RDD
- Most basic data structure ◦ DataSet, DataFrame - Added for SQL like operations ◦ DataFrame => DataSet[Row] with convenience functions • Fault tolerant collection of elements • On which can be operated in parallel • Lazy evaluation Data structures

val lines: RDD[String] = sc.textFile("data.txt") • RDD.map ⇒ A new
RDD with each element transformed by a given function val lowerCaseLines: RDD[String] = lines.map { _.toLowerCase } • RDD.filter => A new RDD with only elements that adhere to given predicate val linesWithRuurtjan: RDD[String] = lowerCaseLines.filter{ line => line.contains(“ruurtjan”) } • RDD.flatMap ⇒ A new RDD with each element transformed into 0 or more elements val bertjanRdd: RDD[String] = lowerCaseLines.flatMap{ line => line.split(" ").filter(_ == "bertjan") } • RDD.count => Number of rows in the RDD val awesomeness: Long = bertjansRdd.count // So often mentioned together with my colleague Ruurtjan! Some common operations

• In some cases, the spark API provides additional convenience
◦ RDD[Double] => DoubleRDDFunctions ▪ rdd.histogram ▪ rdd.stdev ▪ rdd.variance ◦ RDD[(K, V)] => PairedRDDFunctions ▪ rdd.countByKey ▪ rdd.groupByKey ▪ rdd.aggregateByKey ▪ rdd.reduceByKey Implicit functionality

Shuffling • Shuffling: e.g. reduceByKey ◦ We initially have an
RDD[(K, V)] ◦ Different values for same key may be on different nodes ◦ After the reduce we have RDD[(K, Iterable<V>)] ▪ meaning: all values for a given K must now be on the same node ◦ Shuffling involves: disk I/O, data serialization and network I/O ▪ I.e. (relatively) slow!

Spark infrastructure

Introducing stream processing

High Level Overview

High Level Overview RDDs RDD[T] => RDD[U]

Discretized Streams

First blood code import org.apache.spark._ import org.apache.spark.streaming._ val ssc =
new StreamingContext(sc, Seconds(1)) ssc .socketTextStream("localhost", 9999) .foreachRDD(rdd => /* do something useful with RDD */) ssc.start()

Let’s talk about parallelism

Reading from Kafka Topic - XYZ Part. 1 Part. 2
Part. 3 Spark Streaming Driver Worker 1 Worker N { part:1, start: 0, end: 100 } { part:3, start: 20, end: 80 } Read offsets

Reading from Kafka - Challenges / Consequences • Match between
number of partitions and number of workers ◦ 100 workers with 1 partition ◦ 1 worker with 100 partitions ◦ Both cases: parallelization of 1 • Initial load when processing a topic with already a lot of data ◦ Too many rows in RDD at once causes out-of-memory on worker ◦ Need to configure batch size ▪ Too small: takes too long to process initial load ▪ I.e. manual tuning, dependent on cluster config • Processing time of mini batches ◦ what if time dependent on #rows? ◦ What if processing depends on latency between backend systems? ◦ MOAR TUNING!!!

Let’s talk about windowing

Windowing in Spark Streaming

Windowing in Spark Streaming mystream .window(Seconds(3), Seconds(2) .foreachRDD(rdd => /*
do something useful with RDD */)

Windowing - What you want NOTE: This is supported since
Spark 2.1 in the structured programming API

Wrap up

Pros: • DStream → Stream of RDDs - potential for
efficient processing of (relatively) large batches of data ◦ Useful for map / reduce operations! Cons: • Event time / Watermarking only available in structured streaming ◦ SQLContext has limited API • Not really real-time processing ◦ Smallest (sensible) batch duration 500ms ◦ However, does use case really, *really* need to be faster than this? Pros / cons

• Pre-requisites: docker and docker-compose • Download: https://pastebin.com/raw/T10CJXhF • docker-compose
up • Open browser at: http://localhost:8080 • Navigate to ‘Mini-batch processing with Spark Streaming’ Handson

+31 (0)1 - 68479294 Coltbaan 4E, Nieuwegein [email protected] www.bigdatarepublic.nl /company/bigdata-republic
@bigdatarep DATA SCIENCE | BIG DATA ANALYTICS | BIG DATA ARCHITECTURES

Hands-on Data Science Meetup: Recommender Systems in Practice (7th of
December 2017) Hands-on Big Data Architecture Meetup: Streaming Data Pipelines #3: Real-time Event Processing with Apache Flink (7th of February 2018) Upcoming Events

Mini-batch processing with Spark Streaming

Mini-batch processing with Spark Streaming

Ruurtjan

More Decks by Ruurtjan

Other Decks in Programming

Featured

Transcript

DATA SCIENCE | BIG DATA ENGINEERING | BIG DATA ARCHITECTURES

Time Activity 17:00 – 18:30 Welcome, food 18:30 – 19:30

• Pre-requisites: docker and docker-compose • Download: https://pastebin.com/raw/T10CJXhF -> save

Meetup Sponsor Actionable insights Embedded analytics Use-case discovery Data science

Big Data Production Infrastructure PRESENTATION endpoint data cache BATCH data

Streaming Ecosystem STREAMING PIPELINE message broker App App App stream

A brief intro to

• RDD, DataSet, DataFrames: what’s in a name? ◦ RDD

val lines: RDD[String] = sc.textFile("data.txt") • RDD.map ⇒ A new

• In some cases, the spark API provides additional convenience

Shuffling • Shuffling: e.g. reduceByKey ◦ We initially have an

Spark infrastructure

Introducing stream processing

High Level Overview

High Level Overview

High Level Overview RDDs RDD[T] => RDD[U]

Discretized Streams

First blood code import org.apache.spark._ import org.apache.spark.streaming._ val ssc =

Let’s talk about parallelism

Reading from Kafka Topic - XYZ Part. 1 Part. 2

Reading from Kafka - Challenges / Consequences • Match between

Let’s talk about windowing

Windowing in Spark Streaming

Windowing in Spark Streaming mystream .window(Seconds(3), Seconds(2) .foreachRDD(rdd => /*

Windowing - What you want NOTE: This is supported since

Wrap up

Pros: • DStream → Stream of RDDs - potential for

• Pre-requisites: docker and docker-compose • Download: https://pastebin.com/raw/T10CJXhF • docker-compose

+31 (0)1 - 68479294 Coltbaan 4E, Nieuwegein [email protected] www.bigdatarepublic.nl /company/bigdata-republic

Hands-on Data Science Meetup: Recommender Systems in Practice (7th of