Slide 1

Slide 1 text

© 2017 MapR Technologies MapR Confidential 1 Introduction to Stream Processing with Apache Flink Tugdual Grall @tgrall

Slide 2

Slide 2 text

© 2017 MapR Technologies @tgrall {“about” : “me”} Tugdual “Tug” Grall • MapR : Technical Evangelist • MongoDB, Couchbase, eXo, Oracle • NantesJUG co-founder
 • @tgrall • http://tgrall.github.io • [email protected] / [email protected]

Slide 3

Slide 3 text

© 2017 MapR Technologies @tgrall 3 Open Source Engines & Tools Commercial Engines & Applications Utility-Grade Platform Services Data Processing Web-Scale Storage MapR-FS MapR-DB Search and Others Real Time Unified Security Multi-tenancy Disaster Recovery Global Namespace High Availability MapR Streams Cloud and Managed Services Search and Others Unified Management and Monitoring Search and Others Event Streaming Database Custom Apps MapR Converged Data Platform

Slide 4

Slide 4 text

© 2017 MapR Technologies @tgrall Streaming Streaming technology is enabling the obvious: continuous processing on data that is continuously produced Hint: you already have streaming data

Slide 5

Slide 5 text

© 2017 MapR Technologies @tgrall Decoupling App B App A App C State managed centralized App B App A App C Applications build their own state

Slide 6

Slide 6 text

© 2017 MapR Technologies @tgrall Event Stream = Data Pipelines

Slide 7

Slide 7 text

© 2017 MapR Technologies @tgrall Streaming and Batch 2016-3-1
 12:00 am 2016-3-1
 1:00 am 2016-3-1
 2:00 am 2016-3-11
 11:00pm 2016-3-12
 12:00am 2016-3-12
 1:00am 2016-3-11
 10:00pm 2016-3-12
 2:00am 2016-3-12
 3:00am … partition partition

Slide 8

Slide 8 text

© 2017 MapR Technologies @tgrall Streaming and Batch 2016-3-1
 12:00 am 2016-3-1
 1:00 am 2016-3-1
 2:00 am 2016-3-11
 11:00pm 2016-3-12
 12:00am 2016-3-12
 1:00am 2016-3-11
 10:00pm 2016-3-12
 2:00am 2016-3-12
 3:00am … partition partition Stream (low latency) Stream (high latency)

Slide 9

Slide 9 text

© 2017 MapR Technologies @tgrall Streaming and Batch 2016-3-1
 12:00 am 2016-3-1
 1:00 am 2016-3-1
 2:00 am 2016-3-11
 11:00pm 2016-3-12
 12:00am 2016-3-12
 1:00am 2016-3-11
 10:00pm 2016-3-12
 2:00am 2016-3-12
 3:00am … partition partition Stream (low latency) Batch (bounded stream) Stream (high latency)

Slide 10

Slide 10 text

© 2017 MapR Technologies @tgrall Processing • Request / Response

Slide 11

Slide 11 text

© 2017 MapR Technologies @tgrall Processing • Request / Response • Batch

Slide 12

Slide 12 text

© 2017 MapR Technologies @tgrall Processing • Request / Response • Batch • Stream Processing

Slide 13

Slide 13 text

© 2017 MapR Technologies @tgrall Processing • Request / Response • Batch • Stream Processing • Real-time reaction to events • Continuous applications • Process both real-time and historical data

Slide 14

Slide 14 text

© 2017 MapR Technologies @tgrall

Slide 15

Slide 15 text

© 2017 MapR Technologies @tgrall Flink Architecture

Slide 16

Slide 16 text

© 2017 MapR Technologies @tgrall Flink Architecture Deployment Local Cluster Cloud Single JVM Standalone, YARN, Mesos AWS, Google

Slide 17

Slide 17 text

© 2017 MapR Technologies @tgrall Flink Architecture Deployment Local Cluster Cloud Single JVM Standalone, YARN, Mesos AWS, Google Core Runtime Distributed Streaming Dataflow

Slide 18

Slide 18 text

© 2017 MapR Technologies @tgrall Flink Architecture Deployment Local Cluster Cloud Single JVM Standalone, YARN, Mesos AWS, Google Core Runtime Distributed Streaming Dataflow DataSet API Batch Processing API & Libraries

Slide 19

Slide 19 text

© 2017 MapR Technologies @tgrall Flink Architecture Deployment Local Cluster Cloud Single JVM Standalone, YARN, Mesos AWS, Google Core Runtime Distributed Streaming Dataflow DataSet API Batch Processing API & Libraries FlinkML Machine Learning Gelly Graph Processing Table Relational

Slide 20

Slide 20 text

© 2017 MapR Technologies @tgrall Flink Architecture Deployment Local Cluster Cloud Single JVM Standalone, YARN, Mesos AWS, Google Core Runtime Distributed Streaming Dataflow DataSet API Batch Processing DataStream API Stream Processing API & Libraries FlinkML Machine Learning Gelly Graph Processing Table Relational

Slide 21

Slide 21 text

© 2017 MapR Technologies @tgrall Flink Architecture Deployment Local Cluster Cloud Single JVM Standalone, YARN, Mesos AWS, Google Core Runtime Distributed Streaming Dataflow DataSet API Batch Processing DataStream API Stream Processing API & Libraries FlinkML Machine Learning Gelly Graph Processing Table Relational CEP Event Processing Table Relational

Slide 22

Slide 22 text

© 2017 MapR Technologies @tgrall Demonstration Flink Basics

Slide 23

Slide 23 text

© 2017 MapR Technologies @tgrall Batch & Stream case class Word (word: String, frequency: Int) // DataSet API - Batch val lines: DataSet[String] = env.readTextFile(…) lines.flatMap {line => line.split(“ ”).map(word => Word(word,1))} .groupBy("word").sum("frequency") .print() // DataStream API - Streaming val lines: DataSream[String] = env.fromSocketStream(...) lines.flatMap {line => line.split(“ ”).map(word => Word(word,1))} .keyBy("word”).window(Time.of(5,SECONDS)) .every(Time.of(1,SECONDS)).sum(”frequency") .print()

Slide 24

Slide 24 text

© 2017 MapR Technologies @tgrall Steam Processing Source Filter /
 Transform Sink

Slide 25

Slide 25 text

© 2017 MapR Technologies @tgrall Flink Ecosystem Source Sink Apache Kafka MapR Streams AWS Kinesis RabbitMQ Twitter Apache Bahir … Apache Kafka MapR Streams AWS Kinesis RabbitMQ Elasticsearch HDFS/MapR-FS …

Slide 26

Slide 26 text

© 2017 MapR Technologies @tgrall Stateful Steam Processing Source Filter /
 Transform State
 read/write Sink

Slide 27

Slide 27 text

© 2017 MapR Technologies @tgrall Is Flink used?

Slide 28

Slide 28 text

© 2017 MapR Technologies @tgrall Powered by Flink

Slide 29

Slide 29 text

© 2017 MapR Technologies @tgrall 10 Billion events/day 2Tb of data/day 30 Applications 2Pb of storage and growing Source Bouyges Telecom : http://berlin.flink-forward.org/wp-content/uploads/2016/07/Thomas-Lamirault_Mohamed-Amine-Abdessemed-A-brief-history-of-time-with-Apache-Flink.pdf

Slide 30

Slide 30 text

© 2017 MapR Technologies @tgrall Stream Processing Windowing

Slide 31

Slide 31 text

© 2017 MapR Technologies @tgrall Stream Windows

Slide 32

Slide 32 text

© 2017 MapR Technologies @tgrall Stream Windows

Slide 33

Slide 33 text

© 2017 MapR Technologies @tgrall Stream Windows

Slide 34

Slide 34 text

© 2017 MapR Technologies @tgrall Stream Windows

Slide 35

Slide 35 text

© 2017 MapR Technologies @tgrall Stream Windows

Slide 36

Slide 36 text

© 2017 MapR Technologies @tgrall Demonstration Flink Windowing

Slide 37

Slide 37 text

© 2017 MapR Technologies @tgrall What about it ? What about it ? Time

Slide 38

Slide 38 text

© 2017 MapR Technologies @tgrall Time in Flink • Multiple notion of “Time” in Flink • Event Time • Ingestion Time • Processing Time

Slide 39

Slide 39 text

© 2017 MapR Technologies @tgrall What Is Event-Time Processing 1977 1980 1983 1999 2002 2005 2015 Processing Time Episode
 IV Episode
 V Episode
 VI Episode
 I Episode
 II Episode
 III Episode
 VII Event Time

Slide 40

Slide 40 text

© 2017 MapR Technologies @tgrall Time in Flink

Slide 41

Slide 41 text

© 2017 MapR Technologies @tgrall Complex Event Processing

Slide 42

Slide 42 text

© 2017 MapR Technologies @tgrall Complex Event Processing • Analyzing a stream of events and drawing conclusions • “if A and then B ! infer event C” • Demanding requirements on stream processor • Low latency! • Exactly-once semantics & event-time support

Slide 43

Slide 43 text

© 2017 MapR Technologies @tgrall Use Case

Slide 44

Slide 44 text

© 2017 MapR Technologies @tgrall Order Events Process is reflected in a stream of order events Order(orderId, tStamp, “received”) Shipment(orderId, tStamp, “shipped”) Delivery(orderId, tStamp, “delivered”) orderId: Identifies the order tStamp: Time at which the event happened

Slide 45

Slide 45 text

© 2017 MapR Technologies @tgrall Real-time Warnings

Slide 46

Slide 46 text

© 2017 MapR Technologies @tgrall CEP to the Rescue Define processing and delivery intervals (SLAs) ProcessSucc(orderId, tStamp, duration) ProcessWarn(orderId, tStamp) DeliverySucc(orderId, tStamp, duration) DeliveryWarn(orderId, tStamp) orderId: Identifies the order tStamp: Time when the event happened duration: Duration of the processing/delivery

Slide 47

Slide 47 text

© 2017 MapR Technologies @tgrall CEP Example

Slide 48

Slide 48 text

© 2017 MapR Technologies @tgrall Processing: Order ! Shipment

Slide 49

Slide 49 text

© 2017 MapR Technologies @tgrall Processing: Order ! Shipment val processingPattern = Pattern .begin[Event]("received").subtype(classOf[Order]) .followedBy("shipped").where(_.status == "shipped") .within(Time.hours(1))

Slide 50

Slide 50 text

© 2017 MapR Technologies @tgrall val processingPattern = Pattern .begin[Event]("received").subtype(classOf[Order]) .followedBy("shipped").where(_.status == "shipped") .within(Time.hours(1)) val processingPatternStream = CEP.pattern( input.keyBy("orderId"), processingPattern) Processing: Order ! Shipment

Slide 51

Slide 51 text

© 2017 MapR Technologies @tgrall val processingPattern = Pattern .begin[Event]("received").subtype(classOf[Order]) .followedBy("shipped").where(_.status == "shipped") .within(Time.hours(1)) val processingPatternStream = CEP.pattern( input.keyBy("orderId"), processingPattern) val procResult: DataStream[Either[ProcessWarn, ProcessSucc]] = processingPatternStream.select { (pP, timestamp) => // Timeout handler ProcessWarn(pP("received").orderId, timestamp) } { fP => // Select function ProcessSucc( fP("received").orderId, fP("shipped").tStamp, fP("shipped").tStamp – fP("received").tStamp) } Processing: Order ! Shipment

Slide 52

Slide 52 text

© 2017 MapR Technologies @tgrall Count Delayed Shipments

Slide 53

Slide 53 text

© 2017 MapR Technologies @tgrall Compute Avg Processing Time

Slide 54

Slide 54 text

© 2017 MapR Technologies @tgrall Demonstration Streaming Analytics

Slide 55

Slide 55 text

© 2017 MapR Technologies @tgrall Demonstration • https://github.com/mapr-demos/mapr-streams-flink-demo • https://github.com/mapr-demos/wifi-sensor-demo • http://tgrall.github.io/blog/2016/10/12/getting-started-with- apache-flink-and-kafka/ • http://tgrall.github.io/blog/2016/10/17/getting-started-with- apache-flink-and-mapr-streams/ • more soon….

Slide 56

Slide 56 text

© 2017 MapR Technologies @tgrall Kostas Tzoumas Stephan Ewen Fabian Hueske Till Rohrmann Jamie Grier Thanks to

Slide 57

Slide 57 text

© 2017 MapR Technologies @tgrall Streaming Architecture http://mapr.com/ebooks/ Free ebooks & Online training http://mapr.com/training/

Slide 58

Slide 58 text

© 2017 MapR Technologies MapR Confidential 58 Stream Processing with Apache Flink Tugdual Grall @tgrall