Introduction to Streaming with Apache Flink

Introduction to Streaming with Apache Flink

After a quick description of event streams, and stream processing, this presentation moves to an introduction of Apache Flink :
- basic architecture
- sample code
- windowing and time concepts
- complex event processing CEP

This presentation was delivered during Devoxx France 2017

Aab9ac774f61c5d9bf143b5a1bfe901b?s=128

Tugdual Grall

April 06, 2017
Tweet

Transcript

  1. 2.

    #DevoxxFR {“about” : “me”} 2 Tugdual “Tug” Grall • MapR

    : Technical Evangelist • MongoDB, Couchbase, eXo, Oracle • NantesJUG co-founder
 • @tgrall • http://tgrall.github.io • tug@mapr.com / tugdual@gmail.com
  2. 3.

    #DevoxxFR 3 Open Source Engines & Tools Commercial Engines &

    Applications Enterprise-Grade Platform Services Data Processing Web-Scale Storage MapR-FS MapR-DB Search and Others Real Time Unified Security Multi-tenancy Disaster Recovery Global Namespace High Availability MapR Streams Cloud and Managed Services Search and Others Unified Management and Monitoring Search and Others Event Streaming Database Custom Apps HDFS API POSIX, NFS HBase API JSON API Kafka API MapR Converged Data Platform
  3. 4.

    #DevoxxFR 4 Streaming technology is enabling the obvious: continuous processing

    on data that is continuously produced Hint: you already have streaming data
  4. 5.

    #DevoxxFR Decoupling 5 App B App A App C State

    managed centralized App B App A App C Applications build their own state
  5. 7.

    #DevoxxFR Streaming and Batch 7 2016-3-1
 12:00 am 2016-3-1
 1:00

    am 2016-3-1
 2:00 am 2016-3-11
 11:00pm 2016-3-12
 12:00am 2016-3-12
 1:00am 2016-3-11
 10:00pm 2016-3-12
 2:00am 2016-3-12
 3:00am … partition partition
  6. 8.

    #DevoxxFR Streaming and Batch 8 2016-3-1
 12:00 am 2016-3-1
 1:00

    am 2016-3-1
 2:00 am 2016-3-11
 11:00pm 2016-3-12
 12:00am 2016-3-12
 1:00am 2016-3-11
 10:00pm 2016-3-12
 2:00am 2016-3-12
 3:00am … partition partition Stream (low latency) Stream (high latency)
  7. 9.

    #DevoxxFR Streaming and Batch 9 2016-3-1
 12:00 am 2016-3-1
 1:00

    am 2016-3-1
 2:00 am 2016-3-11
 11:00pm 2016-3-12
 12:00am 2016-3-12
 1:00am 2016-3-11
 10:00pm 2016-3-12
 2:00am 2016-3-12
 3:00am … partition partition Stream (low latency) Batch (bounded stream) Stream (high latency)
  8. 13.

    #DevoxxFR Processing 13 • Request / Response • Batch •

    Stream Processing • Real-time reaction to events • Continuous applications • Process both real-time and historical data
  9. 17.

    #DevoxxFR Flink Architecture 17 Deployment Local Cluster Cloud Single JVM

    Standalone, YARN, Mesos AWS, Google Core Runtime Distributed Streaming Dataflow
  10. 18.

    #DevoxxFR 18 Deployment Local Cluster Cloud Single JVM Standalone, YARN,

    Mesos AWS, Google Core Runtime Distributed Streaming Dataflow DataSet API Batch Processing API & Libraries
  11. 19.

    #DevoxxFR Flink Architecture 19 Deployment Local Cluster Cloud Single JVM

    Standalone, YARN, Mesos AWS, Google Core Runtime Distributed Streaming Dataflow DataSet API Batch Processing API & Libraries FlinkML Machine Learning Gelly Graph Processing Table Relational
  12. 20.

    #DevoxxFR Flink Architecture 20 Deployment Local Cluster Cloud Single JVM

    Standalone, YARN, Mesos AWS, Google Core Runtime Distributed Streaming Dataflow DataSet API Batch Processing DataStream API Stream Processing API & Libraries FlinkML Machine Learning Gelly Graph Processing Table Relational
  13. 21.

    #DevoxxFR Flink Architecture 21 Deployment Local Cluster Cloud Single JVM

    Standalone, YARN, Mesos AWS, Google Core Runtime Distributed Streaming Dataflow DataSet API Batch Processing DataStream API Stream Processing API & Libraries FlinkML Machine Learning Gelly Graph Processing Table Relational CEP Event Processing Table Relational
  14. 23.

    #DevoxxFR Batch & Stream 23 case class Word (word: String,

    frequency: Int) // DataSet API - Batch val lines: DataSet[String] = env.readTextFile(…) lines.flatMap {line => line.split(“ ”).map(word => Word(word,1))} .groupBy("word").sum("frequency") .print() // DataStream API - Streaming val lines: DataSream[String] = env.fromSocketStream(...) lines.flatMap {line => line.split(“ ”).map(word => Word(word,1))} .keyBy("word”).window(Time.of(5,SECONDS)) .every(Time.of(1,SECONDS)).sum(”frequency") .print()
  15. 25.

    #DevoxxFR Flink Ecosystem 25 Source Sink Apache Kafka MapR Streams

    AWS Kinesis RabbitMQ Twitter Apache Bahir … Apache Kafka MapR Streams AWS Kinesis RabbitMQ Elasticsearch HDFS/MapR-FS …
  16. 29.

    #DevoxxFR 29 10 Billion events/day 2Tb of data/day 30 Applications

    2Pb of storage and growing Source Bouyges Telecom : http://berlin.flink-forward.org/wp-content/uploads/2016/07/Thomas-Lamirault_Mohamed-Amine-Abdessemed-A-brief-history-of-time-with-Apache-Flink.pdf
  17. 38.

    #DevoxxFR Demonstration 38 • Multiple notion of “Time” in Flink

    • Event Time • Ingestion Time • Processing Time
  18. 39.

    #DevoxxFR What Is Event-Time Processing 39 1977 1980 1983 1999

    2002 2005 2015 Processing Time Episode
 IV Episode
 V Episode
 VI Episode
 I Episode
 II Episode
 III Episode
 VII Event Time
  19. 42.

    #DevoxxFR Complex Event Processing 42 • Analyzing a stream of

    events and drawing conclusions • “if A and then B ! infer event C” • Demanding requirements on stream processor • Low latency! • Exactly-once semantics & event-time support
  20. 44.

    #DevoxxFR Order Events 44 Process is reflected in a stream

    of order events Order(orderId, tStamp, “received”) Shipment(orderId, tStamp, “shipped”) Delivery(orderId, tStamp, “delivered”) orderId: Identifies the order tStamp: Time at which the event happened
  21. 46.

    #DevoxxFR CEP to the Rescue 46 Define processing and delivery

    intervals (SLAs) ProcessSucc(orderId, tStamp, duration) ProcessWarn(orderId, tStamp) DeliverySucc(orderId, tStamp, duration) DeliveryWarn(orderId, tStamp) orderId: Identifies the order tStamp: Time when the event happened duration: Duration of the processing/delivery
  22. 49.

    #DevoxxFR 49 Processing: Order ! Shipment val processingPattern = Pattern

    .begin[Event]("received").subtype(classOf[Order]) .followedBy("shipped").where(_.status == "shipped") .within(Time.hours(1))
  23. 50.

    #DevoxxFR 50 val processingPattern = Pattern .begin[Event]("received").subtype(classOf[Order]) .followedBy("shipped").where(_.status == "shipped")

    .within(Time.hours(1)) val processingPatternStream = CEP.pattern( input.keyBy("orderId"), processingPattern) Processing: Order ! Shipment
  24. 51.

    #DevoxxFR 51 val processingPattern = Pattern .begin[Event]("received").subtype(classOf[Order]) .followedBy("shipped").where(_.status == "shipped")

    .within(Time.hours(1)) val processingPatternStream = CEP.pattern( input.keyBy("orderId"), processingPattern) val procResult: DataStream[Either[ProcessWarn, ProcessSucc]] = processingPatternStream.select { (pP, timestamp) => // Timeout handler ProcessWarn(pP("received").orderId, timestamp) } { fP => // Select function ProcessSucc( fP("received").orderId, fP("shipped").tStamp, fP("shipped").tStamp – fP("received").tStamp) } Processing: Order ! Shipment
  25. 54.

    #DevoxxFR The End 54 • Process events in real time

    and/or batch • Complex Event Processing (CEP) • Many other things to discover • Deployment • High Availability • Table/Relational API • … https://mapr.com/ebooks/
  26. 55.

    #DevoxxFR 55 Flink Community & Thanks to Kostas Tzoumas Stephan

    Ewen Fabian Hueske Till Rohrmann Jamie Grier