Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Stream Processing with Apache Flink

Stream Processing with Apache Flink

Avatar for Kristian Kottke

Kristian Kottke

September 13, 2018
Tweet

More Decks by Kristian Kottke

Other Decks in Programming

Transcript

  1. ©iteratec Whoami Kristian Kottke › Senior Software Engineer -> iteratec

    Interests › Software Architecture › Big Data Technologies [email protected] github.com/kkottke xing.to/kkottke speakerdeck.com/kkottke 2
  2. ©iteratec 20 20 Window › Watermark › Trigger › Late

    Data › Discard › Redirect into separate Stream › Update result Key 1
  3. ©iteratec 22 22 Guarantees › At most once › At

    least once › Exactly once › Processor State › End-2-End Exactly once › Resettable / Replayable Source & Sink › Idempotency Source Sink State
  4. ©iteratec Apache Flink Databases Stream following: https://ci.apache.org/projects/flink/flink-docs-release-1.6/ Storage Application Streams

    Historic Data Transactions Logs IoT Clicks ..... ...framework and distributed processing engine for stateful computations over unbounded and bounded data streams 25
  5. ©iteratec Apache Flink Files, HDFS, S3, JDBC, Kafka, ... Local

    Cluster Cloud DataStream API FlinkML Gelly Table & SQL CEP Table & SQL Storage Deployment Runtime API Libraries following: https://ci.apache.org/projects/flink/flink-docs-release-1.6/ DataSet API 26
  6. ©iteratec Apache Flink DataStream<String> messages = env.addSource( new FlinkKafkaConsumer<>(...)); DataStream<Tick>

    ticks = messages.map( Tick::parse); DataStream<Tick> maxValues = ticks .keyBy(„id“) .timeWindow(Time.seconds(10)) .maxBy(„value“); stats.addSink(new BucketingSink(„/path/to/dir“)); OP OP OP OP Transformation Transformation Source Sink 28
  7. ©iteratec DataStream<String> inputStream = env.addSource(new FlinkKafkaConsumer<>(...)); DataStream<Tick> ticks = inputStream

    .map(Tick::parse) .assignTimestampsAndWatermarks(new PeriodicAssigner(Time.seconds(5))); DataStream<Tick> maxValues = ticks .keyBy("id") .timeWindow(Time.seconds(10)) .maxBy("value"); Window Functions 33
  8. ©iteratec DataStream<Tick> performanceValues = ticks .keyBy("id") .timeWindow(Time.seconds(10)) .trigger(new ThresholdTrigger(10d)) .process(new

    PerformanceFunction()); public void process( Tuple key, Context ctx, Iterable<Tick> ticks, Collector<Tick> out) { /* calculate min / max value */ out.collect(tick); } Window Functions 34
  9. ©iteratec public void processElement(Tick tick, Context ctx, Collector<Tick> out) {

    ... ctx.timerService().registerEventTimeTimer(timerTimestamp); ... } public void onTimer(long timestamp, OnTimerContext ctx, Collector<Tick> out) { ... ctx.output(outputTag, ctx.getCurrentKey()); ... } Timer Service 36
  10. ©iteratec DataStream<Tick> priceAlerts = ticks .keyBy("id") .flatMap(new PriceAlertFunction(10d)); public void

    open(Configuration parameters) { // ... previousPriceState = getRuntimeContext().getState(previousPriceDescriptor); } public void flatMap(Tick tick, Collector<Tick> out) throws Exception { if (Math.abs(tick.value - previousPriceState.value()) > threshold) { out.collect(tick); } previousPriceState.update(tick.value); } Value State 38
  11. ©iteratec DataStream<Threshold> thresholds = env.addSource(...); BroadcastStream<Threshold> thresholdBroadcast = thresholds.broadcast(thresholdsDescriptor); DataStream<Tick>

    priceAlerts = ticks .keyBy("id") .connect(thresholdBroadcast) .process(new UpdatablePriceDiffFunction()); Broadcast State 39
  12. ©iteratec Wrap Up › Data usually occur in streams ›

    Batch Processing doesn’t meet the modern requirements regarding continuous data streams › Stream Processing › Powerful › Higher / manageable complexity › Real-time / low latency › Intuitiveness 47