Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Stream Processing with Apache Flink

Stream Processing with Apache Flink

Kristian Kottke

September 13, 2018
Tweet

More Decks by Kristian Kottke

Other Decks in Programming

Transcript

  1. ©iteratec Whoami Kristian Kottke › Senior Software Engineer -> iteratec

    Interests › Software Architecture › Big Data Technologies [email protected] github.com/kkottke xing.to/kkottke speakerdeck.com/kkottke 2
  2. ©iteratec 20 20 Window › Watermark › Trigger › Late

    Data › Discard › Redirect into separate Stream › Update result Key 1
  3. ©iteratec 22 22 Guarantees › At most once › At

    least once › Exactly once › Processor State › End-2-End Exactly once › Resettable / Replayable Source & Sink › Idempotency Source Sink State
  4. ©iteratec Apache Flink Databases Stream following: https://ci.apache.org/projects/flink/flink-docs-release-1.6/ Storage Application Streams

    Historic Data Transactions Logs IoT Clicks ..... ...framework and distributed processing engine for stateful computations over unbounded and bounded data streams 25
  5. ©iteratec Apache Flink Files, HDFS, S3, JDBC, Kafka, ... Local

    Cluster Cloud DataStream API FlinkML Gelly Table & SQL CEP Table & SQL Storage Deployment Runtime API Libraries following: https://ci.apache.org/projects/flink/flink-docs-release-1.6/ DataSet API 26
  6. ©iteratec Apache Flink DataStream<String> messages = env.addSource( new FlinkKafkaConsumer<>(...)); DataStream<Tick>

    ticks = messages.map( Tick::parse); DataStream<Tick> maxValues = ticks .keyBy(„id“) .timeWindow(Time.seconds(10)) .maxBy(„value“); stats.addSink(new BucketingSink(„/path/to/dir“)); OP OP OP OP Transformation Transformation Source Sink 28
  7. ©iteratec DataStream<String> inputStream = env.addSource(new FlinkKafkaConsumer<>(...)); DataStream<Tick> ticks = inputStream

    .map(Tick::parse) .assignTimestampsAndWatermarks(new PeriodicAssigner(Time.seconds(5))); DataStream<Tick> maxValues = ticks .keyBy("id") .timeWindow(Time.seconds(10)) .maxBy("value"); Window Functions 33
  8. ©iteratec DataStream<Tick> performanceValues = ticks .keyBy("id") .timeWindow(Time.seconds(10)) .trigger(new ThresholdTrigger(10d)) .process(new

    PerformanceFunction()); public void process( Tuple key, Context ctx, Iterable<Tick> ticks, Collector<Tick> out) { /* calculate min / max value */ out.collect(tick); } Window Functions 34
  9. ©iteratec public void processElement(Tick tick, Context ctx, Collector<Tick> out) {

    ... ctx.timerService().registerEventTimeTimer(timerTimestamp); ... } public void onTimer(long timestamp, OnTimerContext ctx, Collector<Tick> out) { ... ctx.output(outputTag, ctx.getCurrentKey()); ... } Timer Service 36
  10. ©iteratec DataStream<Tick> priceAlerts = ticks .keyBy("id") .flatMap(new PriceAlertFunction(10d)); public void

    open(Configuration parameters) { // ... previousPriceState = getRuntimeContext().getState(previousPriceDescriptor); } public void flatMap(Tick tick, Collector<Tick> out) throws Exception { if (Math.abs(tick.value - previousPriceState.value()) > threshold) { out.collect(tick); } previousPriceState.update(tick.value); } Value State 38
  11. ©iteratec DataStream<Threshold> thresholds = env.addSource(...); BroadcastStream<Threshold> thresholdBroadcast = thresholds.broadcast(thresholdsDescriptor); DataStream<Tick>

    priceAlerts = ticks .keyBy("id") .connect(thresholdBroadcast) .process(new UpdatablePriceDiffFunction()); Broadcast State 39
  12. ©iteratec Wrap Up › Data usually occur in streams ›

    Batch Processing doesn’t meet the modern requirements regarding continuous data streams › Stream Processing › Powerful › Higher / manageable complexity › Real-time / low latency › Intuitiveness 47