Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Apache Flink 1.7 and Beyond

Apache Flink 1.7 and Beyond

The streaming space is evolving at an ever increasing pace. This trend is also reflected in Apache Flink whose latest major release included again many new features. For streaming practitioners it is essential to learn about Flink's newest capabilities because often they enable completely new use cases and applications.

In this talk, I want to give a brief overview about Apache Flink and its latest feature additions, including the integration of CEP with streaming SQL, proper support for state evolution, temporal joins and many more. Furthermore, I want to put them in perspective with respect to Flink's future direction by giving some insights into ongoing development threads in the community. Thereby, I intend to give attendees a better picture about Flink's current and future capabilities.

Till Rohrmann

December 20, 2018

More Decks by Till Rohrmann

Other Decks in Technology


  1. Apache Flink® 1.7 and Beyond 公司:data Artisans 职位:Engineering Lead 演讲者:Till

    Rohrmann @stsffap 1
  2. 2 Original creators of Apache Flink® dA Platform Stream Processing

    for the Enterprise
  3. 3 What is Apache Flink? Batch Processing process static and

    historic data Data Stream Processing realtime results from data streams Event-driven Applications data-driven actions and services Stateful Computations Over Data Streams
  4. Flink 1.7: What happened so far? 4

  5. • Contributors: 112 • Resolved issues: 430 • Commits: 970

    • Changes LOC: +103824/-63124 5 Flink 1.7.0 in Numbers
  6. • E.g. changing requirements, new algorithms, better serializers, bug fixes,

    etc. • Expensive to restart application from scratch (maintain state) 6 Flink Applications Need to Evolve
  7. • Support for changing state schema • Adding/Removing fields •

    Changing type of fields • Currently fully supported when using Avro types 7 State Schema Evolution “Upgrading Stateful Flink Streaming Applications: State of the Union” by Tzu-Li Tai Today @ 5:20 pm Room 2
  8. 8 Converting Currencies 7:12pm 9:37am 8:45am € 1 $ 1.13

    CN¥ 7.8
  9. 9 Temporal Tables and Joins 13 11 7 Currency Rate

    Time CN¥ 7.8 3 CN¥ 7.89 5 CN¥ 7.75 9 15 14 12 7 4
  10. 10 SQL for Pattern Analysis SELECT * from ?


    driverId ORDER BY rideTime MEASURES S.rideId as sRideId AFTER MATCH SKIP PAST LAST ROW PATTERN (S M{2,} E) DEFINE S AS S.isStart = true, M AS M.rideId <> S.rideId, E AS E.isStart = false AND E.rideId = S.rideId )
  12. • ElasticSearch 6 Table Sink • Support for views in

    SQL Client • More built-in functions: TO_BASE64, LOG2, REPLACE, COSH,… 12 More SQL Improvements “Flink Streaming SQL 2018” by Piotr Nowojski Today @ 4:00 pm Room 2
  13. • Scala 2.12 Support • Exactly-once S3 StreamingFileSink • Kafka

    2.0 connector • Versioned REST API • Removal of legacy mode 13 Other Notable Features
  14. Flink 1.8+: What is happening next? 14

  15. 15 Capability Spectrum offline real time Batch Event-driven applications Streaming

    analytics Strict SLA applications Flink
  16. • Deploying Flink applications should be as easy as starting

    a process • Bundle application code and Flink into a single image • Process connects to other application processes and figures out its role • Removing the cluster out of the equation 16 Flink as a Library P1 P2 P3 P4 New process
  17. • Active mode • Flink is aware of underlying cluster

    framework • Flink allocate resources • E.g. existing YARN and Mesos integration • Reactive mode • Flink is oblivious to its runtime environment • External system allocates and releases resources • Flink scales with respect to available resources • Relevant for environments: Kubernetes, Docker, as a library 17 Reactive vs. Active
  18. 18 Dynamic Scaling • Latency • Throughput • Resource utilization

    • Connector signals
  19. • No fundamental difference between batch and stream processing •

    Batch allows optimizations because data is bounded and ”complete” • Batch and streaming still separately treated from task level upwards • Working toward a single runtime for batch and streaming workloads 19 Batch-Streaming Unification
  20. • Lazy scheduling (batch case) • Deploy tasks starting from

    the sources • Whenever data is produced start consumers • Scheduling of idling tasks à resource under-utilization 20 Flink Scheduler src src join join src build side build side probe side probe side
  21. • More efficient scheduling by taking dependencies into account •

    E.g. probe side is only scheduled after build side has been processed 21 Batch Scheduler src src join join src build side build side probe side probe side (1) (2) (2) (3)
  22. • Make Flink’s scheduler extendable & pluggable • Scheduler considers

    dependencies and reacts to signals from ExecutionGraph • Specialized scheduler for different use cases 22 Extendable Scheduler Scheduler Streaming Scheduler Batch Scheduler Speculative Scheduler
  23. • Tasks own produced result partitions • Containers cannot be

    freed until result is consumed • One implementation for streaming and batch loads 23 Flink’s Shuffle Service Result partition Container
  24. • Result partitions are written to an external shuffle service

    • Containers can be freed early • Different implementations based on use case 24 External & Persistent Shuffle Service External shuffle service (e.g. Yarn, DFS)
  25. • Support for external catalogs (Confluent Schema Registry, Hive Meta

    Store) • Data definition language (DDL) 25 End-to-end SQL Only Pipelines Hive Meta Store Table Source Table Sink Output schema information Input schema information SQL Query
  26. • Flink 1.7.0 added many new features around SQL, connectors

    and state evolution • A lot of new features in the pipeline • Join the community! • Subscribe to mailing lists • Participate in Flink development • Become active 26 TL;DL
  27. 谢谢 THANKS 27