Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Apache Flink 1.7 and Beyond

Apache Flink 1.7 and Beyond

The streaming space is evolving at an ever increasing pace. This trend is also reflected in Apache Flink whose latest major release included again many new features. For streaming practitioners it is essential to learn about Flink's newest capabilities because often they enable completely new use cases and applications.

In this talk, I want to give a brief overview about Apache Flink and its latest feature additions, including the integration of CEP with streaming SQL, proper support for state evolution, temporal joins and many more. Furthermore, I want to put them in perspective with respect to Flink's future direction by giving some insights into ongoing development threads in the community. Thereby, I intend to give attendees a better picture about Flink's current and future capabilities.

Till Rohrmann

December 20, 2018
Tweet

More Decks by Till Rohrmann

Other Decks in Technology

Transcript

  1. 3 What is Apache Flink? Batch Processing process static and

    historic data Data Stream Processing realtime results from data streams Event-driven Applications data-driven actions and services Stateful Computations Over Data Streams
  2. • Contributors: 112 • Resolved issues: 430 • Commits: 970

    • Changes LOC: +103824/-63124 5 Flink 1.7.0 in Numbers
  3. • E.g. changing requirements, new algorithms, better serializers, bug fixes,

    etc. • Expensive to restart application from scratch (maintain state) 6 Flink Applications Need to Evolve
  4. • Support for changing state schema • Adding/Removing fields •

    Changing type of fields • Currently fully supported when using Avro types 7 State Schema Evolution “Upgrading Stateful Flink Streaming Applications: State of the Union” by Tzu-Li Tai Today @ 5:20 pm Room 2
  5. 9 Temporal Tables and Joins 13 11 7 Currency Rate

    Time CN¥ 7.8 3 CN¥ 7.89 5 CN¥ 7.75 9 15 14 12 7 4
  6. 11 MATCH_RECOGNIZE SELECT * FROM TaxiRides MATCH_RECOGNIZE ( PARTITION BY

    driverId ORDER BY rideTime MEASURES S.rideId as sRideId AFTER MATCH SKIP PAST LAST ROW PATTERN (S M{2,} E) DEFINE S AS S.isStart = true, M AS M.rideId <> S.rideId, E AS E.isStart = false AND E.rideId = S.rideId )
  7. • ElasticSearch 6 Table Sink • Support for views in

    SQL Client • More built-in functions: TO_BASE64, LOG2, REPLACE, COSH,… 12 More SQL Improvements “Flink Streaming SQL 2018” by Piotr Nowojski Today @ 4:00 pm Room 2
  8. • Scala 2.12 Support • Exactly-once S3 StreamingFileSink • Kafka

    2.0 connector • Versioned REST API • Removal of legacy mode 13 Other Notable Features
  9. • Deploying Flink applications should be as easy as starting

    a process • Bundle application code and Flink into a single image • Process connects to other application processes and figures out its role • Removing the cluster out of the equation 16 Flink as a Library P1 P2 P3 P4 New process
  10. • Active mode • Flink is aware of underlying cluster

    framework • Flink allocate resources • E.g. existing YARN and Mesos integration • Reactive mode • Flink is oblivious to its runtime environment • External system allocates and releases resources • Flink scales with respect to available resources • Relevant for environments: Kubernetes, Docker, as a library 17 Reactive vs. Active
  11. • No fundamental difference between batch and stream processing •

    Batch allows optimizations because data is bounded and ”complete” • Batch and streaming still separately treated from task level upwards • Working toward a single runtime for batch and streaming workloads 19 Batch-Streaming Unification
  12. • Lazy scheduling (batch case) • Deploy tasks starting from

    the sources • Whenever data is produced start consumers • Scheduling of idling tasks à resource under-utilization 20 Flink Scheduler src src join join src build side build side probe side probe side
  13. • More efficient scheduling by taking dependencies into account •

    E.g. probe side is only scheduled after build side has been processed 21 Batch Scheduler src src join join src build side build side probe side probe side (1) (2) (2) (3)
  14. • Make Flink’s scheduler extendable & pluggable • Scheduler considers

    dependencies and reacts to signals from ExecutionGraph • Specialized scheduler for different use cases 22 Extendable Scheduler Scheduler Streaming Scheduler Batch Scheduler Speculative Scheduler
  15. • Tasks own produced result partitions • Containers cannot be

    freed until result is consumed • One implementation for streaming and batch loads 23 Flink’s Shuffle Service Result partition Container
  16. • Result partitions are written to an external shuffle service

    • Containers can be freed early • Different implementations based on use case 24 External & Persistent Shuffle Service External shuffle service (e.g. Yarn, DFS)
  17. • Support for external catalogs (Confluent Schema Registry, Hive Meta

    Store) • Data definition language (DDL) 25 End-to-end SQL Only Pipelines Hive Meta Store Table Source Table Sink Output schema information Input schema information SQL Query
  18. • Flink 1.7.0 added many new features around SQL, connectors

    and state evolution • A lot of new features in the pipeline • Join the community! • Subscribe to mailing lists • Participate in Flink development • Become active 26 TL;DL