Apache Flink 1.7 and Beyond

Apache Flink® 1.7 and Beyond 公司：data Artisans 职位：Engineering Lead 演讲者：Till
Rohrmann @stsffap 1

2 Original creators of Apache Flink® dA Platform Stream Processing
for the Enterprise

3 What is Apache Flink? Batch Processing process static and
historic data Data Stream Processing realtime results from data streams Event-driven Applications data-driven actions and services Stateful Computations Over Data Streams

Flink 1.7: What happened so far? 4

• Contributors: 112 • Resolved issues: 430 • Commits: 970
• Changes LOC: +103824/-63124 5 Flink 1.7.0 in Numbers

• E.g. changing requirements, new algorithms, better serializers, bug fixes,
etc. • Expensive to restart application from scratch (maintain state) 6 Flink Applications Need to Evolve

• Support for changing state schema • Adding/Removing fields •
Changing type of fields • Currently fully supported when using Avro types 7 State Schema Evolution “Upgrading Stateful Flink Streaming Applications: State of the Union” by Tzu-Li Tai Today @ 5:20 pm Room 2

8 Converting Currencies 7:12pm 9:37am 8:45am € 1 $ 1.13
CN¥ 7.8

9 Temporal Tables and Joins 13 11 7 Currency Rate
Time CN¥ 7.8 3 CN¥ 7.89 5 CN¥ 7.75 9 15 14 12 7 4

10 SQL for Pattern Analysis SELECT * from ?

11 MATCH_RECOGNIZE SELECT * FROM TaxiRides MATCH_RECOGNIZE ( PARTITION BY
driverId ORDER BY rideTime MEASURES S.rideId as sRideId AFTER MATCH SKIP PAST LAST ROW PATTERN (S M{2,} E) DEFINE S AS S.isStart = true, M AS M.rideId <> S.rideId, E AS E.isStart = false AND E.rideId = S.rideId )

• ElasticSearch 6 Table Sink • Support for views in
SQL Client • More built-in functions: TO_BASE64, LOG2, REPLACE, COSH,… 12 More SQL Improvements “Flink Streaming SQL 2018” by Piotr Nowojski Today @ 4:00 pm Room 2

• Scala 2.12 Support • Exactly-once S3 StreamingFileSink • Kafka
2.0 connector • Versioned REST API • Removal of legacy mode 13 Other Notable Features

Flink 1.8+: What is happening next? 14

15 Capability Spectrum offline real time Batch Event-driven applications Streaming
analytics Strict SLA applications Flink

• Deploying Flink applications should be as easy as starting
a process • Bundle application code and Flink into a single image • Process connects to other application processes and figures out its role • Removing the cluster out of the equation 16 Flink as a Library P1 P2 P3 P4 New process

• Active mode • Flink is aware of underlying cluster
framework • Flink allocate resources • E.g. existing YARN and Mesos integration • Reactive mode • Flink is oblivious to its runtime environment • External system allocates and releases resources • Flink scales with respect to available resources • Relevant for environments: Kubernetes, Docker, as a library 17 Reactive vs. Active

18 Dynamic Scaling • Latency • Throughput • Resource utilization
• Connector signals

• No fundamental difference between batch and stream processing •
Batch allows optimizations because data is bounded and ”complete” • Batch and streaming still separately treated from task level upwards • Working toward a single runtime for batch and streaming workloads 19 Batch-Streaming Unification

• Lazy scheduling (batch case) • Deploy tasks starting from
the sources • Whenever data is produced start consumers • Scheduling of idling tasks à resource under-utilization 20 Flink Scheduler src src join join src build side build side probe side probe side

• More efficient scheduling by taking dependencies into account •
E.g. probe side is only scheduled after build side has been processed 21 Batch Scheduler src src join join src build side build side probe side probe side (1) (2) (2) (3)

• Make Flink’s scheduler extendable & pluggable • Scheduler considers
dependencies and reacts to signals from ExecutionGraph • Specialized scheduler for different use cases 22 Extendable Scheduler Scheduler Streaming Scheduler Batch Scheduler Speculative Scheduler

• Tasks own produced result partitions • Containers cannot be
freed until result is consumed • One implementation for streaming and batch loads 23 Flink’s Shuffle Service Result partition Container

• Result partitions are written to an external shuffle service
• Containers can be freed early • Different implementations based on use case 24 External & Persistent Shuffle Service External shuffle service (e.g. Yarn, DFS)

• Support for external catalogs (Confluent Schema Registry, Hive Meta
Store) • Data definition language (DDL) 25 End-to-end SQL Only Pipelines Hive Meta Store Table Source Table Sink Output schema information Input schema information SQL Query

• Flink 1.7.0 added many new features around SQL, connectors
and state evolution • A lot of new features in the pipeline • Join the community! • Subscribe to mailing lists • Participate in Flink development • Become active 26 TL;DL

谢谢 THANKS 27

Apache Flink 1.7 and Beyond

Apache Flink 1.7 and Beyond

Till Rohrmann

More Decks by Till Rohrmann

Other Decks in Technology

Featured

Transcript