Slide 1

Slide 1 text

Apache Flink® 1.7 and Beyond 公司:data Artisans 职位:Engineering Lead 演讲者:Till Rohrmann @stsffap 1

Slide 2

Slide 2 text

2 Original creators of Apache Flink® dA Platform Stream Processing for the Enterprise

Slide 3

Slide 3 text

3 What is Apache Flink? Batch Processing process static and historic data Data Stream Processing realtime results from data streams Event-driven Applications data-driven actions and services Stateful Computations Over Data Streams

Slide 4

Slide 4 text

Flink 1.7: What happened so far? 4

Slide 5

Slide 5 text

• Contributors: 112 • Resolved issues: 430 • Commits: 970 • Changes LOC: +103824/-63124 5 Flink 1.7.0 in Numbers

Slide 6

Slide 6 text

• E.g. changing requirements, new algorithms, better serializers, bug fixes, etc. • Expensive to restart application from scratch (maintain state) 6 Flink Applications Need to Evolve

Slide 7

Slide 7 text

• Support for changing state schema • Adding/Removing fields • Changing type of fields • Currently fully supported when using Avro types 7 State Schema Evolution “Upgrading Stateful Flink Streaming Applications: State of the Union” by Tzu-Li Tai Today @ 5:20 pm Room 2

Slide 8

Slide 8 text

8 Converting Currencies 7:12pm 9:37am 8:45am € 1 $ 1.13 CN¥ 7.8

Slide 9

Slide 9 text

9 Temporal Tables and Joins 13 11 7 Currency Rate Time CN¥ 7.8 3 CN¥ 7.89 5 CN¥ 7.75 9 15 14 12 7 4

Slide 10

Slide 10 text

10 SQL for Pattern Analysis SELECT * from ?

Slide 11

Slide 11 text

11 MATCH_RECOGNIZE SELECT * FROM TaxiRides MATCH_RECOGNIZE ( PARTITION BY driverId ORDER BY rideTime MEASURES S.rideId as sRideId AFTER MATCH SKIP PAST LAST ROW PATTERN (S M{2,} E) DEFINE S AS S.isStart = true, M AS M.rideId <> S.rideId, E AS E.isStart = false AND E.rideId = S.rideId )

Slide 12

Slide 12 text

• ElasticSearch 6 Table Sink • Support for views in SQL Client • More built-in functions: TO_BASE64, LOG2, REPLACE, COSH,… 12 More SQL Improvements “Flink Streaming SQL 2018” by Piotr Nowojski Today @ 4:00 pm Room 2

Slide 13

Slide 13 text

• Scala 2.12 Support • Exactly-once S3 StreamingFileSink • Kafka 2.0 connector • Versioned REST API • Removal of legacy mode 13 Other Notable Features

Slide 14

Slide 14 text

Flink 1.8+: What is happening next? 14

Slide 15

Slide 15 text

15 Capability Spectrum offline real time Batch Event-driven applications Streaming analytics Strict SLA applications Flink

Slide 16

Slide 16 text

• Deploying Flink applications should be as easy as starting a process • Bundle application code and Flink into a single image • Process connects to other application processes and figures out its role • Removing the cluster out of the equation 16 Flink as a Library P1 P2 P3 P4 New process

Slide 17

Slide 17 text

• Active mode • Flink is aware of underlying cluster framework • Flink allocate resources • E.g. existing YARN and Mesos integration • Reactive mode • Flink is oblivious to its runtime environment • External system allocates and releases resources • Flink scales with respect to available resources • Relevant for environments: Kubernetes, Docker, as a library 17 Reactive vs. Active

Slide 18

Slide 18 text

18 Dynamic Scaling • Latency • Throughput • Resource utilization • Connector signals

Slide 19

Slide 19 text

• No fundamental difference between batch and stream processing • Batch allows optimizations because data is bounded and ”complete” • Batch and streaming still separately treated from task level upwards • Working toward a single runtime for batch and streaming workloads 19 Batch-Streaming Unification

Slide 20

Slide 20 text

• Lazy scheduling (batch case) • Deploy tasks starting from the sources • Whenever data is produced start consumers • Scheduling of idling tasks à resource under-utilization 20 Flink Scheduler src src join join src build side build side probe side probe side

Slide 21

Slide 21 text

• More efficient scheduling by taking dependencies into account • E.g. probe side is only scheduled after build side has been processed 21 Batch Scheduler src src join join src build side build side probe side probe side (1) (2) (2) (3)

Slide 22

Slide 22 text

• Make Flink’s scheduler extendable & pluggable • Scheduler considers dependencies and reacts to signals from ExecutionGraph • Specialized scheduler for different use cases 22 Extendable Scheduler Scheduler Streaming Scheduler Batch Scheduler Speculative Scheduler

Slide 23

Slide 23 text

• Tasks own produced result partitions • Containers cannot be freed until result is consumed • One implementation for streaming and batch loads 23 Flink’s Shuffle Service Result partition Container

Slide 24

Slide 24 text

• Result partitions are written to an external shuffle service • Containers can be freed early • Different implementations based on use case 24 External & Persistent Shuffle Service External shuffle service (e.g. Yarn, DFS)

Slide 25

Slide 25 text

• Support for external catalogs (Confluent Schema Registry, Hive Meta Store) • Data definition language (DDL) 25 End-to-end SQL Only Pipelines Hive Meta Store Table Source Table Sink Output schema information Input schema information SQL Query

Slide 26

Slide 26 text

• Flink 1.7.0 added many new features around SQL, connectors and state evolution • A lot of new features in the pipeline • Join the community! • Subscribe to mailing lists • Participate in Flink development • Become active 26 TL;DL

Slide 27

Slide 27 text

谢谢 THANKS 27