Flink’s New Batch Architecture

Slide 1

Slide 1 text

Slide 2

Slide 2 text

Slide 3

Slide 3 text

Slide 4

Slide 4 text

Slide 5

Slide 5 text

© 2019 Ververica Diﬀerent Properties 5 ● Watermarks to model completeness/latency trade-oﬀ ● Incremental results ● In receive order ingestion ● All operators need to be running Continuous Streaming ● No watermarks ● Result ready at end of program ● Massively parallel out-of-order ingestion ● Job can be executed in stages Batch Processing

Slide 6

Slide 6 text

© 2019 Ververica Exploiting Boundedness 6 SELECT AVG(temperature) FROM sensors GROUP BY location Flink optimizer Continuous operators Specialized batch operators Bounded streams Default execution Batch execution ● Eﬃcient execution ○ Improved scheduling ○ Improved resource utilization ○ Faster failovers

Slide 7

Slide 7 text

Slide 8

Slide 8 text

© 2019 Ververica • Lazy scheduling (batch case) ─ Start deploying all sources ─ If data is produced, then start consumers • Resource underutilization ─ Idling tasks due to deploying them too early Lazy From Sources 8 Source #1 Source #2 Join #1 Source #3 Join #2 build side build side probe side probe side

Slide 9

Slide 9 text

© 2019 Ververica • More eﬃcient scheduling by taking dependencies into account • E.g. probe side is only scheduled after build side has been completed Improved Scheduling 9 Source #1 Source #2 Join #1 Source #3 Join #2 build side build side probe side probe side (1) (2) (2) (3)

Slide 10

Slide 10 text

© 2019 Ververica • The Flink community started an eﬀort to make the scheduler pluggable • Scheduler abstraction allows to react to ExecutionGraph signals • Specialized scheduler implementations possible • FLINK-10429 for more details 10 Scheduler Pipelined-Regions Scheduler Advanced Batch Scheduler Speculative Scheduler

Slide 11

Slide 11 text

© 2019 Ververica • Executing an heterogeneous job requires diﬀerent resources over time • Statically sized slots are only optimal for one type of task • Resource underutilization One Slot Needs to Serve Them All 11 TaskExecutor #1 10 GB 5 GB Slot #1 / 10 GB TaskExecutor #2 Slot #2 / 10 GB TaskExecutor #1 Slot #1 / 10 GB TaskExecutor #2 Slot #2 / 10 GB TaskExecutor #1 Slot #1 / 10 GB TaskExecutor #2 Slot #2 / 10 GB 10 GB 1st stage 2nd stage 5 GB 10 GB 5 GB 10 GB 5 GB 10 GB 5 GB 10 GB 5 GB

Slide 12

Slide 12 text

© 2019 Ververica • Size slots dynamically ─ TaskExecutors announce resources ─ Slice oﬀ slots from the available resources • Improved resource utilization and elasticity • FLIP-56 for more details Improved Resource Utilization for Heterogeneous Jobs 12 TaskExecutor #1 0/10 GB used TaskExecutor #2 0/10 GB used TaskExecutor #1 Slot #1 10/10 GB used TaskExecutor #2 Slot #2 10/10 GB used TaskExecutor #1 10/10 GB used TaskExecutor #2 0/10 GB used Slot #1 Slot #2 Freed 10 GB 5 GB 10 GB 1st stage 2nd stage 5 GB 10 GB 5 GB 10 GB 5 GB 10 GB 5 GB 10 GB 5 GB

Slide 13

Slide 13 text

Slide 14

Slide 14 text

© 2019 Ververica • Try to avoid redundant work • Data exchange modes ─ Pipelined ─ Blocking • Persist intermediate results at blocking data exchanges • Backtrack from failed task to latest intermediate result in case of failover Faster Recovery For Flink Jobs 14

Slide 15

Slide 15 text

Slide 16

Slide 16 text

Slide 17

Slide 17 text

Slide 18

Slide 18 text

Slide 19

Slide 19 text

Slide 20

Slide 20 text

© 2019 Ververica • How to activate: ─ ExecutionConfig.setExecutionMode(ExecutionMode.BATCH) or ExecutionConfig.setExecutionMode(ExecutionMode.BATCH_FORCED) ─ jobmanager.execution.failover-strategy: region • FLIP-1 for more details 20

Slide 21

Slide 21 text

© 2019 Ververica • Streaming and Batch workloads can have conflicting requirements for shuffling data ─ High throughput vs. low latency • Flink’s Netty based shuffle service ─ High throughput and low latency shuffle service ─ Results are bound to tasks → Container cannot be freed if the data has not been consumed How to Handle Intermediate Results 21 Container Result partition

Slide 22

Slide 22 text

© 2019 Ververica • Optimized implementations for batch and streaming possible • Shuffle services with different characteristics ─ Hash based vs. sort merge based shuffles • External shuffle service allows to decouple intermediate result storage from Flink • FLIP-31 for more details 22 External shuffle service (e.g Yarn DFS)

Slide 23

Slide 23 text

Slide 24

Slide 24 text

Slide 25

Slide 25 text

Slide 26

Slide 26 text

Slide 27

Slide 27 text

© 2019 Ververica Yarn External Shuffle Service 27 • A dedicated yarn auxiliary service for transferring the shuffle data • Task Manager does not responsible for shuffling the data any more • Resource Manager releases the idle TaskManager faster

Slide 28

Slide 28 text

Slide 29

Slide 29 text

Slide 30

Slide 30 text

Slide 31

Slide 31 text

© 2019 Ververica • A lot of engine improvements to better execute batch jobs ─ Smarter scheduling (Flink 1.10+) ─ Better resource utilization (Flink 1.10) ─ Faster recovery (Flink 1.9) ─ More ﬂexible data exchange (Flink 1.9) • Flink community is dedicated to further improve batch experience • Plans to contribute external shuﬄe service back to Flink 31