Flink’s New Batch Architecture

© 2019 Ververica Batch Is Just a Special Case of
Streaming • Batch is just a bounded stream 2 Only half of the truth

© 2019 Ververica Processing Data as It Arrives 3 older
more recent Watermarks Sources

© 2019 Ververica Having All Data Available 4 older more
recent Watermarks Sources

© 2019 Ververica Diﬀerent Properties 5 • Watermarks to model
completeness/latency trade-oﬀ • Incremental results • In receive order ingestion • All operators need to be running Continuous Streaming • No watermarks • Result ready at end of program • Massively parallel out-of-order ingestion • Job can be executed in stages Batch Processing

© 2019 Ververica Exploiting Boundedness 6 SELECT AVG(temperature) FROM sensors
GROUP BY location Flink optimizer Continuous operators Specialized batch operators Bounded streams Default execution Batch execution • Eﬃcient execution ◦ Improved scheduling ◦ Improved resource utilization ◦ Faster failovers

© 2019 Ververica • Lazy scheduling (batch case) ─ Start
deploying all sources ─ If data is produced, then start consumers • Resource underutilization ─ Idling tasks due to deploying them too early Lazy From Sources 8 Source #1 Source #2 Join #1 Source #3 Join #2 build side build side probe side probe side

© 2019 Ververica • More eﬃcient scheduling by taking dependencies
into account • E.g. probe side is only scheduled after build side has been completed Improved Scheduling 9 Source #1 Source #2 Join #1 Source #3 Join #2 build side build side probe side probe side (1) (2) (2) (3)

© 2019 Ververica • The Flink community started an eﬀort
to make the scheduler pluggable • Scheduler abstraction allows to react to ExecutionGraph signals • Specialized scheduler implementations possible • FLINK-10429 for more details 10 Scheduler Pipelined-Regions Scheduler Advanced Batch Scheduler Speculative Scheduler

© 2019 Ververica • Executing an heterogeneous job requires diﬀerent
resources over time • Statically sized slots are only optimal for one type of task • Resource underutilization One Slot Needs to Serve Them All 11 TaskExecutor #1 10 GB 5 GB Slot #1 / 10 GB TaskExecutor #2 Slot #2 / 10 GB TaskExecutor #1 Slot #1 / 10 GB TaskExecutor #2 Slot #2 / 10 GB TaskExecutor #1 Slot #1 / 10 GB TaskExecutor #2 Slot #2 / 10 GB 10 GB 1st stage 2nd stage 5 GB 10 GB 5 GB 10 GB 5 GB 10 GB 5 GB 10 GB 5 GB

© 2019 Ververica • Size slots dynamically ─ TaskExecutors announce
resources ─ Slice oﬀ slots from the available resources • Improved resource utilization and elasticity • FLIP-56 for more details Improved Resource Utilization for Heterogeneous Jobs 12 TaskExecutor #1 0/10 GB used TaskExecutor #2 0/10 GB used TaskExecutor #1 Slot #1 10/10 GB used TaskExecutor #2 Slot #2 10/10 GB used TaskExecutor #1 10/10 GB used TaskExecutor #2 0/10 GB used Slot #1 Slot #2 Freed 10 GB 5 GB 10 GB 1st stage 2nd stage 5 GB 10 GB 5 GB 10 GB 5 GB 10 GB 5 GB 10 GB 5 GB

© 2019 Ververica • Try to avoid redundant work •
Data exchange modes ─ Pipelined ─ Blocking • Persist intermediate results at blocking data exchanges • Backtrack from failed task to latest intermediate result in case of failover Faster Recovery For Flink Jobs 14

© 2019 Ververica Initial Topology 15 Pipelined data exchange Blocking
data exchange Pipelined region Pipelined region

© 2019 Ververica Executing the First Pipelined Regions 16 Pipelined
data exchange Blocking data exchange Pipelined region

© 2019 Ververica Execute the Consuming Pipelined Region 17 Pipelined
data exchange Blocking data exchange Pipelined region Pipelined region Intermediate results

© 2019 Ververica Backtrack What Needs to Be Recomputed 18
Pipelined data exchange Blocking data exchange Pipelined region Pipelined region Intermediate results

© 2019 Ververica 19 Pipelined data exchange Blocking data exchange
Pipelined region Pipelined region Intermediate results

© 2019 Ververica • How to activate: ─ ExecutionConfig.setExecutionMode(ExecutionMode.BATCH) or
ExecutionConfig.setExecutionMode(ExecutionMode.BATCH_FORCED) ─ jobmanager.execution.failover-strategy: region • FLIP-1 for more details 20

© 2019 Ververica • Streaming and Batch workloads can have
conflicting requirements for shuffling data ─ High throughput vs. low latency • Flink’s Netty based shuffle service ─ High throughput and low latency shuffle service ─ Results are bound to tasks → Container cannot be freed if the data has not been consumed How to Handle Intermediate Results 21 Container Result partition

© 2019 Ververica • Optimized implementations for batch and streaming
possible • Shuffle services with different characteristics ─ Hash based vs. sort merge based shuffles • External shuffle service allows to decouple intermediate result storage from Flink • FLIP-31 for more details 22 External shuffle service (e.g Yarn DFS)

© 2019 Ververica Pluggable Shuﬄe Architecture 26 • Motivation ─
Improve the resource utilization of the cluster ─ Global disk I/O management

© 2019 Ververica Yarn External Shuffle Service 27 • A
dedicated yarn auxiliary service for transferring the shuffle data • Task Manager does not responsible for shuffling the data any more • Resource Manager releases the idle TaskManager faster

© 2019 Ververica • A lot of engine improvements to
better execute batch jobs ─ Smarter scheduling (Flink 1.10+) ─ Better resource utilization (Flink 1.10) ─ Faster recovery (Flink 1.9) ─ More ﬂexible data exchange (Flink 1.9) • Flink community is dedicated to further improve batch experience • Plans to contribute external shuﬄe service back to Flink 31

Flink’s New Batch Architecture

Flink’s New Batch Architecture

More Decks by Till Rohrmann

Other Decks in Technology

Featured

Transcript