Slide 1

Slide 1 text

© 2019 Ververica

Slide 2

Slide 2 text

© 2019 Ververica Batch Is Just a Special Case of Streaming • Batch is just a bounded stream 2 Only half of the truth

Slide 3

Slide 3 text

© 2019 Ververica Processing Data as It Arrives 3 older more recent Watermarks Sources

Slide 4

Slide 4 text

© 2019 Ververica Having All Data Available 4 older more recent Watermarks Sources

Slide 5

Slide 5 text

© 2019 Ververica Different Properties 5 ● Watermarks to model completeness/latency trade-off ● Incremental results ● In receive order ingestion ● All operators need to be running Continuous Streaming ● No watermarks ● Result ready at end of program ● Massively parallel out-of-order ingestion ● Job can be executed in stages Batch Processing

Slide 6

Slide 6 text

© 2019 Ververica Exploiting Boundedness 6 SELECT AVG(temperature) FROM sensors GROUP BY location Flink optimizer Continuous operators Specialized batch operators Bounded streams Default execution Batch execution ● Efficient execution ○ Improved scheduling ○ Improved resource utilization ○ Faster failovers

Slide 7

Slide 7 text

© 2019 Ververica

Slide 8

Slide 8 text

© 2019 Ververica • Lazy scheduling (batch case) ─ Start deploying all sources ─ If data is produced, then start consumers • Resource underutilization ─ Idling tasks due to deploying them too early Lazy From Sources 8 Source #1 Source #2 Join #1 Source #3 Join #2 build side build side probe side probe side

Slide 9

Slide 9 text

© 2019 Ververica • More efficient scheduling by taking dependencies into account • E.g. probe side is only scheduled after build side has been completed Improved Scheduling 9 Source #1 Source #2 Join #1 Source #3 Join #2 build side build side probe side probe side (1) (2) (2) (3)

Slide 10

Slide 10 text

© 2019 Ververica • The Flink community started an effort to make the scheduler pluggable • Scheduler abstraction allows to react to ExecutionGraph signals • Specialized scheduler implementations possible • FLINK-10429 for more details 10 Scheduler Pipelined-Regions Scheduler Advanced Batch Scheduler Speculative Scheduler

Slide 11

Slide 11 text

© 2019 Ververica • Executing an heterogeneous job requires different resources over time • Statically sized slots are only optimal for one type of task • Resource underutilization One Slot Needs to Serve Them All 11 TaskExecutor #1 10 GB 5 GB Slot #1 / 10 GB TaskExecutor #2 Slot #2 / 10 GB TaskExecutor #1 Slot #1 / 10 GB TaskExecutor #2 Slot #2 / 10 GB TaskExecutor #1 Slot #1 / 10 GB TaskExecutor #2 Slot #2 / 10 GB 10 GB 1st stage 2nd stage 5 GB 10 GB 5 GB 10 GB 5 GB 10 GB 5 GB 10 GB 5 GB

Slide 12

Slide 12 text

© 2019 Ververica • Size slots dynamically ─ TaskExecutors announce resources ─ Slice off slots from the available resources • Improved resource utilization and elasticity • FLIP-56 for more details Improved Resource Utilization for Heterogeneous Jobs 12 TaskExecutor #1 0/10 GB used TaskExecutor #2 0/10 GB used TaskExecutor #1 Slot #1 10/10 GB used TaskExecutor #2 Slot #2 10/10 GB used TaskExecutor #1 10/10 GB used TaskExecutor #2 0/10 GB used Slot #1 Slot #2 Freed 10 GB 5 GB 10 GB 1st stage 2nd stage 5 GB 10 GB 5 GB 10 GB 5 GB 10 GB 5 GB 10 GB 5 GB

Slide 13

Slide 13 text

© 2019 Ververica 13 Flink < 1.9 Flink ≥ 1.9

Slide 14

Slide 14 text

© 2019 Ververica • Try to avoid redundant work • Data exchange modes ─ Pipelined ─ Blocking • Persist intermediate results at blocking data exchanges • Backtrack from failed task to latest intermediate result in case of failover Faster Recovery For Flink Jobs 14

Slide 15

Slide 15 text

© 2019 Ververica Initial Topology 15 Pipelined data exchange Blocking data exchange Pipelined region Pipelined region

Slide 16

Slide 16 text

© 2019 Ververica Executing the First Pipelined Regions 16 Pipelined data exchange Blocking data exchange Pipelined region

Slide 17

Slide 17 text

© 2019 Ververica Execute the Consuming Pipelined Region 17 Pipelined data exchange Blocking data exchange Pipelined region Pipelined region Intermediate results

Slide 18

Slide 18 text

© 2019 Ververica Backtrack What Needs to Be Recomputed 18 Pipelined data exchange Blocking data exchange Pipelined region Pipelined region Intermediate results

Slide 19

Slide 19 text

© 2019 Ververica 19 Pipelined data exchange Blocking data exchange Pipelined region Pipelined region Intermediate results

Slide 20

Slide 20 text

© 2019 Ververica • How to activate: ─ ExecutionConfig.setExecutionMode(ExecutionMode.BATCH) or ExecutionConfig.setExecutionMode(ExecutionMode.BATCH_FORCED) ─ jobmanager.execution.failover-strategy: region • FLIP-1 for more details 20

Slide 21

Slide 21 text

© 2019 Ververica • Streaming and Batch workloads can have conflicting requirements for shuffling data ─ High throughput vs. low latency • Flink’s Netty based shuffle service ─ High throughput and low latency shuffle service ─ Results are bound to tasks → Container cannot be freed if the data has not been consumed How to Handle Intermediate Results 21 Container Result partition

Slide 22

Slide 22 text

© 2019 Ververica • Optimized implementations for batch and streaming possible • Shuffle services with different characteristics ─ Hash based vs. sort merge based shuffles • External shuffle service allows to decouple intermediate result storage from Flink • FLIP-31 for more details 22 External shuffle service (e.g Yarn DFS)

Slide 23

Slide 23 text

© 2019 Ververica

Slide 24

Slide 24 text

© 2019 Ververica Pluggable Scheduling Strategy 24

Slide 25

Slide 25 text

© 2019 Ververica Pluggable Scheduling Strategy

Slide 26

Slide 26 text

© 2019 Ververica Pluggable Shuffle Architecture 26 • Motivation ─ Improve the resource utilization of the cluster ─ Global disk I/O management

Slide 27

Slide 27 text

© 2019 Ververica Yarn External Shuffle Service 27 • A dedicated yarn auxiliary service for transferring the shuffle data • Task Manager does not responsible for shuffling the data any more • Resource Manager releases the idle TaskManager faster

Slide 28

Slide 28 text

© 2019 Ververica Performance improvement 28

Slide 29

Slide 29 text

© 2019 Ververica Performance Improvements 29

Slide 30

Slide 30 text

© 2019 Ververica Performance Improvements 30

Slide 31

Slide 31 text

© 2019 Ververica • A lot of engine improvements to better execute batch jobs ─ Smarter scheduling (Flink 1.10+) ─ Better resource utilization (Flink 1.10) ─ Faster recovery (Flink 1.9) ─ More flexible data exchange (Flink 1.9) • Flink community is dedicated to further improve batch experience • Plans to contribute external shuffle service back to Flink 31

Slide 32

Slide 32 text

© 2019 Ververica

Slide 33

Slide 33 text

© 2019 Ververica

Slide 34

Slide 34 text

© 2019 Ververica www.ververica.com @VervericaData [email protected]