Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Flink’s New Batch Architecture

Flink’s New Batch Architecture

Since its inception, Flink supports to execute batch workloads. Using specialized operators for processing bounded streams allows Flink to achieve an already very decent batch performance. However, in particular Flink’s fault recovery, which restarts the whole topology in case of task faults, caused problems for complex and large batch jobs. Moreover, supporting batch and streaming alike led in some components to necessary generalizations which prevented further batch optimizations:

* The scheduler needs to schedule topologies with complex dependencies as well as low latency requirements

* The shuffle service needs to support high-throughput batch as well as fast streaming data exchanges

In this talk, we will shed some light on the community’s effort to address these limitations and which new components compose Flink’s improved batch architecture. We will demonstrate how the new fine grained recovery feature minimizes the set of computations to restart in case of a failover. Moreover, we will explain how a batch job differs from a streaming job and what this means for the scheduler. We will also discuss why it can be beneficial to separate results from computation and how Flink supports this feature. Last but not least we want to give an outlook on possible future improvements like support for speculative execution, RDMA based data exchanges and how they relate to Flink’s new batch architecture.

Till Rohrmann

October 08, 2019
Tweet

More Decks by Till Rohrmann

Other Decks in Technology

Transcript

  1. © 2019 Ververica

  2. © 2019 Ververica Batch Is Just a Special Case of

    Streaming • Batch is just a bounded stream 2 Only half of the truth
  3. © 2019 Ververica Processing Data as It Arrives 3 older

    more recent Watermarks Sources
  4. © 2019 Ververica Having All Data Available 4 older more

    recent Watermarks Sources
  5. © 2019 Ververica Different Properties 5 • Watermarks to model

    completeness/latency trade-off • Incremental results • In receive order ingestion • All operators need to be running Continuous Streaming • No watermarks • Result ready at end of program • Massively parallel out-of-order ingestion • Job can be executed in stages Batch Processing
  6. © 2019 Ververica Exploiting Boundedness 6 SELECT AVG(temperature) FROM sensors

    GROUP BY location Flink optimizer Continuous operators Specialized batch operators Bounded streams Default execution Batch execution • Efficient execution ◦ Improved scheduling ◦ Improved resource utilization ◦ Faster failovers
  7. © 2019 Ververica

  8. © 2019 Ververica • Lazy scheduling (batch case) ─ Start

    deploying all sources ─ If data is produced, then start consumers • Resource underutilization ─ Idling tasks due to deploying them too early Lazy From Sources 8 Source #1 Source #2 Join #1 Source #3 Join #2 build side build side probe side probe side
  9. © 2019 Ververica • More efficient scheduling by taking dependencies

    into account • E.g. probe side is only scheduled after build side has been completed Improved Scheduling 9 Source #1 Source #2 Join #1 Source #3 Join #2 build side build side probe side probe side (1) (2) (2) (3)
  10. © 2019 Ververica • The Flink community started an effort

    to make the scheduler pluggable • Scheduler abstraction allows to react to ExecutionGraph signals • Specialized scheduler implementations possible • FLINK-10429 for more details 10 Scheduler Pipelined-Regions Scheduler Advanced Batch Scheduler Speculative Scheduler
  11. © 2019 Ververica • Executing an heterogeneous job requires different

    resources over time • Statically sized slots are only optimal for one type of task • Resource underutilization One Slot Needs to Serve Them All 11 TaskExecutor #1 10 GB 5 GB Slot #1 / 10 GB TaskExecutor #2 Slot #2 / 10 GB TaskExecutor #1 Slot #1 / 10 GB TaskExecutor #2 Slot #2 / 10 GB TaskExecutor #1 Slot #1 / 10 GB TaskExecutor #2 Slot #2 / 10 GB 10 GB 1st stage 2nd stage 5 GB 10 GB 5 GB 10 GB 5 GB 10 GB 5 GB 10 GB 5 GB
  12. © 2019 Ververica • Size slots dynamically ─ TaskExecutors announce

    resources ─ Slice off slots from the available resources • Improved resource utilization and elasticity • FLIP-56 for more details Improved Resource Utilization for Heterogeneous Jobs 12 TaskExecutor #1 0/10 GB used TaskExecutor #2 0/10 GB used TaskExecutor #1 Slot #1 10/10 GB used TaskExecutor #2 Slot #2 10/10 GB used TaskExecutor #1 10/10 GB used TaskExecutor #2 0/10 GB used Slot #1 Slot #2 Freed 10 GB 5 GB 10 GB 1st stage 2nd stage 5 GB 10 GB 5 GB 10 GB 5 GB 10 GB 5 GB 10 GB 5 GB
  13. © 2019 Ververica 13 Flink < 1.9 Flink ≥ 1.9

  14. © 2019 Ververica • Try to avoid redundant work •

    Data exchange modes ─ Pipelined ─ Blocking • Persist intermediate results at blocking data exchanges • Backtrack from failed task to latest intermediate result in case of failover Faster Recovery For Flink Jobs 14
  15. © 2019 Ververica Initial Topology 15 Pipelined data exchange Blocking

    data exchange Pipelined region Pipelined region
  16. © 2019 Ververica Executing the First Pipelined Regions 16 Pipelined

    data exchange Blocking data exchange Pipelined region
  17. © 2019 Ververica Execute the Consuming Pipelined Region 17 Pipelined

    data exchange Blocking data exchange Pipelined region Pipelined region Intermediate results
  18. © 2019 Ververica Backtrack What Needs to Be Recomputed 18

    Pipelined data exchange Blocking data exchange Pipelined region Pipelined region Intermediate results
  19. © 2019 Ververica 19 Pipelined data exchange Blocking data exchange

    Pipelined region Pipelined region Intermediate results
  20. © 2019 Ververica • How to activate: ─ ExecutionConfig.setExecutionMode(ExecutionMode.BATCH) or

    ExecutionConfig.setExecutionMode(ExecutionMode.BATCH_FORCED) ─ jobmanager.execution.failover-strategy: region • FLIP-1 for more details 20
  21. © 2019 Ververica • Streaming and Batch workloads can have

    conflicting requirements for shuffling data ─ High throughput vs. low latency • Flink’s Netty based shuffle service ─ High throughput and low latency shuffle service ─ Results are bound to tasks → Container cannot be freed if the data has not been consumed How to Handle Intermediate Results 21 Container Result partition
  22. © 2019 Ververica • Optimized implementations for batch and streaming

    possible • Shuffle services with different characteristics ─ Hash based vs. sort merge based shuffles • External shuffle service allows to decouple intermediate result storage from Flink • FLIP-31 for more details 22 External shuffle service (e.g Yarn DFS)
  23. © 2019 Ververica

  24. © 2019 Ververica Pluggable Scheduling Strategy 24

  25. © 2019 Ververica Pluggable Scheduling Strategy

  26. © 2019 Ververica Pluggable Shuffle Architecture 26 • Motivation ─

    Improve the resource utilization of the cluster ─ Global disk I/O management
  27. © 2019 Ververica Yarn External Shuffle Service 27 • A

    dedicated yarn auxiliary service for transferring the shuffle data • Task Manager does not responsible for shuffling the data any more • Resource Manager releases the idle TaskManager faster
  28. © 2019 Ververica Performance improvement 28

  29. © 2019 Ververica Performance Improvements 29

  30. © 2019 Ververica Performance Improvements 30

  31. © 2019 Ververica • A lot of engine improvements to

    better execute batch jobs ─ Smarter scheduling (Flink 1.10+) ─ Better resource utilization (Flink 1.10) ─ Faster recovery (Flink 1.9) ─ More flexible data exchange (Flink 1.9) • Flink community is dedicated to further improve batch experience • Plans to contribute external shuffle service back to Flink 31
  32. © 2019 Ververica

  33. © 2019 Ververica

  34. © 2019 Ververica www.ververica.com @VervericaData info@ververica.com