Upgrade to Pro — share decks privately, control downloads, hide ads and more …

News From Flink's Engine Room: "Full Steam Ahead"

News From Flink's Engine Room: "Full Steam Ahead"

With every release Flink is getting closer to become a truly unified batch and stream processor. This unification is not only a big effort for the API layer but it poses also great challenges for its runtime. Not only does the runtime have to handle streaming and batch workloads alike but it also needs to do it efficiently. In the past 2 years, the community has therefore spent a lot of effort on preparing Flink’s engine for these challenges. At the same time, the community started addressing one of Flink’s biggest operational limitations which is making it fully resource elastic.

In this talk I want to explain the big changes which happened in Flink’s runtime, how they help our users and in which direction the runtime will evolve from here. We will start with recapping how the community improved Flink’s batch capabilities and made the system more extensible. Next we will take a look at Flink’s new unified pipelined region scheduler and see how it improves resource utilization. Last but not least, I want to explain how the reactive execution mode will enable auto-scaling and make Flink fully resource elastic.

Till Rohrmann

October 22, 2020
Tweet

More Decks by Till Rohrmann

Other Decks in Technology

Transcript

  1. © 2020 Ververica News from Flink’s engine room: “Full steam

    ahead” Till Rohrmann @stsffap
  2. © 2020 Ververica Scheduling And Failover

  3. © 2020 Ververica Recap: Batch & Streaming Unification One engine

    to rule them all • Batch is just a bounded stream! 3
  4. © 2020 Ververica Unbounded Stream Processing Processing Data as It

    Arrives 4 older more recent Watermarks Sources
  5. © 2020 Ververica Bounded Stream Processing Having All Data Available

    5 older more recent Watermarks Sources
  6. © 2020 Ververica How to Process Bounded Streams Fast? •

    All data is available at start time ─ Massively parallel out-of-order ingestion ─ Latency not very important → efficient batching of records ─ Optimized operators ─ Results are ready at the end → no watermarks, no incremental results ─ Job can be executed in stages Boundedness Allows Different Execution Strategies 6
  7. © 2020 Ververica Recap: Faster Failover for Bounded Streams •

    Avoid redundant work due to failovers • Separating topology into pipelined regions • Store results produced by each pipelined regions • Resume computation from latest available result FLIP-1: Fine Grained Recovery 7 Src Map Sink Operator Result
  8. © 2020 Ververica What’s The Point? • TPC-H Query 3

    • Exponential failure rate Benefits of FLIP-1 8 SELECT l_orderkey, SUM(l_extendedprice*(1-l_discount)) AS revenue, o_orderdate, o_shippriority FROM customer, orders, lineitem WHERE c_mktsegment = '[SEGMENT]' AND c_custkey = o_custkey AND l_orderkey = o_orderkey AND o_orderdate < date '[DATE]' AND l_shipdate > date '[DATE]' GROUP BY l_orderkey, o_orderdate, o_shippriority;
  9. © 2020 Ververica How to Benefit From FLIP-1? • FLIP-1

    introduced with Flink 1.9 • FLIP-1 is used when using the Blink Table Planner • DataSet jobs use pipelined mode by default → FLIP-1 won’t have any effect until changing the ExecutionMode ─ ExecutionConfig.setExecutionMode(ExecutionMode.BATCH) ─ ExecutionConfig.setExecutionMode(ExecutionMode.BATCH_FORCED) It is not always on! 9
  10. © 2020 Ververica Problems When Scheduling Bounded Streams • Lazy-from-sources

    scheduling strategy ─ Task centric view ─ Schedule tasks as soon as inputs are ready Flink’s Old Scheduler 10 SELECT customerId, name FROM customers, orders WHERE customerId = orderCustomerId Csts Ords Join Blocking Pipelined Tasks to schedule: Csts, Ords, Join #Available slots: 1 Scheduling Order 1: 1. Ords 2. ? Scheduling Order 2: 1. Csts 2. Ords 3. Join
  11. © 2020 Ververica Pipelined Regions Scheduler • Scheduling units are

    the pipelined region (all tasks which need to run at the same time) • Schedule pipelined regions as soon as all its inputs are ready Pipelined Region Centric View 11 Csts Ords Join Blocking Pipelined Pipelined region Pipelined region Scheduling order: 1. Csts 2. Ords + Join
  12. © 2020 Ververica Benefits of Pipelined Region Scheduler • Reliable

    scheduling of bounded jobs under constrained resources ─ Guarantees to make progress as long as the largest pipelined region can be run ─ No more deadlocks due to bad scheduling decisions • Better resource utilization ─ Only schedule tasks which can actually make progress 12
  13. © 2020 Ververica Unified Batch & Streaming Scheduling & Failover

    • Pipelined regions are units for scheduling & failover • Generalizes well to streaming/unbounded workloads → Just a single large pipelined region which produces infinite results ─ If single pipelined region: Pipelined region scheduling == “All at once” scheduling strategy Putting the Pieces Together 13 Pipelined region
  14. © 2020 Ververica Elastic Streaming Pipelines

  15. © 2020 Ververica Changing Workloads Change is The Only Constant

    15
  16. © 2020 Ververica Elastic Streaming Pipelines Adjust to The Actual

    Workload 16
  17. © 2020 Ververica Deployment Modes Flink is Not Always in

    Charge 17 • Yarn, Mesos, Kubernetes • Flink can ask for more resources • Standalone, Containerized • Resources are assigned by a third party Active deployments Oblivious deployments
  18. © 2020 Ververica Reactive Execution Mode Reacting to Available Resources

    18 Job Master Resource Manager Need ∞ resources TaskExecutor Register( ) Assign( ) TaskExecutor Register( )
  19. © 2020 Ververica How Can Flink Declare ∞ Resources? Old

    slot allocation protocol • Every task asks for its slot individually • Fails if we cannot obtain all slots ⇒ Won’t work if we want to react to available resources Declarative slot allocation protocol • Declare the amount of required resources • ResourceManager tries to fulfill the declared resources as good as possible • Reactive mode declares ∞ resource requirements → all slots go to the JobMaster as soon as they arrive • FLIP-138: Declarative Resource Management A New Slot Allocation Protocol 19
  20. © 2020 Ververica How to Make Use of Changing Resources?

    Old scheduler 1. Pre-determine the parallelism 2. Ask for slots 3. Execute the job Declarative scheduler 1. Declare required resources 2. Wait for resources to arrive 3. Decide on the parallelism based on available resources ⇒ Invert resource declaration and deciding on parallelism 4. Adjust parallelism if more resources arrive The Declarative Scheduler 20
  21. © 2020 Ververica The Declarative Scheduler A Small Example 21

    JobGraph Resources ExecutionGraph ∅ Required Available Used 4 0 0 Parallelism: 0
  22. © 2020 Ververica The Declarative Scheduler A Small Example 22

    JobGraph ExecutionGraph Resources Required Available Used 4 2 2 Parallelism: 1
  23. © 2020 Ververica The Declarative Scheduler A Small Example 23

    JobGraph ExecutionGraph Resources Required Available Used 4 4 2 ⇒ Take checkpoint and trigger job restart How to make use of the new resources? Parallelism: 1
  24. © 2020 Ververica The Declarative Scheduler A Small Example 24

    JobGraph ExecutionGraph Resources Required Available Used 4 4 4 Parallelism: 2
  25. © 2020 Ververica Outlook Autoscaling • User defined RescalingPolicies set

    target value ─ target: Ideal parallelism to run the job with • Periodically querying the RescalingPolicies for target values • Declare target resource requirements • Rely on declarative scheduler to rescale job when new resources arrive Enabling Flink to Scale an Application 25 t = 1 t = 1 ResourceM anager
  26. © 2020 Ververica Outlook Autoscaling • User defined RescalingPolicies set

    target value ─ target: Ideal parallelism to run the job with • Periodically querying the RescalingPolicies for target values • Declare target resource requirements • Rely on declarative scheduler to rescale job when new resources arrive Enabling Flink to Scale an Application 26 t = 2 t = 1 ResourceM anager #Target Slots: 3
  27. © 2020 Ververica Outlook Autoscaling • User defined RescalingPolicies set

    target value ─ target: Ideal parallelism to run the job with • Periodically querying the RescalingPolicies for target values • Declare target resource requirements • Rely on declarative scheduler to rescale job when new resources arrive Enabling Flink to Scale an Application 27 t = 2 t = 1 ResourceM anager Allocate 3rd slot
  28. © 2020 Ververica User Benefits • Better resource utilization under

    changing workloads (no more under/over-provisioning) • Easier operations ─ Resources can be added on the fly ─ Flink can better tolerate resource loss • Easier deployments ─ Application style deployments w/o running a cluster 28
  29. © 2020 Ververica Conclusion • Unified scheduling and failover for

    batch & streaming • Flink schedules and fails over batch jobs now more efficiently • Flink will soon support fully elastic streaming pipelines ─ Being able to better handle changing workloads • Reactive mode will ease operations and deployment significantly What to take home? 29
  30. © 2020 Ververica THANK YOU!

  31. © 2020 Ververica Ververica is hiring! Write me (till@ververica.com) or

    visit https://www.ververica.com/careers
  32. © 2020 Ververica QUESTION?