Comparing Apache Flink and Spark for Modern Stream Data Processing

by Sharon Xie

Slide 1

Slide 1 text

Comparing Apache Flink and Spark for Modern Stream Data Processing Sharon Xie Eric Xiao

Slide 2

Slide 2 text

What are we going to learn? 01. Architecture Design 02. Decodableʼs Evaluation Framework 03. Detailed Comparison Spark Streaming VS Flink Streaming Mode

Slide 3

Slide 3 text

Your Presenters Sharon Xie Head of Product ● Founding Engineer @ Decodable ● 7 years of real-time data platform building Eric Xiao Data Platform ● Senior Software Engineer @ Decodable ● Previously on the Streaming Platform teams @ Shopify, Walmart.

Slide 4

Slide 4 text

Decodable is powered by Flink and Debezium

Slide 5

Slide 5 text

How did Flink and Spark solve the problem? 01. Architecture Design

Slide 6

Slide 6 text

Context Typical real-time data platform

Slide 7

Slide 7 text

Design - Spark Streaming Source: https://spark.apache.org/docs/latest/streaming-programming-guide.html Micro-Batch: receives live input data streams and divides the data into batches

Slide 8

Slide 8 text

Can batch size = 1 for Spark Streaming? No. - Batched via interval - Job scheduling overhead >> data processing Source: https://spark.apache.org/docs/latest/streaming-programming-guide.html

Slide 9

Slide 9 text

Design - Flink Source: https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/learn-flink/overview/ Stream processing: continuously process the data as it arrives.

Slide 10

Slide 10 text

Yes. Some operators (aggregation / join) can enable mini batches. Can batch size > 1 for Flink Source: https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/tuning/#minibatch-regular-joins

Slide 11

Slide 11 text

02. Evaluation Framework Whatʼs important to Decodable?

Slide 12

Slide 12 text

Benchmark

Slide 13

Slide 13 text

Benchmark Most of the time

Slide 14

Slide 14 text

Benchmark Most of the time

Slide 15

Slide 15 text

How did we approach benchmark? ● Context Matters ○ Latency VS Throughput ● Design Reveals Potential ○ Can it scale > static numbers

Slide 16

Slide 16 text

How did we choose? The ultimate test

Slide 17

Slide 17 text

Expected Value ⚡ Low Latency = Competitive Edge ✨ Streaming > Batch What do we believe?

Slide 18

Slide 18 text

Cost Operational Overhead Development Complexity Infrastructure Cost How much does it cost to run the workload?

Slide 19

Slide 19 text

Detailed Comparison Eric Xiao

Slide 20

Slide 20 text

Example Use Case

Slide 21

Slide 21 text

Example Use Case

Slide 22

Slide 22 text

Example Use Case

Slide 23

Slide 23 text

Development Complexity

Slide 24

Slide 24 text

API and Language Support Flink Spark Language Support ● Java ● Python ● Java/Scala ● Python ● R Higher Level APIs ● SQL ● Table API ● SQL ● DataFrames Lower Level APIs ● DataStream ● ProcessFunction N/A

Slide 25

Slide 25 text

Stateful) Streaming Transformations Characteristics “Statelessˮ ● Simple transformations: filters, projections, etc. ● Requires no other “contextˮ (other streams, other events). Stateful Temporary) ● Aggregations, “timelyˮ joins, windows. ● Requires “contextˮ but can be dropped after some time. Stateful Indefinite) ● Top N, joining with “staticˮ dataset. ● Requires “contextˮ to be stored forever. More State

Slide 26

Slide 26 text

Stateful) Streaming Transformations Flink Spark “Statelessˮ Well/fully supported Stateful Temporary) Stateful Indefinite) Fully supported ● Limited transformational support. ● Limited state support.

Slide 27

Slide 27 text

Complexity of Expressing a Stateful Application in Spark

Slide 28

Slide 28 text

● Anti-patterns used as workarounds ● More about Spark’s limitations Complexity of Expressing a Stateful Application in Spark

Slide 29

Slide 29 text

Development Complexity ● Spark is better at being language agnostic. ● Flink outperforms when more state is involved. ● Able to express more complicated applications better in Flink. More State

Slide 30

Slide 30 text

Example Use Case

Slide 31

Slide 31 text

Infrastructure Costs

Slide 32

Slide 32 text

Infrastructure Costs OR

Slide 33

Slide 33 text

Infrastructure Costs ● Flink excels at CDC use cases: ○ Has built-in debezium engine ○ Leads to lower latency

Slide 34

Slide 34 text

● Observability/Debuggability. ● Failure handling. ● Performance tuning. Operational Challenges

Slide 35

Slide 35 text

Tools Provided to Handle Operational Tasks Flink Spark Observability / Debuggability ● Kubernetes/task specific metrics. ● Parallelism tuning. ● Checkpoint duration/batch-size tuning. Failure Handling ● Restores from latest checkpoint or micro-batch.

Slide 36

Slide 36 text

Performance Tuning Flink Spark Backpressure Handling Automatic back pressure handling N/A, manually configured Auto-scaling Operator level auto-scaling TaskManager level auto-scaling Memory Auto-tuning Supported N/A, not supported

Slide 37

Slide 37 text

Operational Overhead ● Very similar challenges and tasks. ● Flink has finer grain auto-tuning.

Slide 38

Slide 38 text

Wrap Up Categories Flink Development Complexity1 ● More advanced state management capabilities. ● Additional lower level streaming API DataStream / ProcessFunction. Infrastructure Cost ● Native CDC connectors. Operational Overhead ● Advanced auto-control for resource allocation. ● Automatic back pressure handling.

Slide 39

Slide 39 text

Additional Information ● Blog Post: Comparing Apache Flink and Spark for Modern Stream Data Processing.

Slide 40

Slide 40 text

Thank You Q&A Sharon Xie Eric Xiao

Slide 41

Slide 41 text

How to Benchmark? Real world examples, or use your actual workload Source: https://www.flink-forward.org/global-2021/conference-program#sources--sinks--and-operators--a-performance-deep-dive More: https://github.com/dttung2905/flink-at-scale/