Comparing Apache Flink and Spark for Modern Stream Data Processing

Comparing Apache Flink and Spark for Modern Stream Data Processing
Sharon Xie Eric Xiao

What are we going to learn? 01. Architecture Design 02.
Decodableʼs Evaluation Framework 03. Detailed Comparison Spark Streaming VS Flink Streaming Mode

Your Presenters Sharon Xie Head of Product • Founding Engineer
@ Decodable • 7 years of real-time data platform building Eric Xiao Data Platform • Senior Software Engineer @ Decodable • Previously on the Streaming Platform teams @ Shopify, Walmart.

Decodable is powered by Flink and Debezium

How did Flink and Spark solve the problem? 01. Architecture
Design

Context Typical real-time data platform

Design - Spark Streaming Source: https://spark.apache.org/docs/latest/streaming-programming-guide.html Micro-Batch: receives live input
data streams and divides the data into batches

Can batch size = 1 for Spark Streaming? No. -
Batched via interval - Job scheduling overhead >> data processing Source: https://spark.apache.org/docs/latest/streaming-programming-guide.html

Design - Flink Source: https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/learn-flink/overview/ Stream processing: continuously process the
data as it arrives.

Yes. Some operators (aggregation / join) can enable mini batches.
Can batch size > 1 for Flink Source: https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/tuning/#minibatch-regular-joins

02. Evaluation Framework Whatʼs important to Decodable?

Benchmark

Benchmark Most of the time

How did we approach benchmark? • Context Matters ◦ Latency
VS Throughput • Design Reveals Potential ◦ Can it scale > static numbers

How did we choose? The ultimate test

Expected Value ⚡ Low Latency = Competitive Edge ✨ Streaming
> Batch What do we believe?

Cost Operational Overhead Development Complexity Infrastructure Cost How much does
it cost to run the workload?

Detailed Comparison Eric Xiao

Example Use Case

Development Complexity

API and Language Support Flink Spark Language Support • Java
• Python • Java/Scala • Python • R Higher Level APIs • SQL • Table API • SQL • DataFrames Lower Level APIs • DataStream • ProcessFunction N/A

Stateful) Streaming Transformations Characteristics “Statelessˮ • Simple transformations: filters, projections,
etc. • Requires no other “contextˮ (other streams, other events). Stateful Temporary) • Aggregations, “timelyˮ joins, windows. • Requires “contextˮ but can be dropped after some time. Stateful Indefinite) • Top N, joining with “staticˮ dataset. • Requires “contextˮ to be stored forever. More State

Stateful) Streaming Transformations Flink Spark “Statelessˮ Well/fully supported Stateful Temporary)
Stateful Indefinite) Fully supported • Limited transformational support. • Limited state support.

Complexity of Expressing a Stateful Application in Spark

• Anti-patterns used as workarounds • More about Spark’s limitations
Complexity of Expressing a Stateful Application in Spark

Development Complexity • Spark is better at being language agnostic.
• Flink outperforms when more state is involved. • Able to express more complicated applications better in Flink. More State

Example Use Case

Infrastructure Costs

Infrastructure Costs OR

Infrastructure Costs • Flink excels at CDC use cases: ◦
Has built-in debezium engine ◦ Leads to lower latency

• Observability/Debuggability. • Failure handling. • Performance tuning. Operational Challenges

Tools Provided to Handle Operational Tasks Flink Spark Observability /
Debuggability • Kubernetes/task specific metrics. • Parallelism tuning. • Checkpoint duration/batch-size tuning. Failure Handling • Restores from latest checkpoint or micro-batch.

Performance Tuning Flink Spark Backpressure Handling Automatic back pressure handling
N/A, manually configured Auto-scaling Operator level auto-scaling TaskManager level auto-scaling Memory Auto-tuning Supported N/A, not supported

Operational Overhead • Very similar challenges and tasks. • Flink
has finer grain auto-tuning.

Wrap Up Categories Flink Development Complexity1 • More advanced state
management capabilities. • Additional lower level streaming API DataStream / ProcessFunction. Infrastructure Cost • Native CDC connectors. Operational Overhead • Advanced auto-control for resource allocation. • Automatic back pressure handling.

Additional Information • Blog Post: Comparing Apache Flink and Spark
for Modern Stream Data Processing.

Thank You Q&A Sharon Xie Eric Xiao

How to Benchmark? Real world examples, or use your actual
workload Source: https://www.flink-forward.org/global-2021/conference-program#sources--sinks--and-operators--a-performance-deep-dive More: https://github.com/dttung2905/flink-at-scale/

Comparing Apache Flink and Spark for Modern Str...

Comparing Apache Flink and Spark for Modern Stream Data Processing

More Decks by Sharon Xie

Other Decks in Technology

Featured

Transcript