Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Comparing Apache Flink and Spark for Modern Str...

Sharon Xie
October 25, 2024

Comparing Apache Flink and Spark for Modern Stream Data Processing

This is the slides for the Flink Forward 2024 conference (https://www.flink-forward.org/berlin-2024/agenda#comparing-apache-flink-and-spark-for-modern-stream-data-processing)

Real-time data processing is essential for staying competitive in today’s fast-paced business environment, and choosing the right tool is a key decision. Apache Flink and Spark Structured Streaming are two leading stream processing frameworks, each with unique strengths and trade-offs.

This talk takes a look at our journey at Decodable, where we evaluated both tools and ultimately chose Apache Flink over Spark Structured Streaming for our stream data processing needs. By examining key differences between the two systems, we aim to provide a clear, technical comparison that will help you make informed decisions for your streaming data use cases.

Join us for this talk where we will discuss:
- Design philosophies: Learn about the origins of both systems and some of the fundamental architecture design choices of Flink that makes it more attractive for streaming use cases.
- (Stateful) streaming capabilities: We will dive into and compare similar features that both Spark and Flink offer in the various APIs, we will also share some features only available in Flink that make it a much richer streaming library. We will also talk about some of the data ecosystem tools/connectors that Flink supports natively, like Debezium.
- Production readiness: We will also talk about some of the recent features of Flink that makes running Flink at scale easy, like the Kubernetes operator and its sophisticated auto-scaler.

Sharon Xie

October 25, 2024
Tweet

More Decks by Sharon Xie

Other Decks in Technology

Transcript

  1. What are we going to learn? 01. Architecture Design 02.

    Decodableʼs Evaluation Framework 03. Detailed Comparison Spark Streaming VS Flink Streaming Mode
  2. Your Presenters Sharon Xie Head of Product • Founding Engineer

    @ Decodable • 7 years of real-time data platform building Eric Xiao Data Platform • Senior Software Engineer @ Decodable • Previously on the Streaming Platform teams @ Shopify, Walmart.
  3. Can batch size = 1 for Spark Streaming? No. -

    Batched via interval - Job scheduling overhead >> data processing Source: https://spark.apache.org/docs/latest/streaming-programming-guide.html
  4. Yes. Some operators (aggregation / join) can enable mini batches.

    Can batch size > 1 for Flink Source: https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/tuning/#minibatch-regular-joins
  5. How did we approach benchmark? • Context Matters ◦ Latency

    VS Throughput • Design Reveals Potential ◦ Can it scale > static numbers
  6. API and Language Support Flink Spark Language Support • Java

    • Python • Java/Scala • Python • R Higher Level APIs • SQL • Table API • SQL • DataFrames Lower Level APIs • DataStream • ProcessFunction N/A
  7. Stateful) Streaming Transformations Characteristics “Statelessˮ • Simple transformations: filters, projections,

    etc. • Requires no other “contextˮ (other streams, other events). Stateful Temporary) • Aggregations, “timelyˮ joins, windows. • Requires “contextˮ but can be dropped after some time. Stateful Indefinite) • Top N, joining with “staticˮ dataset. • Requires “contextˮ to be stored forever. More State
  8. Stateful) Streaming Transformations Flink Spark “Statelessˮ Well/fully supported Stateful Temporary)

    Stateful Indefinite) Fully supported • Limited transformational support. • Limited state support.
  9. • Anti-patterns used as workarounds • More about Spark’s limitations

    Complexity of Expressing a Stateful Application in Spark
  10. Development Complexity • Spark is better at being language agnostic.

    • Flink outperforms when more state is involved. • Able to express more complicated applications better in Flink. More State
  11. Infrastructure Costs • Flink excels at CDC use cases: ◦

    Has built-in debezium engine ◦ Leads to lower latency
  12. Tools Provided to Handle Operational Tasks Flink Spark Observability /

    Debuggability • Kubernetes/task specific metrics. • Parallelism tuning. • Checkpoint duration/batch-size tuning. Failure Handling • Restores from latest checkpoint or micro-batch.
  13. Performance Tuning Flink Spark Backpressure Handling Automatic back pressure handling

    N/A, manually configured Auto-scaling Operator level auto-scaling TaskManager level auto-scaling Memory Auto-tuning Supported N/A, not supported
  14. Wrap Up Categories Flink Development Complexity1 • More advanced state

    management capabilities. • Additional lower level streaming API DataStream / ProcessFunction. Infrastructure Cost • Native CDC connectors. Operational Overhead • Advanced auto-control for resource allocation. • Automatic back pressure handling.
  15. How to Benchmark? Real world examples, or use your actual

    workload Source: https://www.flink-forward.org/global-2021/conference-program#sources--sinks--and-operators--a-performance-deep-dive More: https://github.com/dttung2905/flink-at-scale/