Kafka Streams vs. Spark Structured Streaming (extended)

Kafka Streams vs. Spark Structured Streaming (extended ver.) Lee Dongjin
- [email protected]

Introduction • So many Streaming Frameworks / Libraries ◦ RxJava,
Spring Reactor, AKKA streams, Flink, Samza, Storm, … ◦ What to use?! • Spark Structured Streaming vs. Kafka Streams ◦ Advantages, Disadvantages, and Trade-offs. ◦ When to use, or not to use.

Spark Structured Streaming: Overview (1) • Stream processing engine based
on Spark SQL (2.0) API: RDD → Dataframe Execution: Batch → Streaming (microbatch, continuous) Spark Core (1.x) Spark SQL (2.x) Spark Streaming Spark Structured Streaming

Spark Structured Streaming: Overview (2) • Describes the processing logic
with Spark SQL Operations ◦ Easy to learn: almost identical to normal Batch SQL ▪ Except: Source, Sink, Trigger, Output Mode, Watermark, etc ... ◦ Provides various data sources and functions ◦ Optimized by Catalyst Optimizer

Spark Structured Streaming: WordCount (1) // Create DataFrame representing the
stream of input lines // from connection to localhost:9999 val lines = spark.readStream .format("socket") .option("host", "localhost") .option("port", 9999) .load() // Split the lines into words & generate running word count val words = lines.as[String].flatMap(_.split(" ")) val wordCounts = words.groupBy("value").count()

Spark Structured Streaming: WordCount (2) // Start running the query
that prints // the running counts to the console val query = wordCounts.writeStream .outputMode("complete") .format("console") .start() query.awaitTermination()

Kafka Streams: Overview (1) • A stream processing library that
directly integrates with Kafka ◦ from 0.10.0. ◦ Doesn't need a special runtime like YARN: runs in a normal java process. ◦ Masterless: runs on top of Kafka’s consumer group.

Kafka Streams: Overview (2) • Describes the processing logic as
a graph of processors ◦ ‘Processing Topology’ ◦ Source, Sink: Subclass of Processor • With… ◦ High level DSL (a.k.a. KStream API) ▪ Recommended ◦ Low level API

Kafka Streams: WordCount (1) // Build Topology with StreamsBuilder final
StreamsBuilder builder = new StreamsBuilder(); // KStream: unbounded series of records final KStream<String, String> source = builder.stream(inputTopic); // Transform input records into stream of words with `flatMapValues` method final KStream<String, String> tokenized = source .flatMapValues(value -> Arrays.asList( value.toLowerCase(Locale.getDefault()).split(" ")) );

Kafka Streams: WordCount (2) // KTable: Stateful abstraction of aggregated
stream // Build KTable from KStream by group and aggregate operations final KTable<String, Long> counts = tokenized .groupBy((key, value) -> value) .count(); // Convert KTable to KStream KStream<String, Long> changeLog = counts.toStream(); // Write back to output kafka topic changeLog.to(outputTopic, Produced.with(Serdes.String(), Serdes.Long())); // Build Topology instance return builder.build();

Kafka Streams: WordCount (3) Properties props = ... // Configuration
properties Topology topology = ... // Topology object final KafkaStreams streams = new KafkaStreams(topology, props); /* Omit some boilerplate codes... */ // Start the Kafka Streams application streams.start();

KStream, KTable, ... KStream KTable KGroupedTable KGroupedStream SessionWindowedStream TimeWindowedStream group
windowedBy aggregate aggregate group toStream

Kafka Streams: Low Level API • You can define Processor
classes manually ◦ Example: WordCountProcessor ◦ In fact, DSL builds Processor instances internally. • Achieves efficiency & fault-tolerance with… ◦ Intermediate Topic ◦ StateStore

Kafka Streams: StateStore • Local, in-memory Key-Value store ◦ Implemented
with RocksDB - Fast! ◦ Backed by changelog topic - Easy to restore! • Under the hood ◦ ex) KTable

How Spark Structured Streaming works (1) • Dataframe: Container of
QueryExecution ◦ Transformation method: Returns new Dataframe object with updated LogicalPlan. ▪ map, filter, select, ... ◦ Action method: trigger computation and return results. ▪ count, show, ... Dataframe QueryExecution LogicalPlan Provides API Handles the primary workflow for executing LogicalPlan Describes logical operation

How Spark Structured Streaming works (2) • When action method
is called, the LogicalPlan is translated into RDD operations ◦ And finally, Tasks. (Unresolved) LogicalPlan (Resolved) LogicalPlan (Optimized) LogicalPlan SparkPlan Resolve variables w/ Catalog Logical Optimization (ex. Predicate Pushdown) Convert into RDD Operations * RDD Operations are divided into Tasks and run by Executors.

How Spark Structured Streaming works (3) • Then, What happens
with Streaming? ◦ StreamExecution ▪ (Almost) Resolved LogicalPlan ▪ Trigger ▪ Output Mode ▪ Output Sink StreamExecution LogicalPlan

How Spark Structured Streaming works (4) • Driver triggers StreamExecution
periodically. ◦ Driver checks newly arrived records. (ex. checks the latest offset in Kafka topic.) ◦ Clones LogicalPlan, fill with arrived records, and run in normal workflow. ▪ In other words, all workflow is identical with Batch computations from then on. ◦ For each Task, Executor requests the records of given offset range to Kafka brokers.

How Spark Structured Streaming works (5) val lines = spark.readStream
... .load() ... val query = wordCounts.writeStream .outputMode("complete") .format("console") .start() query.awaitTermination() (Create Dataframe, along with contained LogicalPlan.) (Resolve LogicalPlan, Create StreamExecution instance, and have StreamingQueryManager to start it.)

How Kafka Streams works (1) • Built on top of
Kafka’s consumer group feature ◦ Automatically divides the records into disjoint sets. • StreamTask ◦ Created per input partition. ◦ ex) Streams Topology with input topic A (2 partitions) and B (3 partitions): 3 StreamTasks! ◦ Run by thread pool.

How Kafka Streams works (2) • Run with 1 machine,
1 process with 2 threads

How Kafka Streams works (3) • Run with 2 machines,
2 processes with 2 threads each

Comparison Kafka Streams Spark Structured Streaming Deployment Standalone Java Application
Spark Executor (mostly, YARN cluster) Streaming Source Kafka only Kafka, File System, Kinesis, ... Fault-Tolerance StateStore, backed by changelog RDD cache Syntax Low level Processor API / High Level DSL Spark SQL Semantics Simple Rich (w/ query optimization)

Conclusion • Spark Structured Streaming for rich semantics ◦ ETL
Tasks. ◦ ex) Join records with RDBMS, run ML pipeline, etc., ... • Kafka Streams for lightweight manipulation of Kafka topics ◦ Preprocess Kafka topics. ◦ Microservice run with Kafka topics. ◦ Event-based prediction. (e.g., Kafka Streams w/ tensorflow)

Questions? • Slides ◦ https://speakerdeck.com/dongjin • 한국 스파크 사용자 모임
◦ https://www.facebook.com/groups/sparkkoreauser/ • Kafka 한국 사용자 모임 ◦ https://www.facebook.com/groups/kafkakorea/

Kafka Streams vs. Spark Structured Streaming (e...

Kafka Streams vs. Spark Structured Streaming (extended)

Lee Dongjin

More Decks by Lee Dongjin

Other Decks in Technology

Featured

Transcript

Kafka Streams vs. Spark Structured Streaming (extended ver.) Lee Dongjin

Introduction • So many Streaming Frameworks / Libraries ◦ RxJava,

Spark Structured Streaming: Overview (1) • Stream processing engine based

Spark Structured Streaming: Overview (2) • Describes the processing logic

Spark Structured Streaming: WordCount (1) // Create DataFrame representing the

Spark Structured Streaming: WordCount (2) // Start running the query

Kafka Streams: Overview (1) • A stream processing library that

Kafka Streams: Overview (2) • Describes the processing logic as

Kafka Streams: WordCount (1) // Build Topology with StreamsBuilder final

Kafka Streams: WordCount (2) // KTable: Stateful abstraction of aggregated

Kafka Streams: WordCount (3) Properties props = ... // Configuration

KStream, KTable, ... KStream KTable KGroupedTable KGroupedStream SessionWindowedStream TimeWindowedStream group

Kafka Streams: Low Level API • You can define Processor

Kafka Streams: StateStore • Local, in-memory Key-Value store ◦ Implemented

How Spark Structured Streaming works (1) • Dataframe: Container of

How Spark Structured Streaming works (2) • When action method

How Spark Structured Streaming works (3) • Then, What happens

How Spark Structured Streaming works (4) • Driver triggers StreamExecution

How Spark Structured Streaming works (5) val lines = spark.readStream

How Kafka Streams works (1) • Built on top of

How Kafka Streams works (2) • Run with 1 machine,

How Kafka Streams works (3) • Run with 2 machines,

Comparison Kafka Streams Spark Structured Streaming Deployment Standalone Java Application

Conclusion • Spark Structured Streaming for rich semantics ◦ ETL

Questions? • Slides ◦ https://speakerdeck.com/dongjin • 한국 스파크 사용자 모임