Slide 1

Slide 1 text

Kafka Streams vs. Spark Structured Streaming (extended ver.) Lee Dongjin - dongjin@apache.org

Slide 2

Slide 2 text

Introduction ● So many Streaming Frameworks / Libraries ○ RxJava, Spring Reactor, AKKA streams, Flink, Samza, Storm, … ○ What to use?! ● Spark Structured Streaming vs. Kafka Streams ○ Advantages, Disadvantages, and Trade-offs. ○ When to use, or not to use.

Slide 3

Slide 3 text

Spark Structured Streaming: Overview (1) ● Stream processing engine based on Spark SQL (2.0) API: RDD → Dataframe Execution: Batch → Streaming (microbatch, continuous) Spark Core (1.x) Spark SQL (2.x) Spark Streaming Spark Structured Streaming

Slide 4

Slide 4 text

Spark Structured Streaming: Overview (2) ● Describes the processing logic with Spark SQL Operations ○ Easy to learn: almost identical to normal Batch SQL ■ Except: Source, Sink, Trigger, Output Mode, Watermark, etc ... ○ Provides various data sources and functions ○ Optimized by Catalyst Optimizer

Slide 5

Slide 5 text

Spark Structured Streaming: WordCount (1) // Create DataFrame representing the stream of input lines // from connection to localhost:9999 val lines = spark.readStream .format("socket") .option("host", "localhost") .option("port", 9999) .load() // Split the lines into words & generate running word count val words = lines.as[String].flatMap(_.split(" ")) val wordCounts = words.groupBy("value").count()

Slide 6

Slide 6 text

Spark Structured Streaming: WordCount (2) // Start running the query that prints // the running counts to the console val query = wordCounts.writeStream .outputMode("complete") .format("console") .start() query.awaitTermination()

Slide 7

Slide 7 text

Kafka Streams: Overview (1) ● A stream processing library that directly integrates with Kafka ○ from 0.10.0. ○ Doesn't need a special runtime like YARN: runs in a normal java process. ○ Masterless: runs on top of Kafka’s consumer group.

Slide 8

Slide 8 text

Kafka Streams: Overview (2) ● Describes the processing logic as a graph of processors ○ ‘Processing Topology’ ○ Source, Sink: Subclass of Processor ● With… ○ High level DSL (a.k.a. KStream API) ■ Recommended ○ Low level API

Slide 9

Slide 9 text

Kafka Streams: WordCount (1) // Build Topology with StreamsBuilder final StreamsBuilder builder = new StreamsBuilder(); // KStream: unbounded series of records final KStream source = builder.stream(inputTopic); // Transform input records into stream of words with `flatMapValues` method final KStream tokenized = source .flatMapValues(value -> Arrays.asList( value.toLowerCase(Locale.getDefault()).split(" ")) );

Slide 10

Slide 10 text

Kafka Streams: WordCount (2) // KTable: Stateful abstraction of aggregated stream // Build KTable from KStream by group and aggregate operations final KTable counts = tokenized .groupBy((key, value) -> value) .count(); // Convert KTable to KStream KStream changeLog = counts.toStream(); // Write back to output kafka topic changeLog.to(outputTopic, Produced.with(Serdes.String(), Serdes.Long())); // Build Topology instance return builder.build();

Slide 11

Slide 11 text

Kafka Streams: WordCount (3) Properties props = ... // Configuration properties Topology topology = ... // Topology object final KafkaStreams streams = new KafkaStreams(topology, props); /* Omit some boilerplate codes... */ // Start the Kafka Streams application streams.start();

Slide 12

Slide 12 text

KStream, KTable, ... KStream KTable KGroupedTable KGroupedStream SessionWindowedStream TimeWindowedStream group windowedBy aggregate aggregate group toStream

Slide 13

Slide 13 text

Kafka Streams: Low Level API ● You can define Processor classes manually ○ Example: WordCountProcessor ○ In fact, DSL builds Processor instances internally. ● Achieves efficiency & fault-tolerance with… ○ Intermediate Topic ○ StateStore

Slide 14

Slide 14 text

Kafka Streams: StateStore ● Local, in-memory Key-Value store ○ Implemented with RocksDB - Fast! ○ Backed by changelog topic - Easy to restore! ● Under the hood ○ ex) KTable

Slide 15

Slide 15 text

How Spark Structured Streaming works (1) ● Dataframe: Container of QueryExecution ○ Transformation method: Returns new Dataframe object with updated LogicalPlan. ■ map, filter, select, ... ○ Action method: trigger computation and return results. ■ count, show, ... Dataframe QueryExecution LogicalPlan Provides API Handles the primary workflow for executing LogicalPlan Describes logical operation

Slide 16

Slide 16 text

How Spark Structured Streaming works (2) ● When action method is called, the LogicalPlan is translated into RDD operations ○ And finally, Tasks. (Unresolved) LogicalPlan (Resolved) LogicalPlan (Optimized) LogicalPlan SparkPlan Resolve variables w/ Catalog Logical Optimization (ex. Predicate Pushdown) Convert into RDD Operations * RDD Operations are divided into Tasks and run by Executors.

Slide 17

Slide 17 text

How Spark Structured Streaming works (3) ● Then, What happens with Streaming? ○ StreamExecution ■ (Almost) Resolved LogicalPlan ■ Trigger ■ Output Mode ■ Output Sink StreamExecution LogicalPlan

Slide 18

Slide 18 text

How Spark Structured Streaming works (4) ● Driver triggers StreamExecution periodically. ○ Driver checks newly arrived records. (ex. checks the latest offset in Kafka topic.) ○ Clones LogicalPlan, fill with arrived records, and run in normal workflow. ■ In other words, all workflow is identical with Batch computations from then on. ○ For each Task, Executor requests the records of given offset range to Kafka brokers.

Slide 19

Slide 19 text

How Spark Structured Streaming works (5) val lines = spark.readStream ... .load() ... val query = wordCounts.writeStream .outputMode("complete") .format("console") .start() query.awaitTermination() (Create Dataframe, along with contained LogicalPlan.) (Resolve LogicalPlan, Create StreamExecution instance, and have StreamingQueryManager to start it.)

Slide 20

Slide 20 text

How Kafka Streams works (1) ● Built on top of Kafka’s consumer group feature ○ Automatically divides the records into disjoint sets. ● StreamTask ○ Created per input partition. ○ ex) Streams Topology with input topic A (2 partitions) and B (3 partitions): 3 StreamTasks! ○ Run by thread pool.

Slide 21

Slide 21 text

How Kafka Streams works (2) ● Run with 1 machine, 1 process with 2 threads

Slide 22

Slide 22 text

How Kafka Streams works (3) ● Run with 2 machines, 2 processes with 2 threads each

Slide 23

Slide 23 text

Comparison Kafka Streams Spark Structured Streaming Deployment Standalone Java Application Spark Executor (mostly, YARN cluster) Streaming Source Kafka only Kafka, File System, Kinesis, ... Fault-Tolerance StateStore, backed by changelog RDD cache Syntax Low level Processor API / High Level DSL Spark SQL Semantics Simple Rich (w/ query optimization)

Slide 24

Slide 24 text

Conclusion ● Spark Structured Streaming for rich semantics ○ ETL Tasks. ○ ex) Join records with RDBMS, run ML pipeline, etc., ... ● Kafka Streams for lightweight manipulation of Kafka topics ○ Preprocess Kafka topics. ○ Microservice run with Kafka topics. ○ Event-based prediction. (e.g., Kafka Streams w/ tensorflow)

Slide 25

Slide 25 text

Questions? ● Slides ○ https://speakerdeck.com/dongjin ● 한국 스파크 사용자 모임 ○ https://www.facebook.com/groups/sparkkoreauser/ ● Kafka 한국 사용자 모임 ○ https://www.facebook.com/groups/kafkakorea/