Kafka Streams vs. Spark Structured Streaming (extended)

143f88e8c2b2a1123e87c81d9bbefa02?s=47 Lee Dongjin
October 25, 2018

Kafka Streams vs. Spark Structured Streaming (extended)

Kafka Streams vs. Spark Structured Streaming. 어떻게 사용할 수 있고, 내부는 어떻게 되어 있으며, 장단점은 무엇이고 어디에 써야 하는가?
2018년 10월, SKT 사내 세미나에서 발표.

Kafka Streams vs. Spark Structured Streaming: How you can use, How it works under the hood, advantages and disadvantages, and when to use it?
Presented in SK Telecom, October 2018.

Slides: English. Presentation: Korean.


Lee Dongjin

October 25, 2018


  1. Kafka Streams vs. Spark Structured Streaming (extended ver.) Lee Dongjin

    - dongjin@apache.org
  2. Introduction • So many Streaming Frameworks / Libraries ◦ RxJava,

    Spring Reactor, AKKA streams, Flink, Samza, Storm, … ◦ What to use?! • Spark Structured Streaming vs. Kafka Streams ◦ Advantages, Disadvantages, and Trade-offs. ◦ When to use, or not to use.
  3. Spark Structured Streaming: Overview (1) • Stream processing engine based

    on Spark SQL (2.0) API: RDD → Dataframe Execution: Batch → Streaming (microbatch, continuous) Spark Core (1.x) Spark SQL (2.x) Spark Streaming Spark Structured Streaming
  4. Spark Structured Streaming: Overview (2) • Describes the processing logic

    with Spark SQL Operations ◦ Easy to learn: almost identical to normal Batch SQL ▪ Except: Source, Sink, Trigger, Output Mode, Watermark, etc ... ◦ Provides various data sources and functions ◦ Optimized by Catalyst Optimizer
  5. Spark Structured Streaming: WordCount (1) // Create DataFrame representing the

    stream of input lines // from connection to localhost:9999 val lines = spark.readStream .format("socket") .option("host", "localhost") .option("port", 9999) .load() // Split the lines into words & generate running word count val words = lines.as[String].flatMap(_.split(" ")) val wordCounts = words.groupBy("value").count()
  6. Spark Structured Streaming: WordCount (2) // Start running the query

    that prints // the running counts to the console val query = wordCounts.writeStream .outputMode("complete") .format("console") .start() query.awaitTermination()
  7. Kafka Streams: Overview (1) • A stream processing library that

    directly integrates with Kafka ◦ from 0.10.0. ◦ Doesn't need a special runtime like YARN: runs in a normal java process. ◦ Masterless: runs on top of Kafka’s consumer group.
  8. Kafka Streams: Overview (2) • Describes the processing logic as

    a graph of processors ◦ ‘Processing Topology’ ◦ Source, Sink: Subclass of Processor • With… ◦ High level DSL (a.k.a. KStream API) ▪ Recommended ◦ Low level API
  9. Kafka Streams: WordCount (1) // Build Topology with StreamsBuilder final

    StreamsBuilder builder = new StreamsBuilder(); // KStream: unbounded series of records final KStream<String, String> source = builder.stream(inputTopic); // Transform input records into stream of words with `flatMapValues` method final KStream<String, String> tokenized = source .flatMapValues(value -> Arrays.asList( value.toLowerCase(Locale.getDefault()).split(" ")) );
  10. Kafka Streams: WordCount (2) // KTable: Stateful abstraction of aggregated

    stream // Build KTable from KStream by group and aggregate operations final KTable<String, Long> counts = tokenized .groupBy((key, value) -> value) .count(); // Convert KTable to KStream KStream<String, Long> changeLog = counts.toStream(); // Write back to output kafka topic changeLog.to(outputTopic, Produced.with(Serdes.String(), Serdes.Long())); // Build Topology instance return builder.build();
  11. Kafka Streams: WordCount (3) Properties props = ... // Configuration

    properties Topology topology = ... // Topology object final KafkaStreams streams = new KafkaStreams(topology, props); /* Omit some boilerplate codes... */ // Start the Kafka Streams application streams.start();
  12. KStream, KTable, ... KStream KTable KGroupedTable KGroupedStream SessionWindowedStream TimeWindowedStream group

    windowedBy aggregate aggregate group toStream
  13. Kafka Streams: Low Level API • You can define Processor

    classes manually ◦ Example: WordCountProcessor ◦ In fact, DSL builds Processor instances internally. • Achieves efficiency & fault-tolerance with… ◦ Intermediate Topic ◦ StateStore
  14. Kafka Streams: StateStore • Local, in-memory Key-Value store ◦ Implemented

    with RocksDB - Fast! ◦ Backed by changelog topic - Easy to restore! • Under the hood ◦ ex) KTable
  15. How Spark Structured Streaming works (1) • Dataframe: Container of

    QueryExecution ◦ Transformation method: Returns new Dataframe object with updated LogicalPlan. ▪ map, filter, select, ... ◦ Action method: trigger computation and return results. ▪ count, show, ... Dataframe QueryExecution LogicalPlan Provides API Handles the primary workflow for executing LogicalPlan Describes logical operation
  16. How Spark Structured Streaming works (2) • When action method

    is called, the LogicalPlan is translated into RDD operations ◦ And finally, Tasks. (Unresolved) LogicalPlan (Resolved) LogicalPlan (Optimized) LogicalPlan SparkPlan Resolve variables w/ Catalog Logical Optimization (ex. Predicate Pushdown) Convert into RDD Operations * RDD Operations are divided into Tasks and run by Executors.
  17. How Spark Structured Streaming works (3) • Then, What happens

    with Streaming? ◦ StreamExecution ▪ (Almost) Resolved LogicalPlan ▪ Trigger ▪ Output Mode ▪ Output Sink StreamExecution LogicalPlan
  18. How Spark Structured Streaming works (4) • Driver triggers StreamExecution

    periodically. ◦ Driver checks newly arrived records. (ex. checks the latest offset in Kafka topic.) ◦ Clones LogicalPlan, fill with arrived records, and run in normal workflow. ▪ In other words, all workflow is identical with Batch computations from then on. ◦ For each Task, Executor requests the records of given offset range to Kafka brokers.
  19. How Spark Structured Streaming works (5) val lines = spark.readStream

    ... .load() ... val query = wordCounts.writeStream .outputMode("complete") .format("console") .start() query.awaitTermination() (Create Dataframe, along with contained LogicalPlan.) (Resolve LogicalPlan, Create StreamExecution instance, and have StreamingQueryManager to start it.)
  20. How Kafka Streams works (1) • Built on top of

    Kafka’s consumer group feature ◦ Automatically divides the records into disjoint sets. • StreamTask ◦ Created per input partition. ◦ ex) Streams Topology with input topic A (2 partitions) and B (3 partitions): 3 StreamTasks! ◦ Run by thread pool.
  21. How Kafka Streams works (2) • Run with 1 machine,

    1 process with 2 threads
  22. How Kafka Streams works (3) • Run with 2 machines,

    2 processes with 2 threads each
  23. Comparison Kafka Streams Spark Structured Streaming Deployment Standalone Java Application

    Spark Executor (mostly, YARN cluster) Streaming Source Kafka only Kafka, File System, Kinesis, ... Fault-Tolerance StateStore, backed by changelog RDD cache Syntax Low level Processor API / High Level DSL Spark SQL Semantics Simple Rich (w/ query optimization)
  24. Conclusion • Spark Structured Streaming for rich semantics ◦ ETL

    Tasks. ◦ ex) Join records with RDBMS, run ML pipeline, etc., ... • Kafka Streams for lightweight manipulation of Kafka topics ◦ Preprocess Kafka topics. ◦ Microservice run with Kafka topics. ◦ Event-based prediction. (e.g., Kafka Streams w/ tensorflow)
  25. Questions? • Slides ◦ https://speakerdeck.com/dongjin • 한국 스파크 사용자 모임

    ◦ https://www.facebook.com/groups/sparkkoreauser/ • Kafka 한국 사용자 모임 ◦ https://www.facebook.com/groups/kafkakorea/