Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Kafka Streams vs. Spark Structured Streaming (extended)

Kafka Streams vs. Spark Structured Streaming (extended)

Kafka Streams vs. Spark Structured Streaming. 어떻게 사용할 수 있고, 내부는 어떻게 되어 있으며, 장단점은 무엇이고 어디에 써야 하는가?
2018년 10월, SKT 사내 세미나에서 발표.

Kafka Streams vs. Spark Structured Streaming: How you can use, How it works under the hood, advantages and disadvantages, and when to use it?
Presented in SK Telecom, October 2018.

Slides: English. Presentation: Korean.

Lee Dongjin

October 25, 2018
Tweet

More Decks by Lee Dongjin

Other Decks in Technology

Transcript

  1. Kafka Streams vs.
    Spark Structured Streaming
    (extended ver.)
    Lee Dongjin - [email protected]

    View full-size slide

  2. Introduction
    ● So many Streaming Frameworks / Libraries
    ○ RxJava, Spring Reactor, AKKA streams, Flink, Samza, Storm, …
    ○ What to use?!
    ● Spark Structured Streaming vs. Kafka Streams
    ○ Advantages, Disadvantages, and Trade-offs.
    ○ When to use, or not to use.

    View full-size slide

  3. Spark Structured Streaming: Overview (1)
    ● Stream processing engine based on Spark SQL (2.0)
    API: RDD
    → Dataframe
    Execution: Batch → Streaming (microbatch, continuous)
    Spark Core (1.x)
    Spark SQL (2.x)
    Spark Streaming
    Spark Structured
    Streaming

    View full-size slide

  4. Spark Structured Streaming: Overview (2)
    ● Describes the processing logic with Spark SQL Operations
    ○ Easy to learn: almost identical to normal Batch SQL
    ■ Except: Source, Sink, Trigger, Output Mode, Watermark, etc ...
    ○ Provides various data sources and functions
    ○ Optimized by Catalyst Optimizer

    View full-size slide

  5. Spark Structured Streaming: WordCount (1)
    // Create DataFrame representing the stream of input lines
    // from connection to localhost:9999
    val lines = spark.readStream
    .format("socket")
    .option("host", "localhost")
    .option("port", 9999)
    .load()
    // Split the lines into words & generate running word count
    val words = lines.as[String].flatMap(_.split(" "))
    val wordCounts = words.groupBy("value").count()

    View full-size slide

  6. Spark Structured Streaming: WordCount (2)
    // Start running the query that prints
    // the running counts to the console
    val query = wordCounts.writeStream
    .outputMode("complete")
    .format("console")
    .start()
    query.awaitTermination()

    View full-size slide

  7. Kafka Streams: Overview (1)
    ● A stream processing library that directly integrates with Kafka
    ○ from 0.10.0.
    ○ Doesn't need a special runtime like YARN: runs in a normal java process.
    ○ Masterless: runs on top of Kafka’s consumer group.

    View full-size slide

  8. Kafka Streams: Overview (2)
    ● Describes the processing logic as a graph
    of processors
    ○ ‘Processing Topology’
    ○ Source, Sink: Subclass of Processor
    ● With…
    ○ High level DSL (a.k.a. KStream API)
    ■ Recommended
    ○ Low level API

    View full-size slide

  9. Kafka Streams: WordCount (1)
    // Build Topology with StreamsBuilder
    final StreamsBuilder builder = new StreamsBuilder();
    // KStream: unbounded series of records
    final KStream source = builder.stream(inputTopic);
    // Transform input records into stream of words with `flatMapValues` method
    final KStream tokenized = source
    .flatMapValues(value -> Arrays.asList(
    value.toLowerCase(Locale.getDefault()).split(" "))
    );

    View full-size slide

  10. Kafka Streams: WordCount (2)
    // KTable: Stateful abstraction of aggregated stream
    // Build KTable from KStream by group and aggregate operations
    final KTable counts = tokenized
    .groupBy((key, value) -> value)
    .count();
    // Convert KTable to KStream
    KStream changeLog = counts.toStream();
    // Write back to output kafka topic
    changeLog.to(outputTopic, Produced.with(Serdes.String(), Serdes.Long()));
    // Build Topology instance
    return builder.build();

    View full-size slide

  11. Kafka Streams: WordCount (3)
    Properties props = ... // Configuration properties
    Topology topology = ... // Topology object
    final KafkaStreams streams = new KafkaStreams(topology, props);
    /* Omit some boilerplate codes... */
    // Start the Kafka Streams application
    streams.start();

    View full-size slide

  12. KStream, KTable, ...
    KStream
    KTable
    KGroupedTable
    KGroupedStream
    SessionWindowedStream
    TimeWindowedStream
    group
    windowedBy
    aggregate
    aggregate
    group
    toStream

    View full-size slide

  13. Kafka Streams: Low Level API
    ● You can define Processor classes manually
    ○ Example: WordCountProcessor
    ○ In fact, DSL builds Processor instances internally.
    ● Achieves efficiency & fault-tolerance with…
    ○ Intermediate Topic
    ○ StateStore

    View full-size slide

  14. Kafka Streams: StateStore
    ● Local, in-memory Key-Value store
    ○ Implemented with RocksDB - Fast!
    ○ Backed by changelog topic - Easy to restore!
    ● Under the hood
    ○ ex) KTable

    View full-size slide

  15. How Spark Structured Streaming works (1)
    ● Dataframe: Container of QueryExecution
    ○ Transformation method: Returns new Dataframe object with updated LogicalPlan.
    ■ map, filter, select, ...
    ○ Action method: trigger computation and return results.
    ■ count, show, ...
    Dataframe QueryExecution LogicalPlan
    Provides API
    Handles the primary
    workflow for executing
    LogicalPlan
    Describes
    logical
    operation

    View full-size slide

  16. How Spark Structured Streaming works (2)
    ● When action method is called, the LogicalPlan is translated into RDD
    operations
    ○ And finally, Tasks.
    (Unresolved)
    LogicalPlan
    (Resolved)
    LogicalPlan
    (Optimized)
    LogicalPlan
    SparkPlan
    Resolve
    variables
    w/ Catalog
    Logical
    Optimization
    (ex. Predicate
    Pushdown)
    Convert into
    RDD
    Operations
    * RDD Operations are
    divided into Tasks and
    run by Executors.

    View full-size slide

  17. How Spark Structured Streaming works (3)
    ● Then, What happens with Streaming?
    ○ StreamExecution
    ■ (Almost) Resolved LogicalPlan
    ■ Trigger
    ■ Output Mode
    ■ Output Sink
    StreamExecution LogicalPlan

    View full-size slide

  18. How Spark Structured Streaming works (4)
    ● Driver triggers StreamExecution periodically.
    ○ Driver checks newly arrived records. (ex. checks the latest offset in Kafka topic.)
    ○ Clones LogicalPlan, fill with arrived records, and run in normal workflow.
    ■ In other words, all workflow is identical with Batch computations from then on.
    ○ For each Task, Executor requests the records of given offset range to Kafka brokers.

    View full-size slide

  19. How Spark Structured Streaming works (5)
    val lines = spark.readStream
    ...
    .load()
    ...
    val query = wordCounts.writeStream
    .outputMode("complete")
    .format("console")
    .start()
    query.awaitTermination()
    (Create Dataframe, along with
    contained LogicalPlan.)
    (Resolve LogicalPlan, Create
    StreamExecution instance, and have
    StreamingQueryManager to start it.)

    View full-size slide

  20. How Kafka Streams works (1)
    ● Built on top of Kafka’s consumer group feature
    ○ Automatically divides the records into disjoint sets.
    ● StreamTask
    ○ Created per input partition.
    ○ ex) Streams Topology with input topic A (2 partitions) and B (3 partitions): 3 StreamTasks!
    ○ Run by thread pool.

    View full-size slide

  21. How Kafka Streams works (2)
    ● Run with 1 machine, 1 process with 2
    threads

    View full-size slide

  22. How Kafka Streams works (3)
    ● Run with 2 machines, 2 processes with 2
    threads each

    View full-size slide

  23. Comparison
    Kafka Streams
    Spark Structured
    Streaming
    Deployment Standalone Java Application
    Spark Executor (mostly, YARN
    cluster)
    Streaming Source Kafka only
    Kafka, File System,
    Kinesis, ...
    Fault-Tolerance StateStore, backed by
    changelog
    RDD cache
    Syntax Low level Processor API / High
    Level DSL
    Spark SQL
    Semantics Simple Rich (w/ query optimization)

    View full-size slide

  24. Conclusion
    ● Spark Structured Streaming for rich semantics
    ○ ETL Tasks.
    ○ ex) Join records with RDBMS, run ML pipeline, etc., ...
    ● Kafka Streams for lightweight manipulation of Kafka topics
    ○ Preprocess Kafka topics.
    ○ Microservice run with Kafka topics.
    ○ Event-based prediction. (e.g., Kafka Streams w/ tensorflow)

    View full-size slide

  25. Questions?
    ● Slides
    ○ https://speakerdeck.com/dongjin
    ● 한국 스파크 사용자 모임
    ○ https://www.facebook.com/groups/sparkkoreauser/
    ● Kafka 한국 사용자 모임
    ○ https://www.facebook.com/groups/kafkakorea/

    View full-size slide