Kafka Streams vs. Spark Structured Streaming

143f88e8c2b2a1123e87c81d9bbefa02?s=47 Lee Dongjin
November 08, 2018

Kafka Streams vs. Spark Structured Streaming

Kafka Streams vs. Spark Structured Streaming. 어떻게 사용할 수 있고, 장단점은 무엇이고, 어디에 써야 하는가?
2018년 11월 8일, Cloudera Sessions 2018 에서 발표.

Kafka Streams vs. Spark Structured Streaming: How you can use, advantages and disadvantages, and when to use it?
Presented in Cloudera Sessions 2018, November 8th 2018.

Slides: Korean. Presentation: Korean.

143f88e8c2b2a1123e87c81d9bbefa02?s=128

Lee Dongjin

November 08, 2018
Tweet

Transcript

  1. © Cloudera, Inc. All rights reserved. Kafka Streams vs. Spark

    Structured Streaming Apache Software Foundation / Lee Dongjin (dongjin@apache.org)
  2. © Cloudera, Inc. All rights reserved. 2 © Cloudera, Inc.

    All rights reserved. 연사 소개 • Open source contributor ◦ Committer, Apache Software Foundation ◦ Spark, Kafka, etc... • Open source community 운영진 ◦ 한국 스파크 사용자 모임 ◦ Kafka 한국 사용자 모임
  3. © Cloudera, Inc. All rights reserved. 3 © Cloudera, Inc.

    All rights reserved. Streaming Processing, 무엇을 써야 할까 • Streaming 처리 기술 ◦ RxJava, Spring Reactor, AKKA streams, Flink, Samza, Storm, … ◦ "어느 상황에서는 어느 기술을 선택해야 하는가?" ▪ Spark Structured Streaming ▪ Kafka Streams
  4. © Cloudera, Inc. All rights reserved. 4 © Cloudera, Inc.

    All rights reserved. Spark Structured Streaming: 개요 • Spark SQL의 Streaming 처리 확장 (v 2.0 ~) ◦ ‘Stream = 무한히 업데이트되는 Table’ ◦ Dataframe API를 사용해서 작업을 정의 ▪ 예: Wordcount // Batch: create a dataframe from a text file. val lines = spark.read .text("file.txt") // Split the lines into words val wordCounts = lines.as[String] .flatMap(_.split(" ")) .groupBy("value").count() // Streaming: create a dataframe from socket connection to localhost:9999. val lines = spark.readStream .format("socket") .option("host", "localhost") .option("port", 9999) .load() // Split the lines into words val wordCounts = lines.as[String] .flatMap(_.split(" ")) .groupBy("value").count()
  5. © Cloudera, Inc. All rights reserved. 5 © Cloudera, Inc.

    All rights reserved. Spark Structured Streaming: 장점 • 약간의 추가적인 개념만 알면 바로 쓸 수 있음. ◦ Source, Sink, Trigger, Watermark, … • 많은 Function과 Datasource들이 지원됨. ◦ Function: Join, ML Pipeline, ... ◦ Datasource: RDBMS, Parquet, JSON, ... ◦ 예: Kafka로 들어오는 record들을 RDBMS와 Parquet에 저장된 정보와 Join. • 어떠한 작업을 할 것인지(what)에만 신경쓰면 됨. ◦ Catalyst Optimizer
  6. © Cloudera, Inc. All rights reserved. 6 © Cloudera, Inc.

    All rights reserved. Kafka Streams: 개요 (1) • Kafka에서 제공하는 stream processing library ◦ v 0.10.0 ~ • 특징 ◦ 'Stream 처리' = '서로 스트림 데이터를 주고받는 처리 과정들의 집합' ▪ Processor Topology ◦ DSL을 사용해서 작업을 정의 ▪ KStream, ... A B C Topic A Topic B Key Value “a” “apple” “b” “banana” “c” “cinamon” Key Value “g” 5 “a” 12 “b” 42 Topic C Key Value “apple” 12 “banana” 42 Source Processor (Read Kafka topic) Processor (Stateless/Stateful Operation) Sink Processor (Write to Kafka topic) (data forwarding)
  7. © Cloudera, Inc. All rights reserved. 7 © Cloudera, Inc.

    All rights reserved. Kafka Streams: 개요 (2) • 예: Instantiate Wordcount Topology // Build Topology with StreamsBuilder final StreamsBuilder builder = new StreamsBuilder(); // KStream: unbounded series of records final KStream<String, String> source = builder.stream(inputTopic); // Transform input records into stream of words with `flatMapValues` method final KStream<String, String> tokenized = source .flatMapValues(value -> Arrays.asList(value.toLowerCase(Locale.getDefault()).split(" "))); // KTable: Stateful abstraction of aggregated stream // Build KTable from KStream by group and aggregate operations final KTable<String, Long> counts = tokenized.groupBy((key, value) -> value).count(); // Write back aggregated status to output kafka topic counts.to(outputTopic, Produced.with(Serdes.String(), Serdes.Long())); // Build Topology instance return builder.build();
  8. © Cloudera, Inc. All rights reserved. 8 © Cloudera, Inc.

    All rights reserved. Kafka Streams: 개요 (3) • 예: Run Wordcount Topology public static void main() { // Build Topology with StreamsBuilder Properties props = ... // Configuration properties Topology topology = ... // Topology object final KafkaStreams streams = new KafkaStreams(topology, props); /* Omit some boilerplate codes... */ // Start the Kafka Streams application streams.start(); }
  9. © Cloudera, Inc. All rights reserved. 9 © Cloudera, Inc.

    All rights reserved. Kafka Streams: 장점 • Framework (X) Library (O) ◦ “어떻게 실행될지는 사용자가 결정한다.” • Masterless ◦ “Coordination이 필요 없다.” ▪ 전체 작업(Processing Topology)을 완전히 분리된 조각(StreamTask)으로 분할. ▪ Source Kafka topic의 partition 수를 기준으로 StreamTask 개수가 결정. ▪ 어느 host에서 어느 StreamTask를 작업할지는 Kafka의 Consumer Group 기능을 사용해서 조율. ▪ “몇 개의 Process를 어느 host에서 띄우던 간에 사전에 결정된 StreamTask 중 하나를 할당받아서 작업한다.” • Fault-tolerance ◦ StateStore ▪ Processor의 상태 정보를 저장하는 In-memory KeyValue Store (RocksDB) ▪ 변경 내역을 changelog topic 형태로 저장 ▪ “프로세스가 종료되거나 다른 호스트에서 작업을 시작할 때 StateStore의 값을 복원”
  10. © Cloudera, Inc. All rights reserved. 10 © Cloudera, Inc.

    All rights reserved. 비교 Kafka Streams Spark Structured Streaming Deployment Standalone Java Process Spark Executor (mostly, YARN cluster) Streaming Source Kafka Only Socket, File System, Kafka, ... Execution Model Masterless Driver + Executor(s) Fault-Tolerance StateStore, backed by changelog RDD Cache Syntax Low level Processor API / High Level DSL Spark SQL Semantics Simple Rich (w/ query optimization)
  11. © Cloudera, Inc. All rights reserved. 11 © Cloudera, Inc.

    All rights reserved. 결론 (1) • Spark Structured Streaming ◦ 여러 Data Source에서 데이터를 읽어와야 할 때 ◦ 복잡한 처리를 해야 할 때 ▪ Join, Pivot, ML Pipeline, … ◦ 예: ETL 처리
  12. © Cloudera, Inc. All rights reserved. 12 © Cloudera, Inc.

    All rights reserved. 결론 (2) • Kafka Streams ◦ Kafka topic을 주로 처리하는 (경량) application을 개발할 때 ▪ Kafka topic을 사용하는 microservice • 예) topic들을 읽어와서 cache해 놓고 질의 기능을 제공. (Interactive Query) ▪ Kafka topic 전처리 • 예) topic(들)을 읽어와서 가공된 형태로 다른 topic에 저장. ▪ Event에 대한 Instant Prediction • 예) topic에 저장된 event에 대해 ML model을 사용해서 예측한 값을 다른 topic에 저장.
  13. © Cloudera, Inc. All rights reserved. 13 © Cloudera, Inc.

    All rights reserved. 질문? • Slides ◦ https://speakerdeck.com/dongjin/kafka-streams-vs-spark-structured-streamin g • 한국 스파크 사용자 모임 ◦ https://www.facebook.com/groups/sparkkoreauser/ • Kafka 한국 사용자 모임 ◦ https://www.facebook.com/groups/kafkakorea/
  14. © Cloudera, Inc. All rights reserved. THANK YOU