Kafka Streams vs. Spark Structured Streaming

Slide 1

Slide 1 text

Slide 2

Slide 2 text

© Cloudera, Inc. All rights reserved. 2 © Cloudera, Inc. All rights reserved. 연사 소개 ● Open source contributor ○ Committer, Apache Software Foundation ○ Spark, Kafka, etc... ● Open source community 운영진 ○ 한국 스파크 사용자 모임 ○ Kafka 한국 사용자 모임

Slide 3

Slide 3 text

© Cloudera, Inc. All rights reserved. 3 © Cloudera, Inc. All rights reserved. Streaming Processing, 무엇을 써야 할까 ● Streaming 처리 기술 ○ RxJava, Spring Reactor, AKKA streams, Flink, Samza, Storm, … ○ "어느 상황에서는 어느 기술을 선택해야 하는가?" ■ Spark Structured Streaming ■ Kafka Streams

Slide 4

Slide 4 text

© Cloudera, Inc. All rights reserved. 4 © Cloudera, Inc. All rights reserved. Spark Structured Streaming: 개요 ● Spark SQL의 Streaming 처리 확장 (v 2.0 ~) ○ ‘Stream = 무한히 업데이트되는 Table’ ○ Dataframe API를 사용해서 작업을 정의 ■ 예: Wordcount // Batch: create a dataframe from a text file. val lines = spark.read .text("file.txt") // Split the lines into words val wordCounts = lines.as[String] .flatMap(_.split(" ")) .groupBy("value").count() // Streaming: create a dataframe from socket connection to localhost:9999. val lines = spark.readStream .format("socket") .option("host", "localhost") .option("port", 9999) .load() // Split the lines into words val wordCounts = lines.as[String] .flatMap(_.split(" ")) .groupBy("value").count()

Slide 5

Slide 5 text

© Cloudera, Inc. All rights reserved. 5 © Cloudera, Inc. All rights reserved. Spark Structured Streaming: 장점 ● 약간의 추가적인 개념만 알면 바로 쓸 수 있음. ○ Source, Sink, Trigger, Watermark, … ● 많은 Function과 Datasource들이 지원됨. ○ Function: Join, ML Pipeline, ... ○ Datasource: RDBMS, Parquet, JSON, ... ○ 예: Kafka로 들어오는 record들을 RDBMS와 Parquet에 저장된 정보와 Join. ● 어떠한 작업을 할 것인지(what)에만 신경쓰면 됨. ○ Catalyst Optimizer

Slide 6

Slide 6 text

© Cloudera, Inc. All rights reserved. 6 © Cloudera, Inc. All rights reserved. Kafka Streams: 개요 (1) ● Kafka에서 제공하는 stream processing library ○ v 0.10.0 ~ ● 특징 ○ 'Stream 처리' = '서로 스트림 데이터를 주고받는 처리 과정들의 집합' ■ Processor Topology ○ DSL을 사용해서 작업을 정의 ■ KStream, ... A B C Topic A Topic B Key Value “a” “apple” “b” “banana” “c” “cinamon” Key Value “g” 5 “a” 12 “b” 42 Topic C Key Value “apple” 12 “banana” 42 Source Processor (Read Kafka topic) Processor (Stateless/Stateful Operation) Sink Processor (Write to Kafka topic) (data forwarding)

Slide 7

Slide 7 text

© Cloudera, Inc. All rights reserved. 7 © Cloudera, Inc. All rights reserved. Kafka Streams: 개요 (2) ● 예: Instantiate Wordcount Topology // Build Topology with StreamsBuilder final StreamsBuilder builder = new StreamsBuilder(); // KStream: unbounded series of records final KStream source = builder.stream(inputTopic); // Transform input records into stream of words with `flatMapValues` method final KStream tokenized = source .flatMapValues(value -> Arrays.asList(value.toLowerCase(Locale.getDefault()).split(" "))); // KTable: Stateful abstraction of aggregated stream // Build KTable from KStream by group and aggregate operations final KTable counts = tokenized.groupBy((key, value) -> value).count(); // Write back aggregated status to output kafka topic counts.to(outputTopic, Produced.with(Serdes.String(), Serdes.Long())); // Build Topology instance return builder.build();

Slide 8

Slide 8 text

© Cloudera, Inc. All rights reserved. 8 © Cloudera, Inc. All rights reserved. Kafka Streams: 개요 (3) ● 예: Run Wordcount Topology public static void main() { // Build Topology with StreamsBuilder Properties props = ... // Configuration properties Topology topology = ... // Topology object final KafkaStreams streams = new KafkaStreams(topology, props); /* Omit some boilerplate codes... */ // Start the Kafka Streams application streams.start(); }

Slide 9

Slide 9 text

© Cloudera, Inc. All rights reserved. 9 © Cloudera, Inc. All rights reserved. Kafka Streams: 장점 ● Framework (X) Library (O) ○ “어떻게 실행될지는 사용자가 결정한다.” ● Masterless ○ “Coordination이 필요 없다.” ■ 전체 작업(Processing Topology)을 완전히 분리된 조각(StreamTask)으로 분할. ■ Source Kafka topic의 partition 수를 기준으로 StreamTask 개수가 결정. ■ 어느 host에서 어느 StreamTask를 작업할지는 Kafka의 Consumer Group 기능을 사용해서 조율. ■ “몇 개의 Process를 어느 host에서 띄우던 간에 사전에 결정된 StreamTask 중 하나를 할당받아서 작업한다.” ● Fault-tolerance ○ StateStore ■ Processor의 상태 정보를 저장하는 In-memory KeyValue Store (RocksDB) ■ 변경 내역을 changelog topic 형태로 저장 ■ “프로세스가 종료되거나 다른 호스트에서 작업을 시작할 때 StateStore의 값을 복원”

Slide 10

Slide 10 text

© Cloudera, Inc. All rights reserved. 10 © Cloudera, Inc. All rights reserved. 비교 Kafka Streams Spark Structured Streaming Deployment Standalone Java Process Spark Executor (mostly, YARN cluster) Streaming Source Kafka Only Socket, File System, Kafka, ... Execution Model Masterless Driver + Executor(s) Fault-Tolerance StateStore, backed by changelog RDD Cache Syntax Low level Processor API / High Level DSL Spark SQL Semantics Simple Rich (w/ query optimization)

Slide 11

Slide 11 text

© Cloudera, Inc. All rights reserved. 11 © Cloudera, Inc. All rights reserved. 결론 (1) ● Spark Structured Streaming ○ 여러 Data Source에서 데이터를 읽어와야 할 때 ○ 복잡한 처리를 해야 할 때 ■ Join, Pivot, ML Pipeline, … ○ 예: ETL 처리

Slide 12

Slide 12 text

© Cloudera, Inc. All rights reserved. 12 © Cloudera, Inc. All rights reserved. 결론 (2) ● Kafka Streams ○ Kafka topic을 주로 처리하는 (경량) application을 개발할 때 ■ Kafka topic을 사용하는 microservice ● 예) topic들을 읽어와서 cache해 놓고 질의 기능을 제공. (Interactive Query) ■ Kafka topic 전처리 ● 예) topic(들)을 읽어와서 가공된 형태로 다른 topic에 저장. ■ Event에 대한 Instant Prediction ● 예) topic에 저장된 event에 대해 ML model을 사용해서 예측한 값을 다른 topic에 저장.

Slide 13

Slide 13 text

© Cloudera, Inc. All rights reserved. 13 © Cloudera, Inc. All rights reserved. 질문? ● Slides ○ https://speakerdeck.com/dongjin/kafka-streams-vs-spark-structured-streamin g ● 한국 스파크 사용자 모임 ○ https://www.facebook.com/groups/sparkkoreauser/ ● Kafka 한국 사용자 모임 ○ https://www.facebook.com/groups/kafkakorea/