Quickly Build Kafka Stream ETL System with KSETL

by LINE DEVDAY 2021

Slide 1

Slide 1 text

No content

Slide 2

Slide 2 text

KSETL • KSETL = Kafka Stream ETL – Extract, Transform, Load stream on Kafka • Stream – Unbounded continuously generated data – User log – Sensor data

Slide 3

Slide 3 text

Why Stream Processing? • To process data with low latency • For business requirements – Delivery status, order status for food delivery – Fraud detection in financial transaction • For better performance – Contents recommendation

Slide 4

Slide 4 text

Stream processing system • Batch processing system is not enough – Daily à Hourly à Minuitely à Secondly? • Stream processing system – Reflect the latest data quickly

Slide 5

Slide 5 text

Building systems • Many stream processing systems are needed • Common works – Write and debug programs – Build programs – Deploy programs – Monitor programs • Many data engineers do the similar works

Slide 6

Slide 6 text

KSETL • KSETL – Kafka Stream ETL for LINE – Input and output are Kafka topics – (Kafka is widely used for streams in LINE) • Build stream processing systems easily – Let data engineers build their systems by themselves

Slide 7

Slide 7 text

Goal – Easiness • Express ETL logic easily – Introduce SQL-like syntax • Build systems easily – Create ksqlDB clusters dynamically on k8s

Slide 8

Slide 8 text

Express ETL logic easily • For data engineers without programming expertise – Every data engineer knows SQL • Prototyped using various SQL engines – ksqlDB, FlinkSQL, Spark structured streaming – Left join stream-stream – Query a table and write to Kafka topic

Slide 9

Slide 9 text

ksqlDB • Full features – Join – Window aggregation – User Defined Function (UDF) • Based on Kafka Streams API – Easy to understand • Only for stream processing – No extra parts for batch processing • Everything on Kafka – Good Kafka team in LINE

Slide 10

Slide 10 text

Internals of ksqlDB Join p1 p2 Left topic p1 p2 Right topic p1 p2 Joined topic p1 p1 Join partition1 p2 p2 Join partition2 p1 p2 Left changelog topic Local state store Local state store p1 p2 Right changelog topic

Slide 11

Slide 11 text

Build systems easily • Create ksqlDB clusters dynamically – ODA (On-Demand Applications) • Provide logging/monitoring facilities • Run queries against a ksqlDB cluster

Slide 12

Slide 12 text

KSETL ODA Architecture • Many ksqlDB clusters in a k8s

Slide 13

Slide 13 text

KSETL Logging/Monitoring

Slide 14

Slide 14 text

Summary Tradi&onal KSETL Language Java, Scala SQL Build Compile Interactive shell Deploy CI/CD tools On-demand cluster Monitor Custom tools Prebuilt dashboards

Slide 15

Slide 15 text

Example system • AB test report – LINE runs AB tests before releasing new features – Request logs from LINE server • 50k / sec logs at peak time – Event(impression, click) logs from LINE client – Find client reaction for request and aggregate – Stream join and windowed aggregation required

Slide 16

Slide 16 text

Prev. AB test report • Prev. system to join streams – Event log and request log with the same key – Store event(impression,click) logs to Redis – Delay request logs and lookup Redis to implement join window

Slide 17

Slide 17 text

Join

Slide 18

Slide 18 text

Window aggregation

Slide 19

Slide 19 text

Results • Simple architecture – No Redis to join two streams • Fast release – Interactive development • Fast monitoring and update – Get a performance dashboard – Tune fast

Slide 20

Slide 20 text

Limits • KSETL depends on – ksqlDB – Company-wide Kafka • ksqlDB – Some features are missed (Still in active development) – FlinkSQL may be an alternative • Company-wide Kafka – Good support for all Kafka in LINE – But dynamic topic creation is prohibited

Slide 21

Slide 21 text

Future works • Data import from Hive – Hive tables to enrich Kafka topics • Enhancing query deployment – Better way for executing query scripts

Slide 22

Slide 22 text

Thank you