Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Quickly Build Kafka Stream ETL System with KSETL

Quickly Build Kafka Stream ETL System with KSETL

LINE DEVDAY 2021

November 10, 2021
Tweet

More Decks by LINE DEVDAY 2021

Other Decks in Technology

Transcript

  1. KSETL • KSETL = Kafka Stream ETL – Extract, Transform,

    Load stream on Kafka • Stream – Unbounded continuously generated data – User log – Sensor data
  2. Why Stream Processing? • To process data with low latency

    • For business requirements – Delivery status, order status for food delivery – Fraud detection in financial transaction • For better performance – Contents recommendation
  3. Stream processing system • Batch processing system is not enough

    – Daily à Hourly à Minuitely à Secondly? • Stream processing system – Reflect the latest data quickly
  4. Building systems • Many stream processing systems are needed •

    Common works – Write and debug programs – Build programs – Deploy programs – Monitor programs • Many data engineers do the similar works
  5. KSETL • KSETL – Kafka Stream ETL for LINE –

    Input and output are Kafka topics – (Kafka is widely used for streams in LINE) • Build stream processing systems easily – Let data engineers build their systems by themselves
  6. Goal – Easiness • Express ETL logic easily – Introduce

    SQL-like syntax • Build systems easily – Create ksqlDB clusters dynamically on k8s
  7. Express ETL logic easily • For data engineers without programming

    expertise – Every data engineer knows SQL • Prototyped using various SQL engines – ksqlDB, FlinkSQL, Spark structured streaming – Left join stream-stream – Query a table and write to Kafka topic
  8. ksqlDB • Full features – Join – Window aggregation –

    User Defined Function (UDF) • Based on Kafka Streams API – Easy to understand • Only for stream processing – No extra parts for batch processing • Everything on Kafka – Good Kafka team in LINE
  9. Internals of ksqlDB Join p1 p2 Left topic p1 p2

    Right topic p1 p2 Joined topic p1 p1 Join partition1 p2 p2 Join partition2 p1 p2 Left changelog topic Local state store Local state store p1 p2 Right changelog topic
  10. Build systems easily • Create ksqlDB clusters dynamically – ODA

    (On-Demand Applications) • Provide logging/monitoring facilities • Run queries against a ksqlDB cluster
  11. Summary Tradi&onal KSETL Language Java, Scala SQL Build Compile Interactive

    shell Deploy CI/CD tools On-demand cluster Monitor Custom tools Prebuilt dashboards
  12. Example system • AB test report – LINE runs AB

    tests before releasing new features – Request logs from LINE server • 50k / sec logs at peak time – Event(impression, click) logs from LINE client – Find client reaction for request and aggregate – Stream join and windowed aggregation required
  13. Prev. AB test report • Prev. system to join streams

    – Event log and request log with the same key – Store event(impression,click) logs to Redis – Delay request logs and lookup Redis to implement join window
  14. Results • Simple architecture – No Redis to join two

    streams • Fast release – Interactive development • Fast monitoring and update – Get a performance dashboard – Tune fast
  15. Limits • KSETL depends on – ksqlDB – Company-wide Kafka

    • ksqlDB – Some features are missed (Still in active development) – FlinkSQL may be an alternative • Company-wide Kafka – Good support for all Kafka in LINE – But dynamic topic creation is prohibited
  16. Future works • Data import from Hive – Hive tables

    to enrich Kafka topics • Enhancing query deployment – Better way for executing query scripts