Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Quickly Build Kafka Stream ETL System with KSETL

Quickly Build Kafka Stream ETL System with KSETL

LINE DEVDAY 2021

November 10, 2021
Tweet

More Decks by LINE DEVDAY 2021

Other Decks in Technology

Transcript

  1. KSETL
    • KSETL = Kafka Stream ETL
    – Extract, Transform, Load stream on Kafka
    • Stream
    – Unbounded continuously generated data
    – User log
    – Sensor data

    View full-size slide

  2. Why Stream Processing?
    • To process data with low latency
    • For business requirements
    – Delivery status, order status for food delivery
    – Fraud detection in financial transaction
    • For better performance
    – Contents recommendation

    View full-size slide

  3. Stream processing system
    • Batch processing system is not enough
    – Daily à Hourly à Minuitely à Secondly?
    • Stream processing system
    – Reflect the latest data quickly

    View full-size slide

  4. Building systems
    • Many stream processing systems are needed
    • Common works
    – Write and debug programs
    – Build programs
    – Deploy programs
    – Monitor programs
    • Many data engineers do the similar works

    View full-size slide

  5. KSETL
    • KSETL
    – Kafka Stream ETL for LINE
    – Input and output are Kafka topics
    – (Kafka is widely used for streams in LINE)
    • Build stream processing systems easily
    – Let data engineers build their systems by themselves

    View full-size slide

  6. Goal – Easiness
    • Express ETL logic easily
    – Introduce SQL-like syntax
    • Build systems easily
    – Create ksqlDB clusters dynamically on k8s

    View full-size slide

  7. Express ETL logic easily
    • For data engineers without programming expertise
    – Every data engineer knows SQL
    • Prototyped using various SQL engines
    – ksqlDB, FlinkSQL, Spark structured streaming
    – Left join stream-stream
    – Query a table and write to Kafka topic

    View full-size slide

  8. ksqlDB
    • Full features
    – Join
    – Window aggregation
    – User Defined Function (UDF)
    • Based on Kafka Streams API
    – Easy to understand
    • Only for stream processing
    – No extra parts for batch processing
    • Everything on Kafka
    – Good Kafka team in LINE

    View full-size slide

  9. Internals of ksqlDB Join
    p1 p2
    Left topic
    p1 p2
    Right topic
    p1 p2
    Joined topic
    p1 p1
    Join partition1
    p2 p2
    Join partition2
    p1 p2
    Left changelog
    topic
    Local
    state
    store
    Local
    state
    store
    p1 p2
    Right changelog
    topic

    View full-size slide

  10. Build systems easily
    • Create ksqlDB clusters dynamically
    – ODA (On-Demand Applications)
    • Provide logging/monitoring facilities
    • Run queries against a ksqlDB cluster

    View full-size slide

  11. KSETL ODA Architecture
    • Many ksqlDB clusters in a k8s

    View full-size slide

  12. KSETL Logging/Monitoring

    View full-size slide

  13. Summary
    Tradi&onal KSETL
    Language Java, Scala SQL
    Build Compile Interactive shell
    Deploy CI/CD tools On-demand cluster
    Monitor Custom tools Prebuilt dashboards

    View full-size slide

  14. Example system
    • AB test report
    – LINE runs AB tests before releasing new features
    – Request logs from LINE server
    • 50k / sec logs at peak time
    – Event(impression, click) logs from LINE client
    – Find client reaction for request and aggregate
    – Stream join and windowed aggregation required

    View full-size slide

  15. Prev. AB test report
    • Prev. system to join streams
    – Event log and request log with the same key
    – Store event(impression,click) logs to Redis
    – Delay request logs and lookup Redis to implement join window

    View full-size slide

  16. Window aggregation

    View full-size slide

  17. Results
    • Simple architecture
    – No Redis to join two streams
    • Fast release
    – Interactive development
    • Fast monitoring and update
    – Get a performance dashboard
    – Tune fast

    View full-size slide

  18. Limits
    • KSETL depends on
    – ksqlDB
    – Company-wide Kafka
    • ksqlDB
    – Some features are missed (Still in active development)
    – FlinkSQL may be an alternative
    • Company-wide Kafka
    – Good support for all Kafka in LINE
    – But dynamic topic creation is prohibited

    View full-size slide

  19. Future works
    • Data import from Hive
    – Hive tables to enrich Kafka topics
    • Enhancing query deployment
    – Better way for executing query scripts

    View full-size slide