Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Query Your Streaming Data on Kafka using SQL

Gang Tao
February 22, 2024
17

Query Your Streaming Data on Kafka using SQL

Streaming data is rapidly becoming a key component in modern applications, and Apache Kafka has emerged as a popular and powerful platform for managing and processing these data streams. However, as the volume and complexity of streaming data continue to grow, it becomes increasingly critical to have efficient and effective ways of querying and analyzing this data.

This is where query engines like Apache Flink, Trino, Timeplus, Materialize, and ksqlDB come in. These powerful tools offer flexible and scalable ways of processing and analyzing streaming data in real-time, enabling users to extract valuable insights from their data streams.

In this talk, we will introduce the audience to the world of querying streaming data on Apache Kafka with SQL, compare and contrast the features and capabilities of each of these tools, and provide an in-depth analysis of their respective Pros and Cons. We will also discuss the best practices and scenarios where each tool is most effective.

Gang Tao

February 22, 2024
Tweet

Transcript

  1. Data Driven Community | Cloud Data Driven Gang Tao |

    Query Your Streaming Data on Kafka using SQL Query Your Streaming Data on Kafka using SQL Why, How and What Gang Tao Co-Founder and CTO, Timeplus Data Driven Community | Cloud Data Driven
  2. Data Driven Community | Cloud Data Driven Gang Tao |

    Query Your Streaming Data on Kafka using SQL About Me 󰗞 • Co-Founder and CTO of Timeplus • Based in Vancouver • Previously worked for Splunk, SAP, EMC • Full-stack developer for 25+ years • Data Science and Machine Learning Architect • Data visualization expert
  3. Data Driven Community | Cloud Data Driven Gang Tao |

    Query Your Streaming Data on Kafka using SQL 46 ZB of data created by billions of IoT by 2025 30% of data generated will be real-time by 2025 Only 1% of data is analyzed and streaming data is primarily untapped Real-time data is everywhere, at the edge and cloud
  4. Data Driven Community | Cloud Data Driven Gang Tao |

    Query Your Streaming Data on Kafka using SQL Why Apache Kafka? WarpStream Redpanda Apache Kafka Apache Pulsar Apache Kafka is an open-source distributed event streaming platform designed to handle real-time data feeds. Originally developed by LinkedIn. • High performance ◦ High throughput ◦ Low Latency • Scalability • Fault tolerant • Durability
  5. Data Driven Community | Cloud Data Driven Gang Tao |

    Query Your Streaming Data on Kafka using SQL Kafka Fundamentals Append-only log
  6. Data Driven Community | Cloud Data Driven Gang Tao |

    Query Your Streaming Data on Kafka using SQL Reliable Fast Easy Powerful Descriptive Why SQL? 󰗔 SQL is the most popular programming language used for data processing, data analytics, and data science.
  7. Data Driven Community | Cloud Data Driven Gang Tao |

    Query Your Streaming Data on Kafka using SQL Is Kafka a Database?
  8. Data Driven Community | Cloud Data Driven Gang Tao |

    Query Your Streaming Data on Kafka using SQL Database Solutions Druid Pinot Trino ClickHouse StarRocks Databend SQL
  9. Data Driven Community | Cloud Data Driven Gang Tao |

    Query Your Streaming Data on Kafka using SQL 1. Load the druid-kafka-indexing-service extension on both the Overlord and the MiddleManagers 2. Create a supervisor-spec.json containing the Kafka supervisor spec file 3. curl -X POST -H 'Content-Type: application/json' -d @supervisor-spec.json http://localhost:8090/druid/indexer/v1/supervisor Apache Druid is an open-source distributed data store designed to handle large-scale, real-time analytics on streaming and batch data.
  10. Data Driven Community | Cloud Data Driven Gang Tao |

    Query Your Streaming Data on Kafka using SQL ClickHouse ClickHouse features highlights: • Table engine and table function • Rich functions 1500+ • Rich data types - Array, Map, etc. ClickHouse is an open-source columnar database management system specifically designed for OLAP workloads.
  11. Data Driven Community | Cloud Data Driven Gang Tao |

    Query Your Streaming Data on Kafka using SQL Add a catalog properties file etc/catalog/kafka.properties for the Kafka connector. $ ./trino --catalog kafka --schema aSchema trino:aSchema> SELECT count(*) FROM customer; Trino, formerly known as Presto SQL, is an open-source distributed SQL query engine designed for high-performance, interactive analytics on large-scale datasets.
  12. Data Driven Community | Cloud Data Driven Gang Tao |

    Query Your Streaming Data on Kafka using SQL Database Solutions Druid Pinot Trino ClickHouse StarRocks Databend Persist Data ✔ ✔ ✔ ✔ query on the fly ✗* Streaming ✗ ✗ ✗ ✗ * ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗*
  13. Data Driven Community | Cloud Data Driven Gang Tao |

    Query Your Streaming Data on Kafka using SQL Streaming Processor Solution Flink SQL
  14. Data Driven Community | Cloud Data Driven Gang Tao |

    Query Your Streaming Data on Kafka using SQL Apache Flink Apache Flink is an open-source stream processing framework for distributed, high-performing, and fault-tolerant data streaming and batch processing.
  15. Data Driven Community | Cloud Data Driven Gang Tao |

    Query Your Streaming Data on Kafka using SQL Spark Apache Spark is an open-source distributed computing system designed for big data processing and analytics.
  16. Data Driven Community | Cloud Data Driven Gang Tao |

    Query Your Streaming Data on Kafka using SQL Arroyo is a distributed stream processing engine written in Rust, designed to efficiently perform stateful computations on streams of data.
  17. Data Driven Community | Cloud Data Driven Gang Tao |

    Query Your Streaming Data on Kafka using SQL Streaming Processor Solution Flink Persist Data Query on the fly Streaming ✗ ✗ ✔ ✔ ✔ ✔ ✔ ✔ ✗
  18. Data Driven Community | Cloud Data Driven Gang Tao |

    Query Your Streaming Data on Kafka using SQL SQL Streaming Database Solutions ksqlDB RisingWave
  19. Data Driven Community | Cloud Data Driven Gang Tao |

    Query Your Streaming Data on Kafka using SQL ksqlDB KSQLDB is an open-source streaming SQL engine built on top of Apache Kafka, designed for real-time stream processing and analytics.
  20. Data Driven Community | Cloud Data Driven Gang Tao |

    Query Your Streaming Data on Kafka using SQL RisingWave is a Postgres-compatible streaming database engineered to provide the simplest and most cost-efficient approach for processing, analyzing, and managing real-time event streaming data.
  21. Data Driven Community | Cloud Data Driven Gang Tao |

    Query Your Streaming Data on Kafka using SQL Proton is a streaming SQL engine, a fast and lightweight alternative to Apache Flink, powered by ClickHouse. SQL with streaming extension Data Ingestion Unified Query Processing Pipeline ingest append stream read historical read streaming storage historical storage query Kafka External Stream CREATE EXTERNAL STREAM stream_name (<col_name1> <col_type>) SETTINGS type='kafka', brokers='ip:9092', topic='..' … …
  22. Data Driven Community | Cloud Data Driven Gang Tao |

    Query Your Streaming Data on Kafka using SQL SELECT * FROM car_live_data Stream tail SELECT count(*) FROM car_live_data Global aggregation SELECT window_start, count(*) FROM tumble(car_live_data, 1m) GROUP BY window_start Window aggregation SELECT cid, speed_kmh, lag(speed_kmh) OVER (PARTITION BY cid) AS last_spd FROM car_live_data Sub streams SELECT window_start, count(*) FROM tumble(car_live_data, 5s) GROUP BY window_start EMIT AFTER WATERMARK AND DELAY 2s Late event SELECT * FROM car_live_data WHERE _tp_time > now() - 1d Time travel SELECT device, cpu_usage, timestamp FROM device_utils INNER JOIN table(device_products_info) AS dim ON device_utils.product_id = dim.id Stream join SELECT * FROM table(car_live_data) Historical query Proton
  23. Data Driven Community | Cloud Data Driven Gang Tao |

    Query Your Streaming Data on Kafka using SQL Streaming Database Solutions ksqlDB RisingWave Persist Data Query on the fly Streaming ✔ ✔ ✔ ✔ ✔ ✔ ✔* ✔* ✔ Proton
  24. Data Driven Community | Cloud Data Driven Gang Tao |

    Query Your Streaming Data on Kafka using SQL source Streaming Processor • SQL as data pipeline • No data storage • Unbounded real-time query ETL / Data Pipeline ingest external Real-Time Database • Mostly leveraging Kafka to ingest data • Federation search/query ◦ ClickHouse Kafka Engine ◦ Trino • Bounded batch query, no streaming query Historical Report / Ad hoc Analysis source Streaming Database • Supports Kafka data storage • Unbounded real-time query • Combination of real-time data and historical data Hybrid
  25. Data Driven Community | Cloud Data Driven Gang Tao |

    Query Your Streaming Data on Kafka using SQL human machine 1GL - machine language 2GL - assembly language 3GL - imperative language 4GL - descriptive language 5GL - intelligent language data insight Programing: Turn data into insights
  26. Data Driven Community | Cloud Data Driven Gang Tao |

    Query Your Streaming Data on Kafka using SQL Flink ksqlDB Hazelcast Druid Pinot Trino ClickHouse StarRocks RisingWave Databend Streaming Processor Streaming Database Realtime Database Query Kafka with SQL: More Options
  27. Data Driven Community | Cloud Data Driven Gang Tao |

    Query Your Streaming Data on Kafka using SQL Want to learn more? ⭐ https://github.com/timeplus-io/proton
  28. Data Driven Community | Cloud Data Driven Gang Tao |

    Query Your Streaming Data on Kafka using SQL Real-time streaming analytics made powerful and accessible! Thank you. Gang Tao [email protected] Data Driven Community | Cloud Data Driven