Slide 1

Slide 1 text

Data Driven Community | Cloud Data Driven Gang Tao | Query Your Streaming Data on Kafka using SQL Query Your Streaming Data on Kafka using SQL Why, How and What Gang Tao Co-Founder and CTO, Timeplus Data Driven Community | Cloud Data Driven

Slide 2

Slide 2 text

Data Driven Community | Cloud Data Driven Gang Tao | Query Your Streaming Data on Kafka using SQL About Me 󰗞 ● Co-Founder and CTO of Timeplus ● Based in Vancouver ● Previously worked for Splunk, SAP, EMC ● Full-stack developer for 25+ years ● Data Science and Machine Learning Architect ● Data visualization expert

Slide 3

Slide 3 text

Data Driven Community | Cloud Data Driven Gang Tao | Query Your Streaming Data on Kafka using SQL 46 ZB of data created by billions of IoT by 2025 30% of data generated will be real-time by 2025 Only 1% of data is analyzed and streaming data is primarily untapped Real-time data is everywhere, at the edge and cloud

Slide 4

Slide 4 text

Data Driven Community | Cloud Data Driven Gang Tao | Query Your Streaming Data on Kafka using SQL Why Apache Kafka? WarpStream Redpanda Apache Kafka Apache Pulsar Apache Kafka is an open-source distributed event streaming platform designed to handle real-time data feeds. Originally developed by LinkedIn. ● High performance ○ High throughput ○ Low Latency ● Scalability ● Fault tolerant ● Durability

Slide 5

Slide 5 text

Data Driven Community | Cloud Data Driven Gang Tao | Query Your Streaming Data on Kafka using SQL Kafka Fundamentals Append-only log

Slide 6

Slide 6 text

Data Driven Community | Cloud Data Driven Gang Tao | Query Your Streaming Data on Kafka using SQL Reliable Fast Easy Powerful Descriptive Why SQL? 󰗔 SQL is the most popular programming language used for data processing, data analytics, and data science.

Slide 7

Slide 7 text

Data Driven Community | Cloud Data Driven Gang Tao | Query Your Streaming Data on Kafka using SQL Is Kafka a Database?

Slide 8

Slide 8 text

Data Driven Community | Cloud Data Driven Gang Tao | Query Your Streaming Data on Kafka using SQL Database Solutions Druid Pinot Trino ClickHouse StarRocks Databend SQL

Slide 9

Slide 9 text

Data Driven Community | Cloud Data Driven Gang Tao | Query Your Streaming Data on Kafka using SQL 1. Load the druid-kafka-indexing-service extension on both the Overlord and the MiddleManagers 2. Create a supervisor-spec.json containing the Kafka supervisor spec file 3. curl -X POST -H 'Content-Type: application/json' -d @supervisor-spec.json http://localhost:8090/druid/indexer/v1/supervisor Apache Druid is an open-source distributed data store designed to handle large-scale, real-time analytics on streaming and batch data.

Slide 10

Slide 10 text

Data Driven Community | Cloud Data Driven Gang Tao | Query Your Streaming Data on Kafka using SQL ClickHouse ClickHouse features highlights: ● Table engine and table function ● Rich functions 1500+ ● Rich data types - Array, Map, etc. ClickHouse is an open-source columnar database management system specifically designed for OLAP workloads.

Slide 11

Slide 11 text

Data Driven Community | Cloud Data Driven Gang Tao | Query Your Streaming Data on Kafka using SQL Add a catalog properties file etc/catalog/kafka.properties for the Kafka connector. $ ./trino --catalog kafka --schema aSchema trino:aSchema> SELECT count(*) FROM customer; Trino, formerly known as Presto SQL, is an open-source distributed SQL query engine designed for high-performance, interactive analytics on large-scale datasets.

Slide 12

Slide 12 text

Data Driven Community | Cloud Data Driven Gang Tao | Query Your Streaming Data on Kafka using SQL Database Solutions Druid Pinot Trino ClickHouse StarRocks Databend Persist Data ✔ ✔ ✔ ✔ query on the fly ✗* Streaming ✗ ✗ ✗ ✗ * ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗*

Slide 13

Slide 13 text

Data Driven Community | Cloud Data Driven Gang Tao | Query Your Streaming Data on Kafka using SQL Streaming Processor Solution Flink SQL

Slide 14

Slide 14 text

Data Driven Community | Cloud Data Driven Gang Tao | Query Your Streaming Data on Kafka using SQL Apache Flink Apache Flink is an open-source stream processing framework for distributed, high-performing, and fault-tolerant data streaming and batch processing.

Slide 15

Slide 15 text

Data Driven Community | Cloud Data Driven Gang Tao | Query Your Streaming Data on Kafka using SQL Spark Apache Spark is an open-source distributed computing system designed for big data processing and analytics.

Slide 16

Slide 16 text

Data Driven Community | Cloud Data Driven Gang Tao | Query Your Streaming Data on Kafka using SQL Arroyo is a distributed stream processing engine written in Rust, designed to efficiently perform stateful computations on streams of data.

Slide 17

Slide 17 text

Data Driven Community | Cloud Data Driven Gang Tao | Query Your Streaming Data on Kafka using SQL Streaming Processor Solution Flink Persist Data Query on the fly Streaming ✗ ✗ ✔ ✔ ✔ ✔ ✔ ✔ ✗

Slide 18

Slide 18 text

Data Driven Community | Cloud Data Driven Gang Tao | Query Your Streaming Data on Kafka using SQL SQL Streaming Database Solutions ksqlDB RisingWave

Slide 19

Slide 19 text

Data Driven Community | Cloud Data Driven Gang Tao | Query Your Streaming Data on Kafka using SQL ksqlDB KSQLDB is an open-source streaming SQL engine built on top of Apache Kafka, designed for real-time stream processing and analytics.

Slide 20

Slide 20 text

Data Driven Community | Cloud Data Driven Gang Tao | Query Your Streaming Data on Kafka using SQL RisingWave is a Postgres-compatible streaming database engineered to provide the simplest and most cost-efficient approach for processing, analyzing, and managing real-time event streaming data.

Slide 21

Slide 21 text

Data Driven Community | Cloud Data Driven Gang Tao | Query Your Streaming Data on Kafka using SQL Proton is a streaming SQL engine, a fast and lightweight alternative to Apache Flink, powered by ClickHouse. SQL with streaming extension Data Ingestion Unified Query Processing Pipeline ingest append stream read historical read streaming storage historical storage query Kafka External Stream CREATE EXTERNAL STREAM stream_name ( ) SETTINGS type='kafka', brokers='ip:9092', topic='..' … …

Slide 22

Slide 22 text

Data Driven Community | Cloud Data Driven Gang Tao | Query Your Streaming Data on Kafka using SQL SELECT * FROM car_live_data Stream tail SELECT count(*) FROM car_live_data Global aggregation SELECT window_start, count(*) FROM tumble(car_live_data, 1m) GROUP BY window_start Window aggregation SELECT cid, speed_kmh, lag(speed_kmh) OVER (PARTITION BY cid) AS last_spd FROM car_live_data Sub streams SELECT window_start, count(*) FROM tumble(car_live_data, 5s) GROUP BY window_start EMIT AFTER WATERMARK AND DELAY 2s Late event SELECT * FROM car_live_data WHERE _tp_time > now() - 1d Time travel SELECT device, cpu_usage, timestamp FROM device_utils INNER JOIN table(device_products_info) AS dim ON device_utils.product_id = dim.id Stream join SELECT * FROM table(car_live_data) Historical query Proton

Slide 23

Slide 23 text

Data Driven Community | Cloud Data Driven Gang Tao | Query Your Streaming Data on Kafka using SQL Streaming Database Solutions ksqlDB RisingWave Persist Data Query on the fly Streaming ✔ ✔ ✔ ✔ ✔ ✔ ✔* ✔* ✔ Proton

Slide 24

Slide 24 text

Data Driven Community | Cloud Data Driven Gang Tao | Query Your Streaming Data on Kafka using SQL source Streaming Processor ● SQL as data pipeline ● No data storage ● Unbounded real-time query ETL / Data Pipeline ingest external Real-Time Database ● Mostly leveraging Kafka to ingest data ● Federation search/query ○ ClickHouse Kafka Engine ○ Trino ● Bounded batch query, no streaming query Historical Report / Ad hoc Analysis source Streaming Database ● Supports Kafka data storage ● Unbounded real-time query ● Combination of real-time data and historical data Hybrid

Slide 25

Slide 25 text

Data Driven Community | Cloud Data Driven Gang Tao | Query Your Streaming Data on Kafka using SQL human machine 1GL - machine language 2GL - assembly language 3GL - imperative language 4GL - descriptive language 5GL - intelligent language data insight Programing: Turn data into insights

Slide 26

Slide 26 text

Data Driven Community | Cloud Data Driven Gang Tao | Query Your Streaming Data on Kafka using SQL Flink ksqlDB Hazelcast Druid Pinot Trino ClickHouse StarRocks RisingWave Databend Streaming Processor Streaming Database Realtime Database Query Kafka with SQL: More Options

Slide 27

Slide 27 text

Data Driven Community | Cloud Data Driven Gang Tao | Query Your Streaming Data on Kafka using SQL Want to learn more? ⭐ https://github.com/timeplus-io/proton

Slide 28

Slide 28 text

Data Driven Community | Cloud Data Driven Gang Tao | Query Your Streaming Data on Kafka using SQL Real-time streaming analytics made powerful and accessible! Thank you. Gang Tao [email protected] Data Driven Community | Cloud Data Driven