apidays Singapore 2025 - Streaming Lakehouse with Kafka, Flink and Iceberg by Zabeer Farook (CACIB)

Slide 1

Slide 1 text

STREAMING LAKEHOUSE WITH KAFKA, FLINK & ICEBERG ZABEER FAROOK

Slide 2

Slide 2 text

HELLO !! I’m Zabeer Farook Technical Architect, Credit Agricole CIB - Passionate about Stream data processing, Event Driven Architecture, Cloud & DevOps. - Love travelling & exploring places https://sg.linkedin.com/in/zabeer-farook

Slide 3

Slide 3 text

AGENDA 03 04 05 Building a Streaming Lakehouse Quick look at Kafka & Flink High level overview of Apache Iceberg What’s a Lakehouse & how it addresses some of these challenges 02 01 Challenges with Data Warehouse & Data Lake 06 Take Home Demo followed by Q&A

Slide 4

Slide 4 text

CHALLENGES WITH DATA WAREHOUSE Data Warehouse (80’s) BI Applications Dashboards Reports Data Mart Data Mart Data warehouse Batch Structured Data Ingest Store & Process Consume CLOSED ARCHITECTURE VENDOR LOCKIN DATA LOCKIN HIGHER COST

Slide 5

Slide 5 text

CHALLENGES WITH DATA LAKE Data Lake (2010) Ingest Store & Process Consume Machine Learning Batch Analytics Reports & BI Data Lake Raw Data Cleansed Data Batch Structured, Semi Structured & Unstructured Data DATA GOVERNANCE ISSUES NO UNIFIED METADATA LAYER DATA SWAMP NO ACID GUARANTEES (Atomicity, Consistency, Isolation, Durability)

Slide 6

Slide 6 text

CHALLENGES WITH HYBRID DATA PLATFORMS DATA SILOS Data lives in different platforms without interoperability DATA DUPLICATION Data copied across to process with different engines or platforms DATA SYNCH ISSUES Data copies are not in synch always EXPENSIVE Drives costs higher

Slide 7

Slide 7 text

Data Lakehouse (2020) Ingest Store & Process Consume AI/ML & Data Science Batch Analytics Reports & BI Data Lake Raw Data Cleansed Data Batch Structured, Semi Structured & Unstructured Data LAKEHOUSE - BRIDGING THE GAP Metadata Layer with Data Governance, Indexing and Data Management Combines the best of both worlds! Reliability Performance ACID Guarantees Open Architecture Cost Efficiency No Lockin Interoperability Data Governance API Layer

Slide 8

Slide 8 text

WHAT ABOUT REAL TIME DATA? ● Why real time matters? ○ Data Freshness ○ Real time analytics ○ Faster insights ○ Faster decision making

Slide 9

Slide 9 text

Ingest Store & Process Consume AI/ML & Data Science Batch Analytics Reports & BI Data Lake Raw Data Cleansed Data Batch Structured, Semi Structured & Unstructured Data STREAMING LAKEHOUSE Metadata Layer with Data Governance, Indexing and Data Management AI/ML & Data Science Batch & Real time Analytics Reports & BI Data Lake Raw Data Cleansed Data Stream Batch Structured, Semi Structured & Unstructured Data Metadata Layer with Data Governance, Indexing and Data Management ● Real time ingestion layer to ingest data from real time sources like Kafka ● Stream processing capabilities in the lakehouse with engines like Flink ● Real time analytics through distributed query engines ● Realtime Machine Learning Use cases Data Lakehouse Streaming Lakehouse API Layer API Layer

Slide 10

Slide 10 text

BUILDING A STREAMING LAKEHOUSE Data Sources Ingestion Storage & Processing Serving Consumption Stream Batch Metadata Catalog Clean Transform Aggregate SQL Declarative Structured, Semi Structured & Unstructured Data Storage API REST CATALOG

Slide 11

Slide 11 text

APACHE ICEBERG & ITS ROLE IN A LAKEHOUSE Apache Iceberg is a high performance open table format purpose-built for large scale analytics. Plays the role of the metadata layer in a Lake house Architecture. Catalog: Tracks location of table’s current metadata file Metadata file: File which defines a table’s schema, partition,, snapshot list etc Snapshot: Snapshot of data after a write Manifest file: Contains location, path and metadata about a list of data files Manifest list: defines a single Snapshot as a list of manifest files along with stats Data File: File containing the data of the table (parquet, orc, avro etc)

Slide 12

Slide 12 text

APACHE ICEBERG - FEATURES & BENEFITS Image Credits: Starburst ● Expressive SQL ● Open Speciﬁcation ● Schema Evolution ● Partition Evolution ● Time Travel & Rollback ● ACID Compliant ● Branching , Merging & Tagging ● Data Compaction ● Hidden Partitioning

Slide 13

Slide 13 text

THE ICE WARS - Snowflake open sourced Polaris, an Iceberg Catalog - Databricks acquired Tabular, a company founded by the original creators of Iceberg. Also open sourced Unity Catalog - All major cloud & data platform providers supports Iceberg (Confluent Table Flow, AWS S3 Tables, GCP BigQuery Tables etc) - Cloudflare has announced R2 Data Catalog (Iceberg REST Catalog) just last week And the Winner is Iceberg

Slide 14

Slide 14 text

KAFKA & FLINK IN THE LAKEHOUSE ● Distributed pub sub messaging system to handle, store and distribute data in real time ● Streaming of data in real time ● Handles huge volumes of data ● High Throughput & Low latency & Fault Tolerance ● Uniﬁed Stream and Batch Processing ● Highly Efﬁcient stream processing engine ● Handles Large scale stateful stream processing with low latency and high throughput ● Can work with multiple different sources and sinks Kafka and Flink together can transform a Lakehouse into a streaming lakehouse

Slide 15

Slide 15 text

QUICK RECAP DATA LAKEHOUSE Data Lake House offers a cost effective alternative to Data warehouse. It also avoids vendor and data lock in ICEBERG Iceberg’s out of the box features such as Schema Evolution , Partition Evolution, Time Travel etc reduces operational & maintenance costs OPEN TABLE FORMATS Powered by Open Table Formats like Iceberg which offers consistent performance and an open architecture STREAMING LAKEHOUSE Streaming Lakehouse offers streaming ingestion and processing capabilities to power real time analytics & decision making TRINO Distributed Query engines like Trino helps to integrate a Lakehouse with BI & Analytics tools KAFKA & FLINK Kafka and Flink offers real time capabilities to a Lakehouse

Slide 16

Slide 16 text

REMEMBER STREAMING LAKEHOUSE IS POWERFUL, BUT NOT A MAGIC WAND... Data Quality Security & Compliance Maintenance of Iceberg Tables Storage layer security & using catalog with RBAC policies Data Validation Checks Compaction of files, Snapshot Expiration

Slide 17

Slide 17 text

DEMO

Slide 18

Slide 18 text

THE FUTURE BELONGS TO OPEN DATA ARCHITECTURE “What REST did for the web, Apache Iceberg is doing for data architecture - creating open interoperable standards“

Slide 19

Slide 19 text

Q&A