SG Kafka Meetup August 2024: Streaming Lakehouse with Kafka, Flink and Iceberg

Slide 1

Slide 1 text

Streaming Lakehouse with Kafka, Flink & Iceberg

Slide 2

Slide 2 text

I’m Zabeer Farook Technical Architect, Credit Agricole CIB - Passionate about Stream data processing, Event Driven Architecture, Cloud & DevOps. - Love travelling & exploring places HELLO!

Slide 3

Slide 3 text

Your journey with me today.. OLTP Vs OLAP Data Warehouse vs Data Lake vs Lakehouse Streaming Lakehouse Open Table Formats Apache Iceberg - Features, Beneﬁts & Challenges Iceberg vs other open table formats Connecting the dots with Kafka & Flink Demo with Kafka , Flink & Iceberg in Action Q&A KNOW DISCOVER DO

Slide 4

Slide 4 text

OLTP Vs OLAP Online Transaction Processing Online Analytical Processing

Slide 5

Slide 5 text

OLAP System Technical Components Storage Engine Table Format File Format Storage Compute Engine Catalog Metadata layer on top of File format Format of data (CSV, Avro, Parquet, ORC etc) Storage Infra (File System, HDFS, Object Storage) Laying out data, maintenance, Optimization etc Run user workloads to process data Dictionary to discover table metadata

Slide 6

Slide 6 text

Data Warehouse (since 1980s) Late 1980s 2020 - Centralized OLAP database for BI and Reporting - Only Structured Data - Schema On Write - ACID guarantees - Effective Data Governance - Storage & Compute tightly coupled - No native support for ML workloads - High Cost e.g. Teradata, Oracle Exadata (legacy EDW) Cloud Data Warehouses (since 2010) like Redshift, BigQuery, Snowﬂake separates storage and compute and supports unstructured data and supports ML workloads as well OLAP Cubes

Slide 7

Slide 7 text

Data Lake (since 2010) Late 1980s 2020 - Structured, Semi-Structured & Unstructured Data - Schema on Read - Hive Table Format - Storage and Compute decoupling - Open Data formats like CSV, Avro, Parquet, ORC - Lower cost - Supports ML use cases - No metadata layer, no ACID support Data Lake is often used in conjunction with a Data Warehouse - raw data is stored in the lake and further cleansed and aggregated with a data warehouse Started with Hadoop MapReduce and HDFS as storage Evolved with cloud object storage (S3, ADLS, GCS) with query engines (Spark, Presto)

Slide 8

Slide 8 text

Beware of Data Swamp!! Late 1980s 2020 - No default storage engine function to optimize data layout - Data is hardly revisited / optimized - Apply Data Governance, Data Catalogue, Cleansing

Slide 9

Slide 9 text

Data Lakehouse (since 2020) Late 1980s 2020 - Term made popular by Databricks - Metadata layer with Open table formats like Hudi, Delta Lake, Iceberg - Cost Eﬃcient - ACID guarantees - Schema Evolution - Open Architecture - Faster Queries Combines the best of both worlds! Lakehouse can also double up as a data lake and a warehouse

Slide 10

Slide 10 text

So what is a Streaming Lakehouse? - Real-time data ingestion - Stream processing capabilities on Lakehouse - Real-time analytics through distributed query engines - Supports faster decision making

Slide 11

Slide 11 text

Open Table Formats Originally came into picture to overcome limitations of Hive Table Format: 🖓 Invisible Speciﬁcation 🖓 Schema Evolution & Partition Evolution needs data rewrites 🖓 Often Metadata and Data not in synch 🖓 No Transactional guarantees 🖓 No Time travel & rollback Apache XTable provides cross-table omni-directional interoperability between lakehouse table formats (incubating) - Hudi, Delta Lake and Iceberg - Lakehouse Open Table formats solve most of these limitations - Apache Paimon is a recent top level Apache project which is optimized for stream processing in the Lakehouse

Slide 12

Slide 12 text

Slide 13

Slide 13 text

Apache Iceberg Apache Iceberg is a high performance open table format purpose-built for large scale analytics. It brings the reliability and simplicity of SQL tables to big data while making it possible to work with multiple engines like Spark, Trino, PrestoDB, Flink, Hive etc ● 2017 - Created by Netﬂix’s Ryan Blue and Daniel Weeks ● 2018 - Open-sourced and donated to Apache Software Foundation ● Overcomes performance, consistency and many other challenges with the Hive table format

Slide 14

Slide 14 text

Apache Iceberg - Architecture Catalog: Tracks location of table’s current metadata file Metadata file: File which defines a table’s structure, schema, partition scheme, snapshot list etc Snapshot: Snapshot of data after a write Manifest file: Contains location, path and metadata about a list of data files Manifest list: defines a single Snapshot as a list of manifest files along with stats Data File: File containing the data of the table (parquet, orc, avro etc)

Slide 15

Slide 15 text

Apache Iceberg - Features ● Expressive SQL ● Open Speciﬁcation ● Schema Evolution ● Partition Evolution ● Time Travel & Rollback ● ACID Compliant ● Branching , Merging & Tagging ● Data Compaction ● Hidden Partitioning CREATE TABLE employee ( id BIGINT, name STRING, dept STRING, dob date ) PARTITIONED BY ( dob ); Create Table with partition select * from employee /*+ OPTIONS('as-of-timestamp'='1723566414000') */ Time Travel based on time select * from employee /*+ OPTIONS('snapshot-id'='483890958221556534')*/; Time Travel based on snapshot

Slide 16

Slide 16 text

Apache Iceberg - Catalogs & Compute Engines REST CATALOG Popular Catalogs (Metadata store for Iceberg tables) Compute Engines

Slide 17

Slide 17 text

Why Apache Iceberg? - Beneﬁts Image Credits: Starburst ● Avoid Data Silos - Interoperability across different data landscape ● Avoid Data Duplication - work with different compute engines ● Bring your own Compute Engine ● No more data / vendor lock-in ● Seamless DML operations to adhere to regulations such as GDPR ● Optimized Cost & Performance ● SQL database like feel

Slide 18

Slide 18 text

The Ice Wars - Snowﬂake open sources Polaris, an Iceberg Catalog - Databricks acquires Tabular, a company founded by the original creators of Iceberg - Databricks open sources Unity Catalog And the Winner is Iceberg

Slide 19

Slide 19 text

Apache Iceberg - Challenges & Mitigations ● Optimizing and maintaining Iceberg Tables ○ Compaction of files (avoid small file problem with streaming) ○ Retention & Expiration of snapshots ○ Old metadata file removal ○ Orphan file cleanup ● Security ○ Storage layer security ○ Catalog with RBAC policies

Slide 20

Slide 20 text

Apache Iceberg - Comparison with Hudi & Delta Source : Dremio

Slide 21

Slide 21 text

Streaming Kafka Data to Iceberg - Apache Iceberg Kafka Sink Connector - Flink Streaming - Spark Streaming - Conﬂuent Table Flow (In Private Preview) - Other managed vendor solutions

Slide 22

Slide 22 text

Connecting the dots with Kafka & Flink ● Distributed pub sub messaging system to handle, store and distribute data in real time ● Streaming of data in real time ● Handles huge volumes of data ● High Throughput & Low latency & Fault Tolerance ● Uniﬁed Stream and Batch Processing ● Highly Eﬃcient stream processing engine ● Handles Large scale stateful stream processing with low latency and high throughput ● Can work with multiple different sources and sinks Kafka and Flink together can transform a Lakehouse into a streaming lakehouse