Upgrade to Pro — share decks privately, control downloads, hide ads and more …

SG Kafka Meetup August 2024: Streaming Lakehous...

Zabeer Farook
September 05, 2024

SG Kafka Meetup August 2024: Streaming Lakehouse with Kafka, Flink and Iceberg

Slides from my SG Kafka meetup talk on Streaming Lakehouse with Kafka, Flink and Iceberg presented on August 20th 2024, Singapore

Zabeer Farook

September 05, 2024
Tweet

More Decks by Zabeer Farook

Other Decks in Technology

Transcript

  1. I’m Zabeer Farook Technical Architect, Credit Agricole CIB - Passionate

    about Stream data processing, Event Driven Architecture, Cloud & DevOps. - Love travelling & exploring places HELLO!
  2. Your journey with me today.. OLTP Vs OLAP Data Warehouse

    vs Data Lake vs Lakehouse Streaming Lakehouse Open Table Formats Apache Iceberg - Features, Benefits & Challenges Iceberg vs other open table formats Connecting the dots with Kafka & Flink Demo with Kafka , Flink & Iceberg in Action Q&A KNOW DISCOVER DO
  3. OLAP System Technical Components Storage Engine Table Format File Format

    Storage Compute Engine Catalog Metadata layer on top of File format Format of data (CSV, Avro, Parquet, ORC etc) Storage Infra (File System, HDFS, Object Storage) Laying out data, maintenance, Optimization etc Run user workloads to process data Dictionary to discover table metadata
  4. Data Warehouse (since 1980s) Late 1980s 2020 - Centralized OLAP

    database for BI and Reporting - Only Structured Data - Schema On Write - ACID guarantees - Effective Data Governance - Storage & Compute tightly coupled - No native support for ML workloads - High Cost e.g. Teradata, Oracle Exadata (legacy EDW) Cloud Data Warehouses (since 2010) like Redshift, BigQuery, Snowflake separates storage and compute and supports unstructured data and supports ML workloads as well OLAP Cubes
  5. Data Lake (since 2010) Late 1980s 2020 - Structured, Semi-Structured

    & Unstructured Data - Schema on Read - Hive Table Format - Storage and Compute decoupling - Open Data formats like CSV, Avro, Parquet, ORC - Lower cost - Supports ML use cases - No metadata layer, no ACID support Data Lake is often used in conjunction with a Data Warehouse - raw data is stored in the lake and further cleansed and aggregated with a data warehouse Started with Hadoop MapReduce and HDFS as storage Evolved with cloud object storage (S3, ADLS, GCS) with query engines (Spark, Presto)
  6. Beware of Data Swamp!! Late 1980s 2020 - No default

    storage engine function to optimize data layout - Data is hardly revisited / optimized - Apply Data Governance, Data Catalogue, Cleansing
  7. Data Lakehouse (since 2020) Late 1980s 2020 - Term made

    popular by Databricks - Metadata layer with Open table formats like Hudi, Delta Lake, Iceberg - Cost Efficient - ACID guarantees - Schema Evolution - Open Architecture - Faster Queries Combines the best of both worlds! Lakehouse can also double up as a data lake and a warehouse
  8. So what is a Streaming Lakehouse? - Real-time data ingestion

    - Stream processing capabilities on Lakehouse - Real-time analytics through distributed query engines - Supports faster decision making
  9. Open Table Formats Originally came into picture to overcome limitations

    of Hive Table Format: 🖓 Invisible Specification 🖓 Schema Evolution & Partition Evolution needs data rewrites 🖓 Often Metadata and Data not in synch 🖓 No Transactional guarantees 🖓 No Time travel & rollback Apache XTable provides cross-table omni-directional interoperability between lakehouse table formats (incubating) - Hudi, Delta Lake and Iceberg - Lakehouse Open Table formats solve most of these limitations - Apache Paimon is a recent top level Apache project which is optimized for stream processing in the Lakehouse
  10. Your journey with me today.. OLTP Vs OLAP Data Warehouse

    vs Data Lake vs Lakehouse Streaming Lakehouse Open Table Formats Apache Iceberg - Features, Benefits & Challenges Iceberg vs other open table formats Connecting the dots with Kafka & Flink Demo with Kafka , Flink & Iceberg in Action Q&A KNOW DISCOVER DO
  11. Apache Iceberg Apache Iceberg is a high performance open table

    format purpose-built for large scale analytics. It brings the reliability and simplicity of SQL tables to big data while making it possible to work with multiple engines like Spark, Trino, PrestoDB, Flink, Hive etc • 2017 - Created by Netflix’s Ryan Blue and Daniel Weeks • 2018 - Open-sourced and donated to Apache Software Foundation • Overcomes performance, consistency and many other challenges with the Hive table format
  12. Apache Iceberg - Architecture Catalog: Tracks location of table’s current

    metadata file Metadata file: File which defines a table’s structure, schema, partition scheme, snapshot list etc Snapshot: Snapshot of data after a write Manifest file: Contains location, path and metadata about a list of data files Manifest list: defines a single Snapshot as a list of manifest files along with stats Data File: File containing the data of the table (parquet, orc, avro etc)
  13. Apache Iceberg - Features • Expressive SQL • Open Specification

    • Schema Evolution • Partition Evolution • Time Travel & Rollback • ACID Compliant • Branching , Merging & Tagging • Data Compaction • Hidden Partitioning CREATE TABLE employee ( id BIGINT, name STRING, dept STRING, dob date ) PARTITIONED BY ( dob ); Create Table with partition select * from employee /*+ OPTIONS('as-of-timestamp'='1723566414000') */ Time Travel based on time select * from employee /*+ OPTIONS('snapshot-id'='483890958221556534')*/; Time Travel based on snapshot
  14. Apache Iceberg - Catalogs & Compute Engines REST CATALOG Popular

    Catalogs (Metadata store for Iceberg tables) Compute Engines
  15. Why Apache Iceberg? - Benefits Image Credits: Starburst • Avoid

    Data Silos - Interoperability across different data landscape • Avoid Data Duplication - work with different compute engines • Bring your own Compute Engine • No more data / vendor lock-in • Seamless DML operations to adhere to regulations such as GDPR • Optimized Cost & Performance • SQL database like feel
  16. The Ice Wars - Snowflake open sources Polaris, an Iceberg

    Catalog - Databricks acquires Tabular, a company founded by the original creators of Iceberg - Databricks open sources Unity Catalog And the Winner is Iceberg
  17. Apache Iceberg - Challenges & Mitigations • Optimizing and maintaining

    Iceberg Tables ◦ Compaction of files (avoid small file problem with streaming) ◦ Retention & Expiration of snapshots ◦ Old metadata file removal ◦ Orphan file cleanup • Security ◦ Storage layer security ◦ Catalog with RBAC policies
  18. Streaming Kafka Data to Iceberg - Apache Iceberg Kafka Sink

    Connector - Flink Streaming - Spark Streaming - Confluent Table Flow (In Private Preview) - Other managed vendor solutions
  19. Connecting the dots with Kafka & Flink • Distributed pub

    sub messaging system to handle, store and distribute data in real time • Streaming of data in real time • Handles huge volumes of data • High Throughput & Low latency & Fault Tolerance • Unified Stream and Batch Processing • Highly Efficient stream processing engine • Handles Large scale stateful stream processing with low latency and high throughput • Can work with multiple different sources and sinks Kafka and Flink together can transform a Lakehouse into a streaming lakehouse
  20. Your journey with me today.. OLTP Vs OLAP Data Warehouse

    vs Data Lake vs Lakehouse Streaming Lakehouse Open Table Formats Apache Iceberg - Features, Benefits & Challenges Iceberg vs other open table formats Connecting the dots with Kafka & Flink Demo with Kafka , Flink & Iceberg in Action Q&A KNOW DISCOVER DO