Exploring DuckLake: A First Look with Simple Data Flows

Yuqing Guan - Open Data Circle Exploring DuckLake A First
Look with Simple Data Flows November 11, 2025

Agenda 1. Introduction • Background of Lakehouse architecture • Why
DuckLake emerged and Problems it solves 2. Core Concepts • Key principles behind DuckLake • Architecture overview 3. Demonstration • DuckDB and DuckLake demo • Integration with PySpark 4. Conclusion & Discussion • Key takeaways • Q&A

Understanding the foundation • Lakehouse architecture: unify data lake &
warehouse • Data warehouse: structured, fast queries • Data lake: scalable storage, less structured • Lakehouse key features • Metadata layer for data lakes • ACID transactions, schema enforcement and governance… Background of Lakehouse https://www.databricks.com/glossary/data-lakehouse Solutions: Iceberg, Delta Lake, Paimon, DuckLake …

What problems DuckLake addresses • Manifest Structure Complexity  Iceberg's use
of multiple layers of manifest fi les adds complexity to the metadata structure. While this design improves scalability, it can be di ff i cult to manage and debug. • High Communication Overhead When Reading Manifests  Query engines must read multiple manifest and manifest list fi les to plan a query, leading to non- trivial I/O and communication overhead, especially in remote storage setups. • External Dependency for Transaction Management  Although Iceberg supports atomic operations, it typically relies on an external metastore or catalog (e.g., Hive Metastore, AWS Glue, or a relational database) to track metadata versions. Challenges in Iceberg

Challenges in Existing Lakehouses High Communication Overhead When Reading Manifests
• A lot of round trips just to get the metadata before scanning the actual data • If you’re updating or reading a single row, that’s a huge overhead.

What problems DuckLake addresses • Manifest Structure Complexity  Iceberg's use
of multiple layers of manifest fi les adds complexity to the metadata structure. While this design improves scalability, it can be di ff i cult to manage and debug. • High Communication Overhead When Reading Manifests  Query engines must read multiple manifest and manifest list fi les to plan a query, leading to non- trivial I/O and communication overhead, especially in remote storage setups. • External Dependency for Transaction Management  Although Iceberg supports atomic operations, it typically relies on an external metastore or catalog (e.g., Hive Metastore, AWS Glue, or a relational database) to track metadata versions. Challenges in Iceberg

Core Ideas Behind DuckLake Central concept: store all metadata in
a SQL database DuckLake's architecture: Just a database and some Parquet fi les metadata lives in a database, a single SQL query can resolve everything

Core Ideas Behind DuckLake DuckLake schema

Discuss the scalability https://ducklake.select/2025/05/27/ducklake-01/ #scalability • Storage: Built on purpose-designed
fi le storage (e.g., blob storage), allowing virtually unlimited scalability • Metadata (Catalog): The catalog handles only lightweight metadata transactions, making it highly e ffi cient. DuckLake is not tied to a speci fi c database—though PostgreSQL can already scale to hundreds of terabytes and thousands of compute nodes • Compute: Any number of compute nodes can independently read and write data while coordinating through the catalog, enabling in fi nite compute scaling Core Ideas Behind DuckLake (source: https://motherduck.com/blog/redshift- fi les-hunt-for-big-data/) Big Data queries are pretty rare

Demo Overview What you will see https://github.com/opendatacircle/ducklake-lab/tree/main/demos • DuckDB Basics
– Demonstrate how DuckDB e ff i ciently queries large Parquet fi les and serves as a lightweight, high-performance alternative to pandas for analytical workloads. • Introducing DuckLake – Use the DuckDB CLI to explore DuckLake features, including data versioning and time travel capabilities. • Spark–DuckLake Integration – Show how PySpark can seamlessly write to and read from DuckLake using Spark APIs for data processing

Demo 3 - Spark x DuckLake Integration Features Demonstrated •
Use Python DuckDB to convert a raw Parquet fi le into a DuckLake table. • Use SparkSQL to query the DuckLake table and perform analytics. • Use PySpark Dataframe API to write the result as a new DuckLake table. • Use PySpark Dataframe API to verify and inspect the output. The integration fl ow: PySpark APP → DuckDB JDBC Driver → JNI → DuckDB (native lib) → DuckLake extension → data fi les on S3 (MinIO)

Summary / Takeaways Key points to remember • Metadata is
the database • Files are the storage • Fast query planning and simpli fi ed operations • Still maturing for production use • Tight coupling with DuckDB

Reference Recommended materials and further reading • DuckLake intro: https://ducklake.select/2025/05/27/ducklake-01/
• Get started with DuckLake: https://motherduck.com/blog/getting-started- ducklake-table-format/ • Lakehouses comparison: https://medium.com/fresha-data-engineering/how- tables-grew-a-brain-iceberg-hudi-delta-paimon-ducklake-a617f34da6ce • ODC Demo: https://github.com/opendatacircle/ducklake-lab

Questions?

Exploring DuckLake: A First Look with Simple Da...

Exploring DuckLake: A First Look with Simple Data Flows

Open Data Circle

More Decks by Open Data Circle

Other Decks in Technology

Featured

Transcript

Yuqing Guan - Open Data Circle Exploring DuckLake A First

Agenda 1. Introduction • Background of Lakehouse architecture • Why

Understanding the foundation • Lakehouse architecture: unify data lake &

What problems DuckLake addresses • Manifest Structure Complexity  Iceberg's use

Challenges in Existing Lakehouses High Communication Overhead When Reading Manifests

What problems DuckLake addresses • Manifest Structure Complexity  Iceberg's use

Core Ideas Behind DuckLake Central concept: store all metadata in

Core Ideas Behind DuckLake DuckLake schema

Discuss the scalability https://ducklake.select/2025/05/27/ducklake-01/ #scalability • Storage: Built on purpose-designed

Demo Overview What you will see https://github.com/opendatacircle/ducklake-lab/tree/main/demos • DuckDB Basics

Demo 3 - Spark x DuckLake Integration Features Demonstrated •

Summary / Takeaways Key points to remember • Metadata is

Reference Recommended materials and further reading • DuckLake intro: https://ducklake.select/2025/05/27/ducklake-01/

Questions?