$30 off During Our Annual Pro Sale. View Details »

Exploring DuckLake: A First Look with Simple Da...

Exploring DuckLake: A First Look with Simple Data Flows

Avatar for Open Data Circle

Open Data Circle

November 17, 2025
Tweet

More Decks by Open Data Circle

Other Decks in Technology

Transcript

  1. Yuqing Guan - Open Data Circle Exploring DuckLake A First

    Look with Simple Data Flows November 11, 2025
  2. Agenda 1. Introduction • Background of Lakehouse architecture • Why

    DuckLake emerged and Problems it solves 2. Core Concepts • Key principles behind DuckLake • Architecture overview 3. Demonstration • DuckDB and DuckLake demo • Integration with PySpark 4. Conclusion & Discussion • Key takeaways • Q&A
  3. Understanding the foundation • Lakehouse architecture: unify data lake &

    warehouse • Data warehouse: structured, fast queries • Data lake: scalable storage, less structured • Lakehouse key features • Metadata layer for data lakes • ACID transactions, schema enforcement and governance… Background of Lakehouse https://www.databricks.com/glossary/data-lakehouse Solutions: Iceberg, Delta Lake, Paimon, DuckLake …
  4. What problems DuckLake addresses • Manifest Structure Complexity
 Iceberg's use

    of multiple layers of manifest fi les adds complexity to the metadata structure. While this design improves scalability, it can be di ff i cult to manage and debug. • High Communication Overhead When Reading Manifests
 Query engines must read multiple manifest and manifest list fi les to plan a query, leading to non- trivial I/O and communication overhead, especially in remote storage setups. • External Dependency for Transaction Management
 Although Iceberg supports atomic operations, it typically relies on an external metastore or catalog (e.g., Hive Metastore, AWS Glue, or a relational database) to track metadata versions. Challenges in Iceberg
  5. Challenges in Existing Lakehouses High Communication Overhead When Reading Manifests

    • A lot of round trips just to get the metadata before scanning the actual data • If you’re updating or reading a single row, that’s a huge overhead.
  6. What problems DuckLake addresses • Manifest Structure Complexity
 Iceberg's use

    of multiple layers of manifest fi les adds complexity to the metadata structure. While this design improves scalability, it can be di ff i cult to manage and debug. • High Communication Overhead When Reading Manifests
 Query engines must read multiple manifest and manifest list fi les to plan a query, leading to non- trivial I/O and communication overhead, especially in remote storage setups. • External Dependency for Transaction Management
 Although Iceberg supports atomic operations, it typically relies on an external metastore or catalog (e.g., Hive Metastore, AWS Glue, or a relational database) to track metadata versions. Challenges in Iceberg
  7. Core Ideas Behind DuckLake Central concept: store all metadata in

    a SQL database DuckLake's architecture: Just a database and some Parquet fi les metadata lives in a database, a single SQL query can resolve everything
  8. Discuss the scalability https://ducklake.select/2025/05/27/ducklake-01/ #scalability • Storage: Built on purpose-designed

    fi le storage (e.g., blob storage), allowing virtually unlimited scalability • Metadata (Catalog): The catalog handles only lightweight metadata transactions, making it highly e ffi cient. DuckLake is not tied to a speci fi c database—though PostgreSQL can already scale to hundreds of terabytes and thousands of compute nodes • Compute: Any number of compute nodes can independently read and write data while coordinating through the catalog, enabling in fi nite compute scaling Core Ideas Behind DuckLake (source: https://motherduck.com/blog/redshift- fi les-hunt-for-big-data/) Big Data queries are pretty rare
  9. Demo Overview What you will see https://github.com/opendatacircle/ducklake-lab/tree/main/demos • DuckDB Basics

    – Demonstrate how DuckDB e ff i ciently queries large Parquet fi les and serves as a lightweight, high-performance alternative to pandas for analytical workloads. • Introducing DuckLake – Use the DuckDB CLI to explore DuckLake features, including data versioning and time travel capabilities. • Spark–DuckLake Integration – Show how PySpark can seamlessly write to and read from DuckLake using Spark APIs for data processing
  10. Demo 3 - Spark x DuckLake Integration Features Demonstrated •

    Use Python DuckDB to convert a raw Parquet fi le into a DuckLake table. • Use SparkSQL to query the DuckLake table and perform analytics. • Use PySpark Dataframe API to write the result as a new DuckLake table. • Use PySpark Dataframe API to verify and inspect the output. The integration fl ow: PySpark APP → DuckDB JDBC Driver → JNI → DuckDB (native lib) → DuckLake extension → data fi les on S3 (MinIO)
  11. Summary / Takeaways Key points to remember • Metadata is

    the database • Files are the storage • Fast query planning and simpli fi ed operations • Still maturing for production use • Tight coupling with DuckDB
  12. Reference Recommended materials and further reading • DuckLake intro: https://ducklake.select/2025/05/27/ducklake-01/

    • Get started with DuckLake: https://motherduck.com/blog/getting-started- ducklake-table-format/ • Lakehouses comparison: https://medium.com/fresha-data-engineering/how- tables-grew-a-brain-iceberg-hudi-delta-paimon-ducklake-a617f34da6ce • ODC Demo: https://github.com/opendatacircle/ducklake-lab