Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Iceberg Meetup Japan #1 : Iceberg and Databricks

Iceberg Meetup Japan #1 : Iceberg and Databricks

2月21日に開催されたIceberg Meetup #1で使用した資料になります。
DatabricksとIcebergを使用する際のカタログについてご紹介しています。

Databricks Japan

February 24, 2025
Tweet

More Decks by Databricks Japan

Other Decks in Technology

Transcript

  1. Databricks Frozen Over Victoria Bukta - Product @ Databricks Feb

    19, 2025 Bringing the best of data catalogs and Iceberg together
  2. Agenda • Hey! I’m Victoria • Data Catalogs • Unity

    Catalog • Iceberg at Databricks • Vision + Mission 2 Databricks Frozen Over
  3. Does your data lake look like this? • Files in

    object storage • Multiple apps • Producing, accessing, and processing data • Multiple use cases • Data engineering • ML / AI • Reporting 4 Lakehouse - Overview
  4. What does your organization look like? • Discovery ◦ How

    do you find the data you need? • Access ◦ How do I gain access to data? • Observability ◦ Who is accessing data? ◦ What is accessed + how? • Lineage ◦ How was this data produced? 5 Lakehouse - Zooming in
  5. Introducing Hive Metastore • Discovery • You can now look

    through hive and see your tables • Access • Tells me the following so that know how to interact with the table • Location • Format • Schema • IAM permissions required to access to storage locations 7 Solving how engines find out datasets
  6. Data Catalogs with Iceberg Directory based catalog Server-side catalog that’s

    exposed through a REST API • Single client to talk to any custom catalog backend. • Shifting responsibility from client to catalog ◦ Metadata file generation Hadoop Hive / JDBC/ Nessie / etc. Iceberg REST 9 Different Catalog Implementations https://iceberg.apache.org/concepts/catalog/?h=catalog#overview spark .read .format("iceberg") .load( "hdfs://host:8020/catalog/schema /table");
  7. What do we want? Manage data and AI assets in

    one place Govern assets through a single source of truth Leverage best-of-breed tools with your data 1 2 3
  8. Unity Catalog Multimodal Universal catalog for tabular, non-tabular data and

    AI assets Multi-format Support any table format - incl Delta, Iceberg, Parquet, CSV, JSON Unified Single catalog which can govern access across your entire data estate
  9. Functions ML Models Volumes Vector DBs Delta Iceberg Hudi OPEN

    CATALOG Tables Objects AI / ML Microsoft Fabric Google Cloud ENGINES AND PLATFORMS LlamaIndex Image Audio PDF Parquet CSV JSON Open Lakehouse for Data + AI Governance Discovery Lineage Observability Tables Views
  10. Delta Clients Iceberg REST Unity REST Federation & Mirroring Glue

    Horizon REST HMS Unify the lakehouse with Databricks Unity Catalog Write and read from any Iceberg client using open APIs (Unity or Iceberg REST) Access and govern data in Foreign Catalogs from Unity Catalog (and vice-versa) Iceberg Clients
  11. 18 Govern all Delta, Iceberg, and legacy formats (ex: Parquet,

    CSV, JSON) AI-driven predictive optimizations on all managed tables • File Compaction, Snapshot expiry, etc. Break format silos with Databricks Unity Catalog Spark Trino Flink Create table Read table Snowflake DBX Iceberg REST Unity REST Iceberg REST Delta Lake Iceberg AI-driven Predictive Optimization
  12. Long-term vision of Delta and Iceberg Delta Lake Iceberg Format

    Unification • Partnership with the Delta and Iceberg communities to unify the formats • Consistent data and delete files for flexibility and performance • Aligned table features to track row-level changes between versions of a table