Slide 1

Slide 1 text

Databricks Frozen Over Victoria Bukta - Product @ Databricks Feb 19, 2025 Bringing the best of data catalogs and Iceberg together

Slide 2

Slide 2 text

Agenda • Hey! I’m Victoria • Data Catalogs • Unity Catalog • Iceberg at Databricks • Vision + Mission 2 Databricks Frozen Over

Slide 3

Slide 3 text

3 The messy realities of data platforms Data Catalogs

Slide 4

Slide 4 text

Does your data lake look like this? • Files in object storage • Multiple apps • Producing, accessing, and processing data • Multiple use cases • Data engineering • ML / AI • Reporting 4 Lakehouse - Overview

Slide 5

Slide 5 text

What does your organization look like? ● Discovery ○ How do you find the data you need? ● Access ○ How do I gain access to data? ● Observability ○ Who is accessing data? ○ What is accessed + how? ● Lineage ○ How was this data produced? 5 Lakehouse - Zooming in

Slide 6

Slide 6 text

Then came Hive Metastore 6

Slide 7

Slide 7 text

Introducing Hive Metastore • Discovery • You can now look through hive and see your tables • Access • Tells me the following so that know how to interact with the table • Location • Format • Schema • IAM permissions required to access to storage locations 7 Solving how engines find out datasets

Slide 8

Slide 8 text

8 What about data catalogs and Iceberg?

Slide 9

Slide 9 text

Data Catalogs with Iceberg Directory based catalog Server-side catalog that’s exposed through a REST API ● Single client to talk to any custom catalog backend. ● Shifting responsibility from client to catalog ○ Metadata file generation Hadoop Hive / JDBC/ Nessie / etc. Iceberg REST 9 Different Catalog Implementations https://iceberg.apache.org/concepts/catalog/?h=catalog#overview spark .read .format("iceberg") .load( "hdfs://host:8020/catalog/schema /table");

Slide 10

Slide 10 text

Multiple Catalogs!

Slide 11

Slide 11 text

Metadata silos lead to fragmented discovery, governance, auditing, and lineage 11

Slide 12

Slide 12 text

12 The Multi-format, Multimodal, Unified Catalog Unity Catalog

Slide 13

Slide 13 text

What do we want? Manage data and AI assets in one place Govern assets through a single source of truth Leverage best-of-breed tools with your data 1 2 3

Slide 14

Slide 14 text

Unity Catalog Multimodal Universal catalog for tabular, non-tabular data and AI assets Multi-format Support any table format - incl Delta, Iceberg, Parquet, CSV, JSON Unified Single catalog which can govern access across your entire data estate

Slide 15

Slide 15 text

Functions ML Models Volumes Vector DBs Delta Iceberg Hudi OPEN CATALOG Tables Objects AI / ML Microsoft Fabric Google Cloud ENGINES AND PLATFORMS LlamaIndex Image Audio PDF Parquet CSV JSON Open Lakehouse for Data + AI Governance Discovery Lineage Observability Tables Views

Slide 16

Slide 16 text

16 Bring your data under one roof Iceberg At Databricks with Unity Catalog

Slide 17

Slide 17 text

Delta Clients Iceberg REST Unity REST Federation & Mirroring Glue Horizon REST HMS Unify the lakehouse with Databricks Unity Catalog Write and read from any Iceberg client using open APIs (Unity or Iceberg REST) Access and govern data in Foreign Catalogs from Unity Catalog (and vice-versa) Iceberg Clients

Slide 18

Slide 18 text

18 Govern all Delta, Iceberg, and legacy formats (ex: Parquet, CSV, JSON) AI-driven predictive optimizations on all managed tables ● File Compaction, Snapshot expiry, etc. Break format silos with Databricks Unity Catalog Spark Trino Flink Create table Read table Snowflake DBX Iceberg REST Unity REST Iceberg REST Delta Lake Iceberg AI-driven Predictive Optimization

Slide 19

Slide 19 text

Long-term vision of Delta and Iceberg Delta Lake Iceberg Format Unification ● Partnership with the Delta and Iceberg communities to unify the formats ● Consistent data and delete files for flexibility and performance ● Aligned table features to track row-level changes between versions of a table

Slide 20

Slide 20 text

No content