object storage • Multiple apps • Producing, accessing, and processing data • Multiple use cases • Data engineering • ML / AI • Reporting 4 Lakehouse - Overview
do you find the data you need? • Access ◦ How do I gain access to data? • Observability ◦ Who is accessing data? ◦ What is accessed + how? • Lineage ◦ How was this data produced? 5 Lakehouse - Zooming in
through hive and see your tables • Access • Tells me the following so that know how to interact with the table • Location • Format • Schema • IAM permissions required to access to storage locations 7 Solving how engines find out datasets
exposed through a REST API • Single client to talk to any custom catalog backend. • Shifting responsibility from client to catalog ◦ Metadata file generation Hadoop Hive / JDBC/ Nessie / etc. Iceberg REST 9 Different Catalog Implementations https://iceberg.apache.org/concepts/catalog/?h=catalog#overview spark .read .format("iceberg") .load( "hdfs://host:8020/catalog/schema /table");
AI assets Multi-format Support any table format - incl Delta, Iceberg, Parquet, CSV, JSON Unified Single catalog which can govern access across your entire data estate
CATALOG Tables Objects AI / ML Microsoft Fabric Google Cloud ENGINES AND PLATFORMS LlamaIndex Image Audio PDF Parquet CSV JSON Open Lakehouse for Data + AI Governance Discovery Lineage Observability Tables Views
Horizon REST HMS Unify the lakehouse with Databricks Unity Catalog Write and read from any Iceberg client using open APIs (Unity or Iceberg REST) Access and govern data in Foreign Catalogs from Unity Catalog (and vice-versa) Iceberg Clients
Unification • Partnership with the Delta and Iceberg communities to unify the formats • Consistent data and delete files for flexibility and performance • Aligned table features to track row-level changes between versions of a table