From Spec to Implementation: Iceberg REST Catalog with Hive Metastore

From Spec to Implementation: Iceberg REST Catalog with Hive Metastore
@okumin Apache Iceberg Meetup Japan #5 Databricks Office, Tokyo May 20, 2026

About Me - Shohei Okumiya (@okumin) - Apache Hive PMC
Member - Working at Treasure Data -> Treasure AI

ZooKage - Full-Featured Lakehouse, Locally $ git clone --branch v0.5.0
https://github.com/zookage/zookage.git $ cd zookage $ ./bin/up

Agenda - What is Apache Hive? - Recap: Apache Iceberg
Table Format - Overview: Iceberg Catalogs - Comparing Basic Operations: Hive Catalog vs. REST Catalog - How to read - How to write - Advanced REST Catalog Features - Authentication and Authorization - Credential Vending - Metrics Reporting - Server-Side Planning

What is Apache Hive?

“Distributed Data Warehouse at Massive Scale” - Hive Metastore: Metadata
Repository - HiveServer2: SQL Gateway - Hive on Tez, Hive LLAP: Distributed Execution Engine RDBMS HDFS Object Storage HiveServer2 Hive Metastore Hadoop Hive on Tez Kubernetes Hive LLAP Trino, Spark, Flink

What Matters for Today’s Talk - Hive Metastore: Metadata Repository
- HiveServer2: SQL Gateway - Hive on Tez, Hive LLAP: Distributed Execution Engine RDBMS HDFS Object Storage HiveServer2 Hive Metastore Hadoop Hive on Tez Kubernetes Hive LLAP Trino, Spark, Flink

Recap: Apache Iceberg Table Format

Iceberg-related Data Iceberg Catalog The current metadata pointer is stored.
Metadata File The schema or available snapshots are stored. Manifest List The list of Manifest Files included in a snapshot is stored. Manifest File The list of Data Files is stored. Data File The valid records are stored as Parquet or something. Source: Spec - Apache Iceberg™ https://iceberg.apache.org/spec/

What Matters for Today’s Talk When operating an Iceberg table,
we need to use a Catalog to obtain the metadata file location, Catalog Metadata File (JSON) Amazon S3 Manifest List (Avro) Manifest File (Avro) Data File (Parquet)

What Matters for Today’s Talk 3 types of metadata-related files
in a distributed storage, Catalog Metadata File (JSON) Amazon S3 Manifest List (Avro) Manifest File (Avro) Data File (Parquet)

What Matters for Today’s Talk and files storing actual records
as Parquet, ORC, or so on. Catalog Metadata File (JSON) Amazon S3 Manifest List (Avro) Manifest File (Avro) Data File (Parquet)

Overview: Iceberg Catalogs

Iceberg Catalog Implementations Catalog’s roles and responsibilities - It is
able to resolve the current metadata location by a table name - It is able to update the mapping safely Spark + Hadoop Catalog Spark + JDBC Catalog Spark + Hive Catalog Spark + REST Catalog HDFS, S3 PostgreSQL, MySQL Iceberg REST API Hive Metastore Hadoop FileSystem API JDBC Driver Hive Thrift Client Iceberg REST Client

Iceberg REST Catalog API Implementations Managed Services - Databricks Unity
Catalog - Cloudera Iceberg REST Catalog - Snowflake Open Catalog - AWS Glue - Amazon S3 Tables - Google Cloud's Lakehouse - Microsoft Fabric OneLake - Dremio Open Catalog OSS - Unity Catalog - Apache Polaris - Apache Gravitino - Lakekeeper - Project Nessie

Hive Metastore is a new one OSS - Unity Catalog
- Apache Polaris - Apache Gravitino - Lakekeeper - Project Nessie - Apache Hive - Hive Metastore Managed Services - Databricks Unity Catalog - Cloudera Iceberg REST Catalog - Snowflake Open Catalog - AWS Glue - Amazon S3 Tables - Google Cloud's Lakehouse - Microsoft Fabric OneLake - Dremio Open Catalog

Iceberg REST Catalog API backed by Hive Metastore - Hive
Metastore translates an Iceberg REST API request to Hive Catalog’s method - Any semantics (e.g., what characters can be used as a table name) are identical to Hive Catalog - We can deploy the REST API in HMS or add a standalone server Trino REST Catalog Hive Metastore Iceberg REST API Hive Catalog API Server Iceberg REST API Hive Metastore Thrift API Trino REST Catalog REST API RDBMS RDBMS Thrift RPC REST API Embedded Mode Standalone Mode (>= 4.3)

Comparing Basic Operations: Hive Catalog vs. REST Catalog

Hive Catalog vs REST Catalog: Read Path What if we
run a simple SELECT query using Trino? -- Schema CREATE TABLE test (name VARCHAR); -- Query SELECT * FROM test;

Read Path with Hive Catalog (1) Hive Metastore (Thrift) Metadata
Pointer Metadata File (JSON) PostgreSQL Amazon S3 Trino Manifest List (Avro) Manifest File (Avro) Data File (Parquet) (1) GetTableRequest (Thrift) (2) SELECT Current Pointer (3) s3://path/to/000-metadata.json (4) Hive Table

Read Path with Hive Catalog (2) Hive Metastore (Thrift) Metadata
Pointer Metadata File (JSON) PostgreSQL Amazon S3 Trino Manifest List (Avro) Manifest File (Avro) Data File (Parquet) (5) GET s3://path/to/000-metadata.json (6) Schema, List of snapshots (7) GET s3://path/to/snap-abc.avro (8) List of Manifest Files (9) GET s3://path/to/m0.avro (10) List of Data Files (11) GET s3://path/to/000.parquet (12) Records

Read Path with REST Catalog (1) Hive Metastore (REST) Metadata
Pointer Metadata File (JSON) PostgreSQL Amazon S3 Trino Manifest List (Avro) Manifest File (Avro) Data File (Parquet) (1) Load Table (Iceberg REST) (2) SELECT Current Pointer (3) s3://path/to/000-metadata.json

Pointer Metadata File (JSON) PostgreSQL Amazon S3 Trino Manifest List (Avro) Manifest File (Avro) Data File (Parquet) (6) Iceberg Table (4) GET s3://path/to/000-metadata.json (5) Schema, List of snapshots

Pointer Metadata File (JSON) PostgreSQL Amazon S3 Trino Manifest List (Avro) Manifest File (Avro) Data File (Parquet) (7) GET s3://path/to/snap-abc.avro (8) List of Manifest Files (9) GET s3://path/to/m0.avro (10) List of Data Files (11) GET s3://path/to/000.parquet (12) Records

Wrap-up: Read Path - Hive Metastore reads a Metadata File
on its own - Hive Metastore has a few more optimization chances, e.g., caching Metadata Files (HIVE-29035) Hive Catalog REST Catalog Hive Metastore (Thrift) Metadata Pointer Metadata File (JSON) PostgreSQL Amazon S3 Trino Manifest List (Avro) Manifest File (Avro) Data File (Parquet) Hive Metastore (REST) Metadata Pointer Metadata File (JSON) PostgreSQL Amazon S3 Trino Manifest List (Avro) Manifest File (Avro) Data File (Parquet)

Hive Catalog vs REST Catalog: Write Path What if we
add a single row? Note: An Iceberg client makes a few of Load Table requests to know the current table status. In this slides, the part is omitted -- For simplicity SET SESSION <catalog>.merge_manifests_on_write = false; -- Query INSERT INTO test (name) VALUES ('Alice');

Write Path with Hive Catalog (1) Hive Metastore (Thrift) Metadata
Pointer Metadata File (JSON) PostgreSQL Amazon S3 Trino Manifest List (Avro) Manifest File (Avro) Data File (Parquet) (8) 200 OK (7) PUT s3://path/to/new-metadata.json (6) 200 OK (5) PUT s3://path/to/snap-xyz.avro (4) 200 OK (3) PUT s3://path/to/new.avro (2) 200 OK (1) PUT s3://path/to/new.parquet

Pointer Metadata File (JSON) PostgreSQL Amazon S3 Trino Manifest List (Avro) Manifest File (Avro) Data File (Parquet) (9) LockRequest (Thrift) (10) Acquire Lock (11) Lock Acquired (12) LockResponse

Pointer Metadata File (JSON) PostgreSQL Amazon S3 Trino Manifest List (Avro) Manifest File (Avro) Data File (Parquet) (13) GetTableRequest (Thrift) (14) SELECT Current Pointer (15) Current Pointer (16) Hive Table Retry or abort if the metadata location changes during lock acquisition

Pointer Metadata File (JSON) PostgreSQL Amazon S3 Trino Manifest List (Avro) Manifest File (Avro) Data File (Parquet) (17) AlterTableRequest (Thrift) (18) Updater Pointer

Write Path with REST Catalog (1) Hive Metastore (REST) Metadata
Pointer Metadata File (JSON) PostgreSQL Amazon S3 Trino Manifest List (Avro) Manifest File (Avro) Data File (Parquet) (6) 200 OK (5) PUT s3://path/to/snap-xyz.avro (4) 200 OK (3) PUT s3://path/to/new.avro (2) 200 OK (1) PUT s3://path/to/new.parquet

Pointer Metadata File (JSON) PostgreSQL Amazon S3 Trino Manifest List (Avro) Manifest File (Avro) Data File (Parquet) (7) Update Table w/ diff (9) 200 OK (8) PUT s3://path/to/new-metadata.json

Pointer Metadata File (JSON) PostgreSQL Amazon S3 Trino Manifest List (Avro) Manifest File (Avro) Data File (Parquet) (11) Lock acquired (10) Acquire Lock

Pointer Metadata File (JSON) PostgreSQL Amazon S3 Trino Manifest List (Avro) Manifest File (Avro) Data File (Parquet) (13) Current Pointer (12) SELECT Current Pointer Retry or abort if the metadata location changes during lock acquisition

Pointer Metadata File (JSON) PostgreSQL Amazon S3 Trino Manifest List (Avro) Manifest File (Avro) Data File (Parquet) (14) Update Pointer

Wrap-up: Write Path Hive Metastore can abstract more operations, such
as a table-level lock Hive Catalog REST Catalog Hive Metastore (REST) Metadata Pointer Metadata File (JSON) PostgreSQL Amazon S3 Trino Manifest List (Avro) Manifest File (Avro) Data File (Parquet) Hive Metastore (Thrift) Metadata Pointer Metadata File (JSON) PostgreSQL Amazon S3 Trino Manifest List (Avro) Manifest File (Avro) Data File (Parquet)

Advanced REST Catalog Features

Authentication and Authorization - How to resolve user names and
apply the endpoint-level authorization? - OAuth 2, supported by HIVE-29020 - AWS SigV4 - Bearer Token - Table-level access control - Apache Ranger - AWS Lake Formation Catalog Server Policy Store Identity Provider She is Alice Alice can update Table X and Y

OAuth 2.0 + Authorization Plugin • Configure Iceberg REST Catalog
API endpoints as protected resources • Resolve the username from the OAuth 2.0 token claims • Retrieve the user's privileges from Apache Ranger and enforce table-level permissions Authorization Server E.g., Keycloak Hive Metastore Trino (1) Request Access Token Ranger Policy Store (2) Access Token (3) Request w/ Access Token (4) Token Introspection (5) Claim Set w/ ID source (7) Response (4) and (5) is omitted when validating Access Token as JWT Ranger Plugin Sync (6) Check Privileges

Credential Vending - Iceberg REST Catalog API returns storage credentials
- HIVE-29228(work in progress) Iceberg Client Catalog Server Storage Table Metadata with Credentials Read, Write, Delete Files

Without Credential Vending Hive Metastore w/ 🔑 Metadata File (JSON)
Manifest List (Avro) Manifest File (Avro) Data File (Parquet) Amazon S3 Trino w/ 🔑 Hive Metastore has read and write access to the entire /user/hive/warehouse Trino also has read and write access to the entire /user/hive/warehouse

With Credential Vending Hive Metastore Metadata File (JSON) Manifest List
(Avro) Manifest File (Avro) Data File (Parquet) Amazon S3 Trino Trino uses credentials vended by Hive Metastore AWS STS (1) LoadTable (2) Assume Role (3) Temporary Credentials 🔑 (4) Iceberg Table w/ Temporary Credentials 🔑 (5) Access S3 w/ Temporary Credentials 🔑 Credentials scoped to the requested table and permitted operations (a subset of GET, PUT, and DELETE)

Metrics Reporting - Iceberg clients can report useful metrics via
the REST API - Scan Report: The table name, scan conditions, the number of scanned files, etc. - Commit Report: The table name, the number of created or deleted files and records - The REST Catalog enables centralized server-side management of metrics Iceberg Client 1 Catalog Server Iceberg Client 2 Iceberg Client 3 ???

Metrics Reporting - HIVE-29593(>= 4.3): Hive Metastore administrators can implement
and deploy metrics-handling plugins Trino Hive Metastore Kafka Iceberg Datadog Spark Flink

Server-Side Planning - Resolve the list of Data Files on
the REST Catalog API - The spec is available - The Java client implementation was shipped with Apache Iceberg 1.11.0 -> We can start implementing and testing it Iceberg Client Catalog Server Snapshot ID, Projection, Predicate, etc. List of Data Files

Recap: Load Table + Client-Side Planning Hive Metastore Metadata File
(JSON) Manifest List (Avro) Manifest File (Avro) Data File (Parquet) Amazon S3 Trino Metadata Pointer PostgreSQL

With Server-Side Planning Hive Metastore Metadata File (JSON) Manifest List
(Avro) Manifest File (Avro) Data File (Parquet) Amazon S3 Trino Metadata Pointer PostgreSQL (1) Submit Scan Planning w/ Scan Conditions (2) Scan Metadata (3) Locations of Data Files (4) Read Data Files

Key Takeaways - Apache Iceberg REST Catalog is gaining strong
momentum - Hive Metastore is actively adding support for the Iceberg REST API - REST Catalog makes it easier to introduce advanced features - Special Thanks - Treasure AI colleagues for their review - Keisuke Suzuki, Masafumi Koba

From Spec to Implementation: Iceberg REST Catal...

From Spec to Implementation: Iceberg REST Catalog with Hive Metastore

More Decks by okumin

Other Decks in Programming

Featured

Transcript