rights reserved. Tomohiro Tanaka Senior Cloud Support Engineer, AWS Support Amazon Web Services • Responsible for solving most complex troubles and guiding best practices with Iceberg • Contributing to Apache Iceberg OSS project 2
rights reserved. • Defines the specifications that Iceberg clients must follow • Not same as Iceberg library versions! (e.g. iceberg-1.10.1) • Incremented when older versions cannot read new features (Non–forward compatibility) • New features go into the next version after version spec is fixed. • Stored in the Iceberg metadata.json file 4 Iceberg format version
rights reserved. History of Iceberg format versions 5 Version 1 Iceberg's fundamental table mechanisms Version 2 Row-level deletes https://github.com/apache/iceberg/milestone/4 Version 3 (latest) Extending data types and capabilities https://github.com/apache/iceberg/milestone/42 Version 4 (under development) Adaptive table tree structure & Single File commits, Relative paths, Column stats improvements etc. https://github.com/apache/iceberg/milestone/58 Ref: https://github.com/apache/iceberg/milestones
rights reserved. Features in Iceberg V3 Spec Extended data types New capabilities Unknown Variant Timestamp (9) w/ or w/o TZ Geo (Geometry/Geography) New type promotions Deletion Vectors Row Lineage Default values Multi-arguments for transforms Table encryption keys 6
rights reserved. Features in Iceberg V3 Spec Extended data types New capabilities Unknown Variant Timestamp (9) w/ or w/o TZ Geo (Geometry/Geography) New type promotions Deletion Vectors Row Lineage Default values Multi-arguments for transforms Table encryption keys 7
rights reserved. • Proposal: https://github.com/apache/iceberg/issues/10392 § doc: https://docs.google.com/document/d/1sq70XDiWJ2DemWyA5dVB80gKzwi0CWoM0LOWM7 VJVd8/ • Stores semi-structured data such as JSON, AVRO etc. within a single column, not as String • Enhances performance with binary encoding of semi-structured data compared to String • To get variant typed values, use variant_get function (in Spark) • Supported in Iceberg 1.10.0+ and Spark 4.0+ 9 Variant type
rights reserved. Example (3/3) – Read a variant field SELECT variant_get(data, '$.device_id', 'string') as dev_id, variant_get(data, '$.color', 'string') as device_color FROM variant 12
rights reserved. 67890-1-hijklmn-deletes.parquet 01234-5-abdcefg.parquet V2 Row-level deletes id drink price 1 milk 3.00 2 cocoa 4.00 3 espresso 5.00 … … … 100000 white mocha 6.00 DELETE FROM db.tbl WHERE id = 2 id drink price 1 milk 3.00 3 espresso 5.00 … … … 100000 white mocha 6.00 file_path pos s3://bucket/a.parquet 1 Only delete files are created. Create a new file without deleted records. File-level deletes Row-level deletes
rights reserved. Example (2/3) delete in V2 DELETE FROM review WHERE review_year <= 2015 18 /path/to/warehouse/data - 00000-229-ff2ba929-305a-4a39-a1d4-1ac7f513c1d4-0-00001.parquet - 00001-230-ff2ba929-305a-4a39-a1d4-1ac7f513c1d4-0-00001.parquet - 00002-231-ff2ba929-305a-4a39-a1d4-1ac7f513c1d4-0-00001.parquet - 00000-110-eb243d8e-bfd4-4b9c-833a-e1e041f07c1c-00001-deletes.parquet - 00001-111-eb243d8e-bfd4-4b9c-833a-e1e041f07c1c-00001-deletes.parquet - 00002-112-eb243d8e-bfd4-4b9c-833a-e1e041f07c1c-00001-deletes.parquet Tip: The file number of delete files can be configured by write.delete.granularity (default: file in Spark).
rights reserved. • Run: § ALTER TABLE db.tbl SET TBLPROPERTIES ('format-version'='3') (Spark SQL) • Update the table format version last. Update readers/writers, (and REST Catalog) first • Note that: § Your using computing engines have the implementation of V3 features. § Not possible to change back to older versions (e.g. V3 to V2) § Readers/Writers with newer versions only can read/write from/to Iceberg tables of older version. 22 How can I migrate Iceberg V2 to V3?
rights reserved. • Deletion Vectors: § Readers with V3 support can read V2 position delete files after V3 migration. § Once V3 delete occurs, the content of existing V2 position delete files will be moved into V3 Puffin files. § Once running rewrite_position_delete_files Spark procedure for a V3 table, the existing V2 position delete files will be merged to V3 Puffin files. 23 Other considerations for V3 migration
rights reserved. • Row Lineage: § After changing the version to V3, next-row-id is assigned, but it doesn't affect the current records. § Once table records are changed, _row_id is assigned for each row. _last_updated_sequence_number is inherited. 24 Other considerations for V3 migration
rights reserved. • V3 Spec has been fixed around the summer of 2025. • V4 Spec is currently under discussion. • V4 topics: § Single file commits & Adaptive metadata tree (the doc was merged) – proposal doc: https://s.apache.org/iceberg-single-file-commit – Learn: https://qiita.com/m-masataka/items/aa3de63618e2d48433a6 § Relative Path Spec (Issue#13141) – proposal doc: https://s.apache.org/iceberg-spec-relative-path § Column stats improvement (Issue#13153) – proposal doc: https://s.apache.org/iceberg-column-stats 26 Iceberg V4 Spec
rights reserved. • Iceberg format versions and evolution of Iceberg • Some features in V3 are still in development in each engine. • Variant type enables performance optimization to process semi-structured values. • Deletion Vectors can reduce storage cost, and help the enhancement iof read/write performance. • Migration steps to V3 (readers/writers, then format version), and considerations • Iceberg V4 Spec 28 Key takeaways