大規模データ分析基盤におけるApache Icebergの導入: Snowflake✖️Iceberg ✖️ α

1 大規模データ分析基盤における Apache Icebergの導入と標準化 Snowflake✖Iceberg ✖ α

2 データ分析基盤の軌跡概要：ネットワーク領域を中心としたデータ基盤 2020 2024 / 2025 On-premise Netezza Greenplum
2014 AWS Redshift AWS Redshift & GoogleCloud BigQuery Snowflake ・圧縮済み数十PBのデータを定常保有・毎日圧縮済み数百TBのデータを処理・最大1テーブルあたり650TB、25兆レコード程度柔軟性・拡張性に課題

3 Data Warehouse / Lake ETL・ELT Replication Centralized Data Architecture
• 単一のData Warehouse / Lake • ユーザはデータを使用 • ETL/ELTは一箇所で実行課題: - データの特性・内容に対する理解の限界 → データの最適な整形が難しい → データに関する質問に答えられない - 人的リソースの柔軟性・稼働の限界 → データ処理/公開速度の遅延これまでのデータ分析基盤増大するデータに対応するには、より高いスケーラビリティ、柔軟性、俊敏性が必要

4 Data Mesh Decentralized Data Architecture • Self-Service型の分散管理 • Data
Ownerが責任を持ちデータを管理 Data Ownerは所有するデータのROIを管理 Data Mesh Data collection team Unidirectional Bidirectional

5 Snowflake Internal Marketplaceを使用したData Meshの実現 Snowflake Account Domain Domain Domain
Snowflake Account Snowflake Account Organization Account Zero Data Account Internal Marketplaceと Organization Accountを活用 • ドメインごとの権限管理 • ガバナンスの一元管理（例：ログ管理） • データの製品化とデータ製品の公開 • 明確なコスト分離 etc.

6 複数のData Lake/Data WarehouseのData Mesh化 External Data Lake / Warehouse
External Data Lake / Warehouse 外部のData Lake / Warehouseのデータを利用する最適な方法は？ • データの重複保有は最小限 • 一定以上のクエリ実行速度の担保 • よりリアルタイムなデータ統合

7 Data Mesh with Iceberg Decentralized Data Architecture External Iceberg
Catalog & Data 3rd party Analytics Service Open Table Formatである Apache Icebergを採用 • Iceberg は多くのサービスでサポート済みまたはサポートされる予定 • 一定以上のパフォーマンス「Apache Icebergを標準とした高度な相互運用性を備えた世界を作る」 Data Lake/Warehouseにおける Apache Icebergの利用を推奨/標準化 ※ネットワーク領域を中心に

8 Gen2 WH・Managed Icebergテーブルの処理性能消費クレジッ
ト実行時間 Query0 ：insert / sort 実行時間：Execution time Icebergテーブルへの変更による性能劣化はほとんどみられなかった下記条件を設定しApache Icebergテーブルの性能検証を実施利用WH : Gen2 Warehouse WHサイズ：計測条件を合わせつつ適宜変更 SQL：弊社環境で一般的に実行されるSQLの類似SQL

9 Gen2 WH・Managed Icebergテーブルの処理性能消費クレジッ
ト実行時間 Query1 ：select / sort Query2 ：select / group Query3 ：select / join Query3’：同Query3 WHサイズ変更実行時間：Execution time Icebergテーブルへの変更による性能劣化はほとんどみられなかった

10 参考：通常Warehouseを用いた同条件の計測実行時間消費クレ
ジット

11 参考 2025年4~5月における日本リージョンにおける別データにおける計測結果 ※通常Warehouseを利用 Snowflake Summit 2025 資料抜粋

12 • データのロード処理 • 通常テーブル / Icebergテーブルに対する変換処理 Snowflakeにおける通常テーブルとIcebergテーブル※Snowflake Managedの性能を比較 Icebergテーブルの性能

13 条件：・1時間毎にデータをロードを実施・通常テーブルまたはIcebergテーブルのデータロードに必要なクレジットと実行時間をそれぞれ記録 Note: copy option "add_files".は利用しない処理実行時間
消費されるクレジット量とレコード数多くの場合、通常テーブルはIcebergテーブルよりもわずかにコスト効率が優れていたデータロード

14 パターン1 <Warehouse Size> xlarge <毎時処理レコード数> 10億 ~ 40億パターン2
<Warehouse Size> xlarge <毎時処理レコード数> 2億 ~ 6億 <毎時処理時間> 200秒 ~ 600秒 <毎時の処理時間> 100秒 ~ 300秒 ※The processed results are inserted into the target table. Icebergテーブルへのクエリ実行コストと処理実行時間は、通常のテーブルと比較しほぼ同じかほんのわずかに劣っていたデータ変換

15 パターン3 <Warehouse Size> xlarge <毎時処理レコード数> 2.5億 ~ 1億パターン4
<Warehouse Size> xlarge <毎時処理レコード数> 50 ~ 20 <毎時処理時間> 100秒 ~ 400秒パターン5 <Warehouse Size> xlarge <毎時処理レコード数> 5百 ~ 1千 <毎時処理時間> 50秒 ~ 125秒 <毎時処理時間> 50秒 ~ 200秒 Note: 上段のグラフはクエリの実行時間のみのグラフであり、コンパイルとプロビジョニングの時間は含まれていないそのため、下段のクレジット消費グラフと差分が出ている

16 パターン Number of records String & binary data types
Numeric data types Date & time data types Logical data types 1 166178953 27 43 2 0 2 42302224392 4 12 3 0 3 52164400520 3 14 3 0 4 75981040316 15 50 4 1 5 79968409141 21 44 2 0 テーブル概要

17 Performance of iceberg in our environment Managed Iceberg table’s
storage costs are more cost-effective than managed tables. 1GB ~ AWS S3 On-demand Price $25.00 per TB / per month ($USD) [ AWS Tokyo First 50 TB / Month] Snowflake on AWS Tokyo Storage Price $25.00 per TB / per month ($USD) Comparison of Active Storage ( Time travel and failsafes are not included ) In Large data sizes, Iceberg is about 1%~30% cost-effective than the other 100GB (0.1TB) ~

18 Iceberg shows high performance The cost of executing the
query remained almost the same or was slightly inferior. The cost of storage improved from 1 % to 30 %. In our workload,

19 Analysis Snowflake Managed Iceberg Table In the case of
Snowflake managed Iceberg Table, the FLOAT type is written as 32-bit single precision in the file according to the iceberg FLOAT type specification. 32-bit floating-point number (float): Because fewer bits are used, the range of numbers that can be stored is narrower and the precision is limited to approximately seven decimal digits. Snowflake Managed Table In the case of Snowflake Table, the stored FLOAT type is always treated as 64-bit double precision. 64-bit floating-point number (double): Uses more bits (especially the exponent and mantissa parts) to represent numbers, so a very wide range of numbers can be stored with a decimal precision of about 15 to 16 digits.

20 External Iceberg Catalog Snowflake managed Iceberg Catalog Managed Table
現在商用利用を行っているパターン Databricks, Google BigLake Metastore 他検証対応中現地でのみ

21 Snowflake ✖ Databricks IP制限信仰に倣えない場合がある ◾Point： Snowflake - インバウンドにおいて、環境・ユーザそれぞれに細かなIP制限ができる -
アウトバウンドのIPを固定できない Databricks - アウトバウンドのIPを固定できる - インバウンドにおいて、IP制限が環境（ワークスペース）という大枠でしかできない

22 信仰に倣うことができないパターン Databricks Unity Catalogを外部CatalogとしてSnowflake内で Databricks内のIcebergテーブルを叩く場合 IP制限不可 = WorkspaceのインバウンドはIP制限なしに設定
※Snowflake側からのIPを固定できない Snowflake側からのアクセスがある以上、 IP制限できない。。現地でのみ

23 Data MeshとData ROIの最大化 Data ROI Data Owner CoE/Governance Team
Internal Marketplace Catalog of Data Products Data Source Domain per Account Organization Account Zero Data Account Snowflake Data Cloud 3rd Party Analytics Service Access External Iceberg Table from Snowflake Access Snowflake Managed Iceberg Table from Spark Data Subscriber Comsumer Provide Data Products Subscribe Data Products Organization Account Zero Data Account Domain Self-Serviceを基本に高い柔軟性を確保しつつ、適切なガバナンスを通して Data ROIの最大化に繋げる

大規模データ分析基盤におけるApache Icebergの導入: Snowflake✖️Ice...

大規模データ分析基盤におけるApache Icebergの導入: Snowflake✖️Iceberg ✖️ α

Matsubara

More Decks by Matsubara

Featured

Transcript

1 大規模データ分析基盤における Apache Icebergの導入と標準化 Snowflake✖Iceberg ✖ α

2 データ分析基盤の軌跡概要：ネットワーク領域を中心としたデータ基盤 2020 2024 / 2025 On-premise Netezza Greenplum

3 Data Warehouse / Lake ETL・ELT Replication Centralized Data Architecture

4 Data Mesh Decentralized Data Architecture • Self-Service型の分散管理 • Data

5 Snowflake Internal Marketplaceを使用したData Meshの実現 Snowflake Account Domain Domain Domain

6 複数のData Lake/Data WarehouseのData Mesh化 External Data Lake / Warehouse

7 Data Mesh with Iceberg Decentralized Data Architecture External Iceberg

8 Gen2 WH・Managed Icebergテーブルの処理性能消費クレジッ

9 Gen2 WH・Managed Icebergテーブルの処理性能消費クレジッ

10 参考：通常Warehouseを用いた同条件の計測実行時間消費クレ

11 参考 2025年4~5月における日本リージョンにおける別データにおける計測結果 ※通常Warehouseを利用 Snowflake Summit 2025 資料抜粋

12 • データのロード処理 • 通常テーブル / Icebergテーブルに対する変換処理 Snowflakeにおける通常テーブルとIcebergテーブル※Snowflake Managedの性能を比較 Icebergテーブルの性能

13 条件：・1時間毎にデータをロードを実施・通常テーブルまたはIcebergテーブルのデータロードに必要なクレジットと実行時間をそれぞれ記録 Note: copy option "add_files".は利用しない処理実行時間

14 パターン1 <Warehouse Size> xlarge <毎時処理レコード数> 10億 ~ 40億パターン2

15 パターン3 <Warehouse Size> xlarge <毎時処理レコード数> 2.5億 ~ 1億パターン4

16 パターン Number of records String & binary data types

17 Performance of iceberg in our environment Managed Iceberg table’s

18 Iceberg shows high performance The cost of executing the

19 Analysis Snowflake Managed Iceberg Table In the case of

20 External Iceberg Catalog Snowflake managed Iceberg Catalog Managed Table

21 Snowflake ✖ Databricks IP制限信仰に倣えない場合がある ◾Point： Snowflake - インバウンドにおいて、環境・ユーザそれぞれに細かなIP制限ができる -

22 信仰に倣うことができないパターン Databricks Unity Catalogを外部CatalogとしてSnowflake内で Databricks内のIcebergテーブルを叩く場合 IP制限不可 = WorkspaceのインバウンドはIP制限なしに設定

23 Data MeshとData ROIの最大化 Data ROI Data Owner CoE/Governance Team