PG-Strom v6.0 新機能とその先の未来

PG-Strom v6.0 新機能とその先の未来 HeteroDB,Inc Chief Architect & CEO 海外浩平
<[email protected]>

自己紹介／HeteroDB社について 20250523爆速DBセミナー～PG-Strom v6.0 新機能とその先の未来～ 2 会社概要  商号ヘテロDB株式会社
 創業 2017年7月4日  拠点東京都品川区北品川  事業内容高速データベース製品の販売 GPU＆DB領域の技術コンサルティングヘテロジニアスコンピューティング技術をデータベース領域に適用し、誰もが使いやすく、安価で高速なデータ解析基盤を提供する。代表者プロフィール  海外浩平（KaiGai Kohei）  OSS開発者コミュニティにおいて、PostgreSQLやLinux kernelの開発に10年以上従事。主にセキュリティ・FDW等の分野でアップストリームへの貢献。  IPA未踏ソフト事業において“天才プログラマー”認定 (2006)  GPU Technology Conference Japan 2017にてInception Awardを受賞

製品概要 PG-Stromとは？ 20250523爆速DBセミナー～PG-Strom v6.0 新機能とその先の未来～ 3 【機能】  集計／解析ワークロードの透過的なGPU高速化
➔ PostgreSQLの運用ノウハウやソフトウェア資産を”そのまま”活かす事ができる。  SQLからGPU命令コードを自動生成し数千コアによる超並列実行  GPU-Direct SQL：NVMEからGPUへの直接読み出しによるI/O最適化  Apache Arrowに対応し、IoT/M2Mログデータを“そのまま”処理できる PG-Strom: GPUとNVMEの能力を最大限に引き出し、テラバイト級のデータを高速処理するPostgreSQL向け拡張モジュール App GPU off-loading ➢ GpuScan / GpuJoin / GpuPreAgg ➢ GpuSort / Window関数 ➢ GPU-Direct SQL ➢ Apache Arrow support (with min/max statistics, virtual columns by the filenames) ➢ Updatable GPU Cache ➢ Asymmetric Partition-wise JOIN/GROUP BY ➢ Large Tables GPU-JOINs ➢ BRIN-Index support ➢ PostGIS support with GiST-index ➢ pg2Arrow / mysql2arrow / pcap2arrow

本日のアジェンダ ▌PG-Stromの歴史 ▌PG-Strom v6.0の新機能 ▌今後のロードマップ 20250523爆速DBセミナー～PG-Strom v6.0 新機能とその先の未来～ 4

1. PG-Stromの歴史 20250523爆速DBセミナー～PG-Strom v6.0 新機能とその先の未来～ 5

PG-Stromの歴史（1/4）（2017～） PG-Strom （2012～）（2014～）（旧MAP-D; 2013～）（2016～）（2014～）（2015～）
（2016～） ➢ 2010年頃「ビッグデータ」が盛り上がる。Hadoopなど大人気。 ➢ GPUの性能・信頼性が向上を続け、DBMSへの組み込みも現実味。 ➔ 2012年にPG-Stromのプロトタイプ実装。GPU-DBとして最古参の一角。 20250523爆速DBセミナー～PG-Strom v6.0 新機能とその先の未来～ 6

PG-Stromの歴史（2/4）開発のきっかけ ▌PGconf 2011 (Ottawa) で聴講したあるセッション・・・  Parallel Image Searching
Using PostgreSQL and PgOpenCL Running PostgreSQL Stored Procedures on a GPU ✓ https://www.pgcon.org/2011/schedule/events/352.en.html  画像処理用のストアドプロシジャを、GPU用のプログラミング環境 OpenCL で作成するための拡張モジュールについて。 A列 B列 C列 D列 E列幅の小さいデータでも、大量の行を並べれば長大なBLOBと同じく GPU並列処理が効くのでは？ 2011 20250523爆速DBセミナー～PG-Strom v6.0 新機能とその先の未来～ 7

PG-Stromの歴史（3/4）最初期のプロトタイプ 20250523爆速DBセミナー～PG-Strom v6.0 新機能とその先の未来～ 8 実行イメージ：PostgreSQLは行単位、PG-Stromはバルクかつ非同期並列実行イメージ：FDWのAPIを通じてオプティマイザ、エグゼキュータに介入 2012

PG-Stromの歴史（4/4）その後の拡張＆改良 ▌PostgreSQLの改良  Custom Background Worker [PgSQL v9.3] 
Custom Scan API [PgSQL v9.5] ▌Scan/Join/Group-Byの実装 [v1.0]  GPU-JoinとGPU-PreAggを追加 ▌GPU-Direct SQL機構 [v2.0]  I/O処理を高速化 ▌Apache Arrow形式への対応 [v3.0]  列形式データへの対応 ▌内部構造の改修と安定化 [v5.0]  GPU-Serviceの分離と疑似コードの導入 2013 2016 2016 2018 2021 2023 20250523爆速DBセミナー～PG-Strom v6.0 新機能とその先の未来～ 9

CustomScan API（1/3） Parser Optimizer Executor execBegin Exec execEnd CustomScan API
PostgreSQL本体へのパッチを必要せず、純粋に拡張モジュールのみでGPU実行を実装できるように。 SQL 実行結果パース木実行計画 add_scan_path add_join_path BeginCustomScan ExecCustomScan EndCustomScan 拡張モジュールによる実装 CustomScan API 20250523爆速DBセミナー～PG-Strom v6.0 新機能とその先の未来～ 10 2015

CustomScan API（3/3） 20250523爆速DBセミナー～PG-Strom v6.0 新機能とその先の未来～ 12 現在では、PG-Strom以外にもかなりのDBエンジンで利用されている Noriyuki Shinoda『列指向アクセスメソッド徹底比較』,
PostgreSQL Conference Japan 2004 https://speakerdeck.com/nori_shinoda/postgresql-conference-japan-2024-a4-comparison-of-column-oriented-access-methods 2024

GPU-Join Hash-JoinアルゴリズムをGPUで超並列化 inner-table outer-table inner-hash-buffer hash-slot GPU cores outer buffer
• • • • • • • • • • • • 一度に約64MBずつロードして GPU-Join処理 Results Buffer 20250523爆速DBセミナー～PG-Strom v6.0 新機能とその先の未来～ 13

GPU-PreAgg（1/2） ✓ AVG(X) ... count(X) と sum(X) があれば導出できる。 ✓ STDDEV(X)
... count(X) と sum(X*X), sum(X) があれば導出できる。 outer-table outer buffer • • • • • • • • • • • • GPU cores GPU GroupBy Buffer (per session) hash-slot sumX cntX Key sumX cntX Key sumX cntX Key by hash(key) atomicAdd() pavg(cntX, sumX) 20250523爆速DBセミナー～PG-Strom v6.0 新機能とその先の未来～ 14

GPU-PreAgg（2/2） ssbm=# explain verbose select count(*), avg(lo_revenue), lo_shipmode from lineorder
where lo_discount > 0 group by lo_shipmode; QUERY PLAN --------------------------------------------------------------------------------------------- HashAggregate (cost=13191053.13..13191053.22 rows=7 width=51) Output: pgstrom.fcount(pgstrom.nrows()), ¥ pgstrom.avg_numeric(pgstrom.pavg(lo_revenue)), lo_shipmode Group Key: lineorder.lo_shipmode -> Gather (cost=13191052.33..13191053.08 rows=7 width=51) Output: (pgstrom.nrows()), (pgstrom.pavg(lo_revenue)), lo_shipmode Workers Planned: 2 -> Parallel Custom Scan (GpuPreAgg) on public.lineorder ¥ (cost=13190052.33..13190052.38 rows=7 width=51) Output: pgstrom.nrows(), pgstrom.pavg(lo_revenue), lo_shipmode GPU Projection: pgstrom.nrows(), pgstrom.pavg(lineorder.lo_revenue), ¥ lineorder.lo_shipmode GPU Scan Quals: (lineorder.lo_discount > '0'::numeric) ¥ [plan: 6000026000 -> 2274427000] GPU Group Key: lineorder.lo_shipmode Scan-Engine: GPU-Direct with 2 GPUs <0,1> (12 rows) PostgreSQLの並列ワーカーがそれぞれ cntX と sumX を返す ➔ CPUのHashAggregateで結果を統合する。 20250523爆速DBセミナー～PG-Strom v6.0 新機能とその先の未来～ 15

なぜScan, Join, Group-Byをターゲットとして選んだか？ 20250523爆速DBセミナー～PG-Strom v6.0 新機能とその先の未来～ 16  Scan(34.4%)、Join(36.9%)、Aggregate(23.3%)で総処理時間の95%
✓ 但し、Scanは予めバッファにロードした状態である事に留意。 ✓ この３つを倒せば、OLAP系のワークロードを大半をカバーできることに？  その他 (5.3%) の中で目立つのは、Sort、SetOp、Window関数 OLAPの性能評価に使われるTPC-DSの実行時間を分析 2015

GPU Direct SQLの着想と開発（1/3） GPUコア GPU Device Memory CPU Host Memory
ストレージ（HDD/SSD） CPU 720GB/s 16GB/s 0.5GB/s 60GB/s NVME-SSD 一台あたり 3GB/s （当時の）PG-Stromを含む、 GPU-DBはデータがオンメモリ前提ストレージに落ちたら「負け」一方この頃、NVME-SSD製品が登場安価な高速SSDがブレイクの兆し。 20250523爆速DBセミナー～PG-Strom v6.0 新機能とその先の未来～ 17 2016

GPU Direct SQLの着想と開発（2/3） NVIDIA GPUDirect RDMA nvme.ko ドライバ nvme_strom.ko ドライバ
ホストメモリデバイスメモリ物理アドレス空間 NVME-READ From: Block 1234 Qty: 32 Dest: 0x34567000 NVME-READ From: Block 2345 Qty: 32 Dest: 0x45678000 NVME コマンド NVME コマンド 20250523爆速DBセミナー～PG-Strom v6.0 新機能とその先の未来～ 18 2016

GPU Direct SQLの着想と開発（3/3） ✓ GPU-Direct SQL（当時は SSD-to-GPU Direct SQL）により、メモリサイズを越えた大量データ（数TB～）を処理する事が可能に。
✓ この特徴が、PG-Stromが他のGPU-DB製品とは異なった進化をするきっかけに。実験用に購入した Intel SSD 750 (400GB) の理論帯域まで出ている (!) 20250523爆速DBセミナー～PG-Strom v6.0 新機能とその先の未来～ 19 2016

SSBMによるベンチマーク（1/2） 20250523爆速DBセミナー～PG-Strom v6.0 新機能とその先の未来～ 20 ▌SSBM: Star Schema Benchmark
 TPC-Hを簡素化し、Join/GroupByといったDBMSの基本的なOLAP性能の測定に特化。  13本のクエリが定義されており、v2.0の頃からPG-Stromの性能定点観測に利用。 [Q3-2] select c_city, s_city, d_year, sum(lo_revenue) as revenue from customer, lineorder, supplier, date1 where lo_custkey = c_custkey and lo_suppkey = s_suppkey and lo_orderdate = d_datekey and c_nation = 'UNITED STATES’ and s_nation = 'UNITED STATES’ and d_year >= 1992 and d_year <= 1997 group by c_city, s_city, d_year order by d_year asc, revenue desc; サーバースペック model: Supermicro AS-2015CS-TNR CPU: AMD EPYC 9254 (24C, 2.9GHz) x1 RAM: 16GB DIMM[DDR5-4800; ECC] x12 GPU: NVIDIA H100 [80GB; PCI-E] x1 SSD: Intel D7-P5510 (3.84TB) x8 N/W: 2-port 10GBase OS: Red Hat Enterprise Linux 8.9 DBMS: PostgreSQL v16.9 + PG-Strom v6.0dev customer [4.0GB] date [416kB] part [229MB] supplier [1.2GB] lineorder [875GB] SF=999 の場合のデータサイズ

SSBMによるベンチマーク（2/2） 20250523爆速DBセミナー～PG-Strom v6.0 新機能とその先の未来～ 21 (DBサイズ ÷応答時間) の指標単位時間で
処理できるデータ量が大きいほど性能が高い事を示す。 nvme7 nvme6 nvme5 nvme4 nvme3 nvme2 nvme1 nvme0 ファイルシステム経由、細切れ（8kB単位）のREADにより、 NVME-SSD本来の性能を出し切れていないと推察。

Apache Arrow対応（1/4） 20250523爆速DBセミナー～PG-Strom v6.0 新機能とその先の未来～ 22 ▌Apache Arrow形式の特徴 
列指向で分析用途向けに設計されたデータ形式  アプリケーションによらず、共通のデータ交換形式として利用可能  整数、実数、日付時刻、文字列など基本的なデータ型を定義 NVIDIA GPU PostgreSQL / PG-Strom 2020

Apache Arrow対応（2/4） Header “ARROW1¥0¥0” Schema Definition Footer Schema Definition Custom
Metadata Apache Arrow File RecordBatch-1 RecordBatch-k RecordBatch-N Terminator “ARROW1” RecordBatch: 長さの同じ配列を列ごとに寄せ集めたもの。 Schema Definition: 列名やデータ型、属性が列の数だけ列挙されている。テーブル定義に相当する。 A列 B列 C列 D列物理的な（ファイル上の）レイアウト A列 NULL-Bitmap A列 Values-array B列 NULL-Bitmap B列 Values-index B列 Values-body C列 NULL-Bitmap C列 Values-array D列 NULL-Bitmap D列 Values-array Record Batch-k SELECT SUM(D) FROM f_arrow_tbl WHERE B LIKE ‘%abc%’; 被参照列のみのロードが容易 20250523爆速DBセミナー～PG-Strom v6.0 新機能とその先の未来～ 23 2020

《背景技術》FDW (Foreign Data Wrapper) 20250523爆速DBセミナー～PG-Strom v6.0 新機能とその先の未来～ 24 外部テーブル（Foreign
Table）－ PostgreSQL管理外のデータソースを、あたかもテーブルであるかのように取り扱うための機能 PostgreSQL Table CSV postgres_fdw file_fdw arrow_fdw Foreign Data Wrapper (FDW) API 外部データ（Foreign Data） PostgreSQL 内部データ PostgreSQLテーブルには、 PostgreSQLの行データ形式でデータが格納されている Foreign Table 外部データPostgreSQL内部データの相互変換に責任を持つ。

Apache Arrow対応（3/4） 20250523爆速DBセミナー～PG-Strom v6.0 新機能とその先の未来～ 25 ▌なぜApache Arrow形式が優れているのか？ 
被参照列のみ読み出すため、I/O量が少なくて済む  GPUメモリバスの特性から、プロセッサ実行効率が高い  Read-onlyデータなので、実行時のMVCC検査を行う必要がない GPU Direct SQL機構を使って、被参照列だけを転送する。 PCIe Bus NVMe SSD GPU SSD-to-GPU P2P DMA WHERE-clause JOIN GROUP BY クエリの被参照列のみ、ダイレクトデータ転送 Apache Arrow形式を解釈し、データを取り出せるよう GPUコード側での対応。小規模の処理結果だけを PostgreSQLデータ形式で返す Results metadata 2020

Apache Arrow対応（4/4） ▌縦軸は [百万行/秒] で単位時間あたりに処理できた行数を示す。  列データなので被参照列の数/幅によってI/Oの負荷は変わる。  列データの場合、律速要因はI/OよりもむしろGPU側。 ➔
I/O自体は15～20GB/s程度しか出ておらず、JOINの結果生成される行数によって大きく応答時間が変動している事が傍証。 2020 20250523爆速DBセミナー～PG-Strom v6.0 新機能とその先の未来～ 26

PG-Strom v5.0：内部アーキテクチャの刷新（1/2）  PostgreSQLはマルチプロセスモデル → クライアント毎にfork(2)してバックエンドを生成 GPU PostgreSQL Backend PG-Strom
Postmaster Process fork(2) DBクライアント 20250523爆速DBセミナー～PG-Strom v6.0 新機能とその先の未来～ 27 2023

PG-Strom v5.0：内部アーキテクチャの刷新（1/2）  PostgreSQLはマルチプロセスモデル → クライアント毎にfork(2)してバックエンドを生成  PG-Stromを使用するバックエンドプロセスは、APIの使用時に必ずCUDA Contextを生成する。 ✓
CUDA Context = GPU処理にとってのプロセス（リソースやスケジューリングの単位） GPU PostgreSQL Backend CUDA Context PG-Strom Postmaster Process working memory fork(2) DBクライアント 20250523爆速DBセミナー～PG-Strom v6.0 新機能とその先の未来～ 28 2023

CUDA Context = GPU処理にとってのプロセス（リソースやスケジューリングの単位）  CUDA Contextは存在するだけでリソースを消費する（GPUメモリ数百MB） ➔ 同時接続クライアント数が増えると地獄！！ GPU PostgreSQL Backend CUDA Context PG-Strom PostgreSQL Backend CUDA Context PG-Strom PostgreSQL Backend CUDA Context PG-Strom Postmaster Process working memory working memory working memory fork(2) DBクライアント 20250523爆速DBセミナー～PG-Strom v6.0 新機能とその先の未来～ 29 2023

CUDA Context = GPU処理にとってのプロセス（リソースやスケジューリングの単位）  CUDA Contextは存在するだけでリソースを消費する（GPUメモリ数百MB） ➔ 同時接続クライアント数や並列処理ワーカーが増えると地獄！！ GPU PostgreSQL Backend CUDA Context PG-Strom PostgreSQL Backend CUDA Context PG-Strom PostgreSQL Backend CUDA Context PG-Strom Postmaster Process working memory working memory working memory PostgreSQL Worker CUDA Context PG-Strom PostgreSQL Backend CUDA Context PG-Strom working memory working memory fork(2) 何もしなくても消費するリソース DBクライアント 20250523爆速DBセミナー～PG-Strom v6.0 新機能とその先の未来～ 30 2023

PostgreSQL Backend PG-Strom v5.0：内部アーキテクチャの刷新（2/2） CUDA Contextの生成ごとに消費されるワーキングメモリの節約 PG-Strom v3.x 系列 PG-Strom
v5.0 GPU PostgreSQL Backend CUDA Context PG-Strom PostgreSQL Backend CUDA Context PG-Strom PostgreSQL Backend CUDA Context PG-Strom Postmaster Process working memory working memory working memory fork(2) GPU PG-Strom PostgreSQL Backend Background Worker CUDA Context GPU Service (multi- threads) Postmaster Process working memory fork(2) PG-Strom Local connection to GPU Service ✓ コード品質やデバッグ手法に対する知見が十分でなかった時期、デバッグを容易にするメリット。 ✓ メモリコピーのコストをやや過大に見積もり。 ✓ CUDA Context初期化時間、およびワーキングメモリ（数百MB～1GB）の節約。 ✓ クエリ毎の「内部状態」を持つ事が容易に。 20250523爆速DBセミナー～PG-Strom v6.0 新機能とその先の未来～ 31 2023 クエリ毎の内部状態

PG-Strom v5.0：CUDA C++ネイティブコードの廃止（1/2） =# explain verbose select sum(lo_extendedprice*lo_discount) as revenue
from lineorder,date1 where lo_orderdate = d_datekey and d_year = 1993 and lo_discount between 1 and 3 and lo_quantity < 25; QUERY PLAN --------------------------------------------------------------------------------------------- Aggregate (cost=3161089.82..3161089.83 rows=1 width=32) Output: pgstrom.sum_fp_num((pgstrom.psum(((lineorder.lo_extendedprice * lineorder.lo_discount))::double precision))) -> Custom Scan (GpuPreAgg) on public.lineorder (cost=3161089.81..3161089.82 rows=1 width=32) Output: (pgstrom.psum(((lineorder.lo_extendedprice * lineorder.lo_discount))::double precision)) GPU Projection: pgstrom.psum(((lineorder.lo_extendedprice * lineorder.lo_discount))::double precision) GPU Scan Quals: ((lineorder.lo_discount >= '1'::numeric) AND (lineorder.lo_discount <= '3'::numeric) AND (lineorder.lo_quantity < '25'::numeric)) [rows: 600128800 -> 80699690] GPU Join Quals [1]: (lineorder.lo_orderdate = date1.d_datekey) ... [nrows: 80699690 -> 11519510] GPU Outer Hash [1]: lineorder.lo_orderdate GPU Inner Hash [1]: date1.d_datekey GPU-Direct SQL: enabled (GPU-0) : Scan Quals OpCode: {Bool::AND args=[{Func(bool)::numeric_ge args=[{Var(numeric): slot=0, expr='lo_discount'}, {Const(numeric): value='1'}]}, {Func(bool)::numeric_le args=[{Var(numeric): slot=0, expr='lo_discount'}, {Const(numeric): value='3'}]}, {Func(bool)::numeric_lt args=[{Var(numeric): slot=1, expr='lo_quantity'}, {Const(numeric): value='25'}]}]} Join Quals OpCode: {Packed items[1]={JoinQuals: {Func(bool)::int4eq args=[{Var(int4): kvec=0x2a000-2c800, expr='lo_orderdate'}, {Var(int4): slot=5, expr='d_datekey'}]}}} Join HashValue OpCode: {Packed items[1]={HashValue arg={Var(int4): kvec=0x2a000-2c800, expr='lo_orderdate'}}} Partial Aggregation OpCode: {AggFuncs <psum::fp[slot=2, expr='(((lo_extendedprice * lo_discount))::double precision)']> arg={SaveExpr: <slot=2, type='float8'> arg={Func(float8)::float8 arg={Func(numeric)::numeric_mul args=[{Var(numeric): kvec=0x1c000-2a000, expr='lo_extendedprice'}, {Var(numeric): kvec=0x0000-e000, expr='lo_discount'}]}}}} Partial Function BufSz: 16 -> Seq Scan on public.date1 (cost=0.00..78.96 rows=365 width=4) Output: date1.d_datekey Filter: (date1.d_year = 1993) (22 rows) v3.x系列では、自動生成した CUDA C++ソースコードのファイル名が表示されていた部分 20250523爆速DBセミナー～PG-Strom v6.0 新機能とその先の未来～ 32 2023

PG-Strom v5.0：CUDA C++ネイティブコードの廃止（2/2）背景：ARM搭載Computational Storage Driveでの実験 dpuserv NFSマウント評価用データベース（ssbm:
lineorder [87GB]) テーブルスペース実行結果 CSD側でScan, Join, (Partial) GroupBy を実行した結果のみ SQL疑似命令列 PostgreSQL & PG-Strom CSDデバイスの詳細製造元： NGD Systems 記憶素子：NAND Flash 8.0TB QLC（詳細不明）インターフェース：U.2 NVME PCI-E 3.0 x4 CPU： Arm Cortex A53 (Quad cores, 1.5GHz) RAM： 8.0GB DDR4 (6.0GB for user programs) OS： Ubuntu 18.04.5 LTS (aarch64) 平均 SeqRead 425MB/s 20250523爆速DBセミナー～PG-Strom v6.0 新機能とその先の未来～ 33 2023

2. PG-Strom v6.0の新機能～Large Tables JOIN～ 20250523爆速DBセミナー～PG-Strom v6.0 新機能とその先の未来～
34

Large Tables JOINに関する課題（1/2） tpch=# explain select l_shipmode, o_orderpriority, sum(l_extendedprice) from
lineitem, orders where l_orderkey = o_orderkey and o_orderdate > '1997-01-01’ group by l_shipmode, o_orderpriority; QUERY PLAN -------------------------------------------------------------------------------------------------- Gather (cost=26836522.57..26836526.78 rows=35 width=59) Workers Planned: 2 -> Result (cost=26835522.57..26835523.28 rows=35 width=59) -> Parallel Custom Scan (GpuPreAgg) on lineitem (cost=26835522.57..26835522.75 rows=35 width=59) GPU Projection: pgstrom.psum(l_extendedprice), l_shipmode, o_orderpriority GPU Join Quals [1]: (l_orderkey = o_orderkey) [plan: 2500102000 -> 594023000] GPU Outer Hash [1]: l_orderkey GPU Inner Hash [1]: o_orderkey GPU Group Key: l_shipmode, o_orderpriority Scan-Engine: GPU-Direct with 2 GPUs <0,1> -> Parallel Custom Scan (GpuScan) on orders (cost=100.00..4322222.28 rows=148523665 width=24) GPU Projection: o_orderpriority, o_orderkey GPU Scan Quals: (o_orderdate > '1997-01-01'::date) [plan: 1500242000 -> 148523700] Scan-Engine: GPU-Direct with 2 GPUs <0,1> (14 rows) 882GB 205GB GPU-Scanで処理したこの部分を、いったんCPUに戻し、Inner Hash Bufferとして再びGPUにロードし直さなければならない。 20250523爆速DBセミナー～PG-Strom v6.0 新機能とその先の未来～ 35

Large Tables JOINに関する課題（2/2） ▌GPU-Scanが処理結果をPostgreSQL側（CPU側）へ戻すための処理 ▌CPUでInner Hash Bufferのセットアップ処理（ハッシュ値計算など） ➔ これらが足を引っ張り、本来の速度を出せない。 Inner
Hash Bufferのセットアップ中に処理速度がスローダウン GPU-Scanを実行しながら処理結果をCPU側へ戻しているため、最大限の速度で NVME-SSDから読み出せない CPU側でInner Hash Bufferのセットアップ。ハッシュ値の計算や、共有バッファへのメモリコピー OUTER-Tableを読み出しながら、フルスピードのGPU-Join 20250523爆速DBセミナー～PG-Strom v6.0 新機能とその先の未来～ 36

次ステップを意識したGPU-Scanの改良（1/2） outer-table Results Buffer Row-index GPU cores outer buffer •
• • • • • • • • • • • 一度に約64MBずつロードして GPU-Scan処理 WHERE句の評価 1ブロックを処理するたびに、CPU側の PostgreSQLへデータを書き戻していた。 20250523爆速DBセミナー～PG-Strom v6.0 新機能とその先の未来～ 37

次ステップを意識したGPU-Scanの改良（2/2） outer-table Pinned Inner Buffer Hash-Slot GPU cores outer buffer
• • • • • • • • • • • • 一度に約64MBずつロードして GPU-Scan処理 WHERE句の評価このブロックの処理が終わった事を通知するだけ（次のリクエストを準備） JOIN-Keyの Hash値を計算 20250523爆速DBセミナー～PG-Strom v6.0 新機能とその先の未来～ 38

次ステップを意識したGPU-Scanの改良（2/2） outer-table Pinned Inner Buffer Hash-Slot GPU cores outer buffer
• • • • • • • • • • • • WHERE句の評価 JOIN-Keyの Hash値を計算次のGPU-Join処理へブロックの処理完了を通知 20250523爆速DBセミナー～PG-Strom v6.0 新機能とその先の未来～ 39

Pinned Inner Bufferの利用（1/2） tpch=# set pg_strom.pinned_inner_buffer_threshold = '1GB'; SET tpch=#
explain (costs off, analyze) select l_shipmode, o_orderpriority, sum(l_extendedprice) from lineitem, orders where l_orderkey = o_orderkey and o_orderdate > '1997-01-01’ group by l_shipmode, o_orderpriority; QUERY PLAN --------------------------------------------------------------------------------------------------------------- Gather (actual time=64433.580..64437.036 rows=35 loops=1) Workers Planned: 2 Workers Launched: 2 -> Result (actual time=64416.792..64416.800 rows=12 loops=3) -> Parallel Custom Scan (GpuPreAgg) on lineitem (actual time=64416.790..64416.794 rows=12 loops=3) GPU Projection: pgstrom.psum(l_extendedprice), l_shipmode, o_orderpriority GPU Join Quals [1]: (l_orderkey = o_orderkey) [exec: 26964463 -> 6646784] GPU Outer Hash [1]: l_orderkey GPU Inner Hash [1]: o_orderkey GpuJoin buffer usage: 4096B GPU Group Key: l_shipmode, o_orderpriority Scan-Engine: GPU-Direct with GPU0; direct=115517748, ntuples=26964463 -> Parallel Custom Scan (GpuScan) on orders (actual time=12428.602..12428.604 rows=-1 loops=3) GPU Projection: o_orderpriority, o_orderkey GPU Pinned Buffer: nitems: 360355753, usage: 21.48GB, total: 26.94GB GPU Scan Quals: (o_orderdate > '1997-01-01'::date) [exec: 1503918629 -> 361297593] Scan-Engine: GPU-Direct with GPU0; direct=26843246, ntuples=1503918629 Planning Time: 0.322 ms Execution Time: 64437.236 ms (19 rows) CPUを介して受渡した行は存在しない 3.6億行、27GBの Inner Hash Bufferを GPU上に構築 20250523爆速DBセミナー～PG-Strom v6.0 新機能とその先の未来～ 40

Pinned Inner Bufferの利用（2/2） ✓ GPU-Scanの処理結果を、都度、CPU側に返す必要がなくなった。 ✓ CPUがハッシュ値の計算をする必要がなくなった。 ➔ 結果として、フルスピードでINNER/OUTER側両方を読み出せるように。 ※
ただし、Zero-Copyでのバッファ受渡しには問題があり、後日、修正を行っている。 INNERテーブルのスキャン中もスローダウンしていない INNER GPU-Scan + OUTER GPU-Join をフルスピードで実行できている。 20250523爆速DBセミナー～PG-Strom v6.0 新機能とその先の未来～ 41

GPU Memory Over Subscriptionとの戦い（1/3） ▌Pinned Inner Bufferのサイズを変えてみる  検索条件を変えれば比較的容易にPinned Inner
Bufferのサイズを変更可能  GPUメモリ搭載量の近辺で急速にパフォーマンスが悪化 ➔ GPUメモリとCPUメモリとの間でスラッシングが発生し、一向に処理が進まない。 ➔ 最悪の場合、OUT OF MEMORYエラーでクラッシュに至る事も。 ▌何が起こっているのか？  Hash-JOINの特性上、ランダムアクセスが非常に多い（原理上、不可避） 20250523爆速DBセミナー～PG-Strom v6.0 新機能とその先の未来～ 42

GPU Memory Over Subscriptionとの戦い（2/3） Hash表を分割して複数GPUにInner-Bufferを分散させる GPU-Scan Results (GPU-0) GPU-Scan Results
(GPU-1) GPU-Scan Results (GPU-2) GPU-Scan GPU-Scan GPU-Scan INNER TABLE GPU1 GPU2 GPU0 20250523爆速DBセミナー～PG-Strom v6.0 新機能とその先の未来～ 43

GPU Memory Over Subscriptionとの戦い（2/3） Hash表を分割して複数GPUにInner-Bufferを分散させる Inner Buffer (GPU-1) Inner Buffer
(GPU-2) Inner Buffer (GPU-0) GPU-Scan Results (GPU-0) GPU-Scan Results (GPU-1) GPU-Scan Results (GPU-2) INNER TABLE GPU1 GPU2 GPU0 hash%3=0 hash%3=1 hash%3=2 20250523爆速DBセミナー～PG-Strom v6.0 新機能とその先の未来～ 44

(GPU-2) Inner Buffer (GPU-0) GPU-Scan Results (GPU-0) GPU-Scan Results (GPU-1) GPU-Scan Results (GPU-2) INNER TABLE GPU1 GPU2 GPU0 GPU-Join Inner Buffer Reconstruction hash%3=0 hash%3=1 hash%3=2 20250523爆速DBセミナー～PG-Strom v6.0 新機能とその先の未来～ 45

(GPU-2) Inner Buffer (GPU-0) INNER TABLE GPU1 GPU2 GPU0 hash%3=0 hash%3=1 hash%3=2 20250523爆速DBセミナー～PG-Strom v6.0 新機能とその先の未来～ 46

(GPU-2) Inner Buffer (GPU-0) OUTER TABLE GPU1 GPU2 GPU0 hash%3=0 hash%3=1 hash%3=2 Parallel Hash-Join ～数十GB ～数十GB ～数十GB 1ブロックあたり約64MB GPU-Join Source (GPU-0) GPU-Join 20250523爆速DBセミナー～PG-Strom v6.0 新機能とその先の未来～ 47

(GPU-2) Inner Buffer (GPU-0) OUTER TABLE GPU1 GPU2 GPU0 hash%3=0 hash%3=1 hash%3=2 GPU-Join Source (GPU-1) Parallel Hash-Join ～数十GB ～数十GB ～数十GB GPU-Join Source (GPU-0) GPU to GPU Copy 約64MB GPU-Join 20250523爆速DBセミナー～PG-Strom v6.0 新機能とその先の未来～ 48

(GPU-2) Inner Buffer (GPU-0) OUTER TABLE GPU1 GPU2 GPU0 hash%3=0 hash%3=1 hash%3=2 GPU-Join Source (GPU-1) GPU-Join Source (GPU-2) Parallel Hash-Join ～数十GB ～数十GB ～数十GB GPU to GPU Copy 約64MB GPU-Join 20250523爆速DBセミナー～PG-Strom v6.0 新機能とその先の未来～ 49

GPU Memory Over Subscriptionとの戦い（3/3） GPU-Scan結果バッファとGPU-Join Inner Hash-Bufferが共に大きいと…。 GPU-Scan Results Buffer
GPU-Join Inner Buffer GPU 搭載メモリ：40GB 使用済：25GB 結果サイズ：30GB GPU-Join Inner Buffer 使用済：25GB Reconstruction処理中にメモリ溢れによるスラッシングが発生する可能性がある。結果バッファ① GPU-Scan Results Buffer 結果サイズ：30GB 結果バッファ② 結果バッファ③ 結果バッファ④ 結果バッファ⑤ 結果バッファ N 結果バッファを細分化。512MBずつマージして、その都度、明示的にバッファを解放する。改良 20250523爆速DBセミナー～PG-Strom v6.0 新機能とその先の未来～ 50

Pinned Inner Bufferの効果 51 ▌評価に使用したクエリ select l_shipmode, o_orderpriority, sum(l_extendedprice) from
lineitem, orders  882GBと205GBのテーブルのJOIN where l_orderkey = o_orderkey and o_orderdate > ‘1997-09-01‘  ここを調整するとInner-Bufferのサイズを group by l_shipmode, o_orderpriority; 変える事ができる。 ▌残課題  Reconstruction処理のマルチスレッド化（GPUメモリの使い過ぎにならない程度に） o_orderdate Inner nitems [million-rows] OLD GPU Join [sec] NEW GPU Join GPU hash memory usage[GB] ※※ CPU Hash-Join total [sec] CPU hash memory usage [GB] total [sec] reconstruction [sec] 1995-01-01 ※ 816.62 ------ 156.54 58.56 60.89GB (3) ------ ------ 1995-01-01 816.62 -------- 68.23 13.46 60.89GB (2) 780.62 43.14 1995-05-01 744.86 -------- 66.95 13.71 55.32GB (2) 721.53 39.23 1995-09-01 665.32 -------- 64.05 11.05 49.61GB (2) 666.86 35.23 1996-01-01 588.53 -------- 63.57 10.61 43.94GB (2) 605.14 31.25 1996-05-01 513.09 -------- 63.82 8.59 38.28GB (2) 549.02 27.31 1996-09-01 436.41 111.573 59.91 5.44 32.61GB (1) 489.52 23.31 1997-01-01 360.36 99.13 57.19 3.94 26.94GB (1) 433.00 19.33 1997-05-01 285.54 86.44 56.34 3.59 21.37GB (1) 376.77 15.42 1997-09-01 208.86 74.54 55.26 2.55 15.65GB (1) 319.31 11.41 1998-01-01 132.80 63.65 54.38 1.55 9.99GB (1) 263.05 7.44 ※ SET pg_strom.pinned_inner_buffer_partition_size = ‘24GB’ 設定により、意図的にInner Hash Bufferの分割数を増やしてみた。 ※※ ()内の数字はInner Hash Bufferの分割数 20250523爆速DBセミナー～PG-Strom v6.0 新機能とその先の未来～

2. PG-Strom v6.0の新機能～GPU-Sort & Window関数～ 20250523爆速DBセミナー～PG-Strom v6.0 新機能とその先の未来～
52

PG-StromのFAQを見ると、、、要約：ソートって全部のデータをGPUメモリに載せへんと実行できひんやん？でも、それやと非同期実行にならへんねん。しかもGPUメモリって小さいやろ？小さいデータのソートを高速化してホンマに嬉しいんか？昔に試した事あるけど、あんまり出来のエエもんやなかったんやわ…。当時（2016年頃）使っていたGPU： TESLA K20c
2496 CUDA cores[ 706MHz] 5GB GDDR5 [208GB/s] PCI-E 3.0 x16 GeForce GTX980 2048 CUDA cores[1124MHz] 4GB GDDR5 [224GB/s] PCI-E 3.0 x16 20250523爆速DBセミナー～PG-Strom v6.0 新機能とその先の未来～ 53

GPU-Scan / GPU-Joinにおけるデータの流れ outer-table 従来のGPU-Scan/GPU-Join 読み出した64MBブロック単位で、都度、CPU側へ処理結果を返す。 ➔GPUメモリ消費量は少ない
GPU-Join GPU-Scan Pinned Inner Bufferを使う場合 outer-table 処理結果をGPUに留め置き、後の工程で使用する ➔GPUメモリの消費量は大きいが、一度にデータセット全体を参照する事ができる。 GPU-Join GPU-Scan 実行ステータス 20250523爆速DBセミナー～PG-Strom v6.0 新機能とその先の未来～ 54 64MB以下数十GB

Bitonic-sortingアルゴリズム  データ型に大小比較の演算子が定義されていればOK  ソーティングネットワークを辿っていけば、最終的にデータは整列される。  個々の比較演算は独立に行うことができるため、並列化が容易 ➔ GPUでの並列処理に適する。 20250523爆速DBセミナー
～PG-Strom v6.0 新機能とその先の未来～ 55

GPU-Sortを含む実行計画（1/2） ssbm=# explain (costs off, analyze) select d_year, p_brand1 from
lineorder, date1, part, supplier where lo_orderdate = d_datekey and lo_partkey = p_partkey and lo_suppkey = s_suppkey and p_brand1 in ('MFGR#2221','MFGR#2228') and s_region in ('ASIA','AMERICA') order by d_year, lo_discount limit 15; QUERY PLAN -------------------------------------------------------------------------------------------- Gather (actual time=48353.408..48376.961 rows=15 loops=1) Workers Planned: 2 Workers Launched: 2 -> Parallel Custom Scan (GpuJoin) on lineorder (actual time=48170.939..48174.475 rows=5 loops=3) GPU Projection: lo_discount, d_year, p_brand1 GPU Join Quals [1]: (p_partkey = lo_partkey) [exec: 5999989709 -> 12086435] GPU Join Quals [2]: (s_suppkey = lo_suppkey) [exec: 12086435 -> 4841975] GPU Join Quals [3]: (d_datekey = lo_orderdate) [exec: 4841975 -> 4841975] GpuJoin buffer usage: 214.07MB Scan-Engine: GPU-Direct with GPU0; direct=114826068, ntuples=5999989709 GPU-Sort keys: d_year, lo_discount [buffer reconstruction: 0msec, GPU-sorting 240msec] GPU-Sort Limit: 15, 4841975 rows filtered -> Parallel Custom Scan (GpuScan) on part (actual time=3.759..41.799 rows=1337 loops=3) : -> Parallel Custom Scan (GpuScan) on supplier (actual time=31.064..136.006 rows=1333704 : -> Parallel Seq Scan on date1 (actual time=0.005..0.105 rows=852 loops=3) Planning Time: 1.033 ms Execution Time: 48377.340 ms GPU-Join[depth=3]では 484万行が生成されているが、結果として出力されたのは僅か 5x3 = 15 行のみ 20250523爆速DBセミナー～PG-Strom v6.0 新機能とその先の未来～ 56

GPU-Sortを含む実行計画（2/2） ssbm=# explain (costs off, analyze) select d_year, p_brand1 from
lineorder, date1, part, supplier where lo_orderdate = d_datekey and lo_partkey = p_partkey and lo_suppkey = s_suppkey and p_brand1 in ('MFGR#2221','MFGR#2228') and s_region in ('ASIA','AMERICA') order by d_year, lo_discount limit 15; QUERY PLAN ----------------------------------------------------------------------------------------- Limit (actual time=48221.380..48251.421 rows=15 loops=1) -> Gather Merge (actual time=48221.379..48251.418 rows=15 loops=1) Workers Planned: 2 Workers Launched: 2 -> Sort (actual time=48210.779..48210.783 rows=15 loops=3) Sort Key: date1.d_year, lineorder.lo_discount Sort Method: top-N heapsort Memory: 26kB Worker 0: Sort Method: top-N heapsort Memory: 27kB Worker 1: Sort Method: top-N heapsort Memory: 27kB -> Parallel Custom Scan (GpuJoin) on lineorder (actual time=966.941..47976.717 rows=1613992 ... GPU Projection: d_year, p_brand1, lo_discount GPU Join Quals [1]: (p_partkey = lo_partkey) [exec: 5999989709 -> 12086435] GPU Join Quals [2]: (s_suppkey = lo_suppkey) [exec: 12086435 -> 4841975] GPU Join Quals [3]: (d_datekey = lo_orderdate) [exec: 4841975 -> 4841975] GpuJoin buffer usage: 214.07MB Scan-Engine: GPU-Direct with GPU0; direct=114826068, ntuples=5999989709 -> Parallel Custom Scan (GpuScan) on part (actual time=23.568..59.577 rows=1337 loops=3) : Execution Time: 48251.857 ms ただし、単純な「上位xx件」だと PostgreSQLも工夫するので、それほど差はつかない。 20250523爆速DBセミナー～PG-Strom v6.0 新機能とその先の未来～ 57

Window関数の例（1/2） =# select * from ( select "都道府県","市町村","総人口", rank() OVER(PARTITION
BY "都道府県" ORDER BY "総人口" DESC) from population where "都道府県" in ('東京都','神奈川県','埼玉県','千葉県','茨城県’) ) qry WHERE rank <= 3; 都道府県 | 市町村 | 総人口 | rank ----------+-------------------+---------+------ 千葉県 | 千葉県千葉市 | 974951 | 1 千葉県 | 千葉県船橋市 | 642907 | 2 千葉県 | 千葉県松戸市 | 498232 | 3 埼玉県 | 埼玉県さいたま市 | 1324025 | 1 埼玉県 | 埼玉県川口市 | 594274 | 2 埼玉県 | 埼玉県川越市 | 354571 | 3 東京都 | 東京都世田谷区 | 943664 | 1 東京都 | 東京都練馬区 | 752608 | 2 東京都 | 東京都大田区 | 748081 | 3 神奈川県 | 神奈川県横浜市 | 3777491 | 1 神奈川県 | 神奈川県川崎市 | 1538262 | 2 神奈川県 | 神奈川県相模原市 | 725493 | 3 茨城県 | 茨城県水戸市 | 270685 | 1 茨城県 | 茨城県つくば市 | 241656 | 2 茨城県 | 茨城県日立市 | 174508 | 3 (15 rows) 20250523爆速DBセミナー～PG-Strom v6.0 新機能とその先の未来～ 58

Window関数の例（2/2） =# explain select * from ( select "都道府県","市町村","総人口", rank()
OVER(PARTITION BY "都道府県" ORDER BY "総人口" DESC) from population where "都道府県" in ('東京都','神奈川県','埼玉県','千葉県','茨城県’) ) qry WHERE rank <= 3; QUERY PLAN --------------------------------------------------------------------------------------------- WindowAgg (cost=28.85..32.39 rows=177 width=42) Run Condition: (rank() OVER (?) < 5) -> Sort (cost=28.85..29.30 rows=177 width=34) Sort Key: population."都道府県", population."総人口" DESC -> Seq Scan on population (cost=0.00..22.24 rows=177 width=34) Filter: ("都道府県" = ANY ('{東京都,神奈川県,埼玉県,千葉県,茨城県}'::text[])) (6 rows) ➔ WindowAggの入力データは、”都道府県”,”総人口”の順で並んでいる必要がある。 ✓ WindowAggは入力行を一つずつチェックし、”都道府県”列が一つ前の入力と異なる場合は異なるパーティションであると判断する。（rank()がリセットされる） ✓ 処理の特性上、部分ソートが効かない ➔ 必ずシングルCPUで大域ソートを行う 20250523爆速DBセミナー～PG-Strom v6.0 新機能とその先の未来～ 59

GPUでのWindow関数サポート（1/3） ssbm=# explain (costs off, analyze) select * from (
select c_region, c_nation, c_city, lo_orderdate, sum(lo_revenue) lo_rev, rank() over(partition by c_region, c_nation, c_city order by sum(lo_revenue)) cnt from lineorder, customer where lo_custkey = c_custkey group by c_region, c_nation, c_city, lo_orderdate ) subqry where cnt < 4; QUERY PLAN ------------------------------------------------------------------------------------------------------------- WindowAgg (actual time=54634.772..54796.749 rows=750 loops=1) Run Condition: (rank() OVER (?) < 4) -> Gather (actual time=54634.759..54795.769 rows=750 loops=1) Workers Planned: 2 Workers Launched: 2 -> Result (actual time=54507.899..54508.045 rows=250 loops=3) -> Parallel Custom Scan (GpuPreAgg) on lineorder (actual time=54507.897..54507.948 rows=250 loops=3) GPU Projection: pgstrom.psum(lo_revenue), c_region, c_nation, c_city, lo_orderdate GPU Join Quals [1]: (lo_custkey = c_custkey) [exec: 4096 -> 4096] GpuJoin buffer usage: 2746.59MB GPU Group Key: c_region, c_nation, c_city, lo_orderdate Scan-Engine: GPU-Direct with 2 GPUs <0,1>; direct=114826068, ntuples=4096 GPU-Sort keys: c_region, c_nation, c_city, pgstrom.fsum_numeric((pgstrom.psum(lo_revenue))) ¥ [buffer reconstruction: 35msec, GPU-sorting 105msec] Window-Rank Filter: rank() over(PARTITION BY c_region, c_nation, c_city ¥ ORDER BY pgstrom.fsum_numeric((pgstrom.psum(lo_revenue)))) < 4 ¥ [4096 rows filtered by window function] -> Parallel Seq Scan on customer (actual time=0.031..1535.490 rows=10000000 loops=3) Execution Time: 55150.584 ms GPUでは各パーティションあたり 3行しか返却していない ➔ 後処理の負荷軽減になる 20250523爆速DBセミナー～PG-Strom v6.0 新機能とその先の未来～ 60

GPUでのWindow関数サポート（2/3） ssbm=# explain (costs off, analyze) select * from (
select c_region, c_nation, c_city, lo_orderdate, sum(lo_revenue) lo_rev, rank() over(partition by c_region, c_nation, c_city order by sum(lo_revenue)) cnt from lineorder, customer where lo_custkey = c_custkey group by c_region, c_nation, c_city, lo_orderdate ) subqry where cnt < 4; QUERY PLAN ------------------------------------------------------------------------------------------------------------ WindowAgg (actual time=58528.418..58701.257 rows=750 loops=1) Run Condition: (rank() OVER (?) < 4) -> Sort (actual time=58528.406..58607.314 rows=601500 loops=1) Sort Key: customer.c_region, customer.c_nation, customer.c_city, ¥ (pgstrom.fsum_numeric((pgstrom.psum(lineorder.lo_revenue)))) Sort Method: quicksort Memory: 76268kB -> Gather (actual time=55050.926..55724.274 rows=601500 loops=1) Workers Planned: 2 Workers Launched: 2 -> Result (actual time=54978.469..55066.634 rows=200500 loops=3) -> Parallel Custom Scan (GpuPreAgg) on lineorder (actual time=54978.466..55002.391 rows=200500 loops=3) GPU Projection: pgstrom.psum(lo_revenue), c_region, c_nation, c_city, lo_orderdate GPU Join Quals [1]: (lo_custkey = c_custkey) [exec: 4096 -> 4096] GpuJoin buffer usage: 2746.59MB GPU Group Key: c_region, c_nation, c_city, lo_orderdate Scan-Engine: GPU-Direct with 2 GPUs <0,1>; direct=114826068, ntuples=4096 -> Parallel Seq Scan on customer (actual time=0.032..1522.518 rows=10000000 loops=3) Planning Time: 0.358 ms Execution Time: 59062.355 ms (20 rows) ソートしてみるまで rank()は分からないので、全部まとめてCPUに戻すしかない。 20250523爆速DBセミナー～PG-Strom v6.0 新機能とその先の未来～ 61

GPUでのWindow関数サポート（3/3） ▌サポートされている Window関数  row_number() 1から数えたパーティション内の現在行の数  rank() 現在行の順位を返す（ギャップを含む） 
dense_rank() 現在行の順位を返す（ギャップを含まない） ➔ これらのWindow関数を用いて、 rank() < 5 のように、行数の絞り込みが発生する場合のみ有効 ▌それ以外の Window 関数は？  GPUでのWindow関数サポートに意味がある理由を考える ➔ CPUが処理する（特にソートする）行数を減らす。 GPUから余計な行を出さない。 ✓ 行数を減らせる最適化が可能なパターンは可能性あり。 ✓ 行数を減らさない、単なる計算の前処理であればあまり意味はない。 20250523爆速DBセミナー～PG-Strom v6.0 新機能とその先の未来～ 62

GPU版Window関数の実装（1/5）都道府県市町村総人口 : : : 東京都東京都町田市
431079 茨城県茨城県水戸市 270685 埼玉県神奈川県横浜市 3777491 千葉県千葉県君津市 82206 東京都東京都豊島区 301599 千葉県千葉県館山市 45153 茨城県茨城県稲敷市 39039 千葉県千葉県八千代市 199498 茨城県茨城県小美玉市 48870 埼玉県埼玉県本庄市 78569 東京都東京都杉並区 591108 茨城県茨城県牛久市 84651 埼玉県埼玉県東松山市 91791 : : : GPU-Scan/Join/PreAgg 結果バッファ GPU-Scan GPU-Join GPU-PreAgg 20250523爆速DBセミナー～PG-Strom v6.0 新機能とその先の未来～ 63

GPU版Window関数の実装（2/5）都道府県市町村総人口 : : : 埼玉県埼玉県日高市
54571 埼玉県埼玉県羽生市 52862 埼玉県埼玉県白岡市 52214 埼玉県埼玉県幸手市 50066 東京都東京都世田谷区 943664 東京都東京都練馬区 752608 東京都東京都大田区 748081 東京都東京都江戸川区 697932 東京都東京都足立区 695043 東京都東京都杉並区 591108 東京都東京都板橋区 584483 東京都東京都八王子市 579355 東京都東京都江東区 524310 : : : GPU-Scan/Join/PreAgg 結果バッファ GPU-Sort ORDER BY 都道府県, 総人口 20250523爆速DBセミナー～PG-Strom v6.0 新機能とその先の未来～ 64

GPU版Window関数の実装（3/5）都道府県市町村総人口 P-hash : : : : 埼玉県
埼玉県日高市 54571 0x1c2e605e 埼玉県埼玉県羽生市 52862 0x1c2e605e 埼玉県埼玉県白岡市 52214 0x1c2e605e 埼玉県埼玉県幸手市 50066 0x1c2e605e 東京都東京都世田谷区 943664 0x3c02094f 東京都東京都練馬区 752608 0x3c02094f 東京都東京都大田区 748081 0x3c02094f 東京都東京都江戸川区 697932 0x3c02094f 東京都東京都足立区 695043 0x3c02094f 東京都東京都杉並区 591108 0x3c02094f 東京都東京都板橋区 584483 0x3c02094f 東京都東京都八王子市 579355 0x3c02094f 東京都東京都江東区 524310 0x3c02094f : : : : GPU-Scan/Join/PreAgg 結果バッファハッシュ値を計算 P-hash = hash(都道府県) 20250523爆速DBセミナー～PG-Strom v6.0 新機能とその先の未来～ 65

GPU版Window関数の実装（4/5）都道府県市町村総人口 P-hash Rank() : : : :
: 埼玉県埼玉県日高市 54571 0x1c2e605e 埼玉県埼玉県羽生市 52862 0x1c2e605e 埼玉県埼玉県白岡市 52214 0x1c2e605e 埼玉県埼玉県幸手市 50066 0x1c2e605e 東京都東京都世田谷区 943664 0x3c02094f 東京都東京都練馬区 752608 0x3c02094f 東京都東京都大田区 748081 0x3c02094f 東京都東京都江戸川区 697932 0x3c02094f 東京都東京都足立区 695043 0x3c02094f 東京都東京都杉並区 591108 0x3c02094f 東京都東京都板橋区 584483 0x3c02094f 東京都東京都八王子市 579355 0x3c02094f 東京都東京都江東区 524310 0x3c02094f : : : : : GPU-Scan/Join/PreAgg 結果バッファハッシュ値に基づいて、パーティションの先頭を特定。 20250523爆速DBセミナー～PG-Strom v6.0 新機能とその先の未来～ 66

: 埼玉県埼玉県日高市 54571 0x1c2e605e 37 埼玉県埼玉県羽生市 52862 0x1c2e605e 38 埼玉県埼玉県白岡市 52214 0x1c2e605e 39 埼玉県埼玉県幸手市 50066 0x1c2e605e 40 東京都東京都世田谷区 943664 0x3c02094f 1 東京都東京都練馬区 752608 0x3c02094f 2 東京都東京都大田区 748081 0x3c02094f 3 東京都東京都江戸川区 697932 0x3c02094f 4 東京都東京都足立区 695043 0x3c02094f 5 東京都東京都杉並区 591108 0x3c02094f 6 東京都東京都板橋区 584483 0x3c02094f 7 東京都東京都八王子市 579355 0x3c02094f 8 東京都東京都江東区 524310 0x3c02094f 9 : : : : : GPU-Scan/Join/PreAgg 結果バッファハッシュ値に基づいて、パーティションの先頭を特定。 ➔ 自身のRank()を確定する。距離=6 20250523爆速DBセミナー～PG-Strom v6.0 新機能とその先の未来～ 67

: 埼玉県埼玉県日高市 54571 0x1c2e605e 37 埼玉県埼玉県羽生市 52862 0x1c2e605e 38 埼玉県埼玉県白岡市 52214 0x1c2e605e 39 埼玉県埼玉県幸手市 50066 0x1c2e605e 40 東京都東京都世田谷区 943664 0x3c02094f 1 東京都東京都練馬区 752608 0x3c02094f 2 東京都東京都大田区 748081 0x3c02094f 3 東京都東京都江戸川区 697932 0x3c02094f 4 東京都東京都足立区 695043 0x3c02094f 5 東京都東京都杉並区 591108 0x3c02094f 6 東京都東京都板橋区 584483 0x3c02094f 7 東京都東京都八王子市 579355 0x3c02094f 8 東京都東京都江東区 524310 0x3c02094f 9 : : : : : GPU-Scan/Join/PreAgg 結果バッファハッシュ値に基づいて、パーティションの先頭を特定。 ➔ 自身のRank()を確定する。 ➔ rank() < 4 は消す。距離=6 20250523爆速DBセミナー～PG-Strom v6.0 新機能とその先の未来～ 68

2. PG-Strom v6.0の新機能～Arrow_Fdwの仮想列～ 20250523爆速DBセミナー～PG-Strom v6.0 新機能とその先の未来～ 69

RecordBatch: 長さの同じ配列を列ごとに寄せ集めたもの。既存機能：Arrow_Fdwにおけるmin/max統計情報（1/2） 20250523爆速DBセミナー～PG-Strom v6.0 新機能とその先の未来～ 70
中身を読むまでもなく、明らかに検索条件にマッチしないブロックを読み飛ばす。 Header “ARROW1¥0¥0” Schema Definition Footer Schema Definition Custom Metadata Apache Arrow File RecordBatch-1 RecordBatch-k RecordBatch-N Terminator “ARROW1” Custom Metadata min_values=“202009,202107, ...” max_values=“202107,202111, ...” 各レコードバッチ毎に最小値・最大値をカスタムメタデータとして埋め込む事ができる。 SELECT ymd, count(*) FROM arrow_table WHERE ymd <= 202103; RecordBatch-1 RecordBatch-k RecordBatch-N ymd最小値 <= 202103 ➔ true ymd最小値 <= 202103 ➔ false ymd最小値 <= 202103 ➔ false 独自機能 min/max統計情報に基づいて、検索条件にマッチする行を含むブロック（RecordBatch）だけを読み出すよう最適化が可能。

既存機能：Arrow_Fdwにおけるmin/max統計情報（2/2） ssbm=# EXPLAIN (analyze, costs off) SELECT count(*), sum(lo_revenue) FROM
f_lineorder_sorted WHERE lo_orderdate BETWEEN 19940401 AND 19940630 AND lo_orderpriority = '2-HIGH’; QUERY PLAN --------------------------------------------------------------------------------------------------- Aggregate (actual time=334.706..334.708 rows=1 loops=1) -> Custom Scan (GpuPreAgg) on f_lineorder_sorted (actual time=334.691..334.694 rows=2 loops=1) GPU Projection: pgstrom.nrows(), pgstrom.psum((lo_revenue)::bigint) GPU Scan Quals: ((lo_orderdate >= 19940401) AND (lo_orderdate <= 19940630) AND (lo_orderpriority = '2-HIGH'::text)) ¥ [plan: 5999990000 -> 150000, exec: 232539843 -> 45396726] referenced: lo_orderdate, lo_orderpriority, lo_revenue Stats-Hint: (lo_orderdate >= 19940401), ¥ (lo_orderdate <= 19940630) [loaded: 77, skipped: 1910] file0: /opt/arrow/f_lineorder_sorted.arrow (read: 109.52GB, size: 496.69GB) GPU-Direct SQL: enabled (N=2,GPU0,1; direct=556444, ntuples=232539843) Planning Time: 2.028 ms Execution Time: 336.700 ms (10 rows) 明らかに検索条件に合致しない RecordBatch を読み飛ばしているのが分かる検索条件にマッチしない事が明らかなら、そもそも読み出す必要もない。 20250523爆速DBセミナー～PG-Strom v6.0 新機能とその先の未来～ 71

Arrowテーブルを組み込んだデータベース構造（1/2）ログデータトランザクションデータ（PostgreSQL Heap形式）肥大化古いデータを Arrow形式に変換 pg2arrow: SELECT
* FROM my_table WHERE ymd >= ‘2023-01-01’ AND ymd < ‘2024-01-01’ すごく古いデータは、gzipなどで固めてバックアップ用ストレージへテーブルパーティション mytable_2022.arrow mytable_2023.arrow 20250523爆速DBセミナー～PG-Strom v6.0 新機能とその先の未来～ 72

Arrowテーブルを組み込んだデータベース構造（2/2） $ cd /opt/arrow/monthly/ $ ls f_lineorder_199301_AIR.arrow f_lineorder_199409_AIR.arrow f_lineorder_199605_FOB.arrow f_lineorder_199301_FOB.arrow
f_lineorder_199409_FOB.arrow f_lineorder_199605_MAIL.arrow f_lineorder_199301_MAIL.arrow f_lineorder_199409_MAIL.arrow f_lineorder_199605_RAIL.arrow f_lineorder_199301_RAIL.arrow f_lineorder_199409_RAIL.arrow f_lineorder_199605_SHIP.arrow f_lineorder_199301_SHIP.arrow f_lineorder_199409_SHIP.arrow f_lineorder_199605_TRUCK.arrow f_lineorder_199301_TRUCK.arrow f_lineorder_199409_TRUCK.arrow f_lineorder_199606_AIR.arrow f_lineorder_199302_AIR.arrow f_lineorder_199410_AIR.arrow f_lineorder_199606_FOB.arrow f_lineorder_199302_FOB.arrow f_lineorder_199410_FOB.arrow f_lineorder_199606_MAIL.arrow f_lineorder_199302_MAIL.arrow f_lineorder_199410_MAIL.arrow f_lineorder_199606_RAIL.arrow f_lineorder_199302_RAIL.arrow f_lineorder_199410_RAIL.arrow f_lineorder_199606_SHIP.arrow f_lineorder_199302_SHIP.arrow f_lineorder_199410_SHIP.arrow f_lineorder_199606_TRUCK.arrow f_lineorder_199302_TRUCK.arrow f_lineorder_199410_TRUCK.arrow f_lineorder_199607_AIR.arrow f_lineorder_199303_AIR.arrow f_lineorder_199411_AIR.arrow f_lineorder_199607_FOB.arrow f_lineorder_199303_FOB.arrow f_lineorder_199411_FOB.arrow f_lineorder_199607_MAIL.arrow f_lineorder_199303_MAIL.arrow f_lineorder_199411_MAIL.arrow f_lineorder_199607_RAIL.arrow f_lineorder_199303_RAIL.arrow f_lineorder_199411_RAIL.arrow f_lineorder_199607_SHIP.arrow f_lineorder_199303_SHIP.arrow f_lineorder_199411_SHIP.arrow f_lineorder_199607_TRUCK.arrow f_lineorder_199303_TRUCK.arrow f_lineorder_199411_TRUCK.arrow f_lineorder_199608_AIR.arrow f_lineorder_199304_AIR.arrow f_lineorder_199412_AIR.arrow f_lineorder_199608_FOB.arrow f_lineorder_199304_FOB.arrow f_lineorder_199412_FOB.arrow f_lineorder_199608_MAIL.arrow f_lineorder_199304_MAIL.arrow f_lineorder_199412_MAIL.arrow f_lineorder_199608_RAIL.arrow f_lineorder_199304_RAIL.arrow f_lineorder_199412_RAIL.arrow f_lineorder_199608_SHIP.arrow f_lineorder_199304_SHIP.arrow f_lineorder_199412_SHIP.arrow f_lineorder_199608_TRUCK.arrow f_lineorder_199304_TRUCK.arrow f_lineorder_199412_TRUCK.arrow f_lineorder_199609_AIR.arrow f_lineorder_199305_AIR.arrow f_lineorder_199501_AIR.arrow f_lineorder_199609_FOB.arrow f_lineorder_199305_FOB.arrow f_lineorder_199501_FOB.arrow f_lineorder_199609_MAIL.arrow f_lineorder_199305_MAIL.arrow f_lineorder_199501_MAIL.arrow f_lineorder_199609_RAIL.arrow : : : 習慣的に、ファイル名に “yyyymmdd” を付加したりしますよね？ 20250523爆速DBセミナー～PG-Strom v6.0 新機能とその先の未来～ 73

Arrow_FdwのVirtual Column機構（1/3） ssbm=# IMPORT FOREIGN SCHEMA f_lineorder_vcol FROM SERVER arrow_fdw
INTO public OPTIONS (dir '/opt/arrow/monthly', pattern 'f_lineorder_@{year_month}_${shipmode}.arrow'); IMPORT FOREIGN SCHEMA ssbm=# ¥d f_lineorder_vcol Foreign table "public.f_lineorder_vcol" Column | Type | Collation | Nullable | Default | FDW options --------------------+--------------+-----------+----------+---------+------------------------ lo_orderkey | bigint | | | | lo_linenumber | integer | | | | lo_custkey | integer | | | | lo_partkey | integer | | | | lo_suppkey | integer | | | | lo_orderdate | integer | | | | lo_orderpriority | text | | | | lo_shippriority | character(1) | | | | lo_quantity | integer | | | | lo_extendedprice | integer | | | | lo_ordertotalprice | integer | | | | lo_discount | integer | | | | lo_revenue | integer | | | | lo_supplycost | integer | | | | lo_tax | integer | | | | lo_commit_date | text | | | | lo_shipmode | text | | | | year_month | bigint | | | | (virtual 'year_month') shipmode | text | | | | (virtual 'shipmode') Server: arrow_fdw FDW options: (dir '/opt/arrow/monthly', pattern 'f_lineorder_@{year_month}_${shipmode}.arrow') 実はArrowファイル上には存在していない列。ファイル名の一部を列の値として読み出す事ができる。 @{xxxx} … 数値列（bigint） ${xxxx} … 文字列（text） 20250523爆速DBセミナー～PG-Strom v6.0 新機能とその先の未来～ 74

Arrow_FdwのVirtual Column機構（2/3） SSBM Q1_2をVirtual Column用に改修して比較 =# select sum(lo_extendedprice*lo_discount) from f_lineorder,
date1 where lo_orderdate = d_datekey and d_yearmonthnum = 199401 and lo_discount between 4 and 6 and lo_quantity between 26 and 35; revenue ---------------- 96427226949654 (1 row) Time: 5549.456 ms ssbm=# select sum(lo_extendedprice*lo_discount) from f_lineorder_vcol where year_month = 199401 and lo_discount between 4 and 6 and lo_quantity between 26 and 35; sum ---------------- 96427226949654 (1 row) Time: 142.665 ms 20250523爆速DBセミナー～PG-Strom v6.0 新機能とその先の未来～ 75 =# select sum(lo_extendedprice*lo_discount) from lineorder, date1 where lo_orderdate = d_datekey and d_yearmonthnum = 199401 and lo_discount between 4 and 6 and lo_quantity between 26 and 35; revenue ---------------- 96427226949654 (1 row) Time: 116134.440 ms =# select sum(lo_extendedprice*lo_discount) from lineorder, date1 where lo_orderdate = d_datekey and d_yearmonthnum = 199401 and lo_discount between 4 and 6 and lo_quantity between 26 and 35; revenue ---------------- 96427226949654 (1 row) Time: 22404.024 ms 817倍高速に！？元々の PostgreSQL PG-Strom と GPU-Direct SQL データ形式を Arrowに変換して GPUで処理 Arrow_Fdw 仮想列を用いた読み飛ばし

Arrow_FdwのVirtual Column機構（3/3） ssbm=# explain (analyze, costs off) select sum(lo_extendedprice*lo_discount) from
f_lineorder_vcol where year_month = 199401 and lo_discount between 4 and 6 and lo_quantity between 26 and 35; QUERY PLAN -------------------------------------------------------------------------------------------------- Aggregate (actual time=123.326..123.328 rows=1 loops=1) -> Custom Scan (GpuPreAgg) on f_lineorder_vcol (actual time=123.315..123.317 rows=2 loops=1) GPU Projection: pgstrom.psum(((lo_extendedprice * lo_discount))::bigint) GPU Scan Quals: ((lo_discount >= 4) AND (lo_discount <= 6) AND (lo_quantity >= 26) AND (lo_quantity <= 35) AND (year_month = 199401)) [plan: 3914078000 -> 489, exec: 77300885 -> 4217587] referenced: lo_quantity, lo_extendedprice, lo_discount, year_month Stats-Hint: (year_month = 199401) [loaded: 28, skipped: 1416] file0: /opt/arrow/monthly/f_lineorder_199411_FOB.arrow (read: 122.30MB, size: 892.79MB) file1: /opt/arrow/monthly/f_lineorder_199406_AIR.arrow (read: 122.28MB, size: 892.68MB) file2: /opt/arrow/monthly/f_lineorder_199709_FOB.arrow (read: 122.38MB, size: 893.37MB) file3: /opt/arrow/monthly/f_lineorder_199401_RAIL.arrow (read: 126.33MB, size: 932.76MB) file4: /opt/arrow/monthly/f_lineorder_199508_TRUCK.arrow (read: 126.39MB, size: 943.73MB) : : : : file360: /opt/arrow/monthly/f_lineorder_199405_SHIP.arrow (read: 126.43MB, size: 933.48MB) GPU-Direct SQL: enabled (N=2,GPU0,1; direct=113261, ntuples=77300885) Planning Time: 4.772 ms Execution Time: 128.046 ms min/max統計情報と同じ仕組みで、明らかにマッチしないRecord Batchを読み飛ばし 20250523爆速DBセミナー～PG-Strom v6.0 新機能とその先の未来～ 76

3. 今後のロードマップ 20250523爆速DBセミナー～PG-Strom v6.0 新機能とその先の未来～ 77

SELECT * INTO ... ▌NVIDIA GPU Direct Storageは書き込みにも対応している。 ➔ PostgreSQLをバイパスしてGPU-Directで書き込めたら速いハズ。
▌発動できる条件は限定的ではあるが、、、  対象テーブルに EXCLUSIVE ロックを取る必要がある。  GPU処理の結果を「そのまま」INSERTするタイプのSQL文。 ETL/バッチ処理で集計テーブル作成の高速化を目指す GPU-PreAgg GPU-Join GPU-Scan GPU Direct Storage (Read) memory copy GPU->RAM writes via filesystem トランザクションログ書込みやバッファ管理、並行処理の制御 20250523爆速DBセミナー～PG-Strom v6.0 新機能とその先の未来～ 78

SELECT * INTO ... ▌NVIDIA GPU Direct Storageは書き込みにも対応している。 ➔ PostgreSQLをバイパスしてGPU-Directで書き込めたら速いハズ。
▌発動できる条件は限定的ではあるが、、、  対象テーブルに EXCLUSIVE ロックを取る必要がある。  GPU処理の結果を「そのまま」INSERTするタイプのSQL文。 ETL/バッチ処理で集計テーブル作成の高速化を目指す GPU-PreAgg GPU-Join GPU-Scan GPU Direct Storage (Read) GPU Direct Storage (Write) status only 20250523爆速DBセミナー～PG-Strom v6.0 新機能とその先の未来～ 79

クラウド向け機能強化（1/4） ▌GPUの搭載数に比べて、NVMEやN/Wが貧弱である事が多い  例：g6e.48xlargeなら、GPU 8台に対して帯域を埋めるにはNVME-SSDは32台ほしい。 ➔ 実際は1.9TB SSDが4枚搭載されているだけ。 N/W経由の高速ストレージが使えれば御の字だが、、、 ▌AI用途の需要、円安の影響によりGPU価格は高止まり
 PG-Strom「専用」のGPUだと中々投資を正当化しにくい。 ➔ クラウドの場合、オンデマンドでGPUを利用できるのがメリット。オンプレほど理想的な構成を取れるわけではない。例：AWSのg6eインスタンスのスペック 20250523爆速DBセミナー～PG-Strom v6.0 新機能とその先の未来～ 80

クラウド向け機能強化（2/4）リモートGPUの利用 GPU-Serviceがリモートマシンでも支障はないのでは？ 81 PostgreSQL Backend PG-Strom v5.0 GPU PG-Strom
PostgreSQL Backend Background Worker CUDA Context GPU Service (multi- threads) Postmaster Process working memory fork(2) PG-Strom Local connection to GPU Service ✓ CUDA Context初期化時間、およびワーキングメモリ（数百MB～1GB）の節約。 ✓ CUDA Contextスイッチの削減とGPU使用率改善 PostgreSQL Backend PG-Strom PostgreSQL Backend Postmaster Process fork(2) PG-Strom PG-Strom v7.0(?) Remote Process GPU Service GPU NVME Cache Remote Process GPU Service GPU NVME Cache 共有ファイルシステム (NFS, Lusterなど) コンテナなど利用して、バッチ処理の時間帯だけ GPUサーバ上で稼働。 TCP/IPで接続遠隔サーバも可能 20250523爆速DBセミナー～PG-Strom v6.0 新機能とその先の未来～

クラウド向け機能強化（3/4） Parquet形式の対応 ▌課題と前提  クラウド環境ではGPUサーバへデータをロードする帯域が十分でない事が多い。  高密度な列データでは、I/OよりもGPUの方がボトルネックとなる事がしばしば。 ▌どうするか？  帯域が大した事ないなら、GPU-Direct
Storageである必要性は、薄い。  代わりに、圧縮データの展開にCPUも使いつつ、転送すべきデータ量を最小化する。 ➔ 無圧縮のArrowよりも、圧縮マニアのParquetを用いた方がフィットする可能性。細いストレージの帯域と高密度な列データを前提とした設計 GPU-Service DBサーバ GPUサーバ共有ストレージ “低速” を前提とする必要あり 20250523爆速DBセミナー～PG-Strom v6.0 新機能とその先の未来～ 82

クラウド向け機能強化（4/4）ローカルNVMEキャッシュ ▌課題と前提  GPUインスタンスのストレージ帯域は不足している事が多い。  一方で、揮発性のローカルNVMEが搭載されている事も多い。 ▌どうするか？  低更新頻度のデータ（ArrowやParquet）であれば、ローカルNVMEに複製をコピーしても
invalidationが必要になる事は滅多にない。  NFSのFS-cacheを使用するか、自前の実装を行うかは要検討。 IoT/M2Mログデータなど更新頻度が低いデータの読み出しを補完するため、ローカルNVMEをキャッシュとして利用する。 GPUサーバ (g6e.8xlarge) EBSストレージ（永続ストレージ）ローカルNVME （揮発ストレージ） EBS Band 16Gbps NVME 900GB 3.5～7.0GB/s 20250523爆速DBセミナー～PG-Strom v6.0 新機能とその先の未来～ 83

まとめ ▌PG-Stromの歴史 ▌PG-Strom v6.0の新機能  Pinned Inner Buffer  GPU-Sort
& Window関数  Arrow_Fdw仮想列 ▌今後のロードマップ  ETL/バッチ処理向け  クラウドでの利用に向けて 20250523爆速DBセミナー～PG-Strom v6.0 新機能とその先の未来～ 84

オモシロ技術を、カタチにする。

PG-Strom v6.0 新機能とその先の未来

PG-Strom v6.0 新機能とその先の未来

More Decks by KaiGai Kohei

Other Decks in Technology

Featured

Transcript