PG-Strom v3.0新機能 GPU Cache について

PG-Strom v3.0新機能 GPU Cacheについて HeteroDB,Inc Chief Architect & CEO KaiGai
Kohei <[email protected]>

自己紹介／HeteroDB社について PostgreSQL Unconference 2021-May - PG-Strom New Features / GpuCache
2 会社概要  商号ヘテロDB株式会社  創業 2017年7月4日  拠点品川区北品川5-5-15 大崎ブライトコア4F  事業内容高速データベース製品の販売 GPU＆DB領域の技術コンサルティングヘテロジニアスコンピューティング技術をデータベース領域に適用し、誰もが使いやすく、安価で高速なデータ解析基盤を提供する。代表者プロフィール  海外浩平（KaiGai Kohei）  OSS開発者コミュニティにおいて、PostgreSQLやLinux kernelの開発に10年以上従事。主にセキュリティ・FDW等の分野でアップストリームへの貢献。  IPA未踏ソフト事業において“天才プログラマー”認定 (2006)  GPU Technology Conference Japan 2017でInception Awardを受賞

PG-Stromとは？ PostgreSQL Unconference 2021-May - PG-Strom New Features / GpuCache
3 【機能】  集計／解析ワークロードの透過的なGPU高速化  SSD-to-GPU Direct SQLによるPCIeバスレベルの最適化  Apache Arrow対応によるデータ交換、インポート時間をほぼゼロに  GPUメモリにデータを常駐。OLTPワークロードとの共存  PostGIS関数のサポート（一部）。位置情報分析を高速化 PG-Strom: GPUとNVME/PMEMの能力を最大限に引き出し、テラバイト級のデータを高速処理するPostgreSQL向け拡張モジュール App GPU off-loading for IoT/Big-Data for ML/Analytics ➢ SSD-to-GPU Direct SQL ➢ Columnar Store (Arrow_Fdw) ➢ GPU Memory Store (Gstore_Fdw) ➢ Asymmetric Partition-wise JOIN/GROUP BY ➢ BRIN-Index Support ➢ NVME-over-Fabric Support ➢ Inter-process Data Frame for Python scripts ➢ Procedural Language for GPU native code (w/ cuPy) ➢ PostGIS Support NEW NEW

PG-Strom v3.0 について ▌スケジュール  特にこれ以上何もなければ、５月末頃にはリリースできそう。 ▌主な新機能  GPU版PostGIS関数 ✓
St_contains()やSt_dwithin()など数種類のPostGIS関数をGPU向けに実装 ✓ GiST-Indexにも対応し、ポリゴン定義×座標データの高速な突合に対応  GPU Cache機構 ✓ 比較的小規模なデータ（～10GB程度）をGPUデバイスメモリに常駐させ、検索クエリの実行時にデータの読出しを省略する仕組み。 ✓ 更新ログを非同期に適用する事で、CPUGPU間の同期遅延を抑える。  NVIDIA GPUDirect Storage対応 [experimental] ✓ NVME/NVME-oFデバイスからP2P RDMAを用いてSSD->GPU間の直接データ転送を実現するドライバに対応。（現在 0.9β というステータス） ✓ HeteroDB製ドライバ（nvme_strom.ko）と同等の機能を持ち、SDSデバイスにも対応  ユーザ定義のGPU対応データ型/関数/演算子 ✓ PG-Strom本体とは別に、ユーザが独自にGPU向けデータ型/関数/演算子を定義して、 GPUでのSCAN/JOIN/GROUP BY処理の一部として実行する事が可能に。 PostgreSQL Unconference 2021-May - PG-Strom New Features / GpuCache 4

GPUでのSQL実行と、ストレージ（1/3） ▌説明  テーブルからのデータ読み出しは、 PostgreSQLのストレージAPIを用いる。  host➔device間のデータ転送は追加的な処理コストだが、GPUでの並列処理と非同期DMAにより全体として高速化 ▌問題点
 遅い。 PostgreSQLのストレージ層をAs-Isで使用するケース PostgreSQL tables shared buffer source buffer destination buffer source buffer destination buffer next step Parallel Scan/Join/GroupBy 非同期DMA filesystem PostgreSQL Unconference 2021-May - PG-Strom New Features / GpuCache 5

GPUでのSQL実行と、ストレージ（2/3） ▌説明  NVME/NVME-oFデバイス上に配置されたテーブルから、P2P DMAを用いた NVME-SSD ➔ GPU への直接データ転送
 ファイルシステムを介さないため、 NVMEデバイスの理論速度に近い速度でデータを読み出す事ができる。（例：SSDx4 の md-raid0 構成で10GB/s）  PostgreSQLテーブル（heap形式）と、 Apache Arrow形式（列データ）に対応 ▌利用シーン  大容量データ（100GB～）でないと効果を実感しにくい。 ➔ ログデータ分析などが適用ワークロードの中心 SSD-to-GPU DirectSQLを用いるケース PostgreSQL tables shared buffer destination buffer source buffer destination buffer next step Parallel Scan/Join/GroupBy P2P DMA memcpy (NVME/NVME-oF ➔ GPU) filesystem PostgreSQL Unconference 2021-May - PG-Strom New Features / GpuCache 6

GPUでのSQL実行と、ストレージ（3/3） ▌説明  GPUデバイスメモリ上に確保したキャッシュ領域に、テーブルの内容を予めロードしておく。  AFTERトリガを用いて、更新履歴を REDO Log
Bufferに追記する。  適度なインターバル、又は解析系クエリの実行前に、REDOログをGPUに転送して GPU Cacheの内容を更新する。  更新系クエリにデータの追従が容易で、解析系クエリの実行時にテーブルからデータを読み出す必要がない。 ▌利用シーン  比較的小規模（～10GB/GPU程度）のデータセットと、複雑な検索条件を含むワークロード ➔ 位置データ分析など GpuCacheでDBテーブルの内容をGPUに常駐させておく PostgreSQL tables shared buffer REDO Log buffer destination buffer GPU Cache destination buffer next step Parallel Scan/Join/GroupBy Asynchronous Log Applying transactional workloads high-frequency, small-volume PostgreSQL Unconference 2021-May - PG-Strom New Features / GpuCache 7

ターゲット：移動体デバイスのデータ検索 ▌移動体デバイス  高頻度で位置（緯度、経度）の更新がかかる。  （デバイスID、タイムスタンプ、位置、その他の属性）というデータ構造が多い ▌エリア定義情報  比較的件数は少なく、ほぼ静的なデータだが、エリア定義情報（多角形）が非常に複雑な構造を持ち、当たり判定が“重い”
▌用途  商圏分析、物流分析、広告配信、アラート通知、etc… 位置情報（緯度、経度）エリア定義情報（ポリゴン）移動体デバイスエリア内に存在する移動体デバイスの抽出 PostgreSQL Unconference 2021-May - PG-Strom New Features / GpuCache 8

GpuCacheのアーキテクチャ（1/4）－背景と課題 ▌GPUとホストメモリは“遠い”  通常、GPUはPCI-Eバスを介してホストシステムと接続する。  PCI-E Gen 3.0 x16レーンだと、片方向
16GB/s 「しかない」行って帰ってのレイテンシは数十マイクロ秒単位。  DRAMの転送速度は140GB/sでレイテンシは100ns程度。もしL1/L2に載っていれば数ns単位でアクセス可能出展：https://gist.github.com/eshelman/343a1c46cb3fba142c1afdcdeec17646 ▌高い更新頻度  もし100万デバイスが10秒に一度、現在位置を更新するなら？ ➔ 単純 UPDATE を毎秒10万回実行。 ➔ 1.0sec / 10万 = 10us ▌リアルタイム性に対する要請  利用者が検索、解析したいのは、「その時点での最新の状態」 ➔ 後でまとめてバッチ、では通用しない。 GPU Device RAM RAM CPU 900GB/s 140GB/s 16GB/s PCI-E Gen3.0 PostgreSQL Unconference 2021-May - PG-Strom New Features / GpuCache 9

GpuCacheのアーキテクチャ（2/4）  GPU Cacheを仕掛けたテーブルの AFTER ROW トリガにより、更新履歴を共有メモリ上の REDO Log
Buffer に書き込む。  この処理は全てホスト側のオンメモリで行われるため、OLTP系の高頻度な更新にも十分に追従する事ができる。 Cached Table REDO Log Buffer (host shared memory) PostgreSQL Backend PostgreSQL Backend PostgreSQL Backend PostgreSQL Backend REDO Log Entry REDO Log Entry REDO Log Entry REDO Log Entry current position to write log entries last position where log entries are already applied to Used as a circular buffer AFTER ROW TRIGGER INSERT UPDATE DELETE UPDATE PostgreSQL Unconference 2021-May - PG-Strom New Features / GpuCache 10

GpuCacheのアーキテクチャ（3/4）  REDO Log Entryが一定量溜まる／一定時間更新が途絶えると、Background Workerによって、未更新ログがGPUへ転送され、並列にGPU Cacheを更新する。（そのため、ログの適用処理自体は一瞬で終了する） 
未更新ログの適用中にも、新たなREDO Log Entryが書き込まれる事がある。これらは、次のログ適用時にGPU Cache側へ反映される。 Cached Table REDO Log Buffer (host shared memory) Backend worker (GPU memory keeper) REDO Log Entry REDO Log Entry REDO Log Entry REDO Log Entry readPosold readPosnew REDO Log Entry REDO Log Entry GPU Device Memory GPU Cache Load the multiple log entries at once, then GPU kernel updates the GPU cache in parallel, by thousands processor cores. writePosnew writePosold PostgreSQL Unconference 2021-May - PG-Strom New Features / GpuCache 11

GpuCacheのアーキテクチャ（4/4）  GpuScan/GpuJoin/GpuPreAggを適用するような分析・検索系SQLの実行時には、 ① 実行開始時点で未適用の REDO Log Entry をGPU側に適用する。 ②
テーブルを参照せず、GPU Cacheのみを参照してSQLを実行する。という流れになる。したがって、分析・検索系ワークロードも常に最新のデータを参照して処理を実行できる。 Cached Table REDO Log Buffer (host shared memory) readPos REDO Log Entry REDO Log Entry GPU Device Memory GPU Cache Load the log entries not applied yet, at the time when OLAP workloads begins SQL execution. writePos Backend worker (GPU memory keeper) ① command to apply redo log entries, if any PostgreSQL Backend (analytic) ② launch GPU kernels for Scan/Join/GroupBy GPU kernels for SQL PostgreSQL Unconference 2021-May - PG-Strom New Features / GpuCache 12

補足：GpuCache と Gstore_Fdw はどう違うのか？ ▌GpuCacheにおける改良点 ✓ あくまで「原本」はPostgreSQLテーブルで、『GPUはその複製を持つ』という思想なので、例えば IndexScan の方が合理的なら、オプティマイザはそちらを選択する。
✓ トリガを用いて実装しているため、論理レプリケーションとの共存も可能 ➔ GPUのないOLTP用DBと、GPUを搭載したOLAP用DBで役割分担をする構成も ✓ トランザクションの永続化は PostgreSQL に依存する形となるので、従来通りの OLTP系チューニングの方法を適用できる。GPU以外に特殊なH/Wは必要なくなった。 ▌Gstore_Fdwとは？ ✓ PG-Strom v3.0向けに開発していた機能で、 FDWをベースに、GPUメモリ上にデータを常駐させる機能を有していた。 ✓ REDOログを用いた非同期の更新や、テーブルからのデータロードの省略など、要は GpuCache と同じコンセプトの機能 ▌課題 ✓ FDWなので好きにインデックスが張れない。（指定した主キーに対する ‘=‘ 等価比較のみ） ✓ レプリケーション構成を取る事ができない。 ✓ PMEMを使わないと更新性能が出ないため、クラウドでの利用に難がある。 PostgreSQL Unconference 2021-May - PG-Strom New Features / GpuCache 13

GpuCacheの設定（1/2） ▌GpuCacheが発動する条件  テーブルに AFTER INSERT OR UPDATE OR DELETE
FOR ROW で、 pgstrom.gpucache_sync_trigger() を実行するトリガが設定されている。  テーブルに AFTER TRUNCATE FOR STATEMENT で、 pgstrom.gpucache_sync_trigger() を実行するトリガが設定されている。  （論理レプリケーションの場合は更に）レプリケーション先でも上記トリガを実行するよう設定されている。 ▌設定例 # create trigger row_sync after insert or update or delete on mytest for each row execute function pgstrom.gpucache_sync_trigger(); # create trigger stmt_sync after truncate on mytest for each statement execute function pgstrom.gpucache_sync_trigger(); # alter table mytest enable always trigger row_sync; # alter table mytest enable always trigger stmt_sync; PostgreSQL Unconference 2021-May - PG-Strom New Features / GpuCache 14

GpuCacheの設定（2/2） ▌GpuCache設定のカスタマイズ # create trigger row_sync after insert or update
or delete on mytest for each row execute function pgstrom.gpucache_sync_trigger(‘max_num_rows=25000000’); ここに <key>=<value>[,...] 形式で設定を記述できる↑ ▌指定できるパラメータ ✓ max_num_rows=NROWS (default: 10M rows) GPUキャッシュ上に確保できる行数を指定する。コミット前の更新行など、一時的に行数が増える可能性があるため、余裕を持って指定する。 ✓ redo_buffer_size=SIZE (default: 160MB ) REDOバッファのサイズを指定する。単位として’k’,’m’,’g’を指定できる。 ✓ gpu_sync_interval=SECONDS (default: 5sec) REDOバッファが最後に書き込まれてから SECONDS 秒経過すると、更新行数が少なくともGPU側へ反映を行う。 ✓ gpu_sync_threshold=SIZE (default: 25% of redo_buffer_size) REDOバッファの書き込みのうち、未反映分の大きさが SIZE バイトに達すると、 GPU側でログの反映を行う。単位として’k’,’m’,’g’を指定できる。 PostgreSQL Unconference 2021-May - PG-Strom New Features / GpuCache 15

GpuCacheの性能評価（1/3）  評価① 更新性能 ✓ デバイスIDを指定して、座標に相当する ‘lat’ と ‘lon’ を更新し、TPS値を測定する。
 評価② 集計性能 ✓ 「東京都内」に含まれる地点の数を、市区町村ごとに集計し、応答時間を測定する。 (138.565, 36.155) (141.000, 35.000) CREATE TABLE dpoints ( dev_id int primary key, ts timestamp, lat float, lon float ); この矩形領域にランダムな点を 800万個生成 PostgreSQL Unconference 2021-May - PG-Strom New Features / GpuCache 16

GpuCacheの性能評価（2/3）－更新性能 ▌テストスクリプト mytest.sql ¥set __dev_id (int(random(1, 8000000))) ¥set __new_x
(double(random(1385650, 1410000)) / 10000.0) ¥set __new_y (double(random( 350000, 361550)) / 10000.0) UPDATE dpoints SET ts = now(), x = :__new_x, y = :__new_y WHERE dev_id = :__dev_id; $ pgbench -n -f mytest.sql postgres -c 32 -j 32 -T 10 -M prepared : number of transactions actually processed: 1258486 latency average = 0.255 ms tps = 125661.080344 (excluding connections establishing) ➔ 32コネクション、32スレッドで上記のスクリプトを10秒間ひたすら回し続ける。 ▌結果  GpuCacheあり TPS=125,661  GpuCacheなし TPS=133,938 ▌参考結果（dpointsを unlogged table として定義した場合）  GpuCacheあり TPS=144,096  GpuCacheなし TPS=163,724 GpuCacheへの追記により、 TPSで10%程度のコスト。 PostgreSQL Unconference 2021-May - PG-Strom New Features / GpuCache 17

GpuCacheの性能評価（3/3）－分析性能 ▌テストスクリプト SELECT n03_001 pref, n03_004 city, count(*) FROM
giscity g, dpoints d WHERE n03_001 = ‘東京都’ AND st_contains(g.geom,st_makepoint(d.x, d.y)) GROUP BY pref, city ORDER BY pref, city; ▌結果  PG-Strom + GpuCache： 857.203ms  PG-Strom（GpuCacheなし）： 1368.695ms ➔データ構造にかなり左右されるハズ。今回のデータは（機器ID、タイムスタンプ、緯度、経度）のみだが、通常はこれに加えてメタデータ相当のフィールドをロードする必要あり。  PostgreSQL： 55749.379ms PostgreSQL Unconference 2021-May - PG-Strom New Features / GpuCache 18

補足：分析クエリ実行計画（PG-Strom = On） GroupAggregate (actual time=848.333..848.357 rows=54 loops=1) Group Key:
g.n03_001, g.n03_004 -> Sort (actual time=848.322..848.328 rows=54 loops=1) Sort Key: g.n03_004 Sort Method: quicksort Memory: 29kB -> Custom Scan (GpuPreAgg) (actual time=847.395..847.407 rows=54 loops=1) Reduction: Local Combined GpuJoin: enabled GPU Preference: GPU0 (NVIDIA Tesla V100-PCIE-16GB) -> Custom Scan (GpuJoin) on dpoints d (never executed) Outer Scan: dpoints d (never executed) Depth 1: GpuGiSTJoin HeapSize: 7841.91KB (estimated: 3251.36KB), IndexSize: 13.25MB IndexFilter: (g.geom ~ st_makepoint(d.x, d.y)) on giscity_geom Rows Fetched by Index: 804104 JoinQuals: st_contains(g.geom, st_makepoint(d.x, d.y)) GPU Preference: GPU0 (NVIDIA Tesla V100-PCIE-16GB) GPU Cache: NVIDIA Tesla V100-PCIE-16GB [max_num_rows: 12000000] GPU Cache Size: main: 772.51M, extra: 0 -> Seq Scan on giscity g (actual time=0.173..19.394 rows=6173 loops=1) Filter: ((n03_001)::text = '東京都'::text) Rows Removed by Filter: 112726 Planning Time: 0.351 ms Execution Time: 857.203 ms (24 rows) PostgreSQL Unconference 2021-May - PG-Strom New Features / GpuCache 19

補足：分析クエリ実行計画（PG-Strom = Off） Finalize GroupAggregate (actual time=56053.249..56084.157 rows=54 loops=1) Group
Key: g.n03_001, g.n03_004 -> Gather Merge (actual time=56052.975..56084.093 rows=270 loops=1) Workers Planned: 4 Workers Launched: 4 -> Partial GroupAggregate (actual time=56033.094..56053.135 rows=54 loops=5) Group Key: g.n03_001, g.n03_004 -> Sort (actual time=56032.296..56040.879 rows=81620 loops=5) Sort Key: g.n03_004 Sort Method: quicksort Memory: 10188kB Worker 0: Sort Method: quicksort Memory: 10090kB Worker 1: Sort Method: quicksort Memory: 10147kB Worker 2: Sort Method: quicksort Memory: 9950kB Worker 3: Sort Method: quicksort Memory: 10060kB -> Nested Loop (actual time=0.841..55383.095 rows=81620 loops=5) -> Parallel Seq Scan on dpoints d (actual time=0.017..161.622 rows=1600000 loops=5) -> Index Scan using giscity_geom on giscity g (actual time=0.033..0.034 rows=0 loops=8000000) Index Cond: (geom ~ st_makepoint(d.x, d.y)) Filter: (((n03_001)::text = '東京都'::text) AND st_contains(geom, st_makepoint(d.x, d.y))) Rows Removed by Filter: 1 Planning Time: 0.162 ms Execution Time: 56087.117 ms (22 rows) PostgreSQL Unconference 2021-May - PG-Strom New Features / GpuCache 20

GpuCacheと論理レプリケーション（1/6） ▌背景  クラウド環境を前提として考えた場合、H/W構成が自由にならない事が多い。特にGPUインスタンスは限られる。（AWS：p3.*、Azure：NC*s_v3）  また、PMEMやNVME-SSDなど高速ストレージも使い難い。  GPUインスタンスのCPUが、トランザクション系負荷を捌くのに非力である場合、 CPUコアの増加
➔ 同時にGPU数の増加 ➔ インスタンス費用の増加、は辛い…。  最初に実装した FDW ベースの Gstore_Fdw では、この辺の考慮が無かった。 ▌GpuCacheでは？  あくまで「原本」はPostgreSQLテーブルなので、論理レプリケーションが使える。  分析用DB側は Unlogged Table にして更新性能を稼ぐ構成も組める。論理レプリケーション分析クエリ分析用DB トランザクション用DB PostgreSQL Unconference 2021-May - PG-Strom New Features / GpuCache 21

GpuCacheと論理レプリケーション（2/6） ▌Publication側の設定  postgresql.confの編集 ✓ wal_level = logical に設定 
テーブルの作成（CREATE TABLE ...）  初期データの投入  パブリケーションの作成（CREATE PUBLICATION...） ▌Subscription側の設定  postgresql.confの編集 ✓ shared_preload_libraries = ‘$libdir/pg_strom’ に設定  テーブルの作成（CREATE UNLOGGED TABLE）  GpuCache関連トリガーの作成と、レプリケーション時の有効化  サブスクリプションの作成（CREATE SUBSCRIPTION...） ※ 設定方法は一般的な論理レプリケーションと同じ Subscription側（p3.2xlarge）コスト： 4.194USD/h vCPUs： 8 メモリ： 61GB ストレージ：EBS(gp3) Publication側（m5.2xlarge）コスト： 0.496USD/h vCPUs： 8 メモリ： 32GB ストレージ：EBS(gp3) dpoints_even テーブル dpoints_odd テーブル dpoints テーブル（パーティション） ├ dpoints_odd テーブル（unlogged） └ dpoints_even テーブル（unlogged） IaaSの永続ストレージは遅いが（PMEMやNVMEと比べて）、トランザクション側で永続化できていれば、分析側では永続化不要では？ PostgreSQL Unconference 2021-May - PG-Strom New Features / GpuCache 22

GpuCacheと論理レプリケーション（3/6） (インスタンス①での設定) =# create table dpoints_odd (dev_id int primary key,
ts timestamp, x float, y float, check(dev_id % 2 = 1)); CREATE TABLE =# grant SELECT ON dpoints_odd to slave; GRANT =# insert into dpoints_odd (select x, now(), random(), random() from generate_series(1,8000000, 2) x); INSERT 0 4000000 =# CREATE PUBLICATION repl_odd FOR TABLE dpoints_odd ; CREATE PUBLICATION (インスタンス②での設定) =# create table dpoints_even (dev_id int primary key, ts timestamp, x float, y float, check (dev_id % 2 = 0)); CREATE TABLE =# grant SELECT on dpoints_even to slave; GRANT =# insert into dpoints_even (select x, now(), random(), random() from generate_series(2,8000000, 2) x); INSERT 0 4000000 =# CREATE PUBLICATION repl_even FOR TABLE dpoints_even ; CREATE PUBLICATION トランザクション側の設定（x２台分） PostgreSQL Unconference 2021-May - PG-Strom New Features / GpuCache 23

GpuCacheと論理レプリケーション（4/6） =# create table dpoints (dev_id int not null, ts
timestamp, x float, y float) partition by range ((dev_id % 2)); =# create table dpoints_even partition of dpoints for values from ( 0 ) to ( 1 ); =# create unique index dpoints_even__dev_id on dpoints_even(dev_id); =# alter table dpoints_even add primary key using index dpoints_even__dev_id; =# create trigger row_sync after insert or update or delete on dpoints_even for row execute function pgstrom.gpucache_sync_trigger(); =# create trigger stmt_sync after truncate on dpoints_even for statement execute function pgstrom.gpucache_sync_trigger(); =# alter table dpoints_even enable always trigger row_sync; =# alter table dpoints_even enable always trigger stmt_sync; // 同じ操作を dpoints_odd にも繰り返し =# create table dpoints_odd partition of dpoints for values from ( 1 ) to ( 2 ); =# create unique index dpoints_odd__dev_id on dpoints_even(dev_id); =# alter table dpoints_odd add primary key using index dpoints_odd__dev_id; =# create trigger row_sync after insert or update or delete on dpoints_odd for row execute function pgstrom.gpucache_sync_trigger(); =# create trigger stmt_sync after truncate on dpoints_odd for statement execute function pgstrom.gpucache_sync_trigger(); =# alter table dpoints_odd enable always trigger row_sync; =# alter table dpoints_odd enable always trigger stmt_sync; // 論理レプリケーションの開始 =# CREATE SUBSCRIPTION repl_odd CONNECTION 'host=172.31.19.228 dbname=postgres user=slave' PUBLICATION repl_odd; =# CREATE SUBSCRIPTION repl_even CONNECTION 'host=172.31.20.230 dbname=postgres user=slave' PUBLICATION repl_even; 分析側の設定（GPUインスタンス）論理レプリケーションのスレーブ側でトリガを起動させるために必要な設定 PostgreSQL Unconference 2021-May - PG-Strom New Features / GpuCache 24

GpuCacheと論理レプリケーション（5/6） $ cat mytest_even.sql ¥set __dev_id (int(random(1, 4000000) * 2))
¥set __new_x (double(random(1,1000000)) / 10000.0) ¥set __new_y (double(random(1,1000000)) / 10000.0) UPDATE dpoints_even SET ts = now(), x = :__new_x, y = :__new_y WHERE dev_id = :__dev_id; $ /usr/pgsql-12/bin/pgbench -h 172.31.19.228 -n -f mytest_odd.sql postgres ¥ -c 96 -j 96 -T 10 -M prepared & /usr/pgsql-12/bin/pgbench -h 172.31.20.230 -n -f mytest_even.sql postgres ¥ -c 96 -j 96 -T 10 -M prepared : number of transactions actually processed: 242223 latency average = 3.988 ms tps = 24073.372665 (including connections establishing) tps = 31549.989987 (excluding connections establishing) : number of transactions actually processed: 210376 latency average = 4.600 ms tps = 20870.313617 (including connections establishing) tps = 26237.257896 (excluding connections establishing) 8vCPU + EBS(gp3)のインスタンスでも5.7k TPS/s 程度出る PostgreSQL Unconference 2021-May - PG-Strom New Features / GpuCache 25

GpuCacheと論理レプリケーション（6/6） postgres=# explain select * from dpoints where x <
0.001 and y < 0.001 order by dev_id; QUERY PLAN ------------------------------------------------------------------------------------------------- Sort (cost=10598.31..10598.31 rows=3 width=28) Sort Key: dpoints_even.dev_id -> Append (cost=4285.90..10598.28 rows=3 width=28) -> Custom Scan (GpuScan) on dpoints_even (cost=4285.90..5301.80 rows=2 width=28) GPU Filter: ((x < '0.001'::double precision) AND (y < '0.001'::double precision)) GPU Cache: NVIDIA Tesla V100-SXM2-16GB [max_num_rows: 10485760] GPU Cache Size: main: 675.03M, extra: 0 -> Custom Scan (GpuScan) on dpoints_odd (cost=4285.90..5296.47 rows=1 width=28) GPU Filter: ((x < '0.001'::double precision) AND (y < '0.001'::double precision)) GPU Cache: NVIDIA Tesla V100-SXM2-16GB [max_num_rows: 10485760] GPU Cache Size: main: 675.03M, extra: 0 (11 rows) postgres=# select * from dpoints where x < 0.001 and y < 0.001 order by dev_id; dev_id | ts | x | y ---------+----------------------------+------------------------+----------------------- 932294 | 2021-05-10 04:08:16.916965 | 6.436344180471565e-05 | 0.0008733261591729047 1141774 | 2021-05-10 04:08:16.916965 | 0.0007466930297361785 | 0.0002913637187091922 1396325 | 2021-05-10 04:08:59.035472 | 0.00012029229877441594 | 0.0006813698993006767 2281051 | 2021-05-10 04:08:59.035472 | 0.0006355283323991046 | 0.0008848143919486517 5580669 | 2021-05-10 04:08:59.035472 | 1.9338331895824012e-05 | 0.0009735620154209812 7717535 | 2021-05-10 04:08:59.035472 | 0.0001952751708920175 | 0.0009738177919409452 (6 rows) GPUサーバでの実行計画とデータの内容を確認 dpoints_even由来、内容も一致 dpoints_odd由来、内容も一致 PostgreSQL Unconference 2021-May - PG-Strom New Features / GpuCache 26

まとめ ▌PG-Strom v3.0  5月末頃リリースに向けて作業中  GpuCacheのほか、GPU版PostGIS、NVIDIA GPUDirect Storage、ユーザ定義のGPU関数・データ型などの新機能
▌GpuCache  元々は、FDWベースで作っていた Gstore_Fdw の代替となる実装。 ✓ 課題：主キー以外のインデックスが使えない、レプリケーションに対応できない、 PMEMを使わないとまるで性能が出ない（= クラウドで使えない）  「原本」をPostgreSQLテーブルに置き、トリガを利用して、その複製を GPUメモリに予めロードしておくための機能。 ➔分析系SQLの実行時に、わざわざデータを再ロードする必要がない。  通常のテーブル書き込みと比較すると、約10%程度のTPS低下。（但し、NVME-SSD上にDBを構築したケース）  GPU版PostGISのようなヘビーな解析系ワークロードによく適合する。  PostgreSQLの論理レプリケーションと同時利用する事ができる。 GPUインスタンスではトランザクション負荷の荷が重い場合は、安価なCPUのみインスタンスを並べるのも一つの方策（かも） PostgreSQL Unconference 2021-May - PG-Strom New Features / GpuCache 27

PG-Strom v3.0新機能 GPU Cache について

PG-Strom v3.0新機能 GPU Cache について

KaiGai Kohei

More Decks by KaiGai Kohei

Other Decks in Technology

Featured

Transcript

PG-Strom v3.0新機能 GPU Cacheについて HeteroDB,Inc Chief Architect & CEO KaiGai

自己紹介／HeteroDB社について PostgreSQL Unconference 2021-May - PG-Strom New Features / GpuCache

PG-Stromとは？ PostgreSQL Unconference 2021-May - PG-Strom New Features / GpuCache

PG-Strom v3.0 について ▌スケジュール  特にこれ以上何もなければ、５月末頃にはリリースできそう。 ▌主な新機能  GPU版PostGIS関数 ✓

GPUでのSQL実行と、ストレージ（2/3） ▌説明  NVME/NVME-oFデバイス上に配置されたテーブルから、P2P DMAを用いた NVME-SSD ➔ GPU への直接データ転送

GPUでのSQL実行と、ストレージ（3/3） ▌説明  GPUデバイスメモリ上に確保したキャッシュ領域に、テーブルの内容を予めロードしておく。  AFTERトリガを用いて、更新履歴を REDO Log

GpuCacheのアーキテクチャ（1/4）－背景と課題 ▌GPUとホストメモリは“遠い”  通常、GPUはPCI-Eバスを介してホストシステムと接続する。  PCI-E Gen 3.0 x16レーンだと、片方向

GpuCacheのアーキテクチャ（2/4）  GPU Cacheを仕掛けたテーブルの AFTER ROW トリガにより、更新履歴を共有メモリ上の REDO Log

GpuCacheのアーキテクチャ（4/4）  GpuScan/GpuJoin/GpuPreAggを適用するような分析・検索系SQLの実行時には、 ① 実行開始時点で未適用の REDO Log Entry をGPU側に適用する。 ②

GpuCacheの設定（1/2） ▌GpuCacheが発動する条件  テーブルに AFTER INSERT OR UPDATE OR DELETE

GpuCacheの設定（2/2） ▌GpuCache設定のカスタマイズ # create trigger row_sync after insert or update

GpuCacheの性能評価（1/3）  評価① 更新性能 ✓ デバイスIDを指定して、座標に相当する ‘lat’ と ‘lon’ を更新し、TPS値を測定する。

GpuCacheの性能評価（2/3）－更新性能 ▌テストスクリプト mytest.sql ¥set __dev_id (int(random(1, 8000000))) ¥set __new_x

GpuCacheの性能評価（3/3）－分析性能 ▌テストスクリプト SELECT n03_001 pref, n03_004 city, count(*) FROM

補足：分析クエリ実行計画（PG-Strom = On） GroupAggregate (actual time=848.333..848.357 rows=54 loops=1) Group Key:

補足：分析クエリ実行計画（PG-Strom = Off） Finalize GroupAggregate (actual time=56053.249..56084.157 rows=54 loops=1) Group

GpuCacheと論理レプリケーション（2/6） ▌Publication側の設定  postgresql.confの編集 ✓ wal_level = logical に設定 

GpuCacheと論理レプリケーション（3/6） (インスタンス①での設定) =# create table dpoints_odd (dev_id int primary key,

GpuCacheと論理レプリケーション（4/6） =# create table dpoints (dev_id int not null, ts

GpuCacheと論理レプリケーション（5/6） $ cat mytest_even.sql ¥set __dev_id (int(random(1, 4000000) * 2))

GpuCacheと論理レプリケーション（6/6） postgres=# explain select * from dpoints where x <

まとめ ▌PG-Strom v3.0  5月末頃リリースに向けて作業中  GpuCacheのほか、GPU版PostGIS、NVIDIA GPUDirect Storage、ユーザ定義のGPU関数・データ型などの新機能