CEO of HeteroDB Contributor of PostgreSQL (2006-) Primary Developer of PG-Strom (2012-) Interested in: Big-data, GPU, NVME/PMEM, ... about Myself about HeteroDB Established: 4th-Jul-2017 Location: Shinagawa, Tokyo, Japan Businesses: ✓ Development of high-performance data-processing software on top of heterogeneous architecture. ✓ Technology consulting service on GPU&DB area. PGconf.online 2021 - GPU version of PostGIS and GiST-Index 2 PG-Strom
and GiST-Index 3 Adds custom-scan/join/aggregate path with GPU execution SQL to GPU binary code generation & Just-in-time compilation Designed for IoT/M2M grade log-data processing Closely connected NVME-SSD and GPU using P2P data transfer [I/O acceleration] Geospatial analytics with GPU-version of PostGIS and GiST-index support PG-Strom: An extension of PostgreSQL to pull out maximum capability of GPU and NVME for processing of terabytes scale data App GPU off-loading for IoT/Big-Data for ML/Analytics PG-Strom’s Features ➢ JIT of GPU code with SQL & Transparent SQL acceleration ➢ SSD-to-GPU Direct SQL ➢ Columnar Store (Arrow_Fdw) ➢ Asymmetric Partition-wise JOIN/GROUP BY ➢ GPU Memory Store ➢ PostGIS support ➢ GiST-Index on GPU ➢ BRIN-Index support ➢ NVME-over-Fabric support ➢ Data-frame exchange for Python scripts NEW
GiST-Index 4 Over 10years history in HPC, then massive popularization in Machine-Learning NVIDIA Tesla V100 Super Computer (TITEC; TSUBAME3.0) Computer Graphics Machine-Learning More than thousands cores and TB/s grade memory bandwidth on a chip for highly computing intensive workloads. Simulation
Location (Longitude, Latitude) is updated very frequently. Its log data often contains (device_id, timestamp, location (point), other attributes) ▌Area definitions Relatively small items, and almost static data. Polygon often has very complicated form, thus heavy “collision detection”. ▌Purpose Area marketing, Logistics analytics, Advertisement delivery, Emergency Alert, etc… Latest Location (Point) Area definition (Polygon) Mobile device Extract mobile devices within the target areas PGconf.online 2021 - GPU version of PostGIS and GiST-Index 5 GPS
contains a polygon (that is often very complicated). (x1 ,y1 ) - (x2 ,y2 ) form by FP32 (= 16bytes) PostGIS assigns a bounding-box when it stores geometry values It allows to skip heavy operations prior to geolocational operations obviously disjointed Tokyo Germany France PGconf.online 2021 - GPU version of PostGIS and GiST-Index 9
Tree) - A framework of index at PostgreSQL PostGIS implements R-tree for Geometry type Check for contains (‘@’ operator), overlaps (‘&&’ operator) Works efficiently to check relationship between very large number of points and many polygons. # Data published by Geospatial Information Authority of Japan $ shp2pgsql N03-20_200101.shp | psql gistest gistest=# ¥d+ List of relations Schema | Name | Type | Owner | Size | --------+-------------+----------+--------+------------+ public | geo_japan | table | kaigai | 243 MB | gistest=# ¥di+ List of relations Schema | Name | Type | Owner | Table | Size | --------+--------------------+-------+--------+-----------+---------+ public | geo_japan_pkey | index | kaigai | geo_japan | 2616 kB | public | geo_japan_geom_idx | index | kaigai | geo_japan | 14 MB | PGconf.online 2021 - GPU version of PostGIS and GiST-Index 10
GPU version of PostGIS and GiST-Index 11 ▌GPU’s characteristics SIMT (Single-Instruction Multiple-Threads) architecture More than thousands processor cores and 1.0TB/s grade memory band ➔ Designed for “same operations on large number of values” ▌SQL characteristics “same operations on large number of values”, like evaluation of WHERE-clause, JOIN or GROUP BY. SQL runs WHERE, JOIN, GROUP BY on very large number of rows Thousands processor units on GPU evaluates thousands rows in parallel CPU Parallel GPU Parallel
FROM ft WHERE st_contains('polygon ((10 10,90 10,90 12,12 12,12 88,90 88,90 90,¥ 10 90,10 10))’, st_makepoint(x,y)); count -------- 236610 (1 row) Time: 44.680 ms --- Vanilla PostGIS =# SET pg_strom.enabled = off; SET =# SELECT count(*) FROM tt WHERE st_contains('polygon ((10 10,90 10,90 12,12 12,12 88,90 88,90 90,¥ 10 90,10 10))', st_makepoint(x,y)); count -------- 236610 (1 row) Time: 622.314 ms Count number of points in the specified area from 5 million points. PGconf.online 2021 - GPU version of PostGIS and GiST-Index 14
float8 st_distance(geometry,geometry) bool st_dwithin(geometry,geometry,float8) bool st_contains(geometry,geometry) bool st_crosses(geometry,geometry) text st_relate(geometry,geometry) ...and more in the future version Current Status of the supported functions PGconf.online 2021 - GPU version of PostGIS and GiST-Index 16
PostGIS and GiST-Index 17 Number of Polygons x Points combination is too large even for GPU Latest Location (Point) Area definition (Polygon) Mobile device GPS 100K-10M devices 100 - 100K polygons 10M-1T Combinations?
is a rectangle [(Xmin ,Ymin ) – (Xmax ,Ymax )] that contains all of (R3, R4, R5) and their pointers ✓ R4 is a rectangle that contains all of (R11,R12) and their pointers ✓ R12 is a rectangle that contains the target geometry and its ItemPointer ✓ Sequentially evaluates the entry for each tree-node. Dive into the next depth only if matched. ✓ Not fast as like B-tree, because of sequential evaluation for each depth in R-tree. (xmin,ymin) (xmax,ymax) PGconf.online 2021 - GPU version of PostGIS and GiST-Index 19 Search Key (Lon,Lat) 〇 〇
is a rectangle [(Xmin ,Ymin ) – (Xmax ,Ymax )] that contains all of (R3, R4, R5) and their pointers ✓ R4 is a rectangle that contains all of (R11,R12) and their pointers ✓ R12 is a rectangle that contains the target geometry and its ItemPointer ✓ Sequentially evaluates the entry for each tree-node. Dive into the next depth only if matched. ✓ Not fast as like B-tree, because of sequential evaluation for each depth in R-tree. (xmin,ymin) (xmax,ymax) PGconf.online 2021 - GPU version of PostGIS and GiST-Index 20 Search Key (Lon,Lat) 〇 〇 × 〇 ×
is a rectangle [(Xmin ,Ymin ) – (Xmax ,Ymax )] that contains all of (R3, R4, R5) and their pointers ✓ R4 is a rectangle that contains all of (R11,R12) and their pointers ✓ R12 is a rectangle that contains the target geometry and its ItemPointer ✓ Sequentially evaluates the entry for each tree-node. Dive into the next depth only if matched. ✓ Not fast as like B-tree, because of sequential evaluation for each depth in R-tree. (xmin,ymin) (xmax,ymax) PGconf.online 2021 - GPU version of PostGIS and GiST-Index 21 Search Key (Lon,Lat) 〇 〇 × 〇 × 〇 ×
is a rectangle [(Xmin ,Ymin ) – (Xmax ,Ymax )] that contains all of (R3, R4, R5) and their pointers ✓ R4 is a rectangle that contains all of (R11,R12) and their pointers ✓ R12 is a rectangle that contains the target geometry and its ItemPointer ✓ Sequentially evaluates the entry for each tree-node. Dive into the next depth only if matched. ✓ Not fast as like B-tree, because of sequential evaluation for each depth in R-tree. (xmin,ymin) (xmax,ymax) PGconf.online 2021 - GPU version of PostGIS and GiST-Index 22 Search Key (Lon,Lat) 〇 〇 × 〇 × 〇 × × ×
PostGIS and GiST-Index 23 ▌Overview PG-Strom may utilize GiST-Index for joining a table with area definition (polygons; small) and a table with locational data (points; large). Both the area definition table and its index are loaded onto GPU on GpuJoin. GpuJoin looks at the GiST-index first for rough pruning, using bounding-box Then, evaluate the “collision detection” with polygon value on the table Above operations by thousands cores of GPU in parallel, so we expected its search performance is better, but... collision detection of Polygons x Points as a part of GpuJoin GiST-Index (R-tree) Area Definitions (Polygon) A table with Location Data (Points) Thousands threads search R-tree index in parallel
GPU version of PostGIS and GiST-Index 24 (123.0, 20.0) (154.2, 46.2) SELECT n03_001,n03_004,count(*) FROM geo_japan j, geopoint p WHERE st_contains(j.geom, st_makepoint(x,y)) AND j.n03_001 like ‘Tokyo’ GROUP BY n03_001,n03_004; 10M of randomly generated geolocational points n03_001 | n03_004 | count ---------+-------------+------- Tokyo | Akiruno | 105 Tokyo | Miyake | 76 Tokyo | Mitaka | 17 Tokyo | Setagaya-ku | 67 Tokyo | Chuo-ku | 12 Tokyo | Nakano-ku | 18 Tokyo | Hachijo | 105 : : : Tokyo | Toshima-ku | 14 Tokyo | Adachi-ku | 55 Tokyo | Aogashima | 7 Tokyo | Ome | 117 (63 rows) CPU-rev: 30.539s GPU-rev: 33.841s (Slow!) Area definition data by the Geospatial Information Authority of Japan
PostGIS and GiST-Index 25 GPU groups its processor cores by Streaming Multiprocessor(SM) ✓ Cores, Registers and L1 cache (shared memory) are managed per streaming-multiprocessor A SM can run up to 1024 threads simultaneously on the shared 64 cores/SM ✓ Like a very large scale hyper-threading in CPU Threads are scheduled per warp (32threads). called SIMT (Single-Instruction Multiple-Thread) architecture ✓ Uniformed workloads, like matrix operations, fully utilizes the processor cores ✓ If a particular thread tends to consume large cycles, other threads in a warp must wait. GPU Block Diagram (Tesla V100; 80SMs) Streaming Multiprocessors (64CUDA cores/SM) •••…••• •••…••• •••…••• •••…••• •••…••• •••…••• Thread Group (1~1024 Threads) Warp (32 threads)
of PostGIS and GiST-Index 26 SELECT * FROM A, B WHERE A.id = B.id; Table-A 512 threads fetches 512 rows at once. Table-B • • • • • • 〇 × 〇 × 〇 × GpuHashJoin / GpuNestLoop N = __syncthreads_count(...) Write out the JOIN results (if any) thread-0 allocates result buffer for N-items Fetch next frame, and repeat the above steps. GpuHashJoin • Hash calculation • Search Hash table • Evaluation of JOIN condition GpuNestLoop • Evaluation of JOIN condition ➔little differences in processing cycles between the threads References
of PostGIS and GiST-Index 27 Large variance of thread’s processing time, that leads synchronization wait! Table-A • • • • • • Reference Index-A × × × × × 〇 N = __syncthreads_count(...) Reference Walk down to the leaf node of R-tree, but no matched item No matched items, by just looking at the root node of R-tree Walk down to the leaf node of R-tree, and evaluate JOIN conditions Other GPU cores must be idle by the completion of the longest operations.
GPU version of PostGIS and GiST-Index 28 Little utilization ratio of GPU cores, due to inter-core synchronization • • • • • • • • Every thread loads a row from the input buffer, and extract the key for GiST-Index. Hit on the GiST-Index? Search the GiST-Index by the key Is the JOIN-condition true? Evaluation of the JOIN-condition write out JOIN-results (if found) nitems = __syncthreads_count(found); Repeat found=true found=false No No A thread that found a matched entry can block other 511 threads in the same thread-group until completion of the evaluation of JOIN-condition. Very low efficiency of GPU core usage.
version of PostGIS and GiST-Index 29 Minimization of the synchronization point, to pull up utilization ratio • • • • • • • • Every thread loads a row from the input buffer, and extract the key for GiST-Index. Search the GiST-Index by the key Hit on the GiST-Index? Allocation of the temporary buffer, to save the pointers that hit GiST-Index above. Consumption of temporary buffer exceeds 512 items. __syncthreads() Evaluation of the JOIN-condition Is the JOIN-condition true? No No nitems = __syncthreads_count(found) Write out JOIN results Repeat found=false found=true As long as temporary buffer has space, threads continue to fetch rows and search GiST-Index. ➔ Then, evaluate JOIN-conditions by all the threads at once.
2021 - GPU version of PostGIS and GiST-Index 30 (123.0, 20.0) (154.2, 46.2) SELECT n03_001,n03_004,count(*) FROM geo_japan j, geopoint p WHERE st_contains(j.geom, st_makepoint(x,y)) AND j.n03_001 like ‘Tokyo’ GROUP BY n03_001,n03_004; 10M of randomly generated geolocational points n03_001 | n03_004 | count ---------+-------------+------- Tokyo | Akiruno | 105 Tokyo | Miyake | 76 Tokyo | Mitaka | 17 Tokyo | Setagaya-ku | 67 Tokyo | Chuo-ku | 12 Tokyo | Nakano-ku | 18 Tokyo | Hachijo | 105 : : : Tokyo | Toshima-ku | 14 Tokyo | Adachi-ku | 55 Tokyo | Aogashima | 7 Tokyo | Ome | 117 (63 rows) CPU-rev: 30.539s GPU-rev: 0.316s Area definition data by the Geospatial Information Authority of Japan 100 times faster!!
GiST-Index 33 ▌GPU-version PostGIS Enhancement for parallel execution of PostGIS functions Also support GiST-Index (R-tree) of geometry type 100 times faster results to pickup “points in area” type workloads. ▌Expected Use scenarios Area marketing analytics Real-time advertisement delivery Push event notifications, etc... ➔ GPU+PostGIS allows to run “computing intensive” workloads on your workstation or cloud instance as like you are usually doing. ▌Resources GitHub: https://github.com/heterodb/pg-strom Document: http://heterodb.github.io/pg-strom/ja/ Contact: Tw: @kkaigai / ✉ [email protected]