Data processing more than billion rows per second ～unleash full capability of GPU and NVME～

Data processing more than billion rows per second ～unleash full
capability of GPU and NVME～ HeteroDB,Inc Chief Architect & CEO KaiGai Kohei <[email protected]>

Self-Introduction  KaiGai Kohei (海外浩平)  Chief Architect & CEO
of HeteroDB  Contributor of PostgreSQL (2006-)  Primary Developer of PG-Strom (2012-)  Interested in: Big-data, GPU, NVME/PMEM, ... about myself about our company  Established: 4th-Jul-2017  Location: Shinagawa, Tokyo, Japan  Businesses: ✓ Sales & development of high-performance data-processing software on top of heterogeneous architecture. ✓ Technology consulting service on GPU&DB area. PG-Strom PostgreSQL Webinar Series 2020 - Data processing more than billion rows per second 2

What is PG-Strom? PostgreSQL Webinar Series 2020 - Data processing
more than billion rows per second 3  Transparent GPU acceleration for analytics and reporting workloads  Binary code generation by JIT from SQL statements  PCIe-bus level optimization by SSD-to-GPU Direct SQL technology  Columnar-storage for efficient I/O and vector processors PG-Strom is an extension of PostgreSQL for terabytes scale data-processing and in-database analytics, by utilization of GPU and NVME-SSD. App GPU off-loading for IoT/Big-Data for ML/Analytics ➢ SSD-to-GPU Direct SQL ➢ Columnar Store (Arrow_Fdw) ➢ Asymmetric Partition-wise JOIN/GROUP BY ➢ BRIN-Index Support ➢ NVME-over-Fabric Support ➢ PostGIS + GiST Support (WIP) ➢ GPU Memory Store(WIP) ➢ Procedural Language for GPU native code. ➢ IPC on GPU device memory over cuPy data frame.

PG-Strom as an open source project PostgreSQL Webinar Series 2020
- Data processing more than billion rows per second 4 Official documentation in English / Japanese http://heterodb.github.io/pg-strom/ Many stars ☺ Distributed under GPL-v2.0 Has been developed since 2012 ▌Supported PostgreSQL versions  PostgreSQL v12, v11, and v10. WIP for v13 support ▌Related contributions to PostgreSQL  Writable FDW support (9.3)  Custom-Scan interface (9.5)  FDW and Custom-Scan JOIN pushdown (9.5) ▌Talks at PostgreSQL community  PGconf.EU 2012 (Prague), 2015 (Vienna), 2018 (Lisbon)  PGconf.ASIA 2018 (Tokyo), 2019 (Bali, Indonesia)  PGconf.SV 2016 (San Francisco)  PGconf.China 2015 (Beijing)

Target Workloads? PostgreSQL Webinar Series 2020 - Data processing more
than billion rows per second 5

Target: IoT/M2M Log-data processing PostgreSQL Webinar Series 2020 - Data
processing more than billion rows per second 6 Manufacturing Logistics Mobile Home electronics  Any kind of devices generate various form of log-data.  People want/try to find out insight from the data.  Data importing and processing must be rapid.  System administration should be simple and easy. JBoF: Just Bunch of Flash NVME over Fabric (RDMA) DB Admin BI Tools Machine-learning applications e.g, anomaly detection shared data-frame PG-Strom

How much data size we expect? (1/3) ▌This is a
factory of an international top-brand company. ▌Most of users don’t have bigger data than them. Case: A semiconductor factory managed up to 100TB with PostgreSQL Total 20TB Per Year Kept for 5 years Defect Investigation PostgreSQL Webinar Series 2020 - Data processing more than billion rows per second 7

How much data size we expect? (2/3) Nowadays, 2U server
is capable for 100TB capacity model Supermicro 2029U-TN24R4T Qty CPU Intel Xeon Gold 6226 (12C, 2.7GHz) 2 RAM 32GB RDIMM (DDR4-2933, ECC) 12 GPU NVIDIA Tesla P40 (3840C, 24GB) 2 HDD Seagate 1.0TB SATA (7.2krpm) 1 NVME Intel DC P4510 (8.0TB, U.2) 24 N/W built-in 10GBase-T 4 8.0TB x 24 = 192TB How much is it? 60,932USD by thinkmate.com (31st-Aug-2019) PostgreSQL Webinar Series 2020 - Data processing more than billion rows per second 8

How much data size we expect? (3/3) On the other
hands, it is not small enough. TB TB TB B PostgreSQL Webinar Series 2020 - Data processing more than billion rows per second 9

Weapons to tackle terabytes scale data • SSD-to-GPU Direct SQL
Efficient Storage • Arrow_Fdw Efficient Data Structure • Table Partitioning Efficient Data Deployment PostgreSQL Webinar Series 2020 - Data processing more than billion rows per second 10

PG-Strom: SSD-to-GPU Direct SQL PostgreSQL Webinar Series 2020 - Data
processing more than billion rows per second 11 Efficient Storage

GPU’s characteristics - mostly as a computing accelerator PostgreSQL Webinar
Series 2020 - Data processing more than billion rows per second 12 Over 10years history in HPC, then massive popularization in Machine-Learning NVIDIA Tesla V100 Super Computer (TITEC; TSUBAME3.0) Computer Graphics Machine-Learning How I/O workloads are accelerated by GPU that is a computing accelerator? Simulation

A usual composition of x86_64 server GPU SSD CPU RAM
HDD N/W PostgreSQL Webinar Series 2020 - Data processing more than billion rows per second 13

Data flow to process a massive amount of data CPU
RAM SSD GPU PCIe PostgreSQL Data Blocks Normal Data Flow All the records, including junks, must be loaded onto RAM once, because software cannot check necessity of the rows prior to the data loading. So, amount of the I/O traffic over PCIe bus tends to be large. PostgreSQL Webinar Series 2020 - Data processing more than billion rows per second 14 Unless records are not loaded to CPU/RAM once, over the PCIe bus, software cannot check its necessity even if they are “junk” records.

SSD-to-GPU Direct SQL (1/4) – Overview CPU RAM SSD GPU
PCIe PostgreSQL Data Blocks NVIDIA GPUDirect RDMA It allows to load the data blocks on NVME-SSD to GPU using peer-to-peer DMA over PCIe-bus; bypassing CPU/RAM. WHERE-clause JOIN GROUP BY Run SQL by GPU to reduce the data size Data Size: Small PostgreSQL Webinar Series 2020 - Data processing more than billion rows per second 15

Background - GPUDirect RDMA by NVIDIA Physical address space PCIe
BAR1 Area GPU device memory RAM NVMe-SSD Infiniband HBA PCIe devices GPUDirect RDMA It enables to map GPU device memory on physical address space of the host system Once “physical address of GPU device memory” appears, we can use is as source or destination address of DMA with PCIe devices. PostgreSQL Webinar Series 2020 - Data processing more than billion rows per second 16 0xf0000000 0xe0000000 DMA Request SRC: 1200th sector LEN: 40 sectors DST: 0xe0200000

SSD-to-GPU Direct SQL (2/4) – System configuration and workloads PostgreSQL
Webinar Series 2020 - Data processing more than billion rows per second 17 Supermicro SYS-1019GP-TT CPU Xeon Gold 6126T (2.6GHz, 12C) x1 RAM 192GB (32GB DDR4-2666 x 6) GPU NVIDIA Tesla V100 (5120C, 16GB) x1 SSD Intel SSD DC P4600 (HHHL; 2.0TB) x3 (striping configuration by md-raid0) HDD 2.0TB(SATA; 72krpm) x6 Network 10Gb Ethernet 2ports OS Red Hat Enterprise Linux 7.6 CUDA 10.1 + NVIDIA Driver 418.40.04 DB PostgreSQL v11.2 PG-Strom v2.2devel ▪ Query Example (Q2_3) SELECT sum(lo_revenue), d_year, p_brand1 FROM lineorder, date1, part, supplier WHERE lo_orderdate = d_datekey AND lo_partkey = p_partkey AND lo_suppkey = s_suppkey AND p_category = 'MFGR#12‘ AND s_region = 'AMERICA‘ GROUP BY d_year, p_brand1 ORDER BY d_year, p_brand1; customer 12M rows (1.6GB) date1 2.5K rows (400KB) part 1.8M rows (206MB) supplier 4.0M rows (528MB) lineorder 2.4B rows (351GB) Summarizing queries for typical Star-Schema structure on simple 1U server

SSD-to-GPU Direct SQL (3/4) – Benchmark results PostgreSQL Webinar Series
2020 - Data processing more than billion rows per second 18  Query Execution Throughput = (353GB; DB-size) / (Query response time [sec])  SSD-to-GPU Direct SQL runs the workloads close to the hardware limitation (8.5GB/s)  about x3 times faster than filesystem and CPU based mechanism 2343.7 2320.6 2315.8 2233.1 2223.9 2216.8 2167.8 2281.9 2277.6 2275.9 2172.0 2198.8 2233.3 7880.4 7924.7 7912.5 7725.8 7712.6 7827.2 7398.1 7258.0 7638.5 7750.4 7402.8 7419.6 7639.8 0 1,000 2,000 3,000 4,000 5,000 6,000 7,000 8,000 9,000 Q1_1 Q1_2 Q1_3 Q2_1 Q2_2 Q2_3 Q3_1 Q3_2 Q3_3 Q3_4 Q4_1 Q4_2 Q4_3 Query Execution Throughput [MB/s] Star Schema Benchmark on 1xNVIDIA Tesla V100 + 3xIntel DC P4600 PostgreSQL 11.2 PG-Strom 2.2devel

SSD-to-GPU Direct SQL (4/4) – Software Stack PostgreSQL Webinar Series
2020 - Data processing more than billion rows per second 19 Filesystem (ext4) nvme_strom kernel module NVMe SSD PostgreSQL pg_strom extension read(2) ioctl(2) Operating System Software Layer Database Software Layer blk-mq nvme pcie nvme rdma Network HBA NVMe Request Network HBA NVMe SSD NVME-over-Fabric Target RDMA over Converged Ethernet GPUDirect RDMA Translation from logical location to physical location on the drive ▪ Other software ▪ Software by HeteroDB ▪ Hardware Special NVME READ command that loads the source blocks on NVME- SSD to GPU’s device memory. External Storage Server Local Storage Layer

PG-Strom: Arrow_Fdw (columnar store) PostgreSQL Webinar Series 2020 - Data
processing more than billion rows per second 20 Efficient Data Strucute

Background: Apache Arrow (1/2) ▌Characteristics  Column-oriented data format designed
for analytics workloads  A common data format for inter-application exchange  Various primitive data types like integer, floating-point, date/time and so on PostgreSQL Webinar Series 2020 - Data processing more than billion rows per second 21 PostgreSQL / PG-Strom NVIDIA GPU

Background: Apache Arrow (2/2) Apache Arrow Data Types PostgreSQL Data
Types extra description Int int2, int4, int8 FloatingPoint float2, float4, float8 float2 is an enhancement by PG-Strom Binary bytea Utf8 text Bool bool Decimal numeric Decimal = fixed-point 128bit value Date date adjusted to unitsz = Day Time time adjusted to unitsz = MicroSecond Timestamp timestamp adjusted to unitsz = MicroSecond Interval interval List array types Only 1-dimensional array is supportable Struct composite types Union ------ FixedSizeBinary char(n) FixedSizeList ------ Map ------ Most of data types are convertible between Apache Arrow and PostgreSQL. PostgreSQL Webinar Series 2020 - Data processing more than billion rows per second 22

Log-data characteristics (1/2) – WHERE is it generated on? ETL
OLTP OLAP Traditional OLTP&OLAP – Data is generated inside of database system Data Creation IoT/M2M use case – Data is generated outside of database system Log processing BI Tools BI Tools Gateway Server Data Creation Data Creation Many Devices PostgreSQL Webinar Series 2020 - Data processing more than billion rows per second 23 Data Importing becomes a heavy time-consuming operations for big-data processing. Data Import Import!

Arrow_Fdw - maps Apache Arrow files as a foreign table
PostgreSQL Webinar Series 2020 - Data processing more than billion rows per second 24 Apache Arrow Files Arrow_Fdw Foreign Table PostgreSQL Tables Not Importing Write out  Arrow_Fdw enables to map Apache Arrow files as a PostgreSQL foreign table.  Not Importing, so Apache Arrow files are immediately visible.

PostgreSQL Webinar Series 2020 - Data processing more than billion rows per second 25 Apache Arrow Files Arrow_Fdw Foreign Table PostgreSQL Tables Write out  Arrow_Fdw enables to map Apache Arrow files as a PostgreSQL foreign table.  Not Importing, so Apache Arrow files are immediately visible.  pg2arrow generates Apache Arrow file from query results of PostgreSQL. pg2arrow

PostgreSQL Webinar Series 2020 - Data processing more than billion rows per second 26 Apache Arrow Files Arrow_Fdw Foreign Table PostgreSQL Tables Write out  Arrow_Fdw enables to map Apache Arrow files as a PostgreSQL foreign table.  Not Importing, so Apache Arrow files are immediately visible.  pg2arrow generates Apache Arrow file from query results of PostgreSQL.  Arrow_Fdw can be writable, but only batched-INSERT is supported. pg2arrow

SSD-to-GPU Direct SQL on Arrow_Fdw (1/3) PostgreSQL Webinar Series 2020
- Data processing more than billion rows per second 27 ▌Why Apache Arrow is beneficial?  Less amount of I/O to be loaded; only referenced columns  Higher utilization of GPU core; by vector processing and wide memory bus  Read-only structure; No MVCC checks are required on run-time It transfers ONLY Referenced Columns over SSD-to-GPU Direct SQL mechanism PCIe Bus NVMe SSD GPU SSD-to-GPU P2P DMA WHERE-clause JOIN GROUP BY P2P data transfer only referenced columns GPU code supports Apache Arrow format as data source. Runs SQL workloads in parallel by thousands cores. Write back the small results built as heap-tuple of PostgreSQL Results metadata

SSD-to-GPU Direct SQL on Arrow_Fdw (2/3) - Benchmark Results 
Query Execution Throughput = (6.0B rows) / (Query Response Time [sec])  Combined usage of SSD-to-GPU Direct SQL and Columnar-store pulled out the extreme performance; 130-500M rows per second  Server configuration is identical to what we show on p.17. (1U, 1CPU, 1GPU, 3SSD) PostgreSQL Webinar Series 2020 - Data processing more than billion rows per second 28 16.3 16.0 16.0 15.7 15.5 15.6 15.6 15.9 15.7 15.7 15.5 15.6 15.4 55.0 55.1 55.1 54.4 54.4 54.6 52.9 54.0 54.9 54.8 53.2 53.1 54.0 481.0 494.2 499.6 246.6 261.7 282.0 156.2 179.0 183.6 185.0 138.3 139.1 152.6 0 100 200 300 400 500 Q1_1 Q1_2 Q1_3 Q2_1 Q2_2 Q2_3 Q3_1 Q3_2 Q3_3 Q3_4 Q4_1 Q4_2 Q4_3 Number of rows processed per second [million rows/sec] Star Schema Benchmark (s=1000, DBsize: 879GB) on Tesla V100 x1 + DC P4600 x3 PostgreSQL 11.5 PG-Strom 2.2 (Row data) PG-Strom 2.2 + Arrow_Fdw

Table Partitioning Efficient Data Deployment PostgreSQL Webinar Series 2020 -
Data processing more than billion rows per second 30

Log-data characteristics (2/2) – INSERT-only ✓ MVCC visibility check is
(relatively) not significant. ✓ Rows with old timestamp will never inserted. INSERT UPDATE DELETE INSERT UPDATE DELETE Transactional Data Log Data with timestamp current 2019 2018 PostgreSQL Webinar Series 2020 - Data processing more than billion rows per second 31

Configuration of PostgreSQL partition for log-data (1/2) ▌Mixture of PostgreSQL
table and Arrow foreign table in partition declaration ▌Log-data should have timestamp, and never updated ➔ Old data can be moved to Arrow foreign table for more efficient I/O logdata_201912 logdata_202001 logdata_202002 logdata_current logdata table （PARTITION BY timestamp） 2020-03-21 12:34:56 dev_id: 2345 signal: 12.4 Log-data with timestamp PostgreSQL tables Row data store Read-writable but slot Arrow foreign table Column data store Read-only but fast Unrelated child tables are skipped, if query contains WHERE-clause on the timestamp E.g) WHERE timestamp > ‘2020-01-23’ AND device_id = 456 PostgreSQL Webinar Series 2020 - Data processing more than billion rows per second 32

Configuration of PostgreSQL partition for log-data (2/2) logdata_201912 logdata_202001 logdata_202002
logdata_current logdata table （PARTITION BY timestamp） 2019-03-21 12:34:56 dev_id: 2345 signal: 12.4 logdata_201912 logdata_202001 logdata_202002 logdata_202003 logdata_current logdata table （PARTITION BY timestamp） a month later Extract data of 2020-03 PostgreSQL Webinar Series 2020 - Data processing more than billion rows per second 33 ▌Mixture of PostgreSQL table and Arrow foreign table in partition declaration ▌Log-data should have timestamp, and never updated ➔ Old data can be moved to Arrow foreign table for more efficient I/O Log-data with timestamp

Asymmetric Partition-wise JOIN (1/4) PostgreSQL Webinar Series 2020 - Data
processing more than billion rows per second 34 lineorder lineorder_p0 lineorder_p1 lineorder_p2 reminder=0 reminder=1 reminder=2 customer date supplier parts tablespace: nvme0 tablespace: nvme1 tablespace: nvme2 Records from partition-leafs must be backed to CPU and processed once! Scan Scan Scan Append Join Agg Query Results Scan Massive records must be processed on CPU first

Asymmetric Partition-wise JOIN (2/4) postgres=# explain select * from ptable
p, t1 where p.a = t1.aid; QUERY PLAN ---------------------------------------------------------------------------------- Hash Join (cost=2.12..24658.62 rows=49950 width=49) Hash Cond: (p.a = t1.aid) -> Append (cost=0.00..20407.00 rows=1000000 width=12) -> Seq Scan on ptable_p0 p (cost=0.00..5134.63 rows=333263 width=12) -> Seq Scan on ptable_p1 p_1 (cost=0.00..5137.97 rows=333497 width=12) -> Seq Scan on ptable_p2 p_2 (cost=0.00..5134.40 rows=333240 width=12) -> Hash (cost=1.50..1.50 rows=50 width=37) -> Seq Scan on t1 (cost=0.00..1.50 rows=50 width=37) (8 rows) PostgreSQL Webinar Series 2020 - Data processing more than billion rows per second 35

Asymmetric Partition-wise JOIN (3/4) PostgreSQL Webinar Series 2020 - Data
processing more than billion rows per second 36 lineorder lineorder_p0 lineorder_p1 lineorder_p2 reminder=0 reminder=1 reminder=2 customer date supplier parts Push down JOIN/GROUP BY, even if smaller half is not a partitioned table. Join Append Agg Query Results Scan Scan PreAgg Join Scan PreAgg Join Scan PreAgg tablespace: nvme0 tablespace: nvme1 tablespace: nvme2 Very small number of partial results of JOIN/GROUP BY shall be gathered on CPU.

Asymmetric Partition-wise JOIN (4/4) postgres=# set enable_partitionwise_join = on; SET
postgres=# explain select * from ptable p, t1 where p.a = t1.aid; QUERY PLAN ---------------------------------------------------------------------------------- Append (cost=2.12..19912.62 rows=49950 width=49) -> Hash Join (cost=2.12..6552.96 rows=16647 width=49) Hash Cond: (p.a = t1.aid) -> Seq Scan on ptable_p0 p (cost=0.00..5134.63 rows=333263 width=12) -> Hash (cost=1.50..1.50 rows=50 width=37) -> Seq Scan on t1 (cost=0.00..1.50 rows=50 width=37) -> Hash Join (cost=2.12..6557.29 rows=16658 width=49) Hash Cond: (p_1.a = t1.aid) -> Seq Scan on ptable_p1 p_1 (cost=0.00..5137.97 rows=333497 width=12) -> Hash (cost=1.50..1.50 rows=50 width=37) -> Seq Scan on t1 (cost=0.00..1.50 rows=50 width=37) -> Hash Join (cost=2.12..6552.62 rows=16645 width=49) Hash Cond: (p_2.a = t1.aid) -> Seq Scan on ptable_p2 p_2 (cost=0.00..5134.40 rows=333240 width=12) -> Hash (cost=1.50..1.50 rows=50 width=37) -> Seq Scan on t1 (cost=0.00..1.50 rows=50 width=37) (16 rows) PostgreSQL Webinar Series 2020 - Data processing more than billion rows per second 37

PCIe-bus level optimization (1/3) PostgreSQL Webinar Series 2020 - Data
processing more than billion rows per second 38 Supermicro SYS-4029TRT2 x96 lane PCIe switch x96 lane PCIe switch CPU2 CPU1 QPI Gen3 x16 Gen3 x16 for each slot Gen3 x16 for each slot Gen3 x16 ▌HPC Server – optimization for GPUDirect RDMA ▌I/O Expansion Box NEC ExpEther 40G (4slots edition) Network Switch 4 slots of PCIe Gen3 x8 PCIe Swich 40Gb Ethernet CPU NIC Extra I/O Boxes

processing more than billion rows per second 39 By P2P DMA over PCIe-switch, major data traffic bypass CPU CPU CPU PCIe switch SSD GPU PCIe switch SSD GPU PCIe switch SSD GPU PCIe switch SSD GPU SCAN SCAN SCAN SCAN JOIN JOIN JOIN JOIN GROUP BY GROUP BY GROUP BY GROUP BY Pre-processed Data (very small) GATHER GATHER

processing more than billion rows per second 40 $ pg_ctl restart : LOG: - PCIe[0000:80] LOG: - PCIe(0000:80:02.0) LOG: - PCIe(0000:83:00.0) LOG: - PCIe(0000:84:00.0) LOG: - PCIe(0000:85:00.0) nvme0 (INTEL SSDPEDKE020T7) LOG: - PCIe(0000:84:01.0) LOG: - PCIe(0000:86:00.0) GPU0 (Tesla P40) LOG: - PCIe(0000:84:02.0) LOG: - PCIe(0000:87:00.0) nvme1 (INTEL SSDPEDKE020T7) LOG: - PCIe(0000:80:03.0) LOG: - PCIe(0000:c0:00.0) LOG: - PCIe(0000:c1:00.0) LOG: - PCIe(0000:c2:00.0) nvme2 (INTEL SSDPEDKE020T7) LOG: - PCIe(0000:c1:01.0) LOG: - PCIe(0000:c3:00.0) GPU1 (Tesla P40) LOG: - PCIe(0000:c1:02.0) LOG: - PCIe(0000:c4:00.0) nvme3 (INTEL SSDPEDKE020T7) LOG: - PCIe(0000:80:03.2) LOG: - PCIe(0000:e0:00.0) LOG: - PCIe(0000:e1:00.0) LOG: - PCIe(0000:e2:00.0) nvme4 (INTEL SSDPEDKE020T7) LOG: - PCIe(0000:e1:01.0) LOG: - PCIe(0000:e3:00.0) GPU2 (Tesla P40) LOG: - PCIe(0000:e1:02.0) LOG: - PCIe(0000:e4:00.0) nvme5 (INTEL SSDPEDKE020T7) LOG: GPU<->SSD Distance Matrix LOG: GPU0 GPU1 GPU2 LOG: nvme0 ( 3) 7 7 LOG: nvme5 7 7 ( 3) LOG: nvme4 7 7 ( 3) LOG: nvme2 7 ( 3) 7 LOG: nvme1 ( 3) 7 7 LOG: nvme3 7 ( 3) 7 PG-Strom chooses the closest GPU for the tables to be scanned, according to the PCI-E device distance.

Large-scale Benchmark PostgreSQL Webinar Series 2020 - Data processing more

Capable to process 100TB Log data on a single-node in
reasonable time 100-120GB/s of effective data transfer capability by 4-units’ parallel 25-30GB/s of effective data transfer capability by columnar data structure per unit Large-scale Benchmark (1/4) QPI All of SSD-to-GPU Direct SQL, Apache Arrow and PCI-E bus optimization are used CPU2 CPU1 RAM RAM PCIe-SW PCIe-SW NVME0 NVME1 NVME2 NVME3 NVME4 NVME5 NVME6 NVME7 NVME8 NVME9 NVME10 NVME11 NVME12 NVME13 NVME14 NVME15 GPU0 GPU1 GPU2 GPU3 HBA0 HBA1 HBA2 HBA3 8.0-10GB/s of physical data transfer capability per GPU+4xSSD unit 42 PCIe-SW PCIe-SW JBOF unit-0 JBOF unit-1 PostgreSQL Webinar Series 2020 - Data processing more than billion rows per second

Large-scale Benchmark (2/4) PostgreSQL Webinar Series 2020 - Data processing
more than billion rows per second 43 Build HPC 4U Server + 4 of GPU + 16 of NVME-SSD configuration CPU2 CPU1 SerialCables dual PCI-ENC8G-08A (U.2 NVME JBOF; 8slots) x2 NVIDIA Tesla V100 x4 PCIe switch PCIe switch PCIe switch PCIe switch Gen3 x16 for each slot Gen3 x16 for each slot via SFF-8644 based PCIe x4 cables (3.2GB/s x 16 = max 51.2GB/s) Intel SSD DC P4510 (1.0TB) x16 SerialCables PCI-AD-x16HE x4 Supermicro SYS-4029GP-TRT SPECIAL THANKS

Large-scale Benchmark (3/4) PostgreSQL Webinar Series 2020 - Data processing
more than billion rows per second 44 Build HPC 4U Server + 4 of GPU + 16 of NVME-SSD configuration

Large-scale Benchmark (4/4) UPI Distributed 3.5TB test data into 4
partitions; 879GB for each CPU2 CPU1 RAM RAM PCIe-SW PCIe-SW NVME0 NVME1 NVME2 NVME3 NVME4 NVME5 NVME6 NVME7 NVME8 NVME9 NVME10 NVME11 NVME12 NVME13 NVME14 NVME15 GPU0 GPU1 GPU2 GPU3 HBA0 HBA1 HBA2 HBA3 45 PCIe-SW PCIe-SW PostgreSQL Webinar Series 2020 - Data processing more than billion rows per second lineorder_p0 (879GB) lineorder_p1 (879GB) lineorder_p2 (879GB) lineorder_p3 (879GB) lineorder (---GB) Hash-Partition (N=4)

Benchmark Results (1/2) PostgreSQL Webinar Series 2020 - Data processing
more than billion rows per second 46 Billion rows per second at a single-node PostgreSQL!! 64.0 65.0 65.2 58.8 58.0 67.3 43.9 58.7 61.0 60.8 52.9 53.0 59.1 255.2 255.6 256.9 257.6 258.0 257.5 248.5 254.9 253.3 253.3 249.3 249.4 249.6 1,681 1,714 1,719 1,058 1,090 1,127 753 801 816 812 658 664 675 0 200 400 600 800 1,000 1,200 1,400 1,600 1,800 Q1_1 Q1_2 Q1_3 Q2_1 Q2_2 Q2_3 Q3_1 Q3_2 Q3_3 Q3_4 Q4_1 Q4_2 Q4_3 Number of rows processed per second [million rows / sec] Results of Star Schema Benchmark [SF=4,000 (3.5TB; 24B rows), 4xGPU, 16xNVME-SSD] PostgreSQL 11.5 PG-Strom 2.2 PG-Strom 2.2 + Arrow_Fdw

Benchmark Results (2/2) PostgreSQL Webinar Series 2020 - Data processing
more than billion rows per second 47 40GB/s Read from NVME-SSDs during SQL execution (95% of H/W limit) 0 5,000 10,000 15,000 20,000 25,000 30,000 35,000 40,000 0 20 40 60 80 100 120 140 160 180 200 220 240 260 280 300 320 340 NVME-SSD Read Throughput [MB/s] Elapsed Time [sec] /dev/md0 (GPU0) /dev/md1 (GPU1) /dev/md2 (GPU2) /dev/md3 (GPU3)

PostgreSQL Webinar Series 2020 - Data processing more than billion
rows per second 48 Future Direction

NVIDIA GPUDirect Storage (cuFile API) support PostgreSQL Webinar Series 2020
- Data processing more than billion rows per second 49 ▌NVIDIA GPUDirect Storage?  API & Driver stack for direct read from NVME to GPU ✓ Features are almost equivalend to NVME-Strom ✓ Ubuntu 18.04 & RHEL8/CentOS8 shall be released at the initial release  NVIDIA will officially release the software in 2020. ▌Why beneficial?  Linux kernel driver that is tested/evaluated by NVIDIA’s QA process.  Involvement of broader ecosystem with third-party solutions; like distributed-filesystem, block storage, software defined storage  Broader OS support; Ubuntu 18.04. PG-Strom is here Ref: GPUDIRECT STORAGE: A DIRECT GPU-STORAGE DATA PATH https://on-demand.gputechconf.com/supercomputing/2019/pdf/sc1922-gpudirect-storage-transfer-data-directly-to-gpu-memory-alleviating-io-bottlenecks.pdf WIP

PostGIS Functions & Operator support (1/2) PostgreSQL Webinar Series 2020
- Data processing more than billion rows per second 50 ▌PostGIS  An extension of handle geographic objects and relational operations on them. ✓ Geographic objects: Point, LineString, Polygon, and etc .... ✓ Geographic relational operations: contains, crosses, touch, distance, and etc...  Wise design for fast search ✓ Bounding-box, GiST-index, CPU Parallel  v1.0 is released at 2005, then its development has continued over 15 years. © GAIA RESOURCES Geometric master data Sensor Data Massive data including location information Aggregating and summarizing data by prefecture, city, town, village and so on 【IoT/M2Mデータに対しての想定利用シーン】 WIP

PostGIS Functions & Operator support (2/2) PostgreSQL Webinar Series 2020
- Data processing more than billion rows per second 51 ▌Current status  geometry st_makepoint(float8,float8,...)  float8 st_distance(geometry,geometry)  bool st_dwithin(geometry,geometry,float8)  bool st_contains(geometry,geometry)  bool st_crosses(geometry,geometry) ▌Next: GiST (R-tree) Index support  Index-based nested-loop up to million polygon x billion points.  Runs index-search by GPU’s thousands threads in parallel. ➔ Because of GiST internal structure, GPU-parallel is likely efficient. WIP GiST(R-tree) Index Polygon-definition Table with Location data Index search by thousands threads in parallel

Resources PostgreSQL Webinar Series 2020 - Data processing more than
billion rows per second 52 ▌Repository  https://github.com/heterodb/pg-strom  https://heterodb.github.io/swdc/ ▌Document  http://heterodb.github.io/pg-strom/ ▌Contact  [email protected]  Tw: @kkaigai / @heterodb

Backup Slides PostgreSQL Webinar Series 2020 - Data processing more

Apache Arrow Internal (1/3) ▌Apache Arrow Internal  Header •
Fixed byte string: “ARROW1¥0¥0”  Schema Definition • # of columns and columns definitions. • Data type/precision, column name, column number, ...  Record Batch • Internal data block that contains a certain # of rows. • E.g, if N=1M rows with (Int32, Float64), this record-batch begins from 1M of Int32 array, then 1M of Float64 array follows.  Dictionary Batch • Optional area for dictionary compression. • E.g) 1 = “Tokyo”, 2 = “Osaka”, 3 = “Kyoto” ➔ Int32 representation for frequently appearing words  Footer • Metadata - offset/length of RecordBatches and DictionaryBatches Header “ARROW1¥0¥0” Schema Definition DictionaryBatch-0 RecordBatch-0 RecordBatch-k Footer • DictionaryBatch[0] (offset, size) • RecordBatch[0] (offset, size) : • RecordBatch[k] (offset, size) Apache Arrow file PostgreSQL Webinar Series 2020 - Data processing more than billion rows per second 55

Apache Arrow Internal (2/3) - writable Arrow_Fdw Header “ARROW1¥0¥0” Schema
Definition DictionaryBatch-0 RecordBatch-0 RecordBatch-k Footer • DictionaryBatch[0] • RecordBatch[0] : • RecordBatch[k] Arrow file (before writes) Header “ARROW1¥0¥0” Schema Definition DictionaryBatch-0 RecordBatch-0 RecordBatch-k Footer (new revision) • DictionaryBatch[0] • RecordBatch[0] : • RecordBatch[k] • RecordBatch[k+1] Arrow file (after writes) RecordBatch-(k+1) Overwrite of the original Footer area Rows written by a single INSERT command PostgreSQL Webinar Series 2020 - Data processing more than billion rows per second 56

Apache Arrow Internal (3/3) - writable Arrow_Fdw Header “ARROW1¥0¥0” Schema
Definition DictionaryBatch-0 RecordBatch-0 RecordBatch-k Footer • DictionaryBatch[0] • RecordBatch[0] : • RecordBatch[k] Arrow file (before writes) Header “ARROW1¥0¥0” Schema Definition DictionaryBatch-0 RecordBatch-0 RecordBatch-k Footer (new revision) • DictionaryBatch[0] • RecordBatch[0] : • RecordBatch[k] • RecordBatch[k+1] Arrow file (after writes) RecordBatch-(k+1) Overwrite of the original Footer area Footer • DictionaryBatch[0] • RecordBatch[0] : • RecordBatch[k] Arrow_Fdw keeps the original Footer image until commit, for support of rollback. PostgreSQL Webinar Series 2020 - Data processing more than billion rows per second 57

Pg2Arrow / MySQL2Arrow $ pg2arrow --help Usage: pg2arrow [OPTION] [database]
[username] General options: -d, --dbname=DBNAME Database name to connect to -c, --command=COMMAND SQL command to run -t, --table=TABLENAME Table name to be dumped (-c and -t are exclusive, either of them must be given) -o, --output=FILENAME result file in Apache Arrow format --append=FILENAME result Apache Arrow file to be appended (--output and --append are exclusive. If neither of them are given, it creates a temporary file.) Arrow format options: -s, --segment-size=SIZE size of record batch for each Connection options: -h, --host=HOSTNAME database server host -p, --port=PORT database server port -u, --user=USERNAME database user name -w, --no-password never prompt for password -W, --password force password prompt Other options: --dump=FILENAME dump information of arrow file --progress shows progress of the job --set=NAME:VALUE config option to set before SQL execution --help shows this message ✓ mysql2arrow also has almost equivalent capability Pg2Arrow / MySQL2Arrow enables to dump SQL results as Arrow file Apache Arrow Data Files Arrow_Fdw Pg2Arrow PostgreSQL Webinar Series 2020 - Data processing more than billion rows per second 58

Data processing more than billion rows per seco...

Data processing more than billion rows per second ～unleash full capability of GPU and NVME～

More Decks by KaiGai Kohei

Other Decks in Technology

Featured

Transcript