Performance Evaluation of a Next-Generation SX-Aurora TSUBASA Vector Supercomputer

Slide 1

Slide 1 text

ISC’23 Performance Evaluation of a Next-Generation SX-Aurora TSUBASA Vector Supercomputer Keichi Takahashi1 ([email protected]), Soya Fujimoto2, Satoru Nagase2, Yoko Isobe2, Yoichi Shimomura1, Ryusuke Egawa3, and Hiroyuki Takizawa1 1Tohoku University, 2NEC Corporation, 3Tokyo Denki University

Slide 2

Slide 2 text

ISC’23 Agenda • Vector Engine 3.0 (VE30) is the first major update to NEC’s Vector Engine series of vector processors [1,2]. • In this paper, we conduct the first performance evaluation of VE30. • Specifically, we • Describe the overall architecture of VE30 and improvements from its predecessor. • Evaluate the basic performance using industry-standard benchmarks. • Analyze the impact of architectural enhancements/ • Evaluate the real-world performance using workloads including SPEChpc 2021. • Present several performance tuning techniques for VE30. Performance Evaluation of a Next-Generation SX-Aurora TSUBASA Vector Supercomputer 2 [1] K. Komatsu et al., “Performance Evaluation of a Vector Supercomputer SX-Aurora TSUBASA,” SC’18. [2] R. Egawa et al., “Exploiting the Potentials of the Second Generation SX-Aurora TSUBASA,” PMBS 2020.

Slide 3

Slide 3 text

ISC’23 SX-Aurora TSUBASA Vector Supercomputer • SX-Aurora TSUBASA (SX-AT) • Latest generation of NEC’s SX-series supercomputers based on the Vector Engine (VE) processor. • VE is implemented as a PCIe card and attached to the host. • Application runs on the VE and “reverse” offload syscalls to the host. • Vector Engine (VE) • A vector processor tightly integrated with High Bandwidth Memory (HBM), primarilty targeting memory-bound applications. • Can be programmed with standard parallel programming models (MPI+OpenMP). • Powerful autovectorizing compilers for C/C++ and Fortran are provided. Performance Evaluation of a Next-Generation SX-Aurora TSUBASA Vector Supercomputer 3 Vector Engine Vector Host (x86) Vector Engine PCIe Switch … InfiniBand HCA https://www.nec.com/en/global/solutions/hpc/sx/ vector_engine.html MPI Syscalls

Slide 4

Slide 4 text

ISC’23 Architecture of VE30 Performance Evaluation of a Next-Generation SX-Aurora TSUBASA Vector Supercomputer 4 Main Memory (96 GB) Last-Level Cache (64 MB) Network on Chip (2D Mesh) SPU VPU L3 Cache (2 MB) 6.4 TB/s 2.45 TB/s 410 GB/s 410 GB/s 16 cores Core Core Core Core Core Core LLC LLC Core Core Core Core Core Core Core Core Core Core HBM2E HBM2E HBM2E HBM2E HBM2E HBM2E …

Slide 5

Slide 5 text

ISC’23 Improvements from VE20 • Per-core private L3 cache • L3 Cache can be bypassed by software. • Compute-capable LLC • Each LLC bank contains a compute unit to perform accumulation in the LLC. • Better FP32 performance • VE30 only requires 4-byte alignment for FP32 data, while VE20 required 8-byte alignment. Performance Evaluation of a Next-Generation SX-Aurora TSUBASA Vector Supercomputer 5 VE Type 20A VE Type 30A VE30 Improve ment # of Cores 10 16 1.6x FP64 Perf./Socket [TFLOP/s] 3.07 4.91 1.6x Memory B/W [TB/s] 1.53 2.45 1.6x Memory Capacity [GB] 48 96 2.0x LLC B/W [TB/s] 3.0 6.4 2.1x LLC Capacity [MB] 16 64 4.0x

Slide 6

Slide 6 text

ISC’23 Evaluation targets Performance Evaluation of a Next-Generation SX-Aurora TSUBASA Vector Supercomputer 6 NEC VE Type 20B NEC VE Type 30A Fujitsu A64FX Intel Xeon Platinum 8368 NVIDIA A100 80GB PCIe FP64 Perf./Core [GFLOP/s] 307 307 70 83.2 181 w/ TC 90 w/o TC # of Cores 8 16 48 36 108 FP64 Perf./Socket [TFLOP/s] 2.4 4.9 3.3 3.1 19.5 w/ TC 9.7 w/o TC LLC B/W [TB/s] 3.0 6.4 3.6 3.21 4.91 LLC Capacity [MB] 16 64 32 54 40 Memory B/W [TB/s] 1.53 2.45 1.024 0.204 1.935 Memory Capacity [GB] 48 96 32 80 Process Rule [nm] 16 7 7 10 7

Slide 7

Slide 7 text

Basic performance Performance Evaluation of a Next-Generation SX-Aurora TSUBASA Vector Supercomputer 7

Slide 8

Slide 8 text

ISC’23 Benchmarks for basic performance measurements • HPL1: Compute-intensive benchmark that solves a dense linear system using LU decomposition with pivoting. • BabelStream2: Benchmark that measures the effective memory bandwidth. • HPCG1: Memory-intensive benchmark that solves a sparse linear system using the conjugate gradient method and a geometric multigrid preconditioner. • Himeno benchmark: Memory-intensive benchmark that solves the Poisson equation using the Jacobi method. Performance Evaluation of a Next-Generation SX-Aurora TSUBASA Vector Supercomputer 8

Slide 9

Slide 9 text

ISC’23 Basic performance (HPL and BabelStream) Performance Evaluation of a Next-Generation SX-Aurora TSUBASA Vector Supercomputer 9 0 2 4 6 8 10 12 14 VE20 VE30 A64FXIceLake A100 40GB A100 80GB 0 20 40 60 80 100 TFLOP/s Efficiency [%] Performance Efficiency 2.13 4.43 2.78 1.83 11.8 12.5 86% 90% 82% 57% 60% 64% HPL Excellent compute performance Low efficiency due to throttling 0 500 1000 1500 2000 VE20 VE30 A64FXIceLake ×2 A100 40GB A100 80GB 0 20 40 60 80 100 GB/s Efficiency [%] Performance Efficiency 1230 1793 826 163 1410 1657 80% 72% 81% 80% 91% 86% Highest among all processors BabelStream

Slide 10

Slide 10 text

ISC’23 Basic performance (HPCG and Himeno) Performance Evaluation of a Next-Generation SX-Aurora TSUBASA Vector Supercomputer 10 0 100 200 300 400 500 600 700 800 900 VE20 VE30 A64FXIceLake A100 40GB A100 80GB 0 5 10 15 20 GFLOP/s Efficiency [%] Performance Efficiency 388 837 342 75 553 634 16% 17% 10% 2.3% 2.8% 3.2% 0 50 100 150 200 250 300 VE20 VE30 A64FX IceLake A100 40GB A100 80GB 0 1 2 3 4 5 6 7 GFLOP/s Efficiency [%] Performance Efficiency 139 258 106 29 222 259 5.6% 5.2% 3.1% 0.94% 2.2% 2.6% HPCG Himeno benchmark Almost identical to A100 80 GB Highest among all processors

Slide 11

Slide 11 text

ISC’23 Multi-node performance (HPL, HPCG and Himeno) Performance Evaluation of a Next-Generation SX-Aurora TSUBASA Vector Supercomputer 11 0.1 1 10 100 1000 1 10 100 0 20 40 60 80 100 TFLOP/s Efficiency [%] # of VEs HPL Performance HPCG Performance Himeno Performance HPL Efficiency HPCG Efficiency Himeno Efficiency Performance on 128 VEs is: • HPL: 537 TFLOP/s, 85.5% efficiency • HPCG: 30.6 TFLOP/s, 4.9% efficiency • Himeno: 919 TFLOP/s, 15.2% efficiency VE 30 AMD EPYC 7713P PCIe SW IB HDR 200G IB HDR 200G VE 30 VE 30 PCIe SW VE 30 VE 30 PCIe SW VE 30 VE 30 PCIe SW VE 30

Slide 12

Slide 12 text

ISC’23 Per-core private L3 cache Performance Evaluation of a Next-Generation SX-Aurora TSUBASA Vector Supercomputer 12 Main Memory Last-Level Cache Network on Chip SPU VPU L3 Cache Reduces NoC congestion Can be bypassed by software 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 Earthquake Turbulent Flow Antenna Land Mine Turbine Plasma TFLOP/s w/o L3 cache w/ L3 cache VE30 adds software-controllable per-core private L3 cache (2 MB, unified, write-through). 3.13x: Data fits in L3C, reducing gather latency Tohoku Univ. kernel collection Reduces LLC contention Reduces access latency

Slide 13

Slide 13 text

ISC’23 Compute-capable LLC Performance Evaluation of a Next-Generation SX-Aurora TSUBASA Vector Supercomputer 13 for (int i = 0; i < n; i++) { y[l[i]] = y[l[i]] + x[i]; } On VE20, the user was responsible for choosing from: • Scalar: Computes using scalar instructions only (compiler’s default). • ivdep: Computes using vector instructions only. Must ensure that l[i] do not overlap (requires use of compiler directive). • list_vector: Computes using vectorized instructions, then corrects results for overlapped indices using scalar instructions (requires use of compiler directive). On VE30: • vlfa: Dedicated instruction for index vector accumulation (compiler’s default). Core LLC Memory l[i] x[i] y[l[i]] Each LLC bank has a compute unit Indexed vector accumulation used in finite element method, particle method, etc.

Slide 14

Slide 14 text

ISC’23 Hardware support for indexed vector accumulation Performance Evaluation of a Next-Generation SX-Aurora TSUBASA Vector Supercomputer 14 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1 2 4 8 12 16 20 24 28 32 GFLOP/s # of Overlapping Indices VE20 scalar VE20 list_vector VE30 scalar VE30 list_vector VE30 vlfa Single-core performance of a microbenchmark that performs indexed vector accumulation with varying degree of address overlaps (x out of 32 addresses overlap). vlfa falls behind scalar vlfa is 3.48x faster than list_vector Since vlfa is always faster than list_vector and high degree of overlaps rarely happens in real-world applications, the user can always use vlfa.

Slide 15

Slide 15 text

Real-world workloads Performance Evaluation of a Next-Generation SX-Aurora TSUBASA Vector Supercomputer 15

Slide 16

Slide 16 text

ISC’23 Tohoku University kernel collection Performance Evaluation of a Next-Generation SX-Aurora TSUBASA Vector Supercomputer 16 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 Earthquake Turbulent Flow Antenna Land Mine Turbine Plasma TFLOP/s VE20 VE30 Domain Bottleneck VE30 Speedup Earthquake Seismology Mem. B/W 1.56x Turbulent Flow Fluid dynamics LLC B/W 2.33x Antenna Electronics Mem. B/W 1.77x Land Mine Electronics Mem. B/W 1.92x Turbine Fluid dynamics Mem. latency 2.40x Plasma Geophysics Mem. latency 2.41x Six kernels extracted from production applications developed by the users of Cyberscience Center, Tohoku Univ. Memory and LLC B/W improvement L3C and LLC B/W improvement Shorter latency thanks to L3C VE30 HW improvement Peak Mem. B/W: 1.60x Peak LLC B/W: 2.13x

Slide 17

Slide 17 text

ISC’23 SPEChpc 2021 • Benchmark suite developed by the Standard Performance Evaluation Corporation (SPEC) • “a set of application benchmark suites using a comprehensive measure of real-world performance for the state-of-the-art HPC systems”1 • Programming models: • Used MPI+OpenMP on VE20/30, A64FX and IceLake-SP, and MPI+OpenACC on A100. • All benchmarks are executed as-is (no source modification) on all platforms. • Workload sizes: • Tiny (9 benchmarks, requires ~60 GB of memory): Executed using the minimum possible number of sockets on each platform. Speedup is normalized by the number of used sockets. • Medium (6 benchmarks, requires ~4 TB of memory): Executed using 128 sockets on all platforms. Performance Evaluation of a Next-Generation SX-Aurora TSUBASA Vector Supercomputer 17 1 https://www.spec.org/hpc2021/

Slide 18

Slide 18 text

ISC’23 SPEChpc 2021 tiny workload results • VE30 is the fastest in LBM, TeaLeaf and POT3D. • VE30 slightly underperforms A100 in CloverLeaf and miniWeather. • VE30 shows poor performance in SPH-EXA and HPGMG-FV. Performance Evaluation of a Next-Generation SX-Aurora TSUBASA Vector Supercomputer 18 0 5 10 15 20 25 30 35 LBM TeaLeaf CloverLeaf POT3D SPH-EXA HPGMG-FV miniWeather Speedup over Baseline System VE20 x2 VE30 x1 A100 80GB x1 A100 40GB x2 A64FX x3 IceLake-SP x1 Discussed in the next slide

Slide 19

Slide 19 text

ISC’23 SPEChpc 2021 tiny workload performance analysis • LBM, TeaLeaf, CloverLeaf, POT3D • Memory-bound and achieves good performance on VE. • In CloverLeaf, kernels that perform gather are slower than A100. • SPH-EXA: • Octree traversal for nearest neighbor search cannot be vectorized. • Could benefit from (reverse) offloading nearest neighbor search to the host CPU. • HPGMG-FV • Inner-most loop is too short (32 iterations) for VE where a register holds 256 elements. • Could benefit from collapsing loops to increase the average vector length. • miniWeather • Memory-bound kernels are faster than A100 but compute-bound kernels are slower. Performance Evaluation of a Next-Generation SX-Aurora TSUBASA Vector Supercomputer 19

Slide 20

Slide 20 text

ISC’23 • Trend looks similar to tiny since the both the problem size and # of nodes are increased. • VE30’s MPI communication performance is worse than A100 in CloverLeaf, POT3D and miniWeather – requires further investigation (see paper for MPI profiling results). Performance Evaluation of a Next-Generation SX-Aurora TSUBASA Vector Supercomputer 20 SPEChpc 2021 medium workload results 0 5 10 15 20 25 30 35 40 LBM TeaLeaf CloverLeaf POT3D HPGMG-FV miniWeather Speedup over Baseline System VE20 x128 VE30 x128 A100 40GB x128 A64FX x128 IceLake-SP x128

Slide 21

Slide 21 text

ISC’23 Selective L3 caching • On VE30, users can selectively cache reused data on the L3 cache • Here we use the Himeno benchmark to assess the impact of selective L3 caching • Arrays a, b, c, bnd, wrk1 and wrk2 are accessed in a streaming manner. • Array p is reused only p is reused (ideally 18 out of 19 loads in the inner-most loop hit in cache). Performance Evaluation of a Next-Generation SX-Aurora TSUBASA Vector Supercomputer 21 for (i=1 ; i

Slide 22

Slide 22 text

ISC’23 Impact of selective caching in Himeno benchmark Performance Evaluation of a Next-Generation SX-Aurora TSUBASA Vector Supercomputer 22 250 255 260 265 270 275 280 285 290 Watt Cache all Bypass all Cache p only 2.5 2.6 2.7 2.8 2.9 3 3.1 3.2 3.3 GFLOP/s per Watt 0 100 200 300 400 500 600 700 800 900 S M L XL GFLOP/s Problem Size Cache all Bypass all Cache p only Size Dimensions S 64x64x128 M 128x128x256 L 256x256x512 XL 512x512x1024 +6.9% w/ selective caching p does not fit in L3C +5.7% w/ selective caching Performance Power (L size) Power Efficiency (L) VE20: 2.14 GFLOP/s/W A100: 2.21 GFLOP/s/W +8.2% -0.6% +6.5% w/ selective caching

Slide 23

Slide 23 text

ISC’23 Partitioning mode • Partitioning mode splits a single VE into two NUMA nodes • Each NUMA node has half the cores, LLC and HBM (capacity and B/W are both halved) • Alleviates congestion in the NoC and increases total effective LLC B/W • Cache-intensive applications will benefit from partitioning mode Performance Evaluation of a Next-Generation SX-Aurora TSUBASA Vector Supercomputer 23 Core Core Core Core Core Core LLC LLC Core Core Core Core Core Core Core Core Core Core HBM2E HBM2E HBM2E HBM2E HBM2E HBM2E 0 100 200 300 400 500 600 700 800 900 VE20 VE30 GFLOP/s w/o Partitionig Mode w/ Partitionig Mode +7.1% w/ partitioning mode NUMA node #0 NUMA node #1 Himeno benchmark

Slide 24

Slide 24 text

ISC’23 Summary • VE30 attains massive speedup in memory-intensive standard benchmarks. • Speedup exceeds improvement in peak compute and memory performance, indicating the benefits of the newly introduced L3 cache and improved LLC. • VE30 outperforms other processors in many real-world workloads including SPEChpc without any source code modification. • New architectural features could be used to further improve performance. Performance Evaluation of a Next-Generation SX-Aurora TSUBASA Vector Supercomputer 24 VE30 achieves high sustained performance equal to or greater than latest GPUs and CPUs, while allowing programmers to use conventional programming models.

Slide 25

Slide 25 text

Backup slides Performance Evaluation of a Next-Generation SX-Aurora TSUBASA Vector Supercomputer 25

Slide 26

Slide 26 text

ISC’23 SPEChpc medium size MPI profile Performance Evaluation of a Next-Generation SX-Aurora TSUBASA Vector Supercomputer 26 0 20 40 60 80 100 120 140 160 VE30 A100 VE30 A100 VE30 A100 VE30 A100 VE30 A100 Runtime [s] Others MPI_Init(_thread) MPI_Reduce MPI_Barrier MPI_Waitall MPI_Allreduce MPI_Irecv MPI_Isend miniWeather POT3D CloverLeaf TeaLeaf LBM 0 100 200 300 400 500 600 700 VE30 A100 Runtime [s] HPGMG-FV

Slide 27

Slide 27 text

ISC’23 Relaxed alignment restriction for packed vectors • VE20 required 8-byte alignment for FP32 vectors, resulting in poor performance with some access patterns (e.g., stencil-like). • VE30 relaxes the restriction to 4-byte alignment Performance Evaluation of a Next-Generation SX-Aurora TSUBASA Vector Supercomputer 27 0 10 20 30 40 50 60 70 VE20 w/o packed VE30 w/o packed VE30 w/ packed GFLOP/s do k = 1, nz do j = 1, ny do i = 1, nx a(i,j,k) = a(i,j,k) + & (b(i-1,j-1,k-1) + b(i ,j-1,k-1) + b(i+1,j-1,k-1) + & b(i-1,j ,k-1) + b(i ,j ,k-1) + b(i+1,j ,k-1) + & b(i-1,j+1,k-1) + b(i ,j+1,k-1) + b(i+1,j+1,k-1) + & b(i-1,j-1,k ) + b(i ,j-1,k ) + b(i+1,j-1,k ) + & b(i-1,j ,k ) + b(i ,j ,k ) + b(i+1,j ,k ) + & b(i-1,j+1,k ) + b(i ,j+1,k ) + b(i+1,j+1,k ) + & b(i-1,j-1,k+1) + b(i ,j-1,k+1) + b(i+1,j-1,k+1) + & b(i-1,j ,k+1) + b(i ,j ,k+1) + b(i+1,j ,k+1) + & b(i-1,j+1,k+1) + b(i ,j+1,k+1) + b(i+1,j+1,k+1))/27.0 end do end do end do 27-point stencil microbenchmark

Slide 28

Slide 28 text

ISC’23 Real-world kernel with indexed vector accumulation • A kernel extracted from a real-world CFD application (4 out of 256 indices overlap, two pairs of identical indices) • Using vlfa reduces the runtime from 175.6s to 12.0s (14.6x speedup) Performance Evaluation of a Next-Generation SX-Aurora TSUBASA Vector Supercomputer 28 DO N = nstart,nend IF(flag3(N)==1) THEN COF(7,WI(N),WJ(N),WK(N))=COF(7,WI(N),WJ(N),WK(N))+W_TAUWC(N) * W_AREA_1(N) SOC(WI(N),WJ(N),WK(N))=SOC(WI(N),WJ(N),WK(N))+W_TAUWS(N) * W_AREA_1(N) ENDIF ENDDO

Slide 29

Slide 29 text

ISC’23 HPC systems used for evaluation Performance Evaluation of a Next-Generation SX-Aurora TSUBASA Vector Supercomputer 29 System Node configuration Interconnect VE20 AOBA-C@Tohoku Univ. AMD EPYC 7402P x1 Vector Engine Type 20B x8 InfiniBand HDR 200G x2 VE30 Prototype system@NEC Corp. AMD EPYC 7713P x1 Vector Engine Type 30A x8 InfiniBand HDR 200G x2 A64FX Flow Type I@Nagoya Univ. Fujitsu A64FX x1 Tofu-D IceLake-SP SQUID@Osaka Univ. Intel Xeon Platinum 8368 x2 InfiniBand HDR 200G x1 A100 40GB SQUID@Osaka Univ. Intel Xeon Platinum 8368 x2 NVIDIA A100 40 GB x8 InfiniBand HDR 100G x4

Slide 30

Slide 30 text

ISC’23 Comparison to NVIDIA H100 family Performance Evaluation of a Next-Generation SX-Aurora TSUBASA Vector Supercomputer 30 NEC VE Type 30A NVIDIA H100 PCIe NVIDIA H100 SXM5 NVIDIA H100 NVL FP64 Perf./Core [GFLOP/s] 307 449.2 w/ TC 224 w/o TC 506.8 w/ TC 253.4 w/o TC 1013.7 w/TC 506.8 w/o TC # of Cores (SMs) 16 114 132 264 FP64 Perf./Socket [TFLOP/s] 4.9 51.2 w/ TC 25.6 w/o TC 66.9 w/ TC 33.5 w/o TC 133.8 w/ TC 67 w/o TC LLC Capacity [MB] 64 50 50 100 Memory B/W [TB/s] 2.45 2 3.35 7.8 Memory Capacity [GB] 96 80 80 188 Process Rule [nm] 7 4 4 4 TDP 300 W 350 W 700 W 700-800 W