Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Performance Evaluation of a Next-Generation SX-Aurora TSUBASA Vector Supercomputer

Performance Evaluation of a Next-Generation SX-Aurora TSUBASA Vector Supercomputer

My talk at ISC 2023. Proceedings is available at: https://link.springer.com/chapter/10.1007/978-3-031-32041-5_19

Keichi Takahashi

May 23, 2023
Tweet

More Decks by Keichi Takahashi

Other Decks in Science

Transcript

  1. ISC’23
    Performance Evaluation of a Next-Generation
    SX-Aurora TSUBASA Vector Supercomputer
    Keichi Takahashi1 ([email protected]), Soya Fujimoto2,
    Satoru Nagase2, Yoko Isobe2, Yoichi Shimomura1, Ryusuke Egawa3,
    and Hiroyuki Takizawa1
    1Tohoku University, 2NEC Corporation, 3Tokyo Denki University

    View Slide

  2. ISC’23
    Agenda
    • Vector Engine 3.0 (VE30) is the first major update to NEC’s Vector Engine series of vector
    processors [1,2].
    • In this paper, we conduct the first performance evaluation of VE30.
    • Specifically, we
    • Describe the overall architecture of VE30 and improvements from its predecessor.
    • Evaluate the basic performance using industry-standard benchmarks.
    • Analyze the impact of architectural enhancements/
    • Evaluate the real-world performance using workloads including SPEChpc 2021.
    • Present several performance tuning techniques for VE30.
    Performance Evaluation of a Next-Generation SX-Aurora TSUBASA Vector Supercomputer 2
    [1] K. Komatsu et al., “Performance Evaluation of a Vector Supercomputer SX-Aurora TSUBASA,” SC’18.
    [2] R. Egawa et al., “Exploiting the Potentials of the Second Generation SX-Aurora TSUBASA,” PMBS 2020.

    View Slide

  3. ISC’23
    SX-Aurora TSUBASA Vector Supercomputer
    • SX-Aurora TSUBASA (SX-AT)
    • Latest generation of NEC’s SX-series supercomputers based on the
    Vector Engine (VE) processor.
    • VE is implemented as a PCIe card and attached to the host.
    • Application runs on the VE and “reverse” offload syscalls to the host.
    • Vector Engine (VE)
    • A vector processor tightly integrated with High Bandwidth Memory
    (HBM), primarilty targeting memory-bound applications.
    • Can be programmed with standard parallel programming models
    (MPI+OpenMP).
    • Powerful autovectorizing compilers for C/C++ and Fortran are
    provided.
    Performance Evaluation of a Next-Generation SX-Aurora TSUBASA Vector Supercomputer 3
    Vector
    Engine
    Vector Host
    (x86)
    Vector
    Engine
    PCIe
    Switch

    InfiniBand
    HCA
    https://www.nec.com/en/global/solutions/hpc/sx/
    vector_engine.html
    MPI
    Syscalls

    View Slide

  4. ISC’23
    Architecture of VE30
    Performance Evaluation of a Next-Generation SX-Aurora TSUBASA Vector Supercomputer 4
    Main Memory (96 GB)
    Last-Level Cache (64 MB)
    Network on Chip (2D Mesh)
    SPU VPU
    L3 Cache (2 MB)
    6.4 TB/s
    2.45 TB/s
    410 GB/s
    410 GB/s
    16 cores
    Core
    Core
    Core
    Core
    Core
    Core
    LLC
    LLC
    Core
    Core
    Core
    Core
    Core
    Core
    Core
    Core
    Core
    Core
    HBM2E
    HBM2E
    HBM2E
    HBM2E
    HBM2E
    HBM2E

    View Slide

  5. ISC’23
    Improvements from VE20
    • Per-core private L3 cache
    • L3 Cache can be bypassed by software.
    • Compute-capable LLC
    • Each LLC bank contains a compute unit to
    perform accumulation in the LLC.
    • Better FP32 performance
    • VE30 only requires 4-byte alignment for
    FP32 data, while VE20 required 8-byte
    alignment.
    Performance Evaluation of a Next-Generation SX-Aurora TSUBASA Vector Supercomputer 5
    VE Type
    20A
    VE Type
    30A
    VE30
    Improve
    ment
    # of Cores 10 16 1.6x
    FP64 Perf./Socket [TFLOP/s] 3.07 4.91 1.6x
    Memory B/W [TB/s] 1.53 2.45 1.6x
    Memory Capacity [GB] 48 96 2.0x
    LLC B/W [TB/s] 3.0 6.4 2.1x
    LLC Capacity [MB] 16 64 4.0x

    View Slide

  6. ISC’23
    Evaluation targets
    Performance Evaluation of a Next-Generation SX-Aurora TSUBASA Vector Supercomputer 6
    NEC VE
    Type 20B
    NEC VE
    Type 30A
    Fujitsu A64FX Intel Xeon
    Platinum 8368
    NVIDIA A100
    80GB PCIe
    FP64 Perf./Core [GFLOP/s] 307 307 70 83.2 181 w/ TC
    90 w/o TC
    # of Cores 8 16 48 36 108
    FP64 Perf./Socket [TFLOP/s] 2.4 4.9 3.3 3.1 19.5 w/ TC
    9.7 w/o TC
    LLC B/W [TB/s] 3.0 6.4 3.6 3.21 4.91
    LLC Capacity [MB] 16 64 32 54 40
    Memory B/W [TB/s] 1.53 2.45 1.024 0.204 1.935
    Memory Capacity [GB] 48 96 32 80
    Process Rule [nm] 16 7 7 10 7

    View Slide

  7. Basic performance
    Performance Evaluation of a Next-Generation SX-Aurora TSUBASA Vector Supercomputer 7

    View Slide

  8. ISC’23
    Benchmarks for basic performance measurements
    • HPL1: Compute-intensive benchmark that solves a dense linear system using LU
    decomposition with pivoting.
    • BabelStream2: Benchmark that measures the effective memory bandwidth.
    • HPCG1: Memory-intensive benchmark that solves a sparse linear system using the
    conjugate gradient method and a geometric multigrid preconditioner.
    • Himeno benchmark: Memory-intensive benchmark that solves the Poisson equation
    using the Jacobi method.
    Performance Evaluation of a Next-Generation SX-Aurora TSUBASA Vector Supercomputer 8

    View Slide

  9. ISC’23
    Basic performance (HPL and BabelStream)
    Performance Evaluation of a Next-Generation SX-Aurora TSUBASA Vector Supercomputer 9
    0
    2
    4
    6
    8
    10
    12
    14
    VE20 VE30 A64FXIceLake A100
    40GB
    A100
    80GB
    0
    20
    40
    60
    80
    100
    TFLOP/s
    Efficiency [%]
    Performance Efficiency
    2.13
    4.43
    2.78
    1.83
    11.8
    12.5
    86%
    90%
    82%
    57%
    60%
    64%
    HPL Excellent compute
    performance
    Low efficiency
    due to throttling
    0
    500
    1000
    1500
    2000
    VE20 VE30 A64FXIceLake
    ×2
    A100
    40GB
    A100
    80GB
    0
    20
    40
    60
    80
    100
    GB/s
    Efficiency [%]
    Performance Efficiency
    1230
    1793
    826
    163
    1410
    1657
    80%
    72%
    81% 80%
    91%
    86%
    Highest among
    all processors
    BabelStream

    View Slide

  10. ISC’23
    Basic performance (HPCG and Himeno)
    Performance Evaluation of a Next-Generation SX-Aurora TSUBASA Vector Supercomputer 10
    0
    100
    200
    300
    400
    500
    600
    700
    800
    900
    VE20 VE30 A64FXIceLake A100
    40GB
    A100
    80GB
    0
    5
    10
    15
    20
    GFLOP/s
    Efficiency [%]
    Performance Efficiency
    388
    837
    342
    75
    553
    634
    16%
    17%
    10%
    2.3% 2.8% 3.2%
    0
    50
    100
    150
    200
    250
    300
    VE20 VE30 A64FX IceLake A100
    40GB
    A100
    80GB
    0
    1
    2
    3
    4
    5
    6
    7
    GFLOP/s
    Efficiency [%]
    Performance Efficiency
    139
    258
    106
    29
    222
    259
    5.6%
    5.2%
    3.1%
    0.94%
    2.2%
    2.6%
    HPCG Himeno benchmark
    Almost identical to
    A100 80 GB
    Highest among
    all processors

    View Slide

  11. ISC’23
    Multi-node performance (HPL, HPCG and Himeno)
    Performance Evaluation of a Next-Generation SX-Aurora TSUBASA Vector Supercomputer 11
    0.1
    1
    10
    100
    1000
    1 10 100
    0
    20
    40
    60
    80
    100
    TFLOP/s
    Efficiency [%]
    # of VEs
    HPL Performance
    HPCG Performance
    Himeno Performance
    HPL Efficiency
    HPCG Efficiency
    Himeno Efficiency
    Performance on 128 VEs is:
    • HPL: 537 TFLOP/s, 85.5% efficiency
    • HPCG: 30.6 TFLOP/s, 4.9% efficiency
    • Himeno: 919 TFLOP/s, 15.2% efficiency
    VE
    30
    AMD EPYC
    7713P
    PCIe
    SW
    IB HDR
    200G
    IB HDR
    200G
    VE
    30
    VE
    30
    PCIe
    SW
    VE
    30
    VE
    30
    PCIe
    SW
    VE
    30
    VE
    30
    PCIe
    SW
    VE
    30

    View Slide

  12. ISC’23
    Per-core private L3 cache
    Performance Evaluation of a Next-Generation SX-Aurora TSUBASA Vector Supercomputer 12
    Main Memory
    Last-Level Cache
    Network on Chip
    SPU VPU
    L3 Cache
    Reduces NoC
    congestion
    Can be bypassed
    by software
    0.0
    0.2
    0.4
    0.6
    0.8
    1.0
    1.2
    1.4
    1.6
    1.8
    2.0
    Earthquake
    Turbulent
    Flow
    Antenna
    Land Mine
    Turbine
    Plasma
    TFLOP/s
    w/o L3 cache w/ L3 cache
    VE30 adds software-controllable per-core private
    L3 cache (2 MB, unified, write-through).
    3.13x: Data fits
    in L3C, reducing
    gather latency
    Tohoku Univ. kernel collection
    Reduces LLC
    contention
    Reduces access
    latency

    View Slide

  13. ISC’23
    Compute-capable LLC
    Performance Evaluation of a Next-Generation SX-Aurora TSUBASA Vector Supercomputer 13
    for (int i = 0; i < n; i++) {
    y[l[i]] = y[l[i]] + x[i];
    }
    On VE20, the user was responsible for choosing from:
    • Scalar: Computes using scalar instructions only (compiler’s
    default).
    • ivdep: Computes using vector instructions only.
    Must ensure that l[i] do not overlap (requires use of compiler
    directive).
    • list_vector: Computes using vectorized instructions,
    then corrects results for overlapped indices using scalar
    instructions (requires use of compiler directive).
    On VE30:
    • vlfa: Dedicated instruction for index vector accumulation
    (compiler’s default).
    Core
    LLC
    Memory
    l[i] x[i]
    y[l[i]]
    Each LLC bank has
    a compute unit
    Indexed vector accumulation used in finite
    element method, particle method, etc.

    View Slide

  14. ISC’23
    Hardware support for indexed vector accumulation
    Performance Evaluation of a Next-Generation SX-Aurora TSUBASA Vector Supercomputer 14
    0.0
    0.1
    0.2
    0.3
    0.4
    0.5
    0.6
    0.7
    0.8
    0.9
    1.0
    1 2 4 8 12 16 20 24 28 32
    GFLOP/s
    # of Overlapping Indices
    VE20 scalar
    VE20 list_vector
    VE30 scalar
    VE30 list_vector
    VE30 vlfa
    Single-core performance of a microbenchmark that performs indexed vector accumulation
    with varying degree of address overlaps (x out of 32 addresses overlap).
    vlfa falls behind
    scalar
    vlfa is 3.48x faster
    than list_vector
    Since vlfa is always faster than list_vector and high degree of overlaps rarely happens in real-world
    applications, the user can always use vlfa.

    View Slide

  15. Real-world workloads
    Performance Evaluation of a Next-Generation SX-Aurora TSUBASA Vector Supercomputer 15

    View Slide

  16. ISC’23
    Tohoku University kernel collection
    Performance Evaluation of a Next-Generation SX-Aurora TSUBASA Vector Supercomputer 16
    0.0
    0.2
    0.4
    0.6
    0.8
    1.0
    1.2
    1.4
    1.6
    1.8
    2.0
    Earthquake
    Turbulent
    Flow
    Antenna
    Land Mine
    Turbine
    Plasma
    TFLOP/s
    VE20 VE30
    Domain Bottleneck VE30
    Speedup
    Earthquake Seismology Mem. B/W 1.56x
    Turbulent Flow Fluid dynamics LLC B/W 2.33x
    Antenna Electronics Mem. B/W 1.77x
    Land Mine Electronics Mem. B/W 1.92x
    Turbine Fluid dynamics Mem. latency 2.40x
    Plasma Geophysics Mem. latency 2.41x
    Six kernels extracted from production applications developed by the users of Cyberscience Center, Tohoku Univ.
    Memory and LLC
    B/W improvement
    L3C and LLC B/W
    improvement
    Shorter latency
    thanks to L3C
    VE30 HW improvement
    Peak Mem. B/W: 1.60x
    Peak LLC B/W: 2.13x

    View Slide

  17. ISC’23
    SPEChpc 2021
    • Benchmark suite developed by the Standard Performance Evaluation Corporation (SPEC)
    • “a set of application benchmark suites using a comprehensive measure of real-world performance for
    the state-of-the-art HPC systems”1
    • Programming models:
    • Used MPI+OpenMP on VE20/30, A64FX and IceLake-SP, and MPI+OpenACC on A100.
    • All benchmarks are executed as-is (no source modification) on all platforms.
    • Workload sizes:
    • Tiny (9 benchmarks, requires ~60 GB of memory): Executed using the minimum possible number of
    sockets on each platform. Speedup is normalized by the number of used sockets.
    • Medium (6 benchmarks, requires ~4 TB of memory): Executed using 128 sockets on all platforms.
    Performance Evaluation of a Next-Generation SX-Aurora TSUBASA Vector Supercomputer 17
    1 https://www.spec.org/hpc2021/

    View Slide

  18. ISC’23
    SPEChpc 2021 tiny workload results
    • VE30 is the fastest in LBM, TeaLeaf and POT3D.
    • VE30 slightly underperforms A100 in CloverLeaf and miniWeather.
    • VE30 shows poor performance in SPH-EXA and HPGMG-FV.
    Performance Evaluation of a Next-Generation SX-Aurora TSUBASA Vector Supercomputer 18
    0
    5
    10
    15
    20
    25
    30
    35
    LBM TeaLeaf CloverLeaf POT3D SPH-EXA HPGMG-FV miniWeather
    Speedup over Baseline System
    VE20 x2
    VE30 x1
    A100 80GB x1
    A100 40GB x2
    A64FX x3
    IceLake-SP x1
    Discussed in the next slide

    View Slide

  19. ISC’23
    SPEChpc 2021 tiny workload performance analysis
    • LBM, TeaLeaf, CloverLeaf, POT3D
    • Memory-bound and achieves good performance on VE.
    • In CloverLeaf, kernels that perform gather are slower than A100.
    • SPH-EXA:
    • Octree traversal for nearest neighbor search cannot be vectorized.
    • Could benefit from (reverse) offloading nearest neighbor search to the host CPU.
    • HPGMG-FV
    • Inner-most loop is too short (32 iterations) for VE where a register holds 256 elements.
    • Could benefit from collapsing loops to increase the average vector length.
    • miniWeather
    • Memory-bound kernels are faster than A100 but compute-bound kernels are slower.
    Performance Evaluation of a Next-Generation SX-Aurora TSUBASA Vector Supercomputer 19

    View Slide

  20. ISC’23
    • Trend looks similar to tiny since the both the problem size and # of nodes are increased.
    • VE30’s MPI communication performance is worse than A100 in CloverLeaf, POT3D and
    miniWeather – requires further investigation (see paper for MPI profiling results).
    Performance Evaluation of a Next-Generation SX-Aurora TSUBASA Vector Supercomputer 20
    SPEChpc 2021 medium workload results
    0
    5
    10
    15
    20
    25
    30
    35
    40
    LBM TeaLeaf CloverLeaf POT3D HPGMG-FV miniWeather
    Speedup over Baseline System
    VE20 x128
    VE30 x128
    A100 40GB x128
    A64FX x128
    IceLake-SP x128

    View Slide

  21. ISC’23
    Selective L3 caching
    • On VE30, users can selectively cache reused
    data on the L3 cache
    • Here we use the Himeno benchmark to
    assess the impact of selective L3 caching
    • Arrays a, b, c, bnd, wrk1 and wrk2 are accessed in
    a streaming manner.
    • Array p is reused only p is reused (ideally 18 out
    of 19 loads in the inner-most loop hit in cache).
    Performance Evaluation of a Next-Generation SX-Aurora TSUBASA Vector Supercomputer 21
    for (i=1 ; ifor (j=1 ; jfor (k=1 ; ks0 = a[0][i][j][k] * p[i+1][j ][k ]
    + a[1][i][j][k] * p[i ][j+1][k ]
    + a[2][i][j][k] * p[i ][j ][k+1]
    + b[0][i][j][k] * (p[i+1][j+1][k ] - p[i+1][j-1][k ]
    - p[i-1][j+1][k ] + p[i-1][j-1][k ])
    + b[1][i][j][k] * (p[i ][j+1][k+1] - p[i ][j-1][k+1]
    - p[i ][j+1][k-1] + p[i ][j-1][k-1])
    + b[2][i][j][k] * (p[i+1][j ][k+1] - p[i-1][j ][k+1]
    - p[i+1][j ][k-1] + p[i-1][j ][k-1])
    + c[0][i][j][k] * p[i-1][j ][k ]
    + c[1][i][j][k] * p[i ][j-1][k ]
    + c[2][i][j][k] * p[i ][j ][k-1]
    + wrk1[i][j][k];
    ss = (s0 * a[3][i][j][k] - p[i][j][k]) * bnd[i][j][k];
    wgosa += ss*ss;
    wrk2[i][j][k] = p[i][j][k] + omega * ss;
    // Copy wrk2 to wrk and sub wgosa across all ranks
    }
    Jacobi iteration kernel in the Himeno benchmark

    View Slide

  22. ISC’23
    Impact of selective caching in Himeno benchmark
    Performance Evaluation of a Next-Generation SX-Aurora TSUBASA Vector Supercomputer 22
    250
    255
    260
    265
    270
    275
    280
    285
    290
    Watt
    Cache all
    Bypass all
    Cache p only
    2.5
    2.6
    2.7
    2.8
    2.9
    3
    3.1
    3.2
    3.3
    GFLOP/s per Watt
    0
    100
    200
    300
    400
    500
    600
    700
    800
    900
    S M L XL
    GFLOP/s
    Problem Size
    Cache all
    Bypass all
    Cache p only
    Size Dimensions
    S 64x64x128
    M 128x128x256
    L 256x256x512
    XL 512x512x1024
    +6.9% w/
    selective caching
    p does not
    fit in L3C
    +5.7% w/ selective
    caching
    Performance Power (L size) Power Efficiency (L)
    VE20: 2.14 GFLOP/s/W
    A100: 2.21 GFLOP/s/W
    +8.2%
    -0.6%
    +6.5% w/
    selective caching

    View Slide

  23. ISC’23
    Partitioning mode
    • Partitioning mode splits a single VE into two NUMA nodes
    • Each NUMA node has half the cores, LLC and HBM (capacity and B/W are both halved)
    • Alleviates congestion in the NoC and increases total effective LLC B/W
    • Cache-intensive applications will benefit from partitioning mode
    Performance Evaluation of a Next-Generation SX-Aurora TSUBASA Vector Supercomputer 23
    Core
    Core
    Core
    Core
    Core
    Core
    LLC
    LLC
    Core
    Core
    Core
    Core
    Core
    Core
    Core
    Core
    Core
    Core
    HBM2E
    HBM2E
    HBM2E
    HBM2E
    HBM2E
    HBM2E
    0
    100
    200
    300
    400
    500
    600
    700
    800
    900
    VE20 VE30
    GFLOP/s
    w/o Partitionig Mode
    w/ Partitionig Mode
    +7.1% w/
    partitioning mode
    NUMA
    node #0
    NUMA
    node #1
    Himeno benchmark

    View Slide

  24. ISC’23
    Summary
    • VE30 attains massive speedup in memory-intensive standard benchmarks.
    • Speedup exceeds improvement in peak compute and memory performance,
    indicating the benefits of the newly introduced L3 cache and improved LLC.
    • VE30 outperforms other processors in many real-world workloads including SPEChpc
    without any source code modification.
    • New architectural features could be used to further improve performance.
    Performance Evaluation of a Next-Generation SX-Aurora TSUBASA Vector Supercomputer 24
    VE30 achieves high sustained performance equal to or greater than latest GPUs
    and CPUs, while allowing programmers to use conventional programming models.

    View Slide

  25. Backup slides
    Performance Evaluation of a Next-Generation SX-Aurora TSUBASA Vector Supercomputer 25

    View Slide

  26. ISC’23
    SPEChpc medium size MPI profile
    Performance Evaluation of a Next-Generation SX-Aurora TSUBASA Vector Supercomputer 26
    0
    20
    40
    60
    80
    100
    120
    140
    160
    VE30 A100 VE30 A100 VE30 A100 VE30 A100 VE30 A100
    Runtime [s]
    Others
    MPI_Init(_thread)
    MPI_Reduce
    MPI_Barrier
    MPI_Waitall
    MPI_Allreduce
    MPI_Irecv
    MPI_Isend
    miniWeather
    POT3D
    CloverLeaf
    TeaLeaf
    LBM
    0
    100
    200
    300
    400
    500
    600
    700
    VE30 A100
    Runtime [s]
    HPGMG-FV

    View Slide

  27. ISC’23
    Relaxed alignment restriction for packed vectors
    • VE20 required 8-byte alignment for FP32 vectors, resulting in poor performance with
    some access patterns (e.g., stencil-like).
    • VE30 relaxes the restriction to 4-byte alignment
    Performance Evaluation of a Next-Generation SX-Aurora TSUBASA Vector Supercomputer 27
    0
    10
    20
    30
    40
    50
    60
    70
    VE20
    w/o packed
    VE30
    w/o packed
    VE30
    w/ packed
    GFLOP/s
    do k = 1, nz
    do j = 1, ny
    do i = 1, nx
    a(i,j,k) = a(i,j,k) + &
    (b(i-1,j-1,k-1) + b(i ,j-1,k-1) + b(i+1,j-1,k-1) + &
    b(i-1,j ,k-1) + b(i ,j ,k-1) + b(i+1,j ,k-1) + &
    b(i-1,j+1,k-1) + b(i ,j+1,k-1) + b(i+1,j+1,k-1) + &
    b(i-1,j-1,k ) + b(i ,j-1,k ) + b(i+1,j-1,k ) + &
    b(i-1,j ,k ) + b(i ,j ,k ) + b(i+1,j ,k ) + &
    b(i-1,j+1,k ) + b(i ,j+1,k ) + b(i+1,j+1,k ) + &
    b(i-1,j-1,k+1) + b(i ,j-1,k+1) + b(i+1,j-1,k+1) + &
    b(i-1,j ,k+1) + b(i ,j ,k+1) + b(i+1,j ,k+1) + &
    b(i-1,j+1,k+1) + b(i ,j+1,k+1) + b(i+1,j+1,k+1))/27.0
    end do
    end do
    end do
    27-point stencil microbenchmark

    View Slide

  28. ISC’23
    Real-world kernel with indexed vector accumulation
    • A kernel extracted from a real-world CFD application (4 out of 256 indices overlap, two
    pairs of identical indices)
    • Using vlfa reduces the runtime from 175.6s to 12.0s (14.6x speedup)
    Performance Evaluation of a Next-Generation SX-Aurora TSUBASA Vector Supercomputer 28
    DO N = nstart,nend
    IF(flag3(N)==1) THEN
    COF(7,WI(N),WJ(N),WK(N))=COF(7,WI(N),WJ(N),WK(N))+W_TAUWC(N) * W_AREA_1(N)
    SOC(WI(N),WJ(N),WK(N))=SOC(WI(N),WJ(N),WK(N))+W_TAUWS(N) * W_AREA_1(N)
    ENDIF
    ENDDO

    View Slide

  29. ISC’23
    HPC systems used for evaluation
    Performance Evaluation of a Next-Generation SX-Aurora TSUBASA Vector Supercomputer 29
    System Node configuration Interconnect
    VE20 [email protected] Univ. AMD EPYC 7402P x1
    Vector Engine Type 20B x8
    InfiniBand HDR 200G x2
    VE30 Prototype [email protected] Corp. AMD EPYC 7713P x1
    Vector Engine Type 30A x8
    InfiniBand HDR 200G x2
    A64FX Flow Type [email protected] Univ. Fujitsu A64FX x1 Tofu-D
    IceLake-SP [email protected] Univ. Intel Xeon Platinum 8368 x2 InfiniBand HDR 200G x1
    A100 40GB [email protected] Univ. Intel Xeon Platinum 8368 x2
    NVIDIA A100 40 GB x8
    InfiniBand HDR 100G x4

    View Slide

  30. ISC’23
    Comparison to NVIDIA H100 family
    Performance Evaluation of a Next-Generation SX-Aurora TSUBASA Vector Supercomputer 30
    NEC VE Type 30A NVIDIA H100
    PCIe
    NVIDIA H100
    SXM5
    NVIDIA H100
    NVL
    FP64 Perf./Core [GFLOP/s] 307 449.2 w/ TC
    224 w/o TC
    506.8 w/ TC
    253.4 w/o TC
    1013.7 w/TC
    506.8 w/o TC
    # of Cores (SMs) 16 114 132 264
    FP64 Perf./Socket [TFLOP/s] 4.9 51.2 w/ TC
    25.6 w/o TC
    66.9 w/ TC
    33.5 w/o TC
    133.8 w/ TC
    67 w/o TC
    LLC Capacity [MB] 64 50 50 100
    Memory B/W [TB/s] 2.45 2 3.35 7.8
    Memory Capacity [GB] 96 80 80 188
    Process Rule [nm] 7 4 4 4
    TDP 300 W 350 W 700 W 700-800 W

    View Slide