Slide 1

Slide 1 text

4th International Workshop on Quantum Computing Software @SC23 Prototype of a Batched Quantum Circuit Simulator for the Vector Engine Keichi Takahashi1, Toshio Mori2, Hiroyuki Takizawa1 1Cyberscience Center, Tohoku University, Japan 2Center for Quantum Information and Quantum Biology, Osaka University, Japan

Slide 2

Slide 2 text

Initial motivation for developing a batched simulator • Center for Quantum Information and Quantum Biology (QIQB) at Osaka University has been developing a state vector simulator named Qulacs [1], which is widely used in the Japanese research community. • One of the primary use cases of Qulacs is variational quantum algorithm including quantum machine learning algorithms. • A feedback we often receive from our users is to have a capability to simulate a large number of small- to intermediate-scale circuits (batched circuit simulation). 2 [1] Y. Suzuki et al., “Qulacs: a fast and versatile quantum circuit simulator for research purpose ,” Quantum, vol. 5, 2021.

Slide 3

Slide 3 text

Use cases for a batched simulator • Most state-of-the-art simulators focus on simulating large-scale noiseless circuits. ◦ Noise simulation • Exact simulation based on density matrices requires complexity. • Monte Carlo approach [1] can estimate the result with fewer resources. ◦ Variational Quantum Algorithms (VQA) • Iteratively updates a parameterized circuit based on the simulation of the outcome. • Extreme-scale simulators are needed for scalability evaluation, 
 but trial-and-error algorithm development is often done at smaller scales. O(22n) 3 [1] W. Berquist et al., “Stochastic Approach for Simulating Quantum Noise Using Tensor Networks,” QCS 2022.

Slide 4

Slide 4 text

Related work • NVIDIA cuQuantum (cuStateVec) implemented batched gate application and measurement functions in version 1.4.0. • However, to the best of our knowledge, no quantum simulations utilize the batched functions. • This work is the fi rst study to implement a state vector simulator on VE and compare with cuStateVec. 4 custatevecStatus_t custatevecApplyMatrixBatched( custatevecHandle_t handle, void *batchedSv, cudaDataType_t svDataType, const uint32_t nIndexBits, const uint32_t nSVs, custatevecIndex_t svStride, custatevecMatrixMapType_t mapType, const int32_t *matrixIndices, const void *matrices, cudaDataType_t matrixDataType, custatevecMatrixLayout_t layout, const int32_t adjoint, const uint32_t nMatrices, const int32_t *targets, const uint32_t nTargets, const int32_t *controls, const int32_t *controlBitValues, const uint32_t nControls, custatevecComputeType_t computeType, void *extraWorkspace, size_t extraWorkspaceSizeInBytes ) Target and control bits Gate matrices

Slide 5

Slide 5 text

SX-Aurora TSUBASA Vector Engine (VE) • The NEC Vector Engine is a vector processor implemented as a PCIe card. • Combination of HBM and long-vector architecture o ff ers high performance in memory-bound HPC applications. • State vector simulation, which is also memory-bound, could also be a potential fi t for VE. 5 Core Core Core Core Core Core Core Core Core Core Core Core LLC HBM2E HBM2E HBM2E HBM2E HBM2E HBM2E Core Core Core Core LLC SPU L3 Cache (2 MB) VPU Network on Chip (2D Mesh) Last Level Cache (64 MB) Main Memory (96 GB) 2.45 TB/s 6.4 TB/s 410 GB/s 410 GB/s VE Type 20B VE Type 30A A100 PCIe H100 PCIe Peak Performance 2.4 TFLOP/s 4.9 TFLOP/s 9.7 TFLOP/s 25.6 TFLOP/s Memory Bandwidth 1.53 TB/s 2.45 TB/s 1.93 TB/s 2.00 TB/s Memory Capacity 48 GB 96 GB 80 GB 80 GB Process Rule 16nm 7nm 7nm 4 nm [1] K. Takahashi et al., “Performance Evaluation of a Next-Generation SX-Aurora TSUBASA Vector Supercomputer ,” ISC 2023.

Slide 6

Slide 6 text

Performance tuning basics for the Vector Engine • Vectorization is vital to attain good performance on VE. ◦ Standard tuning principles apply: prefer consecutive memory accesses, increase loop length, avoid scatter/gather accesses, minimize conditionals, etc. ◦ VE o ff ers 16,384-bit wide registers (256 FP64 values), requiring long loops. • NEC o ff ers a powerful auto-vectorizing compiler for C/C++ and Fortran: ◦ Use of intrinsics or inline assembly is rarely needed. ◦ Pragma directives are provided to guide the auto-vectorization. ◦ Various loop patterns including conditionals, reduction, pre fi x sum, merging and indexed summation can be vectorized with support from the ISA. 6

Slide 7

Slide 7 text

Computational characteristics of state vector simulation 7 10 100 0 2 4 6 8 10 12 14 Runtime [µs] Target Qubit Index Gather-Scatter Contiguous Strided |ψ⟩ = a0…00 |0…00⟩ + a0…01 |0…01⟩ + … + a1…11 |1…11⟩ [ a′  *…*0i *…* a′  *…*1i *…*] = [ U00 U01 U10 U11 ] [ a*…*0i *…* a*…*1i *…* ] better To compute the -th element, the -th and -th elements of the state vector are be loaded and stored. i i i ⊕ 2k Unitary matrix corresponding to a gate Strided access pattern where the stride depends on the target qubit (stride= ). Maintaining both long loops and contiguous accesses for all target qubits is not possible. 2k Basis states Probability amplitudes

Slide 8

Slide 8 text

Memory layout • Making the state vector continuous on memory requires strided memory access pattern. • Instead, we make the same basis state from of state vectors contiguous memory. This gives ~6.9x speedup. • Separating the real and imaginary components gives further performance improvement (~1.9x). 8 Re |00> Im |00> Re |01> Im |01> Re |10> Im |10> Re |11> Im |11> Re |00> Im |00> Re |01> Im |01> Re |10> Im |10> Re |11> Im |11> State Vector #0 State Vector #1 Re |00> Im |00> Re |01> Im |01> Re |10> Im |10> Re |11> Im |11> Re |00> Im |00> Re |01> Im |01> Re |10> Im |10> Re |11> Im |11> SV #0 SV #1 SV #0 SV #1 SV #0 SV #1 SV #0 SV #1 Re |00> Im |00> Re |01> Im |01> Re |10> Im |10> Re |11> Im |11> Re |00> Im |00> Re |01> Im |01> Re |10> Im |10> Re |11> Im |11> SV #0 SV #1 SV #0 SV #1 SV #0 SV #1 SV #0 SV #1 SV #0 SV #1 SV #0 SV #1 SV #0 SV #1 SV #0 SV #1 SoA 
 cuStateVec AoS 1 AoS 2 
 our work

Slide 9

Slide 9 text

Kernel overview • Two levels of parallelism can be exploited: basis states and state vectors. • Basic design is to parallelize over the bases and vectorize over the samples. • This ensures stride one access pattern with the AoS layout of SVs. 9 uint64_t mask = 1ULL << target; uint64_t lo_mask = mask - 1; uint64_t hi_mask = ~lo_mask; #pragma omp parallel for for (uint64_t i = 0; i < 1ULL << (N - 1); i++) { ITYPE i0 = ((i & hi_mask) << 1) | (i & lo_mask); ITYPE i1 = i0 | mask; #pragma omp simd for (uint32_t sample = 0; sample < B; sample++) { double tmp0_re = state_re[sample + i0 * B]; double tmp0_im = state_im[sample + i0 * B]; double tmp1_re = state_re[sample + i1 * B]; double tmp1_im = state_im[sample + i1 * B]; state_re[sample + i0 * B] = tmp1_re; state_im[sample + i0 * B] = tmp1_im; state_re[sample + i1 * B] = tmp0_re; state_im[sample + i1 * B] = tmp0_im; } } Iterates over SVs Iterates over basis states Example of a Pauli-X gate application

Slide 10

Slide 10 text

Noise gate 10 ℰ(ρ) = (1 − p)ρ + p 3 ∑ A∈{X,Y,Z} AρA We assume the depolarizing noise [1] model. The density matrix representation of a one-qubit depolarizing gate is as follows (one of X/Y/Z gates is applied with probability ). p [1] M. A. Nielsen and I. L. Chuang. Quantum computation and quantum information. Cambridge university press, 2010. A naive implementation would iterate over the SVs and apply a gate to a SV if it should be a ff ected by noise. VE compiler generates a code that uses vector masking, but since , most of the vector elements are masked wasted. p ≪ 1

Slide 11

Slide 11 text

Noise gate simulation scheme • We eliminate conditionals by selecting the SVs a ff ected by noise in advance. • Expected length of the vectorized loop is , which can become suboptimal for the VE architecture very small error rates. pB 11 |0…00⟩ |0…00⟩ |0…00⟩ |0…00⟩ |0…01⟩ |0…01⟩ SV #0 SV #1 SV #2 SV #3 |0…01⟩ |0…01⟩ SV #2 SV #3 SV #0 SV #1 |0…00⟩ |0…00⟩ |0…00⟩ |0…00⟩ |0…01⟩ |0…01⟩ |0…01⟩ |0…01⟩ × × × Gather Scatter

Slide 12

Slide 12 text

Evaluation setup • VE ◦ Software: batched-qsim-ve (https://github.com/keichi/batched-qsim-ve) ◦ Hardware • NEC VE Type 20B (VE20) • NEC VE Type 30A (VE30) • GPU ◦ Software: cuStateVec 1.4.1 ◦ Hardware: • NVIDIA A100 40GB PCIe • NVIDIA A100 80GB PCIe 12 VE Type 20B VE Type 30A A100 40GB A100 80GB Peak Performance 2.4 TFLOP/s 4.9 TFLOP/s 9.7 TFLOP/s 9.7 TFLOP/s Memory Bandwidth 1.53 TB/s 2.45 TB/s 1.55 TB/s 1.93 TB/s Memory Capacity 48 GB 96 GB 40 GB 80 GB LLC B/W 3.0 TB/s 6.4 TB/s 4.9 TB/s 4.9 TB/s LLC Capacity 16 MB 64 MB 40 MB 40 MB AOBA supercomputer at Tohoku University

Slide 13

Slide 13 text

1-qubit gate (RX gate) performance • Measured the time to apply an RX gate to 105 state vectors with varying batch size. ◦ Performance of VE30 is identical to that of A100 80GB. ◦ Batch size on VE needs to be at least 200 for best performance. 13 0 2 4 6 8 10 12 1×102 1×103 1×104 1×105 Runtime [ms] Batch size A100 80GB A100 40GB VE Type 30A VE Type 20B 0 2 4 6 8 10 12 14 1×102 1×103 1×104 1×105 Runtime [ms] Batch size A100 80GB A100 40GB VE Type 30A VE Type 20B 0 50 100 150 200 250 300 350 1×102 1×103 1×104 1×105 Runtime [ms] Batch size A100 80GB A100 40GB VE Type 30A VE Type 20B 0 1 2 3 4 5 6 1×102 1×103 1×104 Runtime [s] Batch size A100 80GB A100 40GB VE Type 30A VE Type 20B 8 qubits 12 qubits 16 qubits 20 qubits better insu ff i cient loop length cache e ff ect

Slide 14

Slide 14 text

2-qubit gate (CNOT gate) performance • Measured the time to apply a CNOT gate to 105 state vectors with varying batch size. ◦ Trend is similar to RX gate, but A100 is ~25% faster than VE Type 30A. ◦ Suggests room for optimization since VE30’s e ff ective memory bandwidth is ~10% higher than that of A100 80GB. 14 8 qubits 12 qubits 16 qubits 20 qubits 0 1 2 3 4 5 6 7 8 9 10 1×102 1×103 1×104 1×105 Runtime [ms] Batch size A100 80GB A100 40GB VE Type 30A VE Type 20B 0 2 4 6 8 10 12 14 16 18 1×102 1×103 1×104 1×105 Runtime [ms] Batch size A100 80GB A100 40GB VE Type 30A VE Type 20B 0 50 100 150 200 250 300 350 400 450 1×102 1×103 1×104 1×105 Runtime [ms] Batch size A100 80GB A100 40GB VE Type 30A VE Type 20B 0 1 2 3 4 5 6 7 8 1×102 1×103 1×104 Runtime [s] Batch size A100 80GB A100 40GB VE Type 30A VE Type 20B better

Slide 15

Slide 15 text

Depolarizing noise gate performance • Measured the time to apply a 1-qubit depolarizing gate to state vectors. ◦ cuStateVec lacks a function to selectively apply gate to a subset of batch, we apply an identity gate to SVs that are not a ff ected by noise. ◦ VE performs better with lower noise rate and larger batch size. 15 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 1×102 1×103 1×104 1×105 Runtime [s] Batch size A100 80GB VE Type 30A 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 1×10-3 1×10-2 1×10-1 Runtime [s] Noise rate A100 80GB VE Type 30A 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 8 9 10 11 12 13 14 Runtime [s] Qubits A100 80GB VE Type 30A Varying noise rate (14 qubits, 105 batch size) Varying batch size (14 qubits, 10-3 error rate) Varying number of qubits (105 batch size, 10-3 error rate) better longer loop length less work

Slide 16

Slide 16 text

Random circuit simulation output 16 0 0.5 1 1.5 2 2.5 3 3.5 4 -5 -4 -3 -2 -1 0 1 2 Probability Density log(2np) Noiseless Noisy (error rate=1e-3) Noisy (error rate=5e-3) Noisy (error rate=1e-2) 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 -5 -4 -3 -2 -1 0 1 2 Probability Density log(2np) This work Qulacs Theoretical [1] F. Arute et al., “Quantum supremacy using a programmable superconducting processor,” Nature, vol. 574, no. 7779, 2019. Google Sycamore-style random circuit 1 0 0 0 0 cos θ −i sin θ 0 0 −sin θ cos θ 0 0 0 0 1 1 2 [ 1 + i 1 − i 1 − i 1 + i] 1 2 [ 1 + i −1 − i 1 + i 1 + i ] 1 2 2 −1 − i 1 − i − 2 X Y W iSWAP-like Noiseless Noisy PDF of probability that a bitstring is observed

Slide 17

Slide 17 text

Random circuit performance • VE largely outperforms A100 80GB up to 1.95x if the batch size is su ffi ciently large. • Even on VE, noise gate application dominates the runtime. Further e ff ort should be put into optimizing noise gates. 17 0 20 40 60 80 100 120 1×102 1×103 1×104 1×105 Runtime [s] Batch size A100 80GB VE Type 30A 0 10 20 30 40 50 60 70 1×102 1×103 1×104 1×105 Runtime [s] Batch size A100 80GB VE Type 30A 0 20 40 60 80 100 120 1×102 1×103 1×104 1×105 Runtime [s] Batch size A100 80GB VE Type 30A 0 10 20 30 40 50 60 70 80 90 100 1×102 1×103 1×104 1×105 Runtime [s] Batch size A100 80GB VE Type 30A better p = 10−3 p = 5 ⋅ 10−3 p = 10−2 p = 5 ⋅ 10−2

Slide 18

Slide 18 text

Conclusion and future work • We prototyped a batched quantum circuit simulator for NEC Vector Engine. • Performance of VE30 is comparable or better to that of NVIDIA A100. ◦ Identical for 1-qubit gates and 25% lower performance for 2-qubit gates. ◦ VE30 outperforms cuStateVec by up to 1.95x in simulating random circuits. • Future work includes: ◦ Evaluation using real-world VQA circuits with noise. ◦ Porting to other CPUs with SIMD instructions such as AVX-512 and SVE. 18 This work was partially supported by JSPS KAKENHI Grant Number JP23K16890. Many thanks to the Qulacs developers!