Prototype of a Batched Quantum Circuit Simulator for the Vector Engine

4th International Workshop on Quantum Computing Software @SC23 Prototype of
a Batched Quantum Circuit Simulator for the Vector Engine Keichi Takahashi1, Toshio Mori2, Hiroyuki Takizawa1 1Cyberscience Center, Tohoku University, Japan 2Center for Quantum Information and Quantum Biology, Osaka University, Japan

Initial motivation for developing a batched simulator • Center for
Quantum Information and Quantum Biology (QIQB) at Osaka University has been developing a state vector simulator named Qulacs [1], which is widely used in the Japanese research community. • One of the primary use cases of Qulacs is variational quantum algorithm including quantum machine learning algorithms. • A feedback we often receive from our users is to have a capability to simulate a large number of small- to intermediate-scale circuits (batched circuit simulation). 2 [1] Y. Suzuki et al., “Qulacs: a fast and versatile quantum circuit simulator for research purpose ,” Quantum, vol. 5, 2021.

Use cases for a batched simulator • Most state-of-the-art simulators
focus on simulating large-scale noiseless circuits. ◦ Noise simulation • Exact simulation based on density matrices requires complexity. • Monte Carlo approach [1] can estimate the result with fewer resources. ◦ Variational Quantum Algorithms (VQA) • Iteratively updates a parameterized circuit based on the simulation of the outcome. • Extreme-scale simulators are needed for scalability evaluation,   but trial-and-error algorithm development is often done at smaller scales. O(22n) 3 [1] W. Berquist et al., “Stochastic Approach for Simulating Quantum Noise Using Tensor Networks,” QCS 2022.

Related work • NVIDIA cuQuantum (cuStateVec) implemented batched gate application
and measurement functions in version 1.4.0. • However, to the best of our knowledge, no quantum simulations utilize the batched functions. • This work is the fi rst study to implement a state vector simulator on VE and compare with cuStateVec. 4 custatevecStatus_t custatevecApplyMatrixBatched( custatevecHandle_t handle, void *batchedSv, cudaDataType_t svDataType, const uint32_t nIndexBits, const uint32_t nSVs, custatevecIndex_t svStride, custatevecMatrixMapType_t mapType, const int32_t *matrixIndices, const void *matrices, cudaDataType_t matrixDataType, custatevecMatrixLayout_t layout, const int32_t adjoint, const uint32_t nMatrices, const int32_t *targets, const uint32_t nTargets, const int32_t *controls, const int32_t *controlBitValues, const uint32_t nControls, custatevecComputeType_t computeType, void *extraWorkspace, size_t extraWorkspaceSizeInBytes ) Target and control bits Gate matrices

SX-Aurora TSUBASA Vector Engine (VE) • The NEC Vector Engine
is a vector processor implemented as a PCIe card. • Combination of HBM and long-vector architecture o ff ers high performance in memory-bound HPC applications. • State vector simulation, which is also memory-bound, could also be a potential fi t for VE. 5 Core Core Core Core Core Core Core Core Core Core Core Core LLC HBM2E HBM2E HBM2E HBM2E HBM2E HBM2E Core Core Core Core LLC SPU L3 Cache (2 MB) VPU Network on Chip (2D Mesh) Last Level Cache (64 MB) Main Memory (96 GB) 2.45 TB/s 6.4 TB/s 410 GB/s 410 GB/s VE Type 20B VE Type 30A A100 PCIe H100 PCIe Peak Performance 2.4 TFLOP/s 4.9 TFLOP/s 9.7 TFLOP/s 25.6 TFLOP/s Memory Bandwidth 1.53 TB/s 2.45 TB/s 1.93 TB/s 2.00 TB/s Memory Capacity 48 GB 96 GB 80 GB 80 GB Process Rule 16nm 7nm 7nm 4 nm [1] K. Takahashi et al., “Performance Evaluation of a Next-Generation SX-Aurora TSUBASA Vector Supercomputer ,” ISC 2023.

Performance tuning basics for the Vector Engine • Vectorization is
vital to attain good performance on VE. ◦ Standard tuning principles apply: prefer consecutive memory accesses, increase loop length, avoid scatter/gather accesses, minimize conditionals, etc. ◦ VE o ff ers 16,384-bit wide registers (256 FP64 values), requiring long loops. • NEC o ff ers a powerful auto-vectorizing compiler for C/C++ and Fortran: ◦ Use of intrinsics or inline assembly is rarely needed. ◦ Pragma directives are provided to guide the auto-vectorization. ◦ Various loop patterns including conditionals, reduction, pre fi x sum, merging and indexed summation can be vectorized with support from the ISA. 6

Computational characteristics of state vector simulation 7 10 100 0
2 4 6 8 10 12 14 Runtime [µs] Target Qubit Index Gather-Scatter Contiguous Strided |ψ⟩ = a0…00 |0…00⟩ + a0…01 |0…01⟩ + … + a1…11 |1…11⟩ [ a′ *…*0i *…* a′ *…*1i *…*] = [ U00 U01 U10 U11 ] [ a*…*0i *…* a*…*1i *…* ] better To compute the -th element, the -th and -th elements of the state vector are be loaded and stored. i i i ⊕ 2k Unitary matrix corresponding to a gate Strided access pattern where the stride depends on the target qubit (stride= ). Maintaining both long loops and contiguous accesses for all target qubits is not possible. 2k Basis states Probability amplitudes

Memory layout • Making the state vector continuous on memory
requires strided memory access pattern. • Instead, we make the same basis state from of state vectors contiguous memory. This gives ~6.9x speedup. • Separating the real and imaginary components gives further performance improvement (~1.9x). 8 Re |00> Im |00> Re |01> Im |01> Re |10> Im |10> Re |11> Im |11> Re |00> Im |00> Re |01> Im |01> Re |10> Im |10> Re |11> Im |11> State Vector #0 State Vector #1 Re |00> Im |00> Re |01> Im |01> Re |10> Im |10> Re |11> Im |11> Re |00> Im |00> Re |01> Im |01> Re |10> Im |10> Re |11> Im |11> SV #0 SV #1 SV #0 SV #1 SV #0 SV #1 SV #0 SV #1 Re |00> Im |00> Re |01> Im |01> Re |10> Im |10> Re |11> Im |11> Re |00> Im |00> Re |01> Im |01> Re |10> Im |10> Re |11> Im |11> SV #0 SV #1 SV #0 SV #1 SV #0 SV #1 SV #0 SV #1 SV #0 SV #1 SV #0 SV #1 SV #0 SV #1 SV #0 SV #1 SoA   cuStateVec AoS 1 AoS 2   our work

Kernel overview • Two levels of parallelism can be exploited:
basis states and state vectors. • Basic design is to parallelize over the bases and vectorize over the samples. • This ensures stride one access pattern with the AoS layout of SVs. 9 uint64_t mask = 1ULL << target; uint64_t lo_mask = mask - 1; uint64_t hi_mask = ~lo_mask; #pragma omp parallel for for (uint64_t i = 0; i < 1ULL << (N - 1); i++) { ITYPE i0 = ((i & hi_mask) << 1) | (i & lo_mask); ITYPE i1 = i0 | mask; #pragma omp simd for (uint32_t sample = 0; sample < B; sample++) { double tmp0_re = state_re[sample + i0 * B]; double tmp0_im = state_im[sample + i0 * B]; double tmp1_re = state_re[sample + i1 * B]; double tmp1_im = state_im[sample + i1 * B]; state_re[sample + i0 * B] = tmp1_re; state_im[sample + i0 * B] = tmp1_im; state_re[sample + i1 * B] = tmp0_re; state_im[sample + i1 * B] = tmp0_im; } } Iterates over SVs Iterates over basis states Example of a Pauli-X gate application

Noise gate 10 ℰ(ρ) = (1 − p)ρ + p
3 ∑ A∈{X,Y,Z} AρA We assume the depolarizing noise [1] model. The density matrix representation of a one-qubit depolarizing gate is as follows (one of X/Y/Z gates is applied with probability ). p [1] M. A. Nielsen and I. L. Chuang. Quantum computation and quantum information. Cambridge university press, 2010. A naive implementation would iterate over the SVs and apply a gate to a SV if it should be a ff ected by noise. VE compiler generates a code that uses vector masking, but since , most of the vector elements are masked wasted. p ≪ 1

Noise gate simulation scheme • We eliminate conditionals by selecting
the SVs a ff ected by noise in advance. • Expected length of the vectorized loop is , which can become suboptimal for the VE architecture very small error rates. pB 11 |0…00⟩ |0…00⟩ |0…00⟩ |0…00⟩ |0…01⟩ |0…01⟩ SV #0 SV #1 SV #2 SV #3 |0…01⟩ |0…01⟩ SV #2 SV #3 SV #0 SV #1 |0…00⟩ |0…00⟩ |0…00⟩ |0…00⟩ |0…01⟩ |0…01⟩ |0…01⟩ |0…01⟩ × × × Gather Scatter

Evaluation setup • VE ◦ Software: batched-qsim-ve (https://github.com/keichi/batched-qsim-ve) ◦ Hardware
• NEC VE Type 20B (VE20) • NEC VE Type 30A (VE30) • GPU ◦ Software: cuStateVec 1.4.1 ◦ Hardware: • NVIDIA A100 40GB PCIe • NVIDIA A100 80GB PCIe 12 VE Type 20B VE Type 30A A100 40GB A100 80GB Peak Performance 2.4 TFLOP/s 4.9 TFLOP/s 9.7 TFLOP/s 9.7 TFLOP/s Memory Bandwidth 1.53 TB/s 2.45 TB/s 1.55 TB/s 1.93 TB/s Memory Capacity 48 GB 96 GB 40 GB 80 GB LLC B/W 3.0 TB/s 6.4 TB/s 4.9 TB/s 4.9 TB/s LLC Capacity 16 MB 64 MB 40 MB 40 MB AOBA supercomputer at Tohoku University

1-qubit gate (RX gate) performance • Measured the time to
apply an RX gate to 105 state vectors with varying batch size. ◦ Performance of VE30 is identical to that of A100 80GB. ◦ Batch size on VE needs to be at least 200 for best performance. 13 0 2 4 6 8 10 12 1×102 1×103 1×104 1×105 Runtime [ms] Batch size A100 80GB A100 40GB VE Type 30A VE Type 20B 0 2 4 6 8 10 12 14 1×102 1×103 1×104 1×105 Runtime [ms] Batch size A100 80GB A100 40GB VE Type 30A VE Type 20B 0 50 100 150 200 250 300 350 1×102 1×103 1×104 1×105 Runtime [ms] Batch size A100 80GB A100 40GB VE Type 30A VE Type 20B 0 1 2 3 4 5 6 1×102 1×103 1×104 Runtime [s] Batch size A100 80GB A100 40GB VE Type 30A VE Type 20B 8 qubits 12 qubits 16 qubits 20 qubits better insu ff i cient loop length cache e ff ect

2-qubit gate (CNOT gate) performance • Measured the time to
apply a CNOT gate to 105 state vectors with varying batch size. ◦ Trend is similar to RX gate, but A100 is ~25% faster than VE Type 30A. ◦ Suggests room for optimization since VE30’s e ff ective memory bandwidth is ~10% higher than that of A100 80GB. 14 8 qubits 12 qubits 16 qubits 20 qubits 0 1 2 3 4 5 6 7 8 9 10 1×102 1×103 1×104 1×105 Runtime [ms] Batch size A100 80GB A100 40GB VE Type 30A VE Type 20B 0 2 4 6 8 10 12 14 16 18 1×102 1×103 1×104 1×105 Runtime [ms] Batch size A100 80GB A100 40GB VE Type 30A VE Type 20B 0 50 100 150 200 250 300 350 400 450 1×102 1×103 1×104 1×105 Runtime [ms] Batch size A100 80GB A100 40GB VE Type 30A VE Type 20B 0 1 2 3 4 5 6 7 8 1×102 1×103 1×104 Runtime [s] Batch size A100 80GB A100 40GB VE Type 30A VE Type 20B better

Depolarizing noise gate performance • Measured the time to apply
a 1-qubit depolarizing gate to state vectors. ◦ cuStateVec lacks a function to selectively apply gate to a subset of batch, we apply an identity gate to SVs that are not a ff ected by noise. ◦ VE performs better with lower noise rate and larger batch size. 15 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 1×102 1×103 1×104 1×105 Runtime [s] Batch size A100 80GB VE Type 30A 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 1×10-3 1×10-2 1×10-1 Runtime [s] Noise rate A100 80GB VE Type 30A 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 8 9 10 11 12 13 14 Runtime [s] Qubits A100 80GB VE Type 30A Varying noise rate (14 qubits, 105 batch size) Varying batch size (14 qubits, 10-3 error rate) Varying number of qubits (105 batch size, 10-3 error rate) better longer loop length less work

Random circuit simulation output 16 0 0.5 1 1.5 2
2.5 3 3.5 4 -5 -4 -3 -2 -1 0 1 2 Probability Density log(2np) Noiseless Noisy (error rate=1e-3) Noisy (error rate=5e-3) Noisy (error rate=1e-2) 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 -5 -4 -3 -2 -1 0 1 2 Probability Density log(2np) This work Qulacs Theoretical [1] F. Arute et al., “Quantum supremacy using a programmable superconducting processor,” Nature, vol. 574, no. 7779, 2019. Google Sycamore-style random circuit 1 0 0 0 0 cos θ −i sin θ 0 0 −sin θ cos θ 0 0 0 0 1 1 2 [ 1 + i 1 − i 1 − i 1 + i] 1 2 [ 1 + i −1 − i 1 + i 1 + i ] 1 2 2 −1 − i 1 − i − 2 X Y W iSWAP-like Noiseless Noisy PDF of probability that a bitstring is observed

Random circuit performance • VE largely outperforms A100 80GB up
to 1.95x if the batch size is su ffi ciently large. • Even on VE, noise gate application dominates the runtime. Further e ff ort should be put into optimizing noise gates. 17 0 20 40 60 80 100 120 1×102 1×103 1×104 1×105 Runtime [s] Batch size A100 80GB VE Type 30A 0 10 20 30 40 50 60 70 1×102 1×103 1×104 1×105 Runtime [s] Batch size A100 80GB VE Type 30A 0 20 40 60 80 100 120 1×102 1×103 1×104 1×105 Runtime [s] Batch size A100 80GB VE Type 30A 0 10 20 30 40 50 60 70 80 90 100 1×102 1×103 1×104 1×105 Runtime [s] Batch size A100 80GB VE Type 30A better p = 10−3 p = 5 ⋅ 10−3 p = 10−2 p = 5 ⋅ 10−2

Conclusion and future work • We prototyped a batched quantum
circuit simulator for NEC Vector Engine. • Performance of VE30 is comparable or better to that of NVIDIA A100. ◦ Identical for 1-qubit gates and 25% lower performance for 2-qubit gates. ◦ VE30 outperforms cuStateVec by up to 1.95x in simulating random circuits. • Future work includes: ◦ Evaluation using real-world VQA circuits with noise. ◦ Porting to other CPUs with SIMD instructions such as AVX-512 and SVE. 18 This work was partially supported by JSPS KAKENHI Grant Number JP23K16890. Many thanks to the Qulacs developers!

Prototype of a Batched Quantum Circuit Simulato...

Prototype of a Batched Quantum Circuit Simulator for the Vector Engine

Keichi Takahashi

More Decks by Keichi Takahashi

Featured

Transcript

4th International Workshop on Quantum Computing Software @SC23 Prototype of

Initial motivation for developing a batched simulator • Center for

Use cases for a batched simulator • Most state-of-the-art simulators

Related work • NVIDIA cuQuantum (cuStateVec) implemented batched gate application

SX-Aurora TSUBASA Vector Engine (VE) • The NEC Vector Engine

Performance tuning basics for the Vector Engine • Vectorization is

Computational characteristics of state vector simulation 7 10 100 0

Memory layout • Making the state vector continuous on memory

Kernel overview • Two levels of parallelism can be exploited:

Noise gate 10 ℰ(ρ) = (1 − p)ρ + p

Noise gate simulation scheme • We eliminate conditionals by selecting

Evaluation setup • VE ◦ Software: batched-qsim-ve (https://github.com/keichi/batched-qsim-ve) ◦ Hardware

1-qubit gate (RX gate) performance • Measured the time to

2-qubit gate (CNOT gate) performance • Measured the time to

Depolarizing noise gate performance • Measured the time to apply

Random circuit simulation output 16 0 0.5 1 1.5 2

Random circuit performance • VE largely outperforms A100 80GB up

Conclusion and future work • We prototyped a batched quantum