$30 off During Our Annual Pro Sale. View Details »

Prototype of a Batched Quantum Circuit Simulator for the Vector Engine

Keichi Takahashi
November 13, 2023
11

Prototype of a Batched Quantum Circuit Simulator for the Vector Engine

Keichi Takahashi, Toshio Mori, Hiroyuki Takizawa, “Prototype of a Batched Quantum Circuit Simulator for the Vector Engine,” Fourth International Workshop on Quantum Computing Software held in conjunction with SC23: The International Conference for High Performance Computing, Networking, Storage, and Analysis, Nov. 2023. 10.1145/3624062.3624226

Keichi Takahashi

November 13, 2023
Tweet

Transcript

  1. 4th International Workshop on Quantum Computing Software @SC23
    Prototype of a Batched Quantum Circuit
    Simulator for the Vector Engine
    Keichi Takahashi1, Toshio Mori2, Hiroyuki Takizawa1


    1Cyberscience Center, Tohoku University, Japan


    2Center for Quantum Information and Quantum Biology, Osaka University, Japan

    View Slide

  2. Initial motivation for developing a batched simulator
    • Center for Quantum Information and Quantum Biology (QIQB) at Osaka University
    has been developing a state vector simulator named Qulacs [1], which is widely
    used in the Japanese research community.


    • One of the primary use cases of Qulacs is variational quantum algorithm including
    quantum machine learning algorithms.


    • A feedback we often receive from our users is to have a capability to simulate a
    large number of small- to intermediate-scale circuits (batched circuit simulation).
    2
    [1] Y. Suzuki et al., “Qulacs: a fast and versatile quantum circuit simulator for research purpose ,” Quantum, vol. 5, 2021.

    View Slide

  3. Use cases for a batched simulator
    • Most state-of-the-art simulators focus on simulating large-scale noiseless circuits.


    ◦ Noise simulation


    • Exact simulation based on density matrices requires complexity.


    • Monte Carlo approach [1] can estimate the result with fewer resources.


    ◦ Variational Quantum Algorithms (VQA)


    • Iteratively updates a parameterized circuit based on the simulation of the
    outcome.


    • Extreme-scale simulators are needed for scalability evaluation,

    but trial-and-error algorithm development is often done at smaller scales.
    O(22n)
    3
    [1] W. Berquist et al., “Stochastic Approach for Simulating Quantum Noise Using Tensor Networks,” QCS 2022.

    View Slide

  4. Related work
    • NVIDIA cuQuantum (cuStateVec)
    implemented batched gate application and
    measurement functions in version 1.4.0.


    • However, to the best of our knowledge, no
    quantum simulations utilize the batched
    functions.


    • This work is the
    fi
    rst study to implement a
    state vector simulator on VE and compare
    with cuStateVec.
    4
    custatevecStatus_t custatevecApplyMatrixBatched(
    custatevecHandle_t handle,
    void *batchedSv,
    cudaDataType_t svDataType,
    const uint32_t nIndexBits,
    const uint32_t nSVs,
    custatevecIndex_t svStride,
    custatevecMatrixMapType_t mapType,
    const int32_t *matrixIndices,
    const void *matrices,
    cudaDataType_t matrixDataType,
    custatevecMatrixLayout_t layout,
    const int32_t adjoint,
    const uint32_t nMatrices,
    const int32_t *targets,
    const uint32_t nTargets,
    const int32_t *controls,
    const int32_t *controlBitValues,
    const uint32_t nControls,
    custatevecComputeType_t computeType,
    void *extraWorkspace,
    size_t extraWorkspaceSizeInBytes
    )
    Target and
    control bits
    Gate matrices

    View Slide

  5. SX-Aurora TSUBASA Vector Engine (VE)
    • The NEC Vector Engine is a vector processor implemented as
    a PCIe card.


    • Combination of HBM and long-vector architecture o
    ff
    ers
    high performance in memory-bound HPC applications.


    • State vector simulation, which is also memory-bound, could
    also be a potential
    fi
    t for VE.
    5
    Core Core
    Core Core
    Core Core
    Core Core
    Core Core
    Core Core
    LLC
    HBM2E
    HBM2E
    HBM2E
    HBM2E
    HBM2E
    HBM2E
    Core
    Core
    Core
    Core
    LLC
    SPU
    L3 Cache (2 MB)
    VPU
    Network on Chip (2D Mesh)
    Last Level Cache (64 MB)
    Main Memory (96 GB)
    2.45 TB/s
    6.4 TB/s
    410 GB/s
    410 GB/s
    VE Type 20B VE Type 30A A100 PCIe H100 PCIe
    Peak Performance 2.4 TFLOP/s 4.9 TFLOP/s 9.7 TFLOP/s 25.6 TFLOP/s
    Memory Bandwidth 1.53 TB/s 2.45 TB/s 1.93 TB/s 2.00 TB/s
    Memory Capacity 48 GB 96 GB 80 GB 80 GB
    Process Rule 16nm 7nm 7nm 4 nm
    [1] K. Takahashi et al., “Performance Evaluation of a Next-Generation SX-Aurora TSUBASA Vector Supercomputer ,” ISC 2023.

    View Slide

  6. Performance tuning basics for the Vector Engine
    • Vectorization is vital to attain good performance on VE.


    ◦ Standard tuning principles apply: prefer consecutive memory accesses, increase
    loop length, avoid scatter/gather accesses, minimize conditionals, etc.


    ◦ VE o
    ff
    ers 16,384-bit wide registers (256 FP64 values), requiring long loops.


    • NEC o
    ff
    ers a powerful auto-vectorizing compiler for C/C++ and Fortran:


    ◦ Use of intrinsics or inline assembly is rarely needed.


    ◦ Pragma directives are provided to guide the auto-vectorization.


    ◦ Various loop patterns including conditionals, reduction, pre
    fi
    x sum, merging and
    indexed summation can be vectorized with support from the ISA.
    6

    View Slide

  7. Computational characteristics of state vector simulation
    7
    10
    100
    0 2 4 6 8 10 12 14
    Runtime [µs]
    Target Qubit Index
    Gather-Scatter
    Contiguous
    Strided
    |ψ⟩ = a0…00
    |0…00⟩ + a0…01
    |0…01⟩ + … + a1…11
    |1…11⟩
    [
    a′

    *…*0i
    *…*
    a′

    *…*1i
    *…*]
    = [
    U00
    U01
    U10
    U11
    ] [
    a*…*0i
    *…*
    a*…*1i
    *…*
    ]
    better
    To compute the -th element, the -th and -th elements
    of the state vector are be loaded and stored.
    i i i ⊕ 2k
    Unitary matrix corresponding to a gate
    Strided access pattern where the stride depends on the
    target qubit (stride= ).


    Maintaining both long loops and contiguous
    accesses for all target qubits is not possible.
    2k
    Basis states
    Probability amplitudes

    View Slide

  8. Memory layout
    • Making the state vector continuous on memory
    requires strided memory access pattern.


    • Instead, we make the same basis state from of
    state vectors contiguous memory. This gives
    ~6.9x speedup.


    • Separating the real and imaginary components
    gives further performance improvement (~1.9x).
    8
    Re |00>
    Im |00>
    Re |01>
    Im |01>
    Re |10>
    Im |10>
    Re |11>
    Im |11>
    Re |00>
    Im |00>
    Re |01>
    Im |01>
    Re |10>
    Im |10>
    Re |11>
    Im |11>
    State Vector #0
    State Vector #1
    Re |00>
    Im |00>
    Re |01>
    Im |01>
    Re |10>
    Im |10>
    Re |11>
    Im |11>
    Re |00>
    Im |00>
    Re |01>
    Im |01>
    Re |10>
    Im |10>
    Re |11>
    Im |11>
    SV #0
    SV #1
    SV #0
    SV #1
    SV #0
    SV #1
    SV #0
    SV #1
    Re |00>
    Im |00>
    Re |01>
    Im |01>
    Re |10>
    Im |10>
    Re |11>
    Im |11>
    Re |00>
    Im |00>
    Re |01>
    Im |01>
    Re |10>
    Im |10>
    Re |11>
    Im |11>
    SV #0
    SV #1
    SV #0
    SV #1
    SV #0
    SV #1
    SV #0
    SV #1
    SV #0
    SV #1
    SV #0
    SV #1
    SV #0
    SV #1
    SV #0
    SV #1
    SoA

    cuStateVec
    AoS 1 AoS 2

    our work

    View Slide

  9. Kernel overview
    • Two levels of parallelism can
    be exploited: basis states
    and state vectors.


    • Basic design is to parallelize
    over the bases and vectorize
    over the samples.


    • This ensures stride one
    access pattern with the AoS
    layout of SVs.
    9
    uint64_t mask = 1ULL << target;
    uint64_t lo_mask = mask - 1;
    uint64_t hi_mask = ~lo_mask;
    #pragma omp parallel for
    for (uint64_t i = 0; i < 1ULL << (N - 1); i++) {
    ITYPE i0 = ((i & hi_mask) << 1) | (i & lo_mask);
    ITYPE i1 = i0 | mask;
    #pragma omp simd
    for (uint32_t sample = 0; sample < B; sample++) {
    double tmp0_re = state_re[sample + i0 * B];
    double tmp0_im = state_im[sample + i0 * B];
    double tmp1_re = state_re[sample + i1 * B];
    double tmp1_im = state_im[sample + i1 * B];
    state_re[sample + i0 * B] = tmp1_re;
    state_im[sample + i0 * B] = tmp1_im;
    state_re[sample + i1 * B] = tmp0_re;
    state_im[sample + i1 * B] = tmp0_im;
    }
    }
    Iterates over SVs
    Iterates over basis states
    Example of a Pauli-X gate application

    View Slide

  10. Noise gate
    10
    ℰ(ρ) = (1 − p)ρ +
    p
    3 ∑
    A∈{X,Y,Z}
    AρA
    We assume the depolarizing noise [1] model. The density matrix representation of a one-qubit
    depolarizing gate is as follows (one of X/Y/Z gates is applied with probability ).
    p
    [1] M. A. Nielsen and I. L. Chuang. Quantum computation and quantum information. Cambridge university press, 2010.
    A naive implementation would iterate over the SVs and
    apply a gate to a SV if it should be a
    ff
    ected by noise.


    VE compiler generates a code that uses vector masking,
    but since , most of the vector elements are masked
    wasted.
    p ≪ 1

    View Slide

  11. Noise gate simulation scheme
    • We eliminate conditionals by selecting the SVs a
    ff
    ected by noise in advance.


    • Expected length of the vectorized loop is , which can become suboptimal for the
    VE architecture very small error rates.
    pB
    11
    |0…00⟩
    |0…00⟩
    |0…00⟩
    |0…00⟩
    |0…01⟩
    |0…01⟩
    SV #0
    SV #1
    SV #2
    SV #3
    |0…01⟩
    |0…01⟩
    SV #2
    SV #3
    SV #0
    SV #1
    |0…00⟩
    |0…00⟩
    |0…00⟩
    |0…00⟩
    |0…01⟩
    |0…01⟩
    |0…01⟩
    |0…01⟩
    ×
    ×
    ×
    Gather Scatter

    View Slide

  12. Evaluation setup
    • VE


    ◦ Software: batched-qsim-ve (https://github.com/keichi/batched-qsim-ve)


    ◦ Hardware


    • NEC VE Type 20B (VE20)


    • NEC VE Type 30A (VE30)


    • GPU


    ◦ Software: cuStateVec 1.4.1


    ◦ Hardware:


    • NVIDIA A100 40GB PCIe


    • NVIDIA A100 80GB PCIe
    12
    VE Type 20B VE Type 30A A100 40GB A100 80GB
    Peak Performance 2.4 TFLOP/s 4.9 TFLOP/s 9.7 TFLOP/s 9.7 TFLOP/s
    Memory Bandwidth 1.53 TB/s 2.45 TB/s 1.55 TB/s 1.93 TB/s
    Memory Capacity 48 GB 96 GB 40 GB 80 GB
    LLC B/W 3.0 TB/s 6.4 TB/s 4.9 TB/s 4.9 TB/s
    LLC Capacity 16 MB 64 MB 40 MB 40 MB
    AOBA supercomputer at Tohoku University

    View Slide

  13. 1-qubit gate (RX gate) performance
    • Measured the time to apply an RX gate to 105 state vectors with varying batch size.


    ◦ Performance of VE30 is identical to that of A100 80GB.


    ◦ Batch size on VE needs to be at least 200 for best performance.
    13
    0
    2
    4
    6
    8
    10
    12
    1×102 1×103 1×104 1×105
    Runtime [ms]
    Batch size
    A100 80GB
    A100 40GB
    VE Type 30A
    VE Type 20B
    0
    2
    4
    6
    8
    10
    12
    14
    1×102 1×103 1×104 1×105
    Runtime [ms]
    Batch size
    A100 80GB
    A100 40GB
    VE Type 30A
    VE Type 20B
    0
    50
    100
    150
    200
    250
    300
    350
    1×102 1×103 1×104 1×105
    Runtime [ms]
    Batch size
    A100 80GB
    A100 40GB
    VE Type 30A
    VE Type 20B
    0
    1
    2
    3
    4
    5
    6
    1×102 1×103 1×104
    Runtime [s]
    Batch size
    A100 80GB
    A100 40GB
    VE Type 30A
    VE Type 20B
    8 qubits 12 qubits 16 qubits 20 qubits
    better
    insu
    ff i
    cient loop length
    cache e
    ff
    ect

    View Slide

  14. 2-qubit gate (CNOT gate) performance
    • Measured the time to apply a CNOT gate to 105 state vectors with varying batch size.


    ◦ Trend is similar to RX gate, but A100 is ~25% faster than VE Type 30A.


    ◦ Suggests room for optimization since VE30’s e
    ff
    ective memory bandwidth is
    ~10% higher than that of A100 80GB.
    14
    8 qubits 12 qubits 16 qubits 20 qubits
    0
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    1×102 1×103 1×104 1×105
    Runtime [ms]
    Batch size
    A100 80GB
    A100 40GB
    VE Type 30A
    VE Type 20B
    0
    2
    4
    6
    8
    10
    12
    14
    16
    18
    1×102 1×103 1×104 1×105
    Runtime [ms]
    Batch size
    A100 80GB
    A100 40GB
    VE Type 30A
    VE Type 20B
    0
    50
    100
    150
    200
    250
    300
    350
    400
    450
    1×102 1×103 1×104 1×105
    Runtime [ms]
    Batch size
    A100 80GB
    A100 40GB
    VE Type 30A
    VE Type 20B
    0
    1
    2
    3
    4
    5
    6
    7
    8
    1×102 1×103 1×104
    Runtime [s]
    Batch size
    A100 80GB
    A100 40GB
    VE Type 30A
    VE Type 20B
    better

    View Slide

  15. Depolarizing noise gate performance
    • Measured the time to apply a 1-qubit depolarizing gate to state vectors.


    ◦ cuStateVec lacks a function to selectively apply gate to a subset of batch, we
    apply an identity gate to SVs that are not a
    ff
    ected by noise.


    ◦ VE performs better with lower noise rate and larger batch size.
    15
    0
    0.05
    0.1
    0.15
    0.2
    0.25
    0.3
    0.35
    0.4
    1×102 1×103 1×104 1×105
    Runtime [s]
    Batch size
    A100 80GB
    VE Type 30A
    0
    0.05
    0.1
    0.15
    0.2
    0.25
    0.3
    0.35
    1×10-3 1×10-2 1×10-1
    Runtime [s]
    Noise rate
    A100 80GB
    VE Type 30A
    0
    0.05
    0.1
    0.15
    0.2
    0.25
    0.3
    0.35
    8 9 10 11 12 13 14
    Runtime [s]
    Qubits
    A100 80GB
    VE Type 30A
    Varying noise rate (14
    qubits, 105 batch size)
    Varying batch size (14
    qubits, 10-3 error rate)
    Varying number of qubits
    (105 batch size, 10-3 error rate)
    better
    longer loop length
    less work

    View Slide

  16. Random circuit simulation output
    16
    0
    0.5
    1
    1.5
    2
    2.5
    3
    3.5
    4
    -5 -4 -3 -2 -1 0 1 2
    Probability Density
    log(2np)
    Noiseless
    Noisy (error rate=1e-3)
    Noisy (error rate=5e-3)
    Noisy (error rate=1e-2)
    0
    0.05
    0.1
    0.15
    0.2
    0.25
    0.3
    0.35
    0.4
    -5 -4 -3 -2 -1 0 1 2
    Probability Density
    log(2np)
    This work
    Qulacs
    Theoretical
    [1] F. Arute et al., “Quantum supremacy using a programmable superconducting processor,” Nature, vol. 574, no. 7779, 2019.
    Google Sycamore-style random circuit
    1 0 0 0
    0 cos θ −i sin θ 0
    0 −sin θ cos θ 0
    0 0 0 1
    1
    2 [
    1 + i 1 − i
    1 − i 1 + i]
    1
    2 [
    1 + i −1 − i
    1 + i 1 + i ]
    1
    2
    2 −1 − i
    1 − i − 2
    X
    Y
    W
    iSWAP-like
    Noiseless Noisy
    PDF of probability that a bitstring is observed

    View Slide

  17. Random circuit performance
    • VE largely outperforms A100 80GB up to 1.95x if the batch size is su
    ffi
    ciently large.


    • Even on VE, noise gate application dominates the runtime. Further e
    ff
    ort should be
    put into optimizing noise gates.
    17
    0
    20
    40
    60
    80
    100
    120
    1×102 1×103 1×104 1×105
    Runtime [s]
    Batch size
    A100 80GB
    VE Type 30A
    0
    10
    20
    30
    40
    50
    60
    70
    1×102 1×103 1×104 1×105
    Runtime [s]
    Batch size
    A100 80GB
    VE Type 30A
    0
    20
    40
    60
    80
    100
    120
    1×102 1×103 1×104 1×105
    Runtime [s]
    Batch size
    A100 80GB
    VE Type 30A
    0
    10
    20
    30
    40
    50
    60
    70
    80
    90
    100
    1×102 1×103 1×104 1×105
    Runtime [s]
    Batch size
    A100 80GB
    VE Type 30A
    better
    p = 10−3 p = 5 ⋅ 10−3 p = 10−2 p = 5 ⋅ 10−2

    View Slide

  18. Conclusion and future work
    • We prototyped a batched quantum circuit simulator for NEC Vector Engine.


    • Performance of VE30 is comparable or better to that of NVIDIA A100.


    ◦ Identical for 1-qubit gates and 25% lower performance for 2-qubit gates.


    ◦ VE30 outperforms cuStateVec by up to 1.95x in simulating random circuits.


    • Future work includes:


    ◦ Evaluation using real-world VQA circuits with noise.


    ◦ Porting to other CPUs with SIMD instructions such as AVX-512 and SVE.
    18
    This work was partially supported by JSPS KAKENHI Grant Number JP23K16890.


    Many thanks to the Qulacs developers!

    View Slide