Upgrade to Pro — share decks privately, control downloads, hide ads and more …

NISQ時代を見据えたバッチ型量子回路シミュレータの開発

Avatar for Keichi Takahashi Keichi Takahashi
July 10, 2025
0

 NISQ時代を見据えたバッチ型量子回路シミュレータの開発

学際大規模情報基盤共同利用・共同研究拠点 第17回シンポジウム での発表

Avatar for Keichi Takahashi

Keichi Takahashi

July 10, 2025
Tweet

More Decks by Keichi Takahashi

Transcript

  1. ྔࢠճ࿏γϛϡϨʔγϣϯ w ྔࢠܭࢉ΁ͷ஫໨ʹͱ΋ͳ͍ɺݹయܭࢉػ্ͰήʔτܕྔࢠܭࢉػΛ࠶ݱ͢ Δྔࢠճ࿏γϛϡϨʔλ͕׆ൃʹ։ൃ w Qiskit Aer (IBM), Cirq (Google),

    QuEST (Oxford), Qulacs (QIQB) ͳͲ w (16౳ΞΫηϥϨʔλͷ׆༻ <>ɺςϯιϧωοτϫʔΫ౳ͷޮ཰తͳܭࢉख ๏ <>ɺେن໛෼ࢄฒྻԽ <>ͳͲɺγϛϡϨʔτՄೳͳྔࢠϏοτ਺ͷ֦େ ͕໨ࢦ͞Ε͍ͯΔ 2 [1] T. Jones et al. QuEST and High Performance Simulation of Quantum Computers. Sci Rep 9, 10736 (2019). [2] Y. Liu et al., Closing the "quantum supremacy" gap: achieving real-time simulation of a random quantum circuit using a new Sunway supercomputer, SC’21, 2021. [3] A. Tabuchi et al., mpiQulacs: A Scalable Distributed Quantum Computer Simulator for ARM-based Clusters, 2023 IEEE International Conference on Quantum Computing and Engineering (QCE), 2023.
  2. όονܕྔࢠճ࿏γϛϡϨʔλͷ։ൃ w γϛϡϨʔγϣϯن໛ͷΈΛ௥ٻ͢ΔͷͰ͸ͳ͘ɺྔࢠ৘ใ෼໺ͷཁٻ <>ʹ Ԡ͑ɺݚڀ։ൃΛՃ଎Ͱ͖ΔγϛϡϨʔλͷ։ൃΛ໨ࢦ͢ w NVIDIAͷϥΠϒϥϦcuStateVecʹ͸όονܭࢉػೳΛఏڙ͍ͯ͠Δ͕ɺ࣮ࡍʹ ࢖༻͍ͯ͠ΔطଘͷγϛϡϨʔλ͸ଘࡏ͠ͳ͍ ࣌఺ 4

    ଟ਺ͷྔࢠճ࿏Λಉ࣌ʹγϛϡϨʔτՄೳͳόονܕྔࢠճ࿏γϛϡϨʔλ Λ։ൃɾެ։͢Δͱͱ΋ʹɺόονܕγϛϡϨʔλͷઃܭ΍ߴ଎Խʹؔ͢Δ ՝୊Λ੔ཧɾநग़͠ɺࠓޙͷγϛϡϨʔλ։ൃʹ׆༻͢Δɻ [1] T. Ichikawa et al., Current numbers of qubits and their uses. Nat Rev Phys 6, 345–347 (2024). ໨త
  3. ։ൃ͢Δόονܕྔࢠճ࿏γϛϡϨʔλͷཁ݅ w ߴ଎ੑଟ਺ͷճ࿏Λߴ଎ʹγϛϡϨʔτՄೳ w όονܕܭࢉʹద͢ΔฒྻԽɾϕΫτϧԽํࣜΛݕ౼ w ॊೈੑ༷ʑͳྔࢠճ࿏Λهड़͠ɺଞιϑτ΢ΣΞͱ࿈ܞՄೳ w 1ZUIPO͓Αͼ$ ͷ"1*Λඋ͑ͨϥΠϒϥϦͱ࣮ͯ͠૷

    w ଟ਺ͷྔࢠճ࿏Λޮ཰తʹදݱ͢ΔͨΊͷ"1*Λఏڙ w ࣮༻ੑ࣮ࡍʹݚڀ͞Ε͍ͯΔྔࢠΞϧΰϦζϜΛ࣮૷Մೳ w ֤छྔࢠήʔτɺΦϒβʔόϒϧɺྔࢠঢ়ଶͷूܭػೳͳͲΛఏڙ w ओཁͳ/*42޲͚ྔࢠΞϧΰϦζϜΛ࣮૷Մೳ 5
  4. 49"VSPSB546#"4"7FDUPS&OHJOF 7& w 1$*FΧʔυ্ʹ࣮૷͞ΕͨϕΫτϧϓϩηοα <> w )#.ͱϩϯάϕΫλΞʔΩςΫνϟͷ૊Έ߹ΘͤʹΑ ΓɺϝϞϦ཯଎ͷ)1$ΞϓϦʹ͓͍ͯ༏ΕͨੑೳΛൃش w ঢ়ଶϕΫτϧγϛϡϨʔγϣϯ΋ϝϞϦ཯଎ͳͨΊɺ

    7&Ͱߴ͍ੑೳΛظ଴Մೳ 6 Core Core Core Core Core Core Core Core Core Core Core Core LLC HBM2E HBM2E HBM2E HBM2E HBM2E HBM2E Core Core Core Core LLC SPU L3 Cache (2 MB) VPU Network on Chip (2D Mesh) Last Level Cache (64 MB) Main Memory (96 GB) 2.45 TB/s 6.4 TB/s 410 GB/s 410 GB/s VE Type 20B VE Type 30A A100 PCIe H100 PCIe Peak Performance 2.4 TFLOP/s 4.9 TFLOP/s 9.7 TFLOP/s 25.6 TFLOP/s Memory Bandwidth 1.53 TB/s 2.45 TB/s 1.93 TB/s 2.00 TB/s Memory Capacity 48 GB 96 GB 80 GB 80 GB Process Rule 16nm 7nm 7nm 4 nm [1] K. Takahashi et al., “Performance Evaluation of a Next-Generation SX-Aurora TSUBASA Vector Supercomputer ,” ISC 2023.
  5. ঢ়ଶϕΫτϧγϛϡϨʔγϣϯͷܭࢉಛੑ 7 10 100 0 2 4 6 8 10

    12 14 Runtime [µs] Target Qubit Index Gather-Scatter Contiguous Strided |ψ⟩ = a0…00 |0…00⟩ + a0…01 |0…01⟩ + … + a1…11 |1…11⟩ [ a′  *…*0i *…* a′  *…*1i *…*] = [ U00 U01 U10 U11 ] [ a*…*0i *…* a*…*1i *…* ] better ൪໨ͷཁૉͷߋ৽ʹ͸ ൪໨ͱ ൪໨ͷঢ়ଶϕΫτϧ ΛϩʔυετΞ͢Δඞཁ ͸ඪతྔࢠϏοτ i i i ⊕ 2k k ྔࢠήʔτʹରԠ͢ΔϢχλϦߦྻ ετϥΠυ௕͕ඪతྔࢠϏοτʹґଘ͢ΔͨΊ ετ ϥΠυ ɺϧʔϓ௕ͱ࿈ଓΞΫηεͷཱ͕྆ࠔ೉ 2k ܭࢉجఈ ݸ 2n ֬཰ৼ෯ ෳૉ਺
  6. ྔࢠήʔτͷ࣮૷ w ܭࢉجఈͱঢ়ଶϕΫτϧʹ ͍ͭͯͷॏϧʔϓߏ଄ w جఈʹ͍ͭͯͷϧʔϓΛε ϨουฒྻԽ͠ɺঢ়ଶϕΫ τϧʹ͍ͭͯͷϧʔϓΛϕ ΫτϧԽ w

    ҟͳΔঢ়ଶϕΫτϧͷಉҰ جఈΛ࿈ଓʹ഑ஔ͢Δ͜ͱͰ ࿈ଓΞΫηεΛ࣮ݱ 8 uint64_t mask = 1ULL << target; uint64_t lo_mask = mask - 1; uint64_t hi_mask = ~lo_mask; #pragma omp parallel for for (uint64_t i = 0; i < 1ULL << (N - 1); i++) { ITYPE i0 = ((i & hi_mask) << 1) | (i & lo_mask); ITYPE i1 = i0 | mask; #pragma omp simd for (uint32_t sample = 0; sample < B; sample++) { double tmp0_re = state_re[sample + i0 * B]; double tmp0_im = state_im[sample + i0 * B]; double tmp1_re = state_re[sample + i1 * B]; double tmp1_im = state_im[sample + i1 * B]; state_re[sample + i0 * B] = tmp1_re; state_im[sample + i0 * B] = tmp1_im; state_re[sample + i1 * B] = tmp0_re; state_im[sample + i1 * B] = tmp0_im; } } ঢ়ଶϕΫτϧʹ ͍ͭͯϧʔϓ ܭࢉجఈʹ ͍ͭͯϧʔϓ 1BVMJ9ήʔτΛ࡞༻ͤ͞ΔΧʔωϧ
  7. γϛϡϨʔλ"1*ͷઃܭ w 2JTLJU΍2VMBDTͷ"1*Λࢀߟʹઃܭ w όονܕঢ়ଶϕΫτϧΛੜ੒ɾॳظԽ͢Δؔ਺ w όονܕঢ়ଶϕΫτϧʹήʔτΛ࡞༻ͤ͞Δؔ਺ w όονܕঢ়ଶϕΫτϧͷ৘ใΛऔಘ͢Δؔ਺ 9

    state = State(3, 5) state.set_zero_state() state.act_h_gate(0) state.act_rx_gate(0, np.pi/4) state.act_rx_gate(1, [0.1, 0.2, 0.3]) state.get_vector(0)
  8. ήʔτΛ࡞༻ͤ͞Δ"1*ͷઃܭ 10 state.act_rx(0, 0.1) શόονʹಉҰήʔτΛ ࡞༻ ಉҰήʔτΛόονຖʹҟͳΔ ύϥϝʔλͰ࡞༻ ϊΠζήʔτ state.act_noise_gate(0,

    0.1) state.act_rx(0, [0.1,0.2,0.3]) |0⟩ |0⟩ RX(0.1) CX |0⟩ |0⟩ RX(0.1) CX |0⟩ |0⟩ RX(0.1) CX |0⟩ |0⟩ RX(0.1) CX |0⟩ |0⟩ RX(0.2) CX |0⟩ |0⟩ RX(0.3) CX |0⟩ |0⟩ CX |0⟩ |0⟩ CX X |0⟩ |0⟩ CX
  9. γϛϡϨʔλͷ࣮૷ w 7&Ͱͷ1ZUIPOͷಈ࡞࣮੷͸ଘࡏ͠ͳ ͍ͨΊɺ1ZUIPOΠϯλϓϦλ͸$16 ଆͰ࣮ߦ w $6%"%SJWFS"1*ϥΠΫͳϥΠϒϥϦ 7&%SJWFS"1* 7&%" Ͱԋࢉ෦෼ͷ

    Έ7&΁ΦϑϩʔσΟϯά͢Δɺ(16 ͱྨࣅͨ͠ߏ੒ w ൺֱͷͨΊɺԋࢉ෦෼͸DV4UBUF7FD (16 ·ͨ͸0QFO.1 $16 ʹࠩସ Մೳʹ࣮૷ 11 ։ൃ෦෼ طଘιϑτ΢ΣΞ libveqsim.so Python nanobind VEDA AVEO VEDA AVEO ϢʔβΞϓϦ ASL libveqsim.vso CPU Vector Engine
  10. ධՁ؀ڥ w ιϑτ΢ΣΞWFRTJN IUUQTHJUIVCDPNLFJDIJWFRTJN  w ϋʔυ΢ΣΞ ⿞ 7&/&$7&5ZQF#" ౦๺େ"0#"

     ⿞ (16/7*%*""(#(#1$*F 12 VE Type 20B VE Type 30A A100 40GB A100 80GB ԋࢉੑೳ 2.4 TFLOP/s 4.9 TFLOP/s 9.7 TFLOP/s 9.7 TFLOP/s ϝϞϦ#8 1.53 TB/s 2.45 TB/s 1.55 TB/s 1.93 TB/s ϝϞϦ༰ྔ 48 GB 96 GB 40 GB 80 GB --$#8 3.0 TB/s 6.4 TB/s 4.9 TB/s 4.9 TB/s --$༰ྔ 16 MB 64 MB 40 MB 40 MB
  11. ྔࢠϏοτήʔτ 39ήʔτ ͷੑೳ w ݸͷঢ়ଶϕΫτϧʹ39ήʔτΛద༻͢Δ࣮ߦ࣌ؒΛܭଌ ⿞ 7&ͷੑೳ͸"(#ͱಉ౳ ⿞ 7&ͰͷόοναΠζ͸࠷௿ඞཁ ϕΫτϧ௕͕ͷͨΊ

    13 0 2 4 6 8 10 12 1×102 1×103 1×104 1×105 Runtime [ms] Batch size A100 80GB A100 40GB VE Type 30A VE Type 20B 0 2 4 6 8 10 12 14 1×102 1×103 1×104 1×105 Runtime [ms] Batch size A100 80GB A100 40GB VE Type 30A VE Type 20B 0 50 100 150 200 250 300 350 1×102 1×103 1×104 1×105 Runtime [ms] Batch size A100 80GB A100 40GB VE Type 30A VE Type 20B 0 1 2 3 4 5 6 1×102 1×103 1×104 Runtime [s] Batch size A100 80GB A100 40GB VE Type 30A VE Type 20B 8 qubits 12 qubits 16 qubits 20 qubits better insu ff i cient loop length cache e ff ect
  12. ྔࢠϏοτήʔτ $/05ήʔτ ͷੑೳ w ݸͷঢ়ଶϕΫτϧʹ$/05ήʔτΛద༻͢Δ࣮ߦ࣌ؒΛܭଌ ⿞ 39ήʔτͱ܏޲͸ྨࣅ͍ͯ͠Δ͕ɺ"(#͸7&"ΑΓ໿ߴ଎ ⿞ 7&ͷ࣮ޮଳҬ෯͸"(#ΑΓߴ͍ͨΊɺ࠷దԽͷ༨஍͋Γ 14

    8 qubits 12 qubits 16 qubits 20 qubits 0 1 2 3 4 5 6 7 8 9 10 1×102 1×103 1×104 1×105 Runtime [ms] Batch size A100 80GB A100 40GB VE Type 30A VE Type 20B 0 2 4 6 8 10 12 14 16 18 1×102 1×103 1×104 1×105 Runtime [ms] Batch size A100 80GB A100 40GB VE Type 30A VE Type 20B 0 50 100 150 200 250 300 350 400 450 1×102 1×103 1×104 1×105 Runtime [ms] Batch size A100 80GB A100 40GB VE Type 30A VE Type 20B 0 1 2 3 4 5 6 7 8 1×102 1×103 1×104 Runtime [s] Batch size A100 80GB A100 40GB VE Type 30A VE Type 20B better
  13. %FQPMBSJ[JOHϊΠζήʔτͷੑೳ w ྔࢠϏοτdepolarizingϊΠζήʔτΛঢ়ଶϕΫτϧʹରͯ͠࡞༻ ⿞ cuStateVec͸Ұ෦ͷঢ়ଶϕΫτϧʹͷΈήʔτΛ࡞༻͢Δػೳ͕ͳ͍ͨΊɺ ϊΠζ͕ൃੜ͠ͳ͍ঢ়ଶϕΫτϧʹ͸߃౳ήʔτΛ࡞༻ ⿞ VE͸ϊΠζ཰͕௿͘όοναΠζ͕େ͖͍΄Ͳ͕ߴ͍ 15 0

    0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 1×102 1×103 1×104 1×105 Runtime [s] Batch size A100 80GB VE Type 30A 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 1×10-3 1×10-2 1×10-1 Runtime [s] Noise rate A100 80GB VE Type 30A 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 8 9 10 11 12 13 14 Runtime [s] Qubits A100 80GB VE Type 30A Varying noise rate (14 qubits, 105 batch size) Varying batch size (14 qubits, 10-3 error rate) Varying number of qubits (105 batch size, 10-3 error rate) better longer loop length less work
  14. ྔࢠαϙʔτϕΫτϧϚγϯ 16 max c1 …cn n ∑ i=1 ci −

    1 2 n ∑ i=1 n ∑ j=1 yi ci k(xi , xj )yj cj TU ∀i n ∑ i=1 ci yi = 0, BOE 0 ≤ ci ≤ 1 2nλ ݹయతʹ͸ଟ߲ࣜΧʔωϧ ΍3#'Χʔωϧ k(xi , xj ) = (xT i xj )d k(xi , xj ) = exp(−γ∥xi − xj ∥2) ྔࢠ47.Ͱ͸ಛ௃ྔΛྔࢠঢ়ଶʹม׵͠ɺΧʔωϧͱͯ͠ྔࢠঢ়ଶͷ಺ੵΛ༻͍Δɻ ͨͩ͠ɺ ͸ಛ௃ྔΛྔࢠঢ়ଶʹຒΊࠐΉࣸ૾ɻ ྔࢠঢ়ଶͷදݱೳྗͷߴ͞ʹΑΓɺैདྷͷ47.ΑΓ෼ྨੑೳͷ޲্͕ظ଴Ͱ͖Δɻ k(xi , xj ) = ∥⟨Φ(xi )|Φ(xj ⟩∥2 Φ(xi ) Havlíček, V., Córcoles, A.D., Temme, K. et al. Supervised learning with quantum-enhanced feature spaces. Nature 567, 209–212 (2019). Χʔωϧ47.͸ɺ47.ʹඇઢܗม׵ΛՃ͑ͨ΋ͷ
  15. ྔࢠΧʔωϧ 17 P(2x1 ) X P(2(π − x1 )(π −

    x2 )) X Havlíček, V., Córcoles, A.D., Temme, K. et al. Supervised learning with quantum-enhanced feature spaces. Nature 567, 209–212 (2019). |0⟩ |0⟩ |0⟩ |0⟩ H H H H U† Φ (y) UΦ (x) H H H H UΦ (x) H H H H U† Φ (y) H H H H ͷঢ়ଶΛ࡞Δ ⟨Φ(x)| ͷঢ়ଶΛ࡞Δ ⟨Φ(x)|Φ(y)⟩ ʜ ͕؍ଌ ͞ΕΔ֬཰͸ |0…0⟩ ∥⟨Φ(x)|Φ(y)⟩∥2 શͯͷαϯϓϧͷ૊ʹରͯ͠Χʔωϧͷܭࢉ͕ඞཁ ˠOݸͷྔࢠճ࿏ͷγϛϡϨʔγϣϯ͕ඞཁ
  16. ྔࢠΧʔωϧͷੑೳ w WFRTJN͸શͯͷྔࢠϏοτ਺ʹ͍ͭͯ DV4UBUF7FD2VMBDTΑΓߴ଎ w 2VMBDTʹΑΔஞ࣮࣍ߦ͸ಛʹྔࢠϏο τ਺͕গͳ͍ྖҬʹ͓͍ͯ௿ޮ཰ w ྔࢠϏοτ਺͕૿Ճ͢Δʹͱ΋ͳ͍ஞ ࣮࣍ߦͷޮ཰͕޲্͠ɺόον࣮ߦͱ

    ͷੑೳࠩ͸ॖখ w WFRTJN͸DV4UBUF7FDΑΓʙഒߴ଎ ˠݪҼ෼ੳத 18 0.01 0.1 1 10 100 1000 4 6 8 10 12 14 16 18 20 veqsim (VE30A) veqsim (A100) Qulacs (A100) Runtime [s] Number of qubits αϯϓϧόοναΠζ