Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Realization and Evaluation of Measurement Feedback Coherent Ising Machines for Combinatorial Optimization Problems

Realization and Evaluation of Measurement Feedback Coherent Ising Machines for Combinatorial Optimization Problems

現代の科学技術において、数理的最適化が様々な分野を支える基盤となっており、近年ハードウェア を用いた計算の高速化も重要性を増している。最適化のためのアクセラレータとして、GPU、FPGAや ASIC等に加えて、最近、組合せ最適化問題などへの応用を目的として、イジング問題のための専用マシ ンの研究・開発が活発に行われている。本論文は、イジング問題に対するアクセラレータとして光パラメ トリック発振器(optical parametric oscillator; OPO)とFPGAを併用した光電子計算アーキテクチャを提案し、数理モデリングによるシミュレーション、実験結果との比較による数理モデルの検証、さらには組 合せ最適化問題に関して既存アルゴリズムとの比較による性能評価を報告している。
本論文は「Realization and Evaluation of Measurement Feedback Coherent Ising Machines for Combinatorial Optimization Problems」(組合せ最適化問題のための測定フィードバック型コヒーレン ト・イジングマシンの実現と評価)と題し、8章からなる。
第1章「Introduction」(序論)では、現在の計算機アーキテクチャと計算複雑性理論の概要を述べ、本研究の位置づけを示している。まずプロセッサ性能の変化と関連する技術についてまとめている。次に、 NP困難である組合せ最適化問題と計算複雑性理論についての導入を行い、NP困難問題に対するアクセラレータの意義を述べている。さらに、物理系を用いた新たな計算手法を紹介し、イジング問題に特化した専用マシンの関連研究についてまとめている。特に、その一つとして光学系を用いたコヒーレント・イ ジングマシン(coherent Ising machine; CIM)に関連した先行研究を紹介し、その課題とともに本論文の 意義を述べている。
第2章「Maximum Cut Problem」(最大カット問題)では、本論文の主な対象問題であるグラフの最大 カット問題(maximum cut problem; MAX-CUT)を定義し、近似アルゴリズムとヒューリスティクスを概 観している。まず基礎的なグラフの定義からはじめ、次にMAX-CUTの定義とこれに対する既存手法につ いて簡潔にまとめている。具体的には、半正定値計画(semidefinite programming; SDP)緩和、焼きなまし法(simulated annealing; SA)、ニューラルネットワークを用いた手法などのアルゴリズムを解説して いる。また、効率的に計算可能なクラスとMAX-CUT問題の近似可能性についても言及している。
第3章「Degenerate OPO」(縮退OPO)では本論文で用いられる縮退OPO(DOPO)と定式化について説 明している。系のハミルトニアン、マスター方程式を経てFokker-Planck方程式により表現される位相空間における主要な定式化を説明した後、DOPOのTruncated Wigner表示に基づく確率微分方程式の定式化を導入している。また、DOPOの分岐など基本的な性質についても述べている。
第4章「Coherent Ising Machine」(コヒーレント・イジングマシン)では測定フィードバックCIMの数 理モデルを提示している。先行研究での欠点であったスケーラビリティの問題を解消する具体的な実装 方法が提案され、確率微分方程式を用いた数理モデルにより定式化されている。また、測定フィードバックに用いられるFPGAの設計も述べている。
第5章「Numerical Simulation」(数値シミュレーション)では前章のモデルに基づく数値計算により実機の計算性能・計算時間が予測されている。数値計算手法とパラメータについて述べた後、具体的な問題 を与えた際のDOPO時間発展を例示している。ランダムなベンチマークグラフセットを用いて、頂点数 800≦n≦20000のスパースなグラフで既存手法と比較してほぼ同等な解が得られることを示している。ま た、CPUの単一コア・単一スレッド上で動作する半正定値計画緩和や焼きなまし法と比較し、グラフが 十分密である時の提案システムの有効性が予測されている。
第6章「Experimental Results and Validation」(実験結果と検証)では頂点数n≦100の問題が実装可能 な実機に関して、数値シミュレーションとの比較において実機の計算性能が検証されている。単一試行 の時間発展はシミュレーション結果とよく一致する。実験結果より、最適解付近にピークを持つ分布が 得られることが示されている。また、立方グラフでのスケーリング評価、厳密解正答率の分布、パラメータの一致が確認されている。
第7章「Performance Evaluation with Larger Problems」(大規模問題での性能評価)では前章より大 規模な問題(n=2000)において、近似アルゴリズム及びヒューリスティクスとの比較を中心として解析さ れている。CPUの単一コア・単一スレッド上で実装されたSAと比較し、n = 2000 の場合に平均ケース で12倍高速であるという結果が得られている。
最後に第8章「Conclusion」(結論)では、本論文の成果を簡潔にまとめると共に、関連した実装の可能性、実用上の課題、物理的性質について議論し、今後の研究課題を提示している。

5c772b62f1974e9da3a88fbb4ef02696?s=128

Yoshitaka Haribara

January 23, 2018
Tweet

Transcript

  1. Realization and Evaluation of Measurement Feedback Coherent Ising Machines for

    Combinatorial Optimization Problems ૊߹ͤ࠷దԽ໰୊ͷͨΊͷ ଌఆϑΟʔυόοΫܕίώʔϨϯτɾΠδϯάϚγϯ ͷ࣮ݱͱධՁ Yoshitaka Haribara Advisor: Prof. Kazuyuki Aihara January 23, 2018 Department of Mathematical Informatics University of Tokyo
  2. Thesis Structure 1.Introduction 2.Maximum Cut Problem 3.Degenerate OPO 4.Coherent Ising

    Machine 5.Numerical simulation 6.Experiment 1: Results and Validation 7.Experiment 2: Performance Evaluation 8.Conclusion 2 [1,2,5] Ref. [3,4,6]
  3. 1. Introduction

  4. Processor Trends 4 • Uniprocessor performance declined around 2003 (from

    52% to 22% / year) • GPU (cf. NVIDIA Tesla V100: 5120 core) • deep learning [1] / scientific simulation (molecular dynamics [2], visualizing CT scan [3], astrophysics many body system [4]) • Field programmable gate array (FPGA; programmable hardware) • Image processing (medical[5], broadcast[6]) / Datacenter (search[7]) [Original data up to the year 2010 collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, and C. Batten. New plot and data collected for 2010-2015 by K. Rupp]                                                                                                                                                                                                                                                                                                  1980 1990 2000 2010 1 10 100 1000 104 105 Year #Cores Single-thread performance (SPECint x 103) Freq.(MHz) Power (W) [1] A. Krizhevsky, et al., NIPS 1097 (2012).; Y. LeCun, et al., Nature 521, 436 (2015). [2] A. Götz, et al., Journal of chemical theory and computation 8, 1542 (2012). [3] O. Kutter, et al., Computer Methods and Programs in Biomedicine 94, 250 (2009). [4] L. Nyland, et al., GPU gems 3, 677 (2007). [5] https://www.altera.com/content/dam/altera-www/global/en_US/pdfs/literature/wp/wp-medical.pdf [6] https://www.altera.com/en_US/pdfs/literature/wp/wp-brdcst0306.pdf [7] A. Putnam, et al. Computer Architecture (ISCA), 2014 ACM/IEEE 41st International Symposium on. IEEE (2014).
  5. Google+UCSB [http://web.physics.ucsb.edu/~martinisgroup/ photos/Linear9XmonSurfaceCode.jpeg] Traffic flow in Beijing (from Ref. [6])

    TPU (Google) [https://cloud.google.com/blog/big-data/ 2017/05/an-in-depth-look-at-googles- first-tensor-processing-unit-tpu] Problem Specific Processors • Application specific IC (ASIC) • TPU (1st gen. for prediction) in neural network [1,2] • 8 bit quantization to reduce the number of multiplier • Matrix (systolic array) and nonlinear activation unit • (Universal) Quantum Computer (n ≧ 49 [3]) • Quantum simulation for Fermionic structures in chemistry and material science [4] • Quantum Annealer (Ising problem) • Machine learning to find Higgs Boson [5] • Traffic flow optimization [6] 5 [1] https://cloud.google.com/blog/big-data/2017/05/an-in- depth-look-at-googles-first-tensor-processing-unit-tpu [2] D. Silver, et al., Nature 550, 354 (2017). [3] J. Kerry, et al., Nature 519, 66 (2015) for n = 9. [4] J.R. McClean, et al., arXiv:1710.07629 [quant-ph] (2017). [5] A Mott, et al., Nature 550, 375 (2017). [6] F. Neukart, et al., arXiv:1708.01625 [quant-ph] (2017). D-Wave [https://www.dwavesys.com/ resources/media-resources]
  6. Complexity Classes NP P = NP = NP-complete P NP-complete

    NP-hard NP-hard P ≠ NP P = NP Computational Complexity • Ising problem (NP-hard) • minimize: • Quantum Annealers [1] • D-Wave (n = 2000, chimera graph) [2] / NEDO • Ising Processor • Hitachi (n = 20,000 CMOS and FPGA) [3] / Fujitsu (FPGA) [4] Ising machines 6 1k-spin sub-array 780 × 380 μm2 4 mm 3 mm 1k spin circuits SRAM I/F Fig. 5—Ising Chip Photograph. The chip has 20,000 spin circuits in a 3-mm × 4-mm = 12-mm2 area. I/F: interface Hitachi [http://www.hitachi.com/rev/ pdf/2016/r2016_06_110.pdf] H = X i6=j Jij i j, i 2 { 1, 1} (i = 1, . . . , n) HQA = X i6=j Jij z i z j (t) n 1 X i=0 x i Ising Transverse field [1] T. Kadowaki and H. Nishimori, Phys. Rev. E 58, 5355 (1998). [2] https://www.dwavesys.com/sites/default/files/D- Wave%202000Q%20Tech%20Collateral_0117F2.pdf [3] M. Yamaoka, et al., IEEE Journal of Solid-State Circuits 51, 303; C. Yoshimura, et al., IEEE CANDAR 436. [4] S. Matsubara, et al., In Conference on Complex, Intelligent, and Software Intensive Systems, 432. K4,4 in chimera [R. M. Karp, Complexity of Computer Computations, (Plenum, New York, 1972), pp. 85–103]
  7. N = 4, 16 CIM [A. Marandi, Z. Wang, K.

    Takata, R. L. Byer, and Y. Yamamoto, Nat. Photonics 8, 937 (2014)] [K. Takata, A. Marandi, R. Hamerly, Y. Haribara, et al., Scientific Reports 6, 34089 (2016)] • Free space coherent Ising machines of N = 4 [A. Marandi, et al.] and N = 16 [K. Takata, et al.]. N=4;*Experiment* •  Time*Division*MulJplexing* –  Pump:*250*MHz,*Yb&Fiber*Laser* –  OPO:*4.8&m*Cavity*(TR =4*ns,*Tcavity =*16*ns)* –  Couplings:*3*Locked*Delay*Lines*with*π*Phase*Shin* Clock* fs*Laser* @*1045*nm* M2* M1* M4* M3* PPLN* 40OPOs ) CP* BS* Filter* PD* OC* Delay*1* OC* IC* OC* IC* Delay*2* OC* IC* Delay*3* OPO1* OPO2* OPO3* OPO4* TR * Tcavity =4TR* Delay*2*=*2TR* Delay*1*=*TR * OPO4* OPO3* OPO2* OPO1* OPO4* OPO OPO OPO1* OPO1* π* MAX&CUT*Problem*on*OPO N=4;*Experiment* •  Time*Division*MulJplexing* –  Pump:*250*MHz,*Yb&Fiber*Laser* –  OPO:*4.8&m*Cavity*(TR =4*ns,*Tcavity =*16*ns)* –  Couplings:*3*Locked*Delay*Lines*with*π*Phase*Shin* •  OC*and*IC*~*4%* •  N=4*MAX&CUT* –  Expected*Result:*2*OPOs*in*|0〉,*2*OPOs*in*|π〉 •  Measurement:*Unequal&Arm*Interferometer* * Clock* fs*Laser* @*1045*nm* M2* M1* M4* M3* PPLN* 40OPOs ) CP* BS* Filter* PD* OC* Delay*1* OC* IC* OC* IC* Delay*2* OC* IC* Delay*3* OPO1* OPO2* OPO3* OPO4* OPO1* OPO2* TR * Tcavity =4TR* Jme* ...* Delay*3*=*3TR ** Delay*2*=*2TR* Delay*1*=*TR * OPO4* OPO3* OPO2* OPO1* OPO4* OPO3* OPO2* OPO1* OPO4* OPO3* OPO2* OPO1* OPO1* OPO2* OPO3* OPO4* π* π* π* π* π* π* MAX&CUT*Problem*on*OPO*Network* November*2014* OPO*Ising*Machine* 8* 1TR 2TR 3TR periment* Delay*1* Delay*2* Delay*3* OPO1* OPO2* OPO3* OPO4* OPO1* OPO2* TR * Tcavity =4TR* Jme* ...* Pulse interval For N = 16, 1TR, 8TR, 15TR delay lines 㱺 Complete graph K4 7 Optical Delay Lines
  8. My Contribution • Proposed a scalable architecture of coherent Ising

    machine with DOPO and FPGA, which enabled to implement arbitrary topology of graphs. • Mathematical modeling with CSDE and numerical simulations with reasonable parameters which led to the experimental success • Benchmark on MAX-CUT instances up to n ≦ 2,000 (experiment) and n ≦ 20,000 (simulation) against existing algorithms
  9. 2. Maximum Cut Problems

  10. CUT = 4 Maximum Cut (MAX-CUT) • Given a graph

    G(V, E) with weight wij. • Find the largest cut in G. • Cut is a partition of V into 2 subsets. • #Cut = total weight of crossing edges. • The term (1 - σi σj)/2 only counts the crossing edge, where σ = {+1, -1}. • Equivalent to anti-ferromagnetic Ising problem (minimizing Σ wij σ i σ j ). +1 +1 -1 -1 10 [R. M. Karp, Complexity of Computer Computations, (Plenum, New York, 1972), pp. 85–103]
  11. Algorithms for MAX-CUT -0.87856: Semi-definite Programming (SDP) relaxation [Goemans and

    Williamson, 1995] -0.53112: Variant of Spectral partitioning (minimum eigenvalue) [Trevisan, 2008] -0.5 + ε (ε > 0): Random Walk [Kale, et al., 2010] -0.5: Greedy [Sahni and Gonzalez, 1976] Approx. ratio -Breakout Local Search [Benlic, et al., 2013] -Simulated Annealing [Kirkpatrick, et al., 1983] -SG3 (Variant of Greedy) [Kahruman, et al., 2007] UGC [2] 0.94117 PCP [1] NP-hard P Planar [Orlova and Dorfman, 1972; Hadlock, 1975], Weakly bipartite [Grötschel, et al., 1981], Positive weighted without long odd cycle [Grötschel, et al., 1984], Integer weight (≦ N) with fixed genus [Galluccio, et al., 2001] Graphs solvable in P Heuristics 11 [1] S. Arora and S. Safra., JACM 45, 70 (1998). [2] S. Khot, STOC 767 (2002).
  12. SDP relaxation • A continuous-valued relaxation problem of MAX-CUT: •

    The matrix becomes Positive Semi-Definite. • The optimal value of this reduction problem (class P) is called SDP Upper Bound. • A solution for the original MAX-CUT is obtained by random projections. Expectation values > 0.87856 of OPT (MAX-CUT). vi vj [M. X. Goemans and D. P. Williamson, J. ACM 42, 1115 (1995)] 12
  13. Simulated Annealing (SA) • Meta-heuristic : gradually decreasing the “temperature”,

    i.e., noise, then converge to the ground state in infinite time [Geman and Geman, IEEE PAMI 6, 721 (1984)] • Performance is heavily dependent on : temperature scheduling (log), data structure (vector or array), multi-spin coding (below) 0 1 0 0 0 1 … 1 1 0 1 0 1 0 1 0 1 0 … 1 0 0 1 1 ↓ bitwise XOR 1 1 1 0 1 1 … 0 1 0 0 1 64-bit multi-spin coding spin σj weight wij (wij=±1 complete graph) → bit count (SIMD instruction) where wij ^ σj (Courtesy of S. Tamate & T. Sonobe @ NII) 13 [Kirkpatrick, et al., Science 220, 671 (1982)] Pflip = ⇢ 1 ( E  0) exp( E) ( E > 0)
  14. BLS: Breakout Local Search • Combination of local search &

    perturbation • 3 types of perturbation: uni-directional, bi-directional, random • Constant time spin update with bucket data structure 1. Local search (steepest descent) 2. Perturbation (forced spin flipping) P 1-P Q 1 - Q → ⁶ random Perturbation 14
  15. Backet Data V1 V2 V3 V4 V5 V6 Cut 1

    -1 1 -1 1 1 -1 1 -1 ɾ ɾ ɾ ɾ ɾ ɾ 1 2 3 4 5 6 gmax : 3 2 1 0 -1 -2 -3 : gmin gmax : 3 2 1 0 -1 -2 -3 : gmin S1 S2 S1 S2 • Gain list for each side of the partition and vertex address list enable constant O(1) time spin configuration update. 6 4 5 2 1 3 maxgain maxgain 15
  16. Greedy Algorithm and its Variant • Input : A graph

    with edge weights wij. • output : A cut {S1, S2} and the cut value cut(S1, S2). 1. ∀i ∈ V \{S1 ∪ S2 } , score(i ) = max{ w(i, S1 ), w(i, S2 )} SG3: score(i ) = |w(i, S1 ) - w(i, S2 )| 2. cut += score(imax) x y S1 S2 ∀i ∈ V’ x y S1 S2 ∀i ∈ V’ imax “best in” [S. Kahruman, et al., Int. J. Comp. Sci. and Eng. 3, 211 (2005)]
  17. Neural Network based Optimization Algorithms • Hopfield Network (HN) [J.

    J. Hopfield, PNAS 79, 2554 (1982)] • Update rule : majority vote: • Derandomized SA • Hopfield-Tank NN (HTNN) [Hopfield and Tank, Science 233, 625 (1986)] • Parallel implementation with a many-core processor sigmoid sgn 0 @ X j Jijxj bj 1 A
  18. QUBO-MILP • MAX-CUT → UBQP/QUBO → IBM CPLEX (MILP). Mixed-

    integer programming with a linear objective function. • zij = xi xj when xi ,xj ∈ {0, 1}, and define the convex hull of {(xi , xj , xixj ) | xi ,xj ∈ {0, 1} }. (Rooted Correlation Semimetric Polytope) • LP-based branch&cut minimize : X i<j Qijzij + n X i=1 Qiixi subject to : zij  xi, zij  xj, xi + xj zij  1, (8i < j) zij 0, x 2 {0, 1}n
  19. 3. Degenerate OPO

  20. Optical Parametric Oscillator (OPO) • In a degenerate optical parametric

    oscillator (DOPO), a nonlinear crystal is placed in a cavity (mirrors). • Frequency down conversion via molecular vibration occurs. The input with frequency ωp (pump) is converted to ωs (signal). • Degeneracy: • OPO can be used as a physical random number generator with quantum noise [A. Marandi, et al., Optics express 20, 19322 (2012)] [C. Nabors, et al., JOSA B 7, 815 (1990)] frequency phase Self-phase-locking 20 ( ) Pump Signal Nonlinear crystal Pump laser
  21. Formulation of a DOPO • System Hamiltonian H ↓ Liouville

    von-Neumann Equation • Quantum Master Equation for density matrix ρ ↓ Phase Space Expansion • Fokker-Planck Equation for Wigner quasi-distribution W(α) ↓ Ito Formula • Langevin Equation: phase space variable α 21 i~ @ˆ ⇢ @t = [ ˆ H, ˆ ⇢] [H. J. Carmichael, “Statistical Methods in Quantum Optics 2”, (Springer, 2009)] ˆ ⇢ = Z d e ⇤ˆ a ˆ a† ⇢Z e ˆ a⇤ ⇤ˆ aW(↵) d↵
  22. Pump Field System Hamiltonian Signal Field Parametric Coupling External Pumping

    Irreversible Interaction with the Reservoir where H = Hfree + Hinteraction + Hpump + Hsys-res Hfree = ~!sˆ a† s ˆ as + ~!pˆ a† p ˆ ap Hinteraction = i~  2 (ˆ a†2 s ˆ ap ˆ a† p ˆ a2 s ) Hpump = i~("ˆ a† p "⇤ˆ ap) Hsys-res = ~(ˆ as ˆ† Rs + ˆ Rsˆ a† s + ˆ ap ˆ† Rp + ˆ Rpˆ a† p ) ( ) Pump Signal 22 : photon creation operator : photon annihilation operator ˆ a ˆ a† [H. J. Carmichael, “Statistical Methods in Quantum Optics 2”, (Springer, 2009)]
  23. • Using the Liouville von-Neumann Equation • Master equation for

    the density matrix ρ • Born and Markov approximation, Adiabatic elimination • No analytic solution in the κ ≠ 0 case Master Equation i~ @ˆ ⇢ @t = [ ˆ H, ˆ ⇢] 23 @ˆ ⇢ @t = i!s[ˆ a† s ˆ as, ˆ ⇢] + i  2 [ˆ a†2 s ˆ ap ˆ a† p ˆ a2 s , ˆ ⇢] + s 2 (2ˆ as ˆ ⇢ˆ a† s ˆ a† s ˆ as ˆ ⇢ ˆ ⇢ˆ a† s ˆ as) + s¯ ns(ˆ as ˆ ⇢ˆ a† s + ˆ a† s ˆ ⇢ˆ as ˆ a† s ˆ as ˆ ⇢ ˆ ⇢ˆ asˆ a† s ). [H. J. Carmichael, “Statistical Methods in Quantum Optics 2”, (Springer, 2009)]
  24. Fokker-Planck Equation • Expanded with Wigner quasi-distribution W(α) • Fokker-Planck

    Eq. for a single DOPO in truncated Wigner representation • Note: • Truncation (dropping the cubic derivatives) is valid in the small quantum noise limit • Wigner representation can describe the nonclassical field @W @t = ⇣ 2 + i! ⌘ @ @↵ ↵ + ⇣ 2 i! ⌘ @ @↵⇤ ↵⇤ + (¯ n + 1 2 ) @2 @↵@↵⇤ W 24 [H. J. Carmichael, “Statistical Methods in Quantum Optics 2”, (Springer, 2009)] ˆ ⇢ = Z d e ⇤ˆ a ˆ a† ⇢Z e ˆ a⇤ ⇤ˆ aW(↵) d↵
  25. • Complex amplitude (c + is) ∈ is described as

    the SDE: Normalized pump rate p (pth = 1), Steady state amplitude As Langevin Equation Squeezed vacuum Phase Sensitive Amplification Squeezed states Pitchfork bifurcation Independent Gaussian noise (μ=0, σ=1) loss saturated gain [P. Kinsler and P. D. Drummond, Phys. Rev. A. 43, 6194 (1991)] 25
  26. Ising problem Coherent Ising Machine (CIM) Objective function Whole loss

    of DOPO network Spin variable DOPO phase Adjacency matrix Coupling constant DOPO network • Feedback (injection) term: • Effective potential : detected amplitude : coupling constant [Z. Wang, A. Marandi, K. Wen, R. L. Byer, and Y. Yamamoto, Phys. Rev. A 88, 063853 (2013)] -1.0 -0.5 0.5 1.0 -0.004 -0.002 0.002 26
  27. 4. Coherent Ising Machine [Y. Haribara, S. Utsunomiya, and Y.

    Yamamoto, Entropy 18, 151 (2016)]
  28. N = 4, 16 CIM [A. Marandi, Z. Wang, K.

    Takata, R. L. Byer, and Y. Yamamoto, Nat. Photonics 8, 937 (2014)] [K. Takata, A. Marandi, R. Hamerly, Y. Haribara, et al., Scientific Reports 6, 34089 (2016)] • Free space coherent Ising machines of N = 4 [A. Marandi, et al.] and N = 16 [K. Takata, et al.]. N=4;*Experiment* •  Time*Division*MulJplexing* –  Pump:*250*MHz,*Yb&Fiber*Laser* –  OPO:*4.8&m*Cavity*(TR =4*ns,*Tcavity =*16*ns)* –  Couplings:*3*Locked*Delay*Lines*with*π*Phase*Shin* Clock* fs*Laser* @*1045*nm* M2* M1* M4* M3* PPLN* 40OPOs ) CP* BS* Filter* PD* OC* Delay*1* OC* IC* OC* IC* Delay*2* OC* IC* Delay*3* OPO1* OPO2* OPO3* OPO4* TR * Tcavity =4TR* Delay*2*=*2TR* Delay*1*=*TR * OPO4* OPO3* OPO2* OPO1* OPO4* OPO OPO OPO1* OPO1* π* MAX&CUT*Problem*on*OPO N=4;*Experiment* •  Time*Division*MulJplexing* –  Pump:*250*MHz,*Yb&Fiber*Laser* –  OPO:*4.8&m*Cavity*(TR =4*ns,*Tcavity =*16*ns)* –  Couplings:*3*Locked*Delay*Lines*with*π*Phase*Shin* •  OC*and*IC*~*4%* •  N=4*MAX&CUT* –  Expected*Result:*2*OPOs*in*|0〉,*2*OPOs*in*|π〉 •  Measurement:*Unequal&Arm*Interferometer* * Clock* fs*Laser* @*1045*nm* M2* M1* M4* M3* PPLN* 40OPOs ) CP* BS* Filter* PD* OC* Delay*1* OC* IC* OC* IC* Delay*2* OC* IC* Delay*3* OPO1* OPO2* OPO3* OPO4* OPO1* OPO2* TR * Tcavity =4TR* Jme* ...* Delay*3*=*3TR ** Delay*2*=*2TR* Delay*1*=*TR * OPO4* OPO3* OPO2* OPO1* OPO4* OPO3* OPO2* OPO1* OPO4* OPO3* OPO2* OPO1* OPO1* OPO2* OPO3* OPO4* π* π* π* π* π* π* MAX&CUT*Problem*on*OPO*Network* November*2014* OPO*Ising*Machine* 8* 1TR 2TR 3TR periment* Delay*1* Delay*2* Delay*3* OPO1* OPO2* OPO3* OPO4* OPO1* OPO2* TR * Tcavity =4TR* Jme* ...* Pulse interval For N = 16, 1TR, 8TR, 15TR delay lines 㱺 Complete graph K4 28 Optical Delay Lines
  29. Measurement Feedback CIM (proposed) • To stabilize the O(N) delay

    lines is experimentally hard for large N • Multiple DOPO pulses in a fiber is coupled by a FPGA circuit • Stochastic differential equation for each DOPO signal amplitude: SHG Pump pulse SHG pulse Signal DOPO pulses #2 #1 Beam splitter #1 PPLN waveguide OPA LO pulse Homodyne detection IM/PM FPGAs DAC ADC Feedback pulse fi Beam splitter #2 Fiber ring cavity  discretize in time 29 gain I/O (without noise) Time evolution g(c) = c p 1 + c2 c(t) = s 1 1 + exp( t) [Y. Haribara, S. Utsunomiya, and Y. Yamamoto, Entropy 18, 151 (2016)] dci = [( 1 + p c2 i s2 i )ci + N X j=1 ⇠ij ˜ cj] dt + 1 As r c2 i + s2 i + 1 2 dWci
  30. 5. Numerical Simulations [Y. Haribara, S. Utsunomiya, K. Kawarabayashi, Y.

    Yamamoto, Encyclopedia of Spectroscopy and Spectrometry 3, 824 (2017)]
  31. Working Equations ˜ ci = cout i p Tmes =

    ci r 1 Tmes Tmes fi As Measured Value SHG Pump pulse SHG pulse Signal DOPO pulses #2 #1 Beam splitter #1 PPLN waveguide OPA LO pulse Homodyne detection IM/PM FPGAs DAC ADC Feedback pulse fi Beam splitter #2 Fiber ring cavity  0 20 40 60 80 100 1.0 0.5 0.0 0.5 1.0 Normalized time Number of round trips DOPO signal amplitude ci 33 34 35 36 0.80 0.85 0.90 0.95 1.00 1.05 1.10 A C B A B C [Y. Haribara, S. Utsunomiya, and Y. Yamamoto, Entropy 18, 151 (2016)] [Y. Haribara, S. Utsunomiya, K. Kawarabayashi, Y. Yamamoto, Encyclopedia of Spectroscopy and Spectrometry 3, 824 (2017)] A B C dci = [(p c2 i s2 i )ci] dt + 1 As r c2 i + s2 i + 1 2 dWci dsi = [( p c2 i s2 i )si] dt + 1 As r c2 i + s2 i + 1 2 dWsi ci(t + t) 7! p 1 Tmesci(t) + p Tmes fi As si(t + t) 7! p 1 Tmessi(t) + p Tmes fi As ci(t + t) 7! p 1 Tinjci(t) + p Tinj⇠ n X j=1 Jij ˜ cj si(t + t) 7! p 1 Tinjsi(t)
  32. Parameter sweep (g14) Mean Best 0.2 0.4 0.6 0.8 1.0

    -1.2 -1.0 -0.8 -0.6 -0.4 -0.2 0.0 Pump rate p Coupling constant  2850 2870 2890 2910 0.2 0.4 0.6 0.8 1.0 -1.2 -1.0 -0.8 -0.6 -0.4 -0.2 0.0 Pump rate p Coupling constant  2890 2910 2930 2950 100 trials for each parameter
  33. Parameter sweep (K100) Mean Best 100 trials for each parameter

  34. Simulation results g22, g39 in G-set • g22, |V| =

    2000, |E| = 19990 (density 1%), random graph • Gradually pumped from p = 0.9 to p = 1.3 linearly. • Coupling constant ξ = -0.02/√<k> = 0.004 • g39, |V| = 2000, |E| = 11778 (density 0.59%), scale-free with ±1 weight • Gradual pump (linear: from p = 0.9 to 1.3) • Coupling constant ξ = -0.05/√<k> = 0.0042 CUT Ratio SDP UB 14136 — BestKnown 13359 0.945 SA 13351 0.944 Best 13081 0.925 Mean 12997.99 0.919 GW-SDP 12993 0.919 CUT Ratio SDP UB 2877 — BestKnown 2408 0.946 SA 2379 0.943 Best 2273 0.931 Mean 2228.69 0.926 GW-SDP 2200 0.923
  35. Results for G-set (sparse) • For the random graphs in

    G-set, the number of edges m are independent from n, which results in the constant scaling for SA as O(|E|) • GW (PD relative gap 10-3), best in 100 trials of SA and CIM with hysteretic optimization (estimated time) on Intel Xeon X5650 (2.67 GHz), 94 GB RAM. • MILP solved by CPLEX-MILP (mixed integer linear programming) [Ikuta, et al., 2015], BLS [Benlic, et al., 2013] CIM SA GW 800 2000 5000 7000 10000 0.001 0.1 10 1000 105 Problem size N Computation time s 35 Topology GW-SDP n roundings MILP CIM w/ hysteretic SA BLS Comp. time 2 s - 105 s 12 s - 105 s 0.05 s 0.05 s 1 s - 104 s Random 0.9273 0.9116 0.9448 0.9471 0.9491 Scale-free 0.9288 0.9269 0.9451 0.9510 0.9547 Toroidal 0.9375 0.9618 0.9511 0.9548 0.9616 All 0.9303 0.9349 0.9464 0.9502 0.9540 (Random graph)
  36. Computation time scaling on complete graphs • GW : interior

    point method for solving SDP → O(N 3.5) • SA : Constant Monte Carlo steps × avg. degree → O(N 2) • CIM : Optical time-scale of DOPO turn-on delay time ()                       CIM (simulation) SA GW SG3 BLS 40 160 800 4000 20000 10-4 0.01 1 100 104 106 Problem size N Computation time (s) () 200 m fiber 2 km fiber 20 km fiber 2 GHz clock 20 GHz clock 200 GHz clock 100 1000 104 105 106 107 10-5 10-4 0.001 0.010 0.100 1 Problem size N Computation time (s) Graph order N Graph order N Cavity length Pulse repetition Note for SA: N = 105 㱺 3 sec for 20 MCS (multi-core CPU)
  37. 6. Experimental Results and Validation [P. L. McMahon*, A. Marandi*,

    Y.haribara, et al., Science 354, 614 (2016)]
  38. Machine configurations 38 Stanford NTT N Number of pulses 100

    (+60 for stabilization) 2048 (+2834 for stabilization) Round trip time (Fiber length) 1.6 μs (320 m) 5 μs (1 km) Pulse repetition freq. (Pulse interval) 100 MHz (= 10 ns) 1 GHz (= 1 ns) FPGA Xilinx Virtex-6 (SX475T) 2x Xilinx Virtex-7 (VX690T) Connection arbitrary topology with Jij = {+1, 0, -1} Current operational scheme Gradual coupling Gradual pumping [P. L. McMahon*, A. Marandi*, Y.haribara, et al., Science 354, 614 (2016)] [T. Inagaki, Y. Haribara, et al., Science 354, 603 (2016)] SHG Pump pulse SHG pulse Signal DOPO pulses #2 #1 Beam splitter #1 PPLN waveguide OPA LO pulse Homodyne detection IM/PM FPGAs DAC ADC Feedback pulse fi Beam splitter #2 Fiber ring cavity  Chap. 6 Chap. 7
  39. Single Trial N = 16 Cubic graph (Möbius ladder) N

    = 100 Random graph (Erdös-Rényi) and in cases in which exact solutions are not easy to obtain, we can find good approximate solutions. The schematic of our experimental setup (Fig. 1) shows that our Ising machine is formed by the combination of time-division– multiplexed OPOs (18) in a single fiber-ring cavity, with measurement and feedback (injec- tion) stages that act to couple the pulses in the cavity such that the Ising Hamiltonian is real- ized. Details are provided in the supplementary materials (26). imental schematic of a measurement-feedback–based coherent Ising machine. n–multiplexed pulsed degenerate optical parametric oscillator is formed by a nonlinear dically poled lithium niobate (PPLN)] in a fiber ring cavity containing 160 pulses. A fraction e is measured and used to compute a feedback signal that effectively couples the ependent pulses in the cavity. IM, intensity modulator; PM, phase modulator; LO, local G, second-harmonic generation; FPGA, field-programmable gate array. A Roundtrip Number OPO Pulse In-Phase Amplitude (arb.) 0 50 100 150 -600 -400 -200 0 200 400 600 OPO 1 OPO 2 OPO 3 OPO 4 OPO 5 OPO 6 OPO 7 OPO 8 OPO 9 OPO 10 OPO 11 OPO 12 OPO 13 OPO 14 OPO 15 OPO 16 Computation Time (µs) Computation Time (µs) 0 50 100 150 200 Graph A er of Problem Instances 400 600 800 1000 All 4060 16-vertex cubic graphs Roundtrip Number Graph Cut Size 0 50 100 150 0 5 10 15 20 25 0 50 100 150 200 Ising Energy -20 -10 0 10 20 Graph A MAX CUT / Ground State Energy Graph A 100% 38% Graph C 58% Graph B Exact Solution Exact Solution Exact Solution on November 6, 2 http://science.sciencemag.org/ Downloaded from 0 50 100 150 200 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 Round trip number in-phase amplitude 0 50 100 150 200 0 5 10 15 20 Round trip number Cut ments Jij ) and the length-N vector h (with elements hi ). We have realized a system with a scalable architecture that uses measurement feedback in place of a network of optical delay lines [which were used in initial, low-connectivity, nonre- programmable demonstrations of the concept (18, 24, 25)]. Our 100-spin Ising machine allows connections between any spin and any other spin and is fully programmable. We show that measurement-feedback–based OPO Ising ma- chines can solve many different Ising problems, and in cases in which exact solutions are not easy to obtain, we can find good approximate solutions. The schematic of our experimental setup (Fig. 1) shows that our Ising machine is formed by the combination of time-division– multiplexed OPOs (18) in a single fiber-ring cavity, with measurement and feedback (injec- tion) stages that act to couple the pulses in the cavity such that the Ising Hamiltonian is real- ized. Details are provided in the supplementary materials (26). ack–based coherent Ising machine. etric oscillator is formed by a nonlinear cavity containing 160 pulses. A fraction ck signal that effectively couples the ulator; PM, phase modulator; LO, local ammable gate array. Roundtrip Number 50 100 150 OPO 1 OPO 2 OPO 3 OPO 4 OPO 5 OPO 6 OPO 7 OPO 8 OPO 9 OPO 10 OPO 11 OPO 12 OPO 13 OPO 14 OPO 15 OPO 16 mputation Time (µs) Computation Time (µs) 50 100 150 200 ph A Roundtrip Number Graph Cut Size 0 50 100 150 0 5 10 15 20 25 0 50 100 150 200 Ising Energy -20 -10 0 10 20 Graph A MAX CUT / Ground State Energy on November 6, 2016 http://science.sciencemag.org/ Downloaded from between qubits remains a major challenge (15), with important implications for the efficiency of AQC/QA systems (16). Networks of coupled optical parametric oscil- lators (OPOs) are an alternative physical system, with an unconventional operating mech- anism (17–20), for solving the Ising problem (21, 22) and by extension many other com- binatorial optimization problems (23). For- mally, the N-spin Ising problem is to find the configuration of spins si ∈ f−1; þ1g (i = 1, ..., N) that minimizes the energy function H ¼ − X 1≤i < j ≤N Jij si sj − X 1≤i ≤N hi si , where the par- ticular problem instance being solved is specified by the N × N matrix J (with ele- ments Jij ) and the length-N vector h (with elements hi ). We have realized a system with a scalable architecture that uses measurement feedback in place of a network of optical delay lines [which were used in initial, low-connectivity, nonre- programmable demonstrations of the concept (18, 24, 25)]. Our 100-spin Ising machine allows connections between any spin and any other spin and is fully programmable. We show that measurement-feedback–based OPO Ising ma- chines can solve many different Ising problems, and in cases in which exact solutions are not easy to obtain, we can find good approximate solutions. The schematic of our experimental setup (Fig. 1) shows that our Ising machine is formed by the combination of time-division– multiplexed OPOs (18) in a single fiber-ring cavity, with measurement and feedback (injec- tion) stages that act to couple the pulses in the cavity such that the Ising Hamiltonian is real- ized. Details are provided in the supplementary materials (26). Fig. 1. Experimental schematic of a measurement-feedback–based coherent Ising machine. A time-division–multiplexed pulsed degenerate optical parametric oscillator is formed by a nonlinear crystal [periodically poled lithium niobate (PPLN)] in a fiber ring cavity containing 160 pulses. A fraction of each pulse is measured and used to compute a feedback signal that effectively couples the otherwise-independent pulses in the cavity. IM, intensity modulator; PM, phase modulator; LO, local oscillator; SHG, second-harmonic generation; FPGA, field-programmable gate array. Graph A Roundtrip Number OPO Pulse In-Phase Amplitude (arb.) 0 50 100 150 -600 -400 -200 0 200 400 600 OPO 1 OPO 2 OPO 3 OPO 4 OPO 5 OPO 6 OPO 7 OPO 8 OPO 9 OPO 10 OPO 11 OPO 12 OPO 13 OPO 14 OPO 15 OPO 16 Computation Time (µs) Computation Time (µs) 0 50 100 150 200 Graph A em Instances 600 800 1000 All 4060 16-vertex cubic graphs Roundtrip Number Graph Cut Size 0 50 100 150 0 5 10 15 20 25 0 50 100 150 200 Ising Energy -20 -10 0 10 20 Graph A MAX CUT / Ground State Energy f Runs 60 80 100 100% State Energy 58% Exact Solution Exact Solution xact Solution RESEARCH | REPORTS on November 6, 2016 http://science.sciencemag.org/ Downloaded from the M energ Th (cubi Fig. 2 matr Fig. 4. Results with various- size and various-density random graphs. (A) Observed probability of obtaining a solution whose cut size is at least x% of the global optimum (maximum cut), as a function of graph size N, for random cubic graph instances. Error bars indicate 1 SD, which is dominated by the difference in difficulty between the various problem instances. (B) The runtime that would be required to obtain a solution of a particular accuracy with 99% probability. (C) The evolution of the in-phase components ci of the N = 100 OPO pulses as a function of the computation time, for a single run with the graph shown in the (D) inset. (D) The graph cut size achieved as a function of the computation time. (Inset) The graph being solved. (E) Observed success probability of obtaining a solution with a particular accuracy as a function of the density of edges in the graph. Experiments were performed on randomly generated N = 100-vertex graphs Runtime to obtain 99% Success Probability (s) 0 10-4 10-3 10-2 10-1 100 Rand Graph Size N=|V | Success Probability (%) 0 20 40 60 80 100 0 20 40 60 80 100 100% 99% 98% 97% 96% 95% 94% 93% 92% 91% 90% Solution accuracy Random cubic graphs Roundtrip Number OPO Pulse In-Phase Amplitude (arb.) 0 50 100 -600 -400 -200 0 200 400 600 0 50 100 150 Graph D: |V|=100, |E|=495 A B C Computation Time (µs) Computation Time (µs) Roundtrip Number 0 50 100 0 100 200 300 Ising Energy -200 -100 0 100 200 300 400 0 50 100 150 MAX CUT / Ground State Energy Graph Cut Size Graph D: |V |=100, |E|=495 D Ising Energy -160 -140 -120 Ising Energy -100 -80 -60 Ising Energy -20 0 20 Fig. 3. Results with various-size Möbius ladder graphs. (A) Observed probability of obtaining a ground state of the Möbius ladder graph in a single run, as a function of the size N of the graph. Multiple 100-run batches were performed for each graph size to obtain the standard deviations, which are shown as error bars. (B to D) Histograms of obtained solutions in 100 runs for the graphs shown in the insets. 0 50 100 150 200 -2 -1 0 1 2 Round trip number In-phase amplitude 0 50 100 150 200 220 240 260 280 300 320 340 Round trip number Cut 616 4 NOVEMBER 2016 • VOL 354 ISSUE 6312 solution whose cut size is at least x% of the global optimum (maximum cut), as a function of graph size N, for random cubic graph instances. Error bars indicate 1 SD, which is dominated by the difference in difficulty between the various problem instances. (B) The runtime that would be required to obtain a solution of a particular accuracy with 99% probability. (C) The evolution of the in-phase components ci of the N = 100 OPO pulses as a function of the computation time, for a single run with the graph shown in the (D) inset. (D) The graph cut size achieved as a function of the computation time. (Inset) The graph being solved. (E) Observed success probability of obtaining a solution with a particular accuracy as a function of the density of edges in the graph. Experiments were performed on randomly generated N = 100-vertex graphs with fixed numbers of edges. Error bars indicate 1 SD. Graph Size N=|V | Success Probability 0 20 40 60 80 0 20 40 60 80 Roundtrip Number OPO Pulse In-Phase Amplitude (arb.) 0 50 100 -600 -400 -200 0 200 400 600 0 50 100 150 Graph D: |V|=100, |E|=495 C Computation Time (µs) 1 2 3 Graph Cut Size D Computation Time (μs) Simulation Experiment Simulation Experiment 39 ground state [P. L. McMahon*, A. Marandi*, Y.haribara, et al., Science 354, 614 (2016)]
  40. All cubic graphs of N = 16 • MAX-CUT in

    cubic graph is NP-hard [E. Halperin, D. Livnat, and U. Zwick (SODA 2002)]. • 4060 instances for n = 16. 100 trials for each instance. 300 round trips for each trial. • Gradual coupling with dξ/dt = -0.001 and pump rate p = 0.88. • Pearson correlation coefficient ρ = 0.80, (cf. simulations between different parameters ρ ~ 0.9) 40 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Simulation Experiment 0 20 40 60 80 100 Simulation Experiment 0.0 0.2 0.4 0.6 0.8 1.0 0 100 200 300 400 500 600 700 Success probability Number of Instances Number of instances A B 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Simulation Experiment 0 20 40 60 Simulation Experiment #Instances [P. L. McMahon*, A. Marandi*, Y.haribara, et al., Science 354, 614 (2016)]
  41. Möbius ladder Möbius ladder 41 N = 4n ξ =

    -0.02 N = 4n+2 Ê Ê Ê Ê Ê Ê Ê Ê ‡ ‡ ‡ ‡ ‡ ‡ ‡ ‡ 5 10 15 20 25 30 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Problem size N Success rate Ground State Probability (%) Moebius Ladder Graphs Ising Energy -160 -140 -120 Ising Energy -100 -80 -60 Ising Energy Number of Runs -20 0 20 0 20 40 60 80 100 N=100 N=52 N=8 100% 48% 21% 10 20 30 40 50 60 70 80 90 100 0 20 40 60 80 100 Graph Size N =|V | MAX CUT / Ground State Energy Fig. 3. Results with various-size Möbius ladder graphs. (A) Observed probability of obtaining a ground state of the Möbius ladder graph in a single run, as a function of the size N of the graph. Multiple 100-run batches were performed for each graph size to obtain the standard deviations, which are shown as error bars. (B to D) Histograms of obtained solutions in 100 runs for the graphs shown in the insets. RESEARCH | REPORTS N = 4n [P. L. McMahon*, A. Marandi*, Y.haribara, et al., Science 354, 614 (2016)] Simulation Experiment
  42. Computation Time Scaling • Time to 99% success (find ground

    state at least once in 99% probability). • Tcomp x log(1-0.99) / log(1-psuccess) (minimum k for (1-psuccess )k < (1-0.99) is required) • Exponential time to the ground states. Almost constant time to solution ≦ .94 OPT 42 The same spin configuration {si } represents a cut of size CðfsigÞ ¼ −1 2 X 1≤i <j ≤N Jij − 1 2 HðfsigÞ; there is a direct mapping between the ener- gies in the Ising problem and the cut sizes in the Max-Cut problem, and minimizing the Ising energy maximizes the cut. The unweighted, undirected Möbius ladder (cubic) graph with N = 16 vertices is shown in Fig. 2A. We programmed the corresponding J matrix into our feedback electronics and ran Graph Size N=|V | Runtime to obtain 99% Success Probability (s) 0 20 40 60 80 100 10-4 10-3 10-2 10-1 100 100% 99% 98% 97% 96% 95% 94% 93% 92% 91% 90% Random cubic graphs | 80 100 100% 99% 98% 97% 96% 95% 94% 93% 92% 91% 90% Solution accuracy Solution accuracy B Ising Energy 60 -140 -120 21% erved probability of obtaining a the size N of the graph. Multiple dard deviations, which are shown the graphs shown in the insets. on November http://science.sciencemag.org/ Downloaded from ¬ ¬ [P. L. McMahon*, A. Marandi*, Y.haribara, et al., Science 354, 614 (2016)]
  43. Effect of Edge Density • Lower success rate in dense

    graphs are due to insufficient pump (or short computation time) • For the dense graphs, it is easier to modulate the pump: gradual pumping 43 Graph Edges |E| Graph Edge Density d=2|E|/(|V|(|V|-1)) Success Probability (%) 100% 99% 98% 97% 96% 95% 94% 93% 92% 91% 90% Solution accuracy 0 2000 4000 0 20 40 60 80 100 0 0.5 1 Simulation Experiment (p = 0.88) 0 1000 2000 3000 4000 5000 0 20 40 60 80 100 Graph Edges |E| Success Probability (%) 0 1000 2000 3000 4000 5000 0 20 40 60 80 100 Graph Edges |E| Success Probability (%) 0 1000 2000 3000 4000 5000 0 20 40 60 80 100 Graph Edges |E| Success Probability (%) (p = 0.7) (p = 0.9) (p = 1.1) [P. L. McMahon*, A. Marandi*, Y.haribara, et al., Science 354, 614 (2016)]
  44. 7. Performance Evaluation with Larger Problems [T. Inagaki, Y. Haribara,

    et al., Science 354, 603 (2016)] [Y. Haribara, H. Ishikawa, S. Utsunomiya, K. Aihara, Y. Yamamoto, Quantum Science and Technology 2, 044002 (2017)]
  45. Optimal scheduling for SA • N = 2000 complete graphs

    with weight wij = {-1, +1} • The time to SDP solution is ~ 5ms (SA on NII server; Intel Xeon CPU W3530 @ 2.8 GHz, 94 GB Memory). • Temperature scheduling is optimized (to use the different normalization of time) to various energy target. Solution by SDP (-60278) SA (200MCS; avg in 100 trials) 0 200 400 600 800 1000 0.006 0.008 0.010 0.012 0.014 Total step (MCS) Computation time (s) to SDP (-60278) -61000 -61500 -62000 0.00 0.01 0.02 0.03 0.04 -63000 -62000 -61000 -60000 -59000 -58000 Computation time (s) Ising energy SA 45 [T. Inagaki, Y. Haribara, et al., Science 354, 603 (2016)]
  46. G22: random G39: scale-free K2000 : complete Node degree distribution

    Best known 13359 2408 − SA 50 ms (Max in 100) 13336 2384 32781 SA 50 ms (Ave. in 100) 13298 2359 32314 CIM 5 ms (Max in 100) 13313 2361 33191 CIM 5 ms (Ave. in 100) 13248 2328 32457 GW-SDP 12992 2200 29619 Cut value histogram 2000 nodes, 19990 edges edge weight: w∈{0, +1} 2000 nodes, 11778 edges edge weight: w∈{-1, 0, +1} 2000 nodes, 1999000 edges edge weight: w∈{-1, +1} SA CIM GW-SDP Node degree 10 15 20 25 30 35 Amount 200 150 100 50 0 Node degree 1 10 100 Amount 1000 100 10 1 Node degree 1990 1990 2000 2010 Amount 2000 1000 0 G22: random G39: scale-free K2000 : complete Node degree distribution Best known 13359 2408 − SA 50 ms (Max in 100) 13336 2384 32781 SA 50 ms (Ave. in 100) 13298 2359 32314 CIM 5 ms (Max in 100) 13313 2361 33191 CIM 5 ms (Ave. in 100) 13248 2328 32457 GW-SDP 12992 2200 29619 Cut value histogram 2000 nodes, 19990 edges edge weight: w∈{0, +1} 2000 nodes, 11778 edges edge weight: w∈{-1, 0, +1} 2000 nodes, 1999000 edges edge weight: w∈{-1, +1} SA CIM GW-SDP Node degree 10 15 20 25 30 35 Amount 200 150 100 50 0 Node degree 1 10 100 Amount 1000 100 10 1 Node degree 1990 1990 2000 2010 Amount 2000 1000 0 Cut value 30000 31000 32000 33000 Amount 20 15 10 5 0 Amount 20 15 10 5 0 Cut value 13000 13200 13400 35 30 25 Cut value 2200 2300 2400 Amount 20 15 10 5 0 25 (50 ms) G22: random G39: scale-free K2000 : complete Node degree distribution Best known 13359 2408  SA 5 ms (Max in 100) 13245 2339 30118 SA 5 ms (Ave. in 100) 13183 2297 29308 CIM 5 ms (Max in 100) 13313 2361 33191 CIM 5 ms (Ave. in 100) 13248 2328 32457 GW-SDP 12992 2200 29619 Cut value histogram 2000 nodes, 19990 edges edge weight: w{0, +1} 2000 nodes, 11778 edges edge weight: w{-1, 0, +1} 2000 nodes, 1999000 edges edge weight: w{-1, +1} SA CIM GW-SDP Table 1 Node degree 10 15 20 25 30 35 Amount 200 150 100 50 0 Node degree 1 10 100 Amount 1000 100 10 1 Node degree 1990 1990 2000 2010 Amount 2000 1000 0 SA5msẚ㍑ (5 ms) (5 ms) (5 ms) Performance of solution quality (CUT histogram in 100 trials) 46 SA and SDP : Intel Xeon X5650 @ 2.67 GHz, Westmere (2010) [T. Inagaki, Y. Haribara, et al., Science 354, 603 (2016)]
  47. Table of approaches Deterministic Stochastic Binary neuron* Hopfield Network [Hopfield

    1982] Simulated Annealing [Kirkpatrick, et al. 1982] Analog neuron Hopfield-Tank Neural Net [Hopfield&Tank 1986] CIM-cSDE *[McCulloch&Pitts 1943] Comment: Boltzmann Machine is also proposed around the time [Hinton&Sejnowski 1983] but the context is slightly different from optimization problems. 47 [Y. Haribara, H. Ishikawa, S. Utsunomiya, K. Aihara, Y. Yamamoto, Quantum Science and Technology 2, 044002 (2017)]
  48. Many-core Processor • PEZY-SC (MIMD architecture) @ 733 MHz •

    8192 threads in parallel on chip (1024 cores x 8 threads on chip) • 3 TFlops (single precision) 48 [Y. Haribara, H. Ishikawa, S. Utsunomiya, K. Aihara, Y. Yamamoto, Quantum Science and Technology 2, 044002 (2017)] PEZY-SC (1024PE) Prefecture (256PE) City (16PE) Village (4PE) PE Program Counter × 8 L1 Instruction Cache 64bit × 256w (2KB) ALU 4FP ops/cycle Register File 32bit × 256w (1KB) Local Storage 32bit x 4096w (16KB) PE L1 Data Cache 2KB Village (4PE) Village (4PE) Village (4PE) Village (4PE) Special Function Unit City (16PE) City (16PE) City (16PE) City (16PE) City (16PE) City (16PE) City (16PE) City (16PE) City (16PE) City (16PE) City (16PE) City (16PE) City (16PE) City (16PE) City (16PE) City (16PE) Prefecture (256PE) Prefecture (256PE) Prefecture (256PE) Prefecture (256PE) DDR4 DRAM 19GB/s PCIe Gen3 x8 PCIe Gen3 x8 PCIe Gen3 x8 PCIe Gen3 x8 L3 Data Cache 2MB DDR4 DRAM 19GB/s DDR4 DRAM 19GB/s DDR4 DRAM 19GB/s DDR4 DRAM 19GB/s DDR4 DRAM 19GB/s DDR4 DRAM 19GB/s DDR4 DRAM 19GB/s ARM 926 ARM 926 L2 Data Cache 64KB L2 Instruction Cache 32KB L3 Instruction Cache 128KB Atomic Cache 128B L1 Data Cache 2KB PE PE PE Thread0 front Thread1 front Thread3 front Thread2 front Thread1 back Thread0 back Thread3 back Thread2 back clock clock clock clock (B) (A)
  49. Result : Time to target 49 Ensemble average in 100

    trials (26 for experiment) with best/worst envelope. Note : worst 3 cases of HN didn’t reach the target. Best (ms) Avg (ms) Hardware CIM 0.071 0.264 DOPO+FPGA HN 0.924 1.84 CPU SA 2.10 3.20 CPU HTNN 7.04 9.67 Manycore CIM (experiment) HN SA HTNN 10-5 10-4 0.001 0.010 -30 -25 -20 -15 -10 -5 0 Time (s) Ising energy Compared to: Best Avg HN 13.0141 6.9697 SA 29.5775 12.1212 HTNN 99.1549 36.6288 Relative speed-up for CIM [Y. Haribara, H. Ishikawa, S. Utsunomiya, K. Aihara, Y. Yamamoto, Quantum Science and Technology 2, 044002 (2017)]
  50. Result : N=20000 50 CIM (simulation) HN SA HTNN 10-5

    10-4 0.001 0.010 0.100 1 -100 -80 -60 -40 -20 0 Time (s) Ising energy Best (ms) Avg (ms) Hardware CIM 0.14 0.15 Simulation HN 23 26 CPU SA 60 65 CPU HTNN 500 540 Manycore Compared to: Best Avg HN 164.3 173.3 SA 428.6 433.3 HTNN 3571.4 3600.0 Ensemble average in 100 trials with best/worst envelope. (simulation: c-number Langevin via truncated-Wigner) Relative speed-up for CIM [Y. Haribara, H. Ishikawa, S. Utsunomiya, K. Aihara, Y. Yamamoto, Quantum Science and Technology 2, 044002 (2017)]
  51. 8. Conclusion • Scalable implementation of measurement feedback CIM is

    proposed with pulsed fiber DOPO and FPGA • Validation with numerical simulation and experimental implementations of CIM with n = 100 and n = 2048 • In the complete graph of n = 2000, experimental CIM is 7.0 and 12.1 times faster than HN and SA implemented on Intel Xeon CPU 51
  52. Publication list 1. Yoshitaka Haribara, Shoko Utsunomiya, Yoshihisa Yamamoto, Entropy

    18, 151 (2016). 2. Kenta Takata, Alireza Marandi, Ryan Hamerly, Yoshitaka Haribara, Daiki Maruo, Shuhei Tamate, Hiromasa Sakaguchi, Shoko Utsunomiya, Yoshihisa Yamamoto, Scientific Reports 6, 34089 (2016). 3. Takahiro Inagaki, Yoshitaka Haribara, Koji Igarashi, Tomohiro Sonobe, Shuhei Tamate, Toshimori Honjo, Alireza Marandi, Peter L. McMahon, Takeshi Umeki, Koji Enbutsu, Osamu Tadanaga, Hirokazu Takenouchi, Kazuyuki Aihara, Ken-ichi Kawarabayashi, Kyo Inoue, Shoko Utsunomiya, Hiroki Takesue, Science 354, 603 (2016). 4. Peter L. McMahon*, Alireza Marandi*, Yoshitaka Haribara, Ryan Hamerly, Carsten Langrock, Shuhei Tamate, Takahiro Inagaki, Hiroki Takesue, Shoko Utsunomiya, Kazuyuki Aihara, Robert L. Byer, M. M. Fejer, Hideo Mabuchi, Yoshihisa Yamamoto, Science 354, 614 (2016). *equally contributed 5. Yoshitaka Haribara, Shoko Utsunomiya, Ken-ichi Kawarabayashi, Yoshihisa Yamamoto, Encyclopedia of Spectroscopy and Spectrometry 3, 824 (2017). 6. Yoshitaka Haribara, Hitoshi Ishikawa, Shoko Utsunomiya, Kazuyuki Aihara, Yoshihisa Yamamoto, Quantum Science and Technology 2, 044002 (2017). 52
  53. Appendix

  54. Appendices Index • Appendix A: Slack variable with constraint optimization

    • Appendix B: Multimode simulation • Appendix C: Parallel SA implementation on multithreaded CPU • Appendix D: Order-N Monte Carlo • Appendix E: Iterative method • Appendix F: FPGA implementation
  55. Pump rate normalization • The CSDE: • In the numerical

    simulation, 3 components are calculated separately. • gain: 10 steps, loss: instantaneous, coupling: instantaneous • The total loss of the cavity consists of 2: out-coupler loss γout and injection-coupler loss γinj . This should be 1 per unit time dt as in CSDE • When dt = .01, the parameters can be set as γout = .009, γinj = .001, i.e., (γout + γinj )/dt = 1. • Suppose the integration step in the gain medium is tg = 10, gain loss coupling 0.5 1.0 1.5 2.0 1 10 100 1000 Pump rate Photon number cf. NTT setup: 8.9 dB loss per round trip (3.3 dB in PPLN)
  56. Lossy cavity (T = 0.25) Random graph g1 (N =

    800), As = 400 ξ = -0.4/(avg.degree) = -0.0085 600 RoundTrips × 100 runs p=0.4 0 100 200 300 400 500 600 -1.0 -0.5 0.0 0.5 1.0 Round trip time Real amplitude p = 0.6 0 100 200 300 400 500 600 -1.0 -0.5 0.0 0.5 1.0 Round trip time Real amplitude p = 1.1 0 100 200 300 400 500 600 -1.0 -0.5 0.0 0.5 1.0 Round trip time Real amplitude Amplitude @ each pump rate 1 N ⁄ i=1 N †ci § 0.0 0.5 1.0 1.5 0.0 0.2 0.4 0.6 0.8 Pump rate Average amplitude max avg 0.0 0.5 1.0 1.5 11000 11100 11200 11300 11400 11500 11600 Pump rate Number of cut Oscillation threshold #Cut ≦ 11603 11529.6
  57. CUT values to SDP UB SDP UB SG SG3 GW

    CIM SA BLS g1 12083 0.90615 0.93925 0.93867 0.95630 0.95862 0.96201 g2 12089 0.91405 0.94251 0.94367 0.95583 0.95790 0.96120 g15 3171 0.90350 0.94387 0.93882 0.94954 0.95427 0.96184 g17 3171 0.90918 0.93914 0.93598 0.94765 0.95080 0.96090 g53 4009 0.90846 0.94138 0.93540 0.94812 0.95161 0.96034 cut values are compared to #E in the original paper SG, SG3 in [Kahruman et al. 05] GW and CIM by Haribara SA by Tamate BLS in [Benlic et al. 12]
  58. Appendix A: Application

  59. Boxed-Constrained Quadratic Programming (BCQP) • Objective function : • BCQP

    is NP-hard • Benchmark instances have discrete values of x: • Discrete steps r for variables x : 2 (Ising) ≦ r ≦ 13 • Quadratic coupling matrix : Qij ∈ {0, ±1, ±2(diag)} min f(x) = 1 2 xT Qx s.t. x 2 ⌦ = { 1  x  1 | x 2 Rn} • If Q is positive semide nite (convex) 㱺 solvable in polynomial time • Ellipsoid method • Gradient descent • Interior point method • If Q is negative semide nite or inde nite 㱺 NP-hard • Branch&Bound with SDP lower bounding [S. Burer, et al., D. Comput Optim Appl 43, 181 (2009)] • Local methods: [N.I.M. Gould, and P.L. Toint, Appl. Optim. 72, 149 (2002).] • Global methods: [P. De Angelis, et al., Developments in Global Optimization, pp. 73-94 (1997).] [P.M. Pardalos, et al., J. Glob. Optim. 1, 15 (1991); S. Burer, et al., D. Comput Optim Appl 43, 181 (2009)] 59
  60. Parameter Sweep • Instance : n=10 spins of r=5 valued

    (x = -1, -1/2, 0, 1/2, 1), OPT = -15.25 Parameter search (Best in 100 runs for each parameter set) n=10, #7 (r=5) n=10, #7 (r=5) OPT = -15.25 n=10, #0 (r=5) OPT = -10.25 n=10, #6 (r=13) OPT = -13.58 60
  61. Slack Variable [T. Leleu] PSJHJOBM QBSBNFUFS PQUJNJ[FE TMBDL WBSJBCMFT TVDDFTT

       range of parameters 0.6 ≦ p ≦ 1.3 0.06 ≦ ξ ≦ 1.4 dxi = [( 1 + p x2 i ei)xi X j Qijxj] dt + g dWi dei = ei[ 1/2 + IR+ (x xi) + IR+ (xi x+)]dt 61
  62. From BCQP • constraint Σ xi = 1 • mean-variance

    portfolio optimization (convex) • max-clique/max-independent set [Motzkin and Straus] (with powerful heuristic solvers) • References: standard quadratic optimization problems [Bomze IM (1998)]; both applications [J Glob Optim 13:369– 387] • more linear constraints • general quadratic programming • below (continuous) & above (binary) threshold OPOs • Mixed-integer programming 62
  63. Truss Topology Optimization [Giger+’06] Ai = {0, 1} Euler’s critical

    load Compression, tension, and buckling limit smin i si(Ai)  smax i sEuler i si(Ai) sEuler i = ⇡EiAi 4l2 i  0 Stress of member i • minimize: • subject to: W(A) = m X i=1 ⇢iliAi Material weight density Member length Weight of the structure Equilibrium of forces si = Fi/Ai (Ai = 1) P P 10-member Truss Ground Structure Ai si A Solution
  64. Truss Topology Optimization on CIM W(A) = m X i=1

    ⇢iliAi • Objective function minimize: • ci = 2Ai - 1: area of the truss member i (Knapsack-type problem) Strategy: 1. ໨తؔ਺ʹ੍໿߲ΛؚΊΔ (buckling constraint) 2. ৳ॖ੍໿ΛεϥοΫม਺ (slack variable) Ͱಋೖ V (c) = m X i=1 ⇢ilici + ↵ K X k=1 f(sE i m X i=1 sikci) Sigmoid Potential dci = [( 1 + p c2 i ei)ci (dV (c)/dci)]dt + gdWi dei = ei[ 1/2 + 1R >0 (s si) + 1R >0 (si s+)]dt
  65. Appendix B: Multimode

  66. 4 2 0 2 4 0.6 0.4 0.2 0.0 0.2

    0.4 0.6 Longitudinal cavity mode m Ns Amplitude a b c d Normalized amplitude Si t 0 50 100 150 200 250 300 0.001 0.01 0.1 1 Normalized time 0 50 100 150 200 250 300 0.001 0.01 0.1 1 Normalized time 0 50 100 150 200 250 300 0.001 0.01 0.1 1 Normalized time 0 50 100 150 200 250 300 0.001 0.01 0.1 1 Normalized time Multimode simulation • Hermite-Gauss modes in free-space cavity
  67. All cubic graphs of n ≦ 16 • Success probability

    to find the ground state against cubic (3-regular) graphs of order ≦ 16. The parameters are fixed as (p, ξ) = (1.1, -0.1). • Note that the MAX-CUT is still NP-hard against cubic graphs [E. Halperin, et al., 2002] • Number of problems: N 4 6 8 10 12 14 16 #graphs 1 2 5 19 85 509 4060 Ê Ê Ê Ê Ê Ê Ê ‡ ‡ ‡ ‡ ‡ ‡ ‡ Ê Ê Ê Ê Ê Ê Ê ‡ ‡ ‡ ‡ ‡ ‡ ‡ Ï Ï Ï Ï Ï Ï Ï Ú Ú Ú Ú Ú Ú Ú 4 6 8 10 12 14 16 0.2 0.4 0.6 0.8 1.0 Graph order N Success rate Success probability Number of vertices N ← Multimode avg. ± SD ← Single-mode avg. ± SD ← Multimode worst case ← Single-mode worst case Note: Up to 5 modes 67
  68. n = 800 sparse graphs • Multimode͸ऑ͍݁ ߹ఆ਺Ͱ΋ͦͦ͜ ͜ͷղΛಘΔ͜ͱ ͕Ͱ͖Δ͕ɺ࠷େ

    ஋͸্͕Βͳ͍ɻ • Single-mode͸݁߹ ఆ਺ΛߋʹڧΊΔ ͱΑΓྑ͍ղΛಘ ΒΕΔɻ g1 cubic 68
  69. Appendix C: CPU and processor Implementation

  70. 0.000 0.005 0.010 0.015 0.020 -32 -31 -30 -29 -28

    -27 Time (s) Ising energy Difference in generation of CPU architecture Note: Intel Core i7-6567U @ 3.3 GHz Skylake (2016) was almost the same as Haswell GW-SDP target Intel Xeon X5650 2.67 GHz Westmere (2010) Intel Xeon E3-1225 v3 3.2 GHz Haswell (2013) (2.5x faster) 70
  71. • Serious problem : the numerical simulation is very slow

    when the number of vertices N scales up even when we use light- weight SDEs • Accelerator configurations • CPU : all processes are calculated on CPU • CPU+GPU : matrix-vector product is off-loaded on GPU (float) • PEZY-SC : all calculations are parallelized (float) Bottleneck of SDE O(|E|) ≦ O(N2) in total Other part : O(N) in total 71 [Y. Haribara, H. Ishikawa, S. Utsunomiya, K. Aihara, Y. Yamamoto, Quantum Science and Technology 2, 044002 (2017)]
  72. Simulation time scaling • Complete graphs for different sizes •

    Simulation time for c-number Langevin eq. (200 round trips) • Spec: • CPU : Intel Xeon W3530, 2.80 GHz, Nehalem-WS (2010) • GPU : NVIDIA Tesla C2075, 1.15 GHz, Fermi (2011) w/cuBLAS • PEZY-SC, 1024 core (x8 threads), 733 MHz (2014) 72 O(N2) ?? depends on FPGA resources                      CPU CPU+GPU PEZY-SC CIM 500 1000 5000 104 10-4 0.001 0.010 0.100 1 10 100 Problem size N Simulation time (s) CPU → PEZY-SC 250x faster
  73. SAͷฒྻԽ 1. ΞϓϦέʔγϣϯґଘͷฒྻԽ 2. ྖҬ෼ׂ (෼ׂ౷࣏๏) 3. ෳ਺Ϛϧίϑ࿈࠯ [A.M. Ferreiro,

    et al., J. Global. Optim. 57, 863 (2012)] ඇಉظ ಉظ sync. sync. sync. sync. (Sync. : ࠷খΤωϧΪʔঢ়ଶΛબͿ)
  74. SAͷGPUʹΑΔฒྻԽ • ؔ਺࠷దԽ: ਖ਼نԽ Schwefelؔ਺ • NVIDIA FeForce GTX 470

    • ಉظߋ৽ͷฒྻԽ͸Τ ϥʔݮ • ܭࢉ࣌ؒ 76x [A.M. Ferreiro, et al., J. Global. Optim. 57, 863 (2012)] 876 J Glob Optim (2013) 57:863–890 Table 1 Error of the solution obtained by the algorithm, both in the value of the function at the minimum (columns | fa − fr |, where fa is the objective function value found by the algorithm and fr is the exact function value in the real minimum) and in the minimum (columns relative error, measured in · 2) n V0 V1 V2 | fa − fr | Relative error | fa − fr | Relative error | fa − fr | Relative error 8 1.3190 × 10−1 2.4283 × 10−3 1.2891 × 10−2 7.4675 × 10−4 1.7000 × 10−5 4.1656 × 10−5 16 2.3712 × 10−1 3.2557 × 10−3 7.4586 × 10−2 1.8240 × 10−3 1.9000 × 10−6 5.0686 × 10−5 32 3.3774 × 10−1 3.8852 × 10−3 2.8171 × 10−1 3.5468 × 10−3 1.5730 × 10−4 6.0577 × 10−5 64 7.9651 × 10−1 5.9664 × 10−3 9.7831 × 10−1 6.6126 × 10−3 3.1880 × 10−4 1.2132 × 10−4 128 1.9198 9.2648 × 10−3 3.0461 1.1674 × 10−2 1.2225 × 10−4 1.5304 × 10−4 256 3.6230 1.2733 × 10−2 9.5765 8.0283 × 10−2 1.4953 × 10−2 8.2214 × 10−4 512 7.3054 1.8097 × 10−2 26.2282 4.0424 × 10−1 4.6350 × 10−1 4.5503 × 10−3 Table 2 Performance of CUDA version versus sequential version with one CPU core for different number of parameters n V0 V1 V2 Time Time Speedup Time Speedup 8 1,493.7686 5.5436 269.4595 5.6859 262.7121 16 2,529.3072 15.3942 164.3027 15.5889 162.2502 32 4,618.5820 56.9808 81.0550 60.1882 76.7356 64 8,773.0560 106.6075 82.2930 110.2702 79.5596 128 17,169.0000 210.9499 81.3890 215.5416 79.6552 256 34,251.9240 455.4910 75.1978 462.8035 74.0096 512 68,134.5760 871.7434 78.1589 893.7668 76.2330 (ඇಉظ) (ຖεςοϓಉظ) (γϦΞϧ) (ඇಉظ) (ຖεςοϓಉظ) (γϦΞϧ) ical experiments have been performed with the following hardware and software configu- rations: a GPU Nvidia Geforce GTX470, a recent CPU Xeon E5620 clocked at 2.4 Ghz with 16 GB of RAM, CentOS Linux, NVIDIA CUDA SDK 3.2 and GNU C++ compiler 4.1.2. In what follows, we denote by V0 the sequential implementation, by V1 the parallel asynchronous version and by V2 the parallel synchronous one. 4.1 Analysis of a sample test problem: normalized Schwefel function A typical benchmark for testing optimization techniques is the normalized Schwefel function (see [31], for example): f (x x x) = − 1 n n i=1 xi sin |xi | , −512 ≤ xi ≤ 512, x x x = (x1, . . . , xn). (6) For any dimension n, the global minimum is achieved at the point x x x , the coordinates of which are x i = 420.968746, i = 1, . . . , n, and f (x x x ) = −418.982887. Table 1 illustrates the accuracy for the three versions of the SA algorithm: sequential (V0), asynchronous (V1) and synchronous (V2). For these three versions we use the following configuration: T0 = 1000, Tmin = 0.01, N = 100 and ρ = 0.99. For the parallel versions we use the choice b = 256 and g = 64, for the number of threads per block (block size) and the number of blocks per grid (grid size), respectively, so that the number of Markov chains is 16,384. With this configuration, the algorithm performs 1.8776×109 function evaluations in all cases. 123 code, following the ideas of Sect. 2.2, so that both codes ons and their performance can thus be compared. The numer- ormed with the following hardware and software configu- GTX470, a recent CPU Xeon E5620 clocked at 2.4 Ghz inux, NVIDIA CUDA SDK 3.2 and GNU C++ compiler by V0 the sequential implementation, by V1 the parallel the parallel synchronous one. blem: normalized Schwefel function ptimization techniques is the normalized Schwefel function |xi | , −512 ≤ xi ≤ 512, x x x = (x1, . . . , xn). (6) minimum is achieved at the point x x x , the coordinates of = 1, . . . , n, and f (x x x ) = −418.982887. y for the three versions of the SA algorithm: sequential (V0), nous (V2). For these three versions we use the following = 0.01, N = 100 and ρ = 0.99. For the parallel versions = 64, for the number of threads per block (block size) and rid size), respectively, so that the number of Markov chains n, the algorithm performs 1.8776×109 function evaluations 123 ( x ∈ Rn) Image source : [http://www-optima.amp.i.kyoto-u.ac.jp/member/student/hedar/Hedar_files/TestGO_files/Page2530.htm]
  75. Speedup for SA (OpenMP) Concurrency #flips time 1 27011.7 0.00530

    2 19722.9 0.00437 4 16302.1 0.00448 8 9538.8 0.00408 16 6200.5 0.00838 24 4635.5 0.01014 n = 2000 Intel Xeon E5-2680 @ 2.50 GHz with OpenMP Weighted complete graph K2000 (Jij = ±1) (Courtesy of T. Sonobe & K. Kawarabayashi @ NII) 1.3x 75 n = 105 Concurrency #flips time 1 2218790.8 4.258 2 1358508.9 3.132 4 923020.6 2.479 8 652692.7 2.124 16 533915.5 2.194 24 471558.2 2.527 2.0x
  76. Appendix D: Monte Carlo

  77. Single spin flip O(N) Monte Carlo • Partition function :

    • k = (k1, …, kNb) ͷ૊߹ͤͰల։ mean Poisson distribution Inverse temperature Positive parameter =:W 77
  78. SA with single spin flip O(N) MC • Complete graphs

    (J=±1-weighted with uniform distribution) • Scaling : O(βεN), while naive MC scales as O(N2) 0 50 100 150 200 250 300 -30 -25 -20 -15 -10 -5 0 Monte Carlo sweeps Ising energy Naive MC (2.0 s)  = 2 (24.9 s)  = 4 (36.6 s)  = 6 (47.3 s)  = 8 (59.0 s)  = 10 (70.0 s) cf. Constant temperature T = 1.0, ε = 0.1 Simulated Annealing N = 2000, β(t) = 40 log(1+10t), t∈[0,1]                            Naive Monte Carlo  Order-N Monte Carlo 10 100 1000 104 10-6 10-5 10-4 0.001 0.010 0.100 Number of spins N CPU time per MCS (s) 78
  79. Appendix E: Extended Implementation

  80. w12 =2 v1 v2 v3 v4 w34 =3 w23 =1

    w24 =3 w14 =-2 w 13=-2 v1 v2 v3 v4 v1 v2 v3 v4 v1 v2 v3 v4 time (in single run) Jij ={0,±1} Weighted graph Temporal graph Weight decomposition weight distribution -3 -2 -1 0 1 2 3 0.0 0.1 0.2 0.3 0.4 Edge weight w Frequency Calculate from large (significant) weight 80
  81. Variant of iterative method Method coupling Jij CUT magnify 2k

    scaling 280.5 2bit-itr single bit 297.4 accumulate or from MSB 370.4 accumulate 2bit-itr bit representation wij = 0 0 1 0 0 1 … 1 1 0 1 0 digit : k … 4 3 2 1 0 ↓ ↓ ↓ ↓ ↓ 2k 2k-4 24 23 2 0 0 1 1 1 1 … 1 1 1 1 1 MSB accumulate bit (or operation): Jij is on if wij > threshold Jij is 2bit (on/off, +1/-1) 81
  82. Torus with Gaussian dist. • torusg3-8 : 3D toroidal graph

    (degree 6) with Gaussian weight: wij ∈ N(0, 1), equivalent to 3D Edwards-Anderson spin glass model • |V| = 512, |E| = 1536, 20-bit weight Implementation Cut SDP UB 457.36 BestKnown 416.85 SA 392.23 continuous Jij ∈ R 393.59 proposed (itr) 385.57 Jij ∈ {-1,0,+1} 354.10 [http://dimacs.rutgers.edu/Challenges/Seventh/Instances/] 280 300 320 340 360 380 0.00 0.05 0.10 0.15 0.20 0.25 CUT Probability 280 300 320 340 360 380 0.00 0.05 0.10 0.15 0.20 0.25 CUT Probability t = 800 best continuous accumulate weight distribution -3 -2 -1 0 1 2 3 0.0 0.1 0.2 0.3 0.4 Edge weight w Frequency best in 1000 trials 82
  83. torusg3-15 from 7-th DIMACS • 3D toroidal graph (degree 6)

    with Gaussian weight: wij ∈ N(0, 1) • |V| = 153 = 3375, |E| = 10125 Implementation Cut SDP UB 3134.60 BestKnown 2602.03 SA 2367.84 continuous Jij ∈ R 2560.43 proposed (itr) 2470.80 Jij ∈ {-1,0,+1} 2214.50 [http://dimacs.rutgers.edu/Challenges/Seventh/Instances/] 10 trials 83
  84. Appendix F: FPGA [patent]